Commit Graph

31 Commits

Author SHA1 Message Date
Katherine Wu
a3e64f721c Add SaveableObjects to SavedModel.
When objects are loaded from the SavedModel, they don't retain their `_gather_saveables_for_checkpoint` functions, which can result in values not being loaded from the checkpoint.

This CL adds a field in the SavedModel proto that stores a save and restore function for each SaveableObject in each node. When loading into Python, the SaveableObjects are restored using the functions.

PiperOrigin-RevId: 318549786
Change-Id: I688c72d7658e1bca98abf373a13a0e15a7fb83e2
2020-07-01 14:28:41 -07:00
Bruce Fontaine
c27b834b49 Wrap save/restore logic in tf.function when in eager mode. This allows parallel saving and restoring when using multiple devices.
PiperOrigin-RevId: 317719780
Change-Id: Ifb7e34f708da4121b49fb38d8dad046d45fedc42
2020-06-22 13:23:14 -07:00
A. Unique TensorFlower
bc1c0e86a6 Wrap save/restore logic in tf.function when in eager mode. This allows parallel saving and restoring when using multiple devices.
PiperOrigin-RevId: 317180143
Change-Id: Icdc740d02beb7c2d3236191add3b72fa103fc134
2020-06-18 14:29:28 -07:00
Bruce Fontaine
a31d5da026 Wrap save/restore logic in tf.function when in eager mode. This allows parallel saving and restoring when using multiple devices.
PiperOrigin-RevId: 317144560
Change-Id: Iebc230589a5e2712da03c5db3f45e4fd7eeb5ff9
2020-06-18 11:34:42 -07:00
A. Unique TensorFlower
e853835634 Add an option to choose the I/O Device for saving and loading models.
This option enables saving and restoring models to or from filesystems only
accessible from the localhost when using multiple devices.

The option is available to
 - Save models: tf.saved_model.save()
 - Checkpoints: tf.Checkpoint()

PiperOrigin-RevId: 307858098
Change-Id: I4cd0a81424e306f0eac40bfb30d5067dfc02d1be
2020-04-22 11:27:15 -07:00
TensorFlower Gardener
eacf534690 Merge pull request from rahul003:s3_skip_temp
PiperOrigin-RevId: 297455352
Change-Id: I41411282776981e9cf4e347b25d238557151f9e6
2020-02-26 14:49:31 -08:00
Rahul Huilgol
e65e99c433 Not use temp files when writing to S3
Address feedback

Add test for the python method has_atomic_move

Removed old comment, and fixed indentation

Remove unncessary imports

Remove the test which checks for reference cycles when saving. Since the check for file system introduces a conditional op, it introduces a reference cycle, and this check does not apply anymore

Fighting lint

Fix lint errors

Use returned status of hasAtomicMove
2020-02-20 20:00:53 +00:00
Rohan Jain
c19c8167c2 In all cases, we can't rely on the tensor.device attribute being set. So its better to get the device for a SaveSpec from the device passed in rather. This was an issue with saving iterators because for iterators the resource usually has a device specification but the serialized tensor derived from it might not have it set. As a result, when saving iterators in a sharded fashion all iterators end up on '' device instead which is not what is intended.
Also adding support for saving iterators in a sharded fashion to avoid unnecessary copying during checkpointing.

PiperOrigin-RevId: 286310419
Change-Id: I1a957af783f7f69753992ce220b59eb43df2c02f
2019-12-18 19:11:04 -08:00
Brian Atkinson
2fcfa6085b Move additional_deps to deps for cuda_py_test.
PiperOrigin-RevId: 285283853
Change-Id: I2534d9fb51955cc9a86d1900ec60fc265f451ddc
2019-12-12 15:28:04 -08:00
Revan Sopher
5304e1240a Add SaveableHook, a special SaveableObject which registers callbacks.
Registers a single constant tensor in order to conform to the SaveableObject API; I feel that's cleaner than special casing SaveableHook throughout the codebase.

PiperOrigin-RevId: 280708433
Change-Id: I5872949eca35c7fe3dcc401c52a63b66a141d865
2019-11-15 12:18:18 -08:00
Gaurav Jain
c3973c78f0 Rename internal_convert_to_tensor for performance
Calling ops.internal_convert_to_tensor is more efficient than calling
ops.convert_to_tensor due to skipping the deprecated_argument_lookup and
also less python function calling overhead. We thus swap these functions
names so we can optimize most code paths.

PiperOrigin-RevId: 274321742
2019-10-12 01:28:49 -07:00
Thomas O'Malley
89e33e5ef3 Add ShardedVariable class.
PiperOrigin-RevId: 272745815
2019-10-03 15:05:15 -07:00
Tres Popp
e01699ccbd [TF:XLA] Cleanup cuda_py_test calls with xla_enable_strict_auto_jit = True.
PiperOrigin-RevId: 270858131
2019-09-24 02:07:26 -07:00
Gaurav Jain
6e60889d0e Avoid hashing tensors directly in collections
PiperOrigin-RevId: 260844204
2019-07-30 20:40:56 -07:00
Alexandre Passos
aa93ea6441 Clean up the ResourceVariable inheritance hierarchy a bit.
PiperOrigin-RevId: 253592362
2019-06-17 09:19:01 -07:00
A. Unique TensorFlower
69feb7e97d Adjust structure of all BUILD files to recommended style (https://docs.bazel.build/versions/master/skylark/build-style.html#file-structure), moving loads to top.
PiperOrigin-RevId: 252072215
2019-06-07 11:35:53 -07:00
A. Unique TensorFlower
e44f32560d Apply 'buildozer fix moveLicensesAndDistribs movePackageToTop' to all BUILD files.
PiperOrigin-RevId: 249812574
2019-05-24 04:53:01 -07:00
Tom Hennigan
e4b9fda1d0 Fixes for multi-GPU continuous integration failures.
1) Test central storage strategy against numpy dataset.
2) Implement non-overridden methods from Variable in AggregatingVariable.
3) Make ResourceVariableSaveable support `is_resource_variable` instances.

PiperOrigin-RevId: 243518258
2019-04-14 12:46:11 -07:00
Taylor Robie
8d24f6ae5c Implement several optimizations to reduce graph construction time.
In approximately decreasing order of significance:

1) Cache various to_string, from_string, and string to string functionality in device.py.

2) Optimize DeviceSpec.to_string to reduce unnecessary string copies.

3) _Skip no-op device assignments when creating ops. (When possible.)

4) Remove hash caching in DeviceSpec (since it can now be computed much more cheaply) which allows less aggressive locking.

5) Misc finesse around how high traffic functions (millions of calls).

PiperOrigin-RevId: 242996847
2019-04-10 21:08:37 -07:00
Alexandre Passos
8b5c79a7c7 Narrow scope of assertion which prevents hub module loading inside wrap_function
PiperOrigin-RevId: 242896943
2019-04-10 10:50:09 -07:00
Tom Hennigan
ea46787326 Make {Mirrored,Aggregating,TPU}Variable extend tf.Variable.
PiperOrigin-RevId: 242345741
2019-04-07 05:59:56 -07:00
Allen Lavoie
d7f5bf5ef2 Switch to sharding checkpoints by default in tf.train.Checkpoint
Each worker will do its own local read and write operations rather than copying to one device. This assumes a shared filesystem for all tasks.

Largely a copy-and-paste job from tf.train.Saver, except to figure out the proper sharding when executing eagerly we need a device up-front when restoring, so SaveSpecs with callable ops now require devices. Previously we evaluated the save Tensor and checked its device in order to shard restores, but executing eagerly that means allocating lots of unused memory.

PiperOrigin-RevId: 240629173
2019-03-27 13:36:06 -07:00
A. Unique TensorFlower
2c0441c286 Enable all currently passing tests.
I ran each test 200 times, and these passed without any flakes.

PiperOrigin-RevId: 238414725
2019-03-14 04:13:43 -07:00
Allen Lavoie
bd36b48c55 Rename Checkpointable -> Trackable and AutoCheckpointable -> AutoTrackable
No API changes in this CL. Just more refactoring for a future API change.

PiperOrigin-RevId: 234242335
2019-02-15 17:38:56 -08:00
Allen Lavoie
a3d38e4805 Allow tf.train.Checkpoint.write in tf.functions
We can't update the Checkpoint proto state (e.g. for tf.train.latest_checkpoint), but this at least throws an informative error and gives the user the option to write a checkpoint from a function without updating the Checkpoint proto.

Restore is more complicated, since we do restore-on-create. I think the right thing to do is to inherit the eagerness of the context where restore() was first called. Too much for this change.

PiperOrigin-RevId: 233082153
2019-02-08 11:01:28 -08:00
Allen Lavoie
f9b9cf52eb Checkpointable->AutoCheckpointable
CheckpointableBase->Checkpointable

In preparation for adding public symbols.

PiperOrigin-RevId: 229459684
2019-01-15 16:13:50 -08:00
Allen Lavoie
750b30d2aa Make layer JSON config saved in checkpoints optional for restore
In general if a Python string is in the checkpoint but not used directly in the saving program, assert_consumed will pass even if the attribute is totally absent on restore.

This should fix checkpoint compatibility with saved_model.load() even without proper Keras revived types. There's no reason it should fail if those aren't available for some reason.

PiperOrigin-RevId: 229448783
2019-01-15 15:23:45 -08:00
Allen Lavoie
efe565bc09 Make the initial tf.train.Checkpoint.restore() read Tensors in a batch
Should be faster when reading from distributed file systems. Does not affect cases where restore-on-create is necessary, but as long as variable objects have been created and tracked before restore() their reads should be batched together.

PiperOrigin-RevId: 227911381
2019-01-04 14:13:56 -08:00
Allen Lavoie
2f4d4da52f Workaround for PartitionedCall trying and failing to run on TPUs when saving
Just omits the function decorator for now. This is pretty terrible and we should fix it, but it will need some work on the TPU side.

Spoke to iga@. Apparently the CPU annotations don't work because the function captures a resource which is on the TPU (and so the eager placer puts the call op on the TPU). One option is to then XLA-compile the function, although that fails right now because we're trying to save strings and XLA doesn't have a kernel for that.

I should also follow up with TPU+checkpointing integration tests.

PiperOrigin-RevId: 226390521
2018-12-20 14:18:46 -08:00
Allen Lavoie
a2bf042d36 Use functional_saver to write SaverDefs in tf.saved_model.save
Replaces the restore() code with tf.train.Saver's bulk restore logic, which was its default. I only noticed because apparently the other path fails on some saveables, and the restore code gets more thoroughly tested via to_proto.

PiperOrigin-RevId: 226077043
2018-12-18 16:19:55 -08:00
Allen Lavoie
66ca3cd10d Add a functional saver, use it for object-based checkpointing
Pulls some utilities out of saver.py which are necessary to actually use it. The functional saver takes only SaveableObjects, so these are utilities for taking a list of whatever users pass in and converting them to those.

One other code move for object-based checkpointing to avoid circular imports.

Applications which need a SaverDef still use the old Saver. Serialization to SaverDef will be added to this saver in a followup.

Does not actually wrap the new Saver's methods in @tf.function yet, since there are memory issues which need to be fixed first.

PiperOrigin-RevId: 224561069
2018-12-07 12:52:23 -08:00