When objects are loaded from the SavedModel, they don't retain their `_gather_saveables_for_checkpoint` functions, which can result in values not being loaded from the checkpoint.
This CL adds a field in the SavedModel proto that stores a save and restore function for each SaveableObject in each node. When loading into Python, the SaveableObjects are restored using the functions.
PiperOrigin-RevId: 318549786
Change-Id: I688c72d7658e1bca98abf373a13a0e15a7fb83e2
This option enables saving and restoring models to or from filesystems only
accessible from the localhost when using multiple devices.
The option is available to
- Save models: tf.saved_model.save()
- Checkpoints: tf.Checkpoint()
PiperOrigin-RevId: 307858098
Change-Id: I4cd0a81424e306f0eac40bfb30d5067dfc02d1be
Address feedback
Add test for the python method has_atomic_move
Removed old comment, and fixed indentation
Remove unncessary imports
Remove the test which checks for reference cycles when saving. Since the check for file system introduces a conditional op, it introduces a reference cycle, and this check does not apply anymore
Fighting lint
Fix lint errors
Use returned status of hasAtomicMove
Also adding support for saving iterators in a sharded fashion to avoid unnecessary copying during checkpointing.
PiperOrigin-RevId: 286310419
Change-Id: I1a957af783f7f69753992ce220b59eb43df2c02f
Registers a single constant tensor in order to conform to the SaveableObject API; I feel that's cleaner than special casing SaveableHook throughout the codebase.
PiperOrigin-RevId: 280708433
Change-Id: I5872949eca35c7fe3dcc401c52a63b66a141d865
Calling ops.internal_convert_to_tensor is more efficient than calling
ops.convert_to_tensor due to skipping the deprecated_argument_lookup and
also less python function calling overhead. We thus swap these functions
names so we can optimize most code paths.
PiperOrigin-RevId: 274321742
1) Test central storage strategy against numpy dataset.
2) Implement non-overridden methods from Variable in AggregatingVariable.
3) Make ResourceVariableSaveable support `is_resource_variable` instances.
PiperOrigin-RevId: 243518258
In approximately decreasing order of significance:
1) Cache various to_string, from_string, and string to string functionality in device.py.
2) Optimize DeviceSpec.to_string to reduce unnecessary string copies.
3) _Skip no-op device assignments when creating ops. (When possible.)
4) Remove hash caching in DeviceSpec (since it can now be computed much more cheaply) which allows less aggressive locking.
5) Misc finesse around how high traffic functions (millions of calls).
PiperOrigin-RevId: 242996847
Each worker will do its own local read and write operations rather than copying to one device. This assumes a shared filesystem for all tasks.
Largely a copy-and-paste job from tf.train.Saver, except to figure out the proper sharding when executing eagerly we need a device up-front when restoring, so SaveSpecs with callable ops now require devices. Previously we evaluated the save Tensor and checked its device in order to shard restores, but executing eagerly that means allocating lots of unused memory.
PiperOrigin-RevId: 240629173
We can't update the Checkpoint proto state (e.g. for tf.train.latest_checkpoint), but this at least throws an informative error and gives the user the option to write a checkpoint from a function without updating the Checkpoint proto.
Restore is more complicated, since we do restore-on-create. I think the right thing to do is to inherit the eagerness of the context where restore() was first called. Too much for this change.
PiperOrigin-RevId: 233082153
In general if a Python string is in the checkpoint but not used directly in the saving program, assert_consumed will pass even if the attribute is totally absent on restore.
This should fix checkpoint compatibility with saved_model.load() even without proper Keras revived types. There's no reason it should fail if those aren't available for some reason.
PiperOrigin-RevId: 229448783
Should be faster when reading from distributed file systems. Does not affect cases where restore-on-create is necessary, but as long as variable objects have been created and tracked before restore() their reads should be batched together.
PiperOrigin-RevId: 227911381
Just omits the function decorator for now. This is pretty terrible and we should fix it, but it will need some work on the TPU side.
Spoke to iga@. Apparently the CPU annotations don't work because the function captures a resource which is on the TPU (and so the eager placer puts the call op on the TPU). One option is to then XLA-compile the function, although that fails right now because we're trying to save strings and XLA doesn't have a kernel for that.
I should also follow up with TPU+checkpointing integration tests.
PiperOrigin-RevId: 226390521
Replaces the restore() code with tf.train.Saver's bulk restore logic, which was its default. I only noticed because apparently the other path fails on some saveables, and the restore code gets more thoroughly tested via to_proto.
PiperOrigin-RevId: 226077043
Pulls some utilities out of saver.py which are necessary to actually use it. The functional saver takes only SaveableObjects, so these are utilities for taking a list of whatever users pass in and converting them to those.
One other code move for object-based checkpointing to avoid circular imports.
Applications which need a SaverDef still use the old Saver. Serialization to SaverDef will be added to this saver in a followup.
Does not actually wrap the new Saver's methods in @tf.function yet, since there are memory issues which need to be fixed first.
PiperOrigin-RevId: 224561069