History

Ruoxin Sang 02ad000479 Support dynamic outputs for XLA on demand ops. PiperOrigin-RevId: 317902879 Change-Id: I6b6dfa54855d5996ac15d4b5c48a5db5dc230025		2020-07-01 11:11:47 -07:00
..
cluster_resolver	Make task_type and task_id standard properties in tf.distribute cluster resolvers.	2020-06-22 14:48:14 -07:00
experimental	Apply 'buildozer fix moveLicensesAndDistribs movePackageToTop' to all BUILD files.	2019-05-24 03:46:12 -07:00
model_collection	Remove passing experimental_run_tf_function in most tests.	2020-02-27 13:34:09 -08:00
parallel_device	Wrap save/restore logic in tf.function when in eager mode. This allows parallel saving and restoring when using multiple devices.	2020-06-22 13:23:14 -07:00
__init__.py	Expose MultiWorkerMirroredStrategy and ParameterServerStrategy	2019-02-20 17:57:26 -08:00
all_reduce_test.py	Fixed cases where tf.TensorShape was constructed with float dimensions	2019-10-14 20:53:35 -07:00
all_reduce.py	Merge pull request #36713 from tensorflow:terrytangyuan-patch-2	2020-02-13 10:26:24 -08:00
BUILD	Make cluster_resolver standard property in tf.distribute strategies.	2020-06-22 18:19:14 -07:00
central_storage_strategy.py	Drop experimental and v2 qualifiers from Strategy experimental_run_v2 method.	2020-03-12 10:26:26 -07:00
checkpoint_utils_test.py	Small adjustments on import spacing.	2019-12-18 20:32:12 -08:00
checkpointing_test.py	Support packed variable in DistributedVariable. Add an option to enable packed variable in TPUStrategy.	2020-06-18 20:12:02 -07:00
collective_all_reduce_strategy_test.py	Make cluster_resolver standard property in tf.distribute strategies.	2020-06-22 18:19:14 -07:00
collective_all_reduce_strategy.py	Make cluster_resolver standard property in tf.distribute strategies.	2020-06-22 18:19:14 -07:00
collective_util.py	Rename all_reduce_sum_gradients to experimental_aggregate_gradients	2020-03-24 13:31:15 -07:00
combinations_test.py	Add multi worker mirrored strategy combinations to enable easier testing again	2020-04-06 15:59:19 -07:00
combinations.py	Treat "test_xla_gpu" as GPU_TEST in "NamedGPUCombination".	2020-04-30 12:31:23 -07:00
cross_device_ops_test.py	Fix cross_device_ops_test with multi GPU	2020-06-16 15:33:47 -07:00
cross_device_ops.py	Explicitly take the set of devices in CollectiveAllReduce	2020-06-16 13:20:30 -07:00
cross_device_utils_test.py	Add an experimental_hints to batch all reduce	2020-02-25 20:32:44 -08:00
cross_device_utils.py	Explicitly take the set of devices in CollectiveAllReduce	2020-06-16 13:20:30 -07:00
ctl_correctness_test.py	Removing v1 optimizer tests from ctl_correctness_test to speed it up	2020-04-30 14:21:05 -07:00
custom_training_loop_gradient_test.py	Support Google-internal TPU resolution in strategy combinations.	2020-05-27 14:29:14 -07:00
custom_training_loop_input_test.py	Support packed variable for tf data captured function.	2020-06-22 16:24:37 -07:00
custom_training_loop_metrics_test.py	Drop experimental and v2 qualifiers from Strategy experimental_run_v2 method.	2020-03-12 10:26:26 -07:00
custom_training_loop_models_test.py	When calling `strategy.reduce` in eager mode, wrap the `strategy.run` calls inside with `tf.function` so it is compatible with TPUStrategy.	2020-05-22 12:23:49 -07:00
custom_training_loop_optimizer_test.py	Error when experimental_aggregate_gradients=False is used with	2020-03-24 20:22:12 -07:00
device_util_test.py	Try to deduce job, replica and task from config.list_logical_devices() again	2020-06-16 15:22:24 -07:00
device_util.py	Try to deduce job, replica and task from config.list_logical_devices() again	2020-06-16 15:22:24 -07:00
distribute_config.py	Add an option to RunConfig and train_and_evaluate to run distribute coordinator.	2018-08-24 19:21:07 -07:00
distribute_coordinator_context.py	Use distribution strategy to configure distribute coordinator.	2018-08-16 12:28:46 -07:00
distribute_coordinator_test.py	minor spelling tweaks	2020-02-11 15:09:21 +09:00
distribute_coordinator.py	Distribute Coordinator currently assumes TF_CONFIG to be the only way to configure a strategy. We now allow cluster resolvers to be passed as arguments to instantiate the strategy instead of TF_CONFIG which should be used instead if set by the user.	2020-03-16 12:03:17 -07:00
distribute_lib_test.py	Make cluster_resolver standard property in tf.distribute strategies.	2020-06-22 18:19:14 -07:00
distribute_lib.py	Make cluster_resolver standard property in tf.distribute strategies.	2020-06-22 18:19:14 -07:00
distribute_utils.py	Support packed variable in DistributedVariable. Add an option to enable packed variable in TPUStrategy.	2020-06-18 20:12:02 -07:00
distributed_file_utils_test.py	Ensure distributed_file_utils.remove_temp_dirpath() can be safely called multiple times.	2020-04-27 16:05:05 -07:00
distributed_file_utils.py	Ensure distributed_file_utils.remove_temp_dirpath() can be safely called multiple times.	2020-04-27 16:05:05 -07:00
distribution_strategy_context.py	Drop experimental and v2 qualifiers from Strategy experimental_run_v2 method.	2020-03-12 10:26:26 -07:00
estimator_training.py	minor spelling tweaks	2020-02-11 15:09:21 +09:00
input_lib_test.py	Enable last partial batch for MWMS in TF2.x	2020-06-22 17:27:34 -07:00
input_lib_type_spec_test.py	Fix incompatibilities between DistributedIterator and the corresponding DistributedIteratorSpec.	2020-04-30 18:10:48 -07:00
input_lib.py	Enable last partial batch for MWMS in TF2.x	2020-06-22 17:27:34 -07:00
input_ops_test.py	[tf.data] Exposing dataset / iterator element type specification in the public API as `element_spec`.	2019-07-02 13:56:20 -07:00
input_ops.py	Consolidate DistributeOptions.auto_shard into DistributeOptions.auto_shard_policy.	2019-10-21 16:03:57 -07:00
keras_metrics_test.py	Drop experimental and v2 qualifiers from Strategy experimental_run_v2 method.	2020-03-12 10:26:26 -07:00
keras_save_load_test.py	Remove passing experimental_run_tf_function in most tests.	2020-02-27 13:34:09 -08:00
metrics_v1_test.py	Move to using 'initializer' from 'initialize' to be more consistent with the tf.data APIs.	2020-01-15 13:20:27 -08:00
minimize_loss_test.py	Drop experimental and v2 qualifiers from Strategy experimental_run_v2 method.	2020-03-12 10:26:26 -07:00
mirrored_function_strategy_test.py	Drop experimental and v2 qualifiers from Strategy experimental_run_v2 method.	2020-03-12 10:26:26 -07:00
mirrored_function_strategy.py	Another round of refactoring of values.py to split utility functions that use distributed Variable types defined in values.py.	2020-06-12 12:04:39 -07:00
mirrored_run.py	Another round of refactoring of values.py to split utility functions that use distributed Variable types defined in values.py.	2020-06-12 12:04:39 -07:00
mirrored_strategy_test.py	Fork the keras related mirrored_strategy_test to keras/distribute.	2020-06-19 10:27:02 -07:00
mirrored_strategy.py	Graduate TPUStrategy from experimental.	2020-06-20 13:10:50 -07:00
mirrored_variable_test.py	Fork keras related mirror_variable_test to keras/distribute.	2020-06-19 10:57:00 -07:00
model_combinations.py	tf.saved_model: Re-create concrete functions at saving time - this ensures that if the cache key has changed, the function will be traced again. For example, when the model is run with distribution strategy, but saved without it, we want to the saved version to trace again without strategy.	2019-08-01 00:42:05 -07:00
moving_averages_test.py	In `assign_moving_average`, call `update_fn` instead of `strategy.extended.update(var, update_fn)` when in update context.	2020-06-10 17:52:55 -07:00
multi_process_lib.py	Set TF_FORCE_GPU_ALLOW_GROWTH for multi process tests	2020-06-12 12:22:38 -07:00
multi_process_runner_no_init_test.py	Refactor MultiProcessRunner:	2019-11-08 20:24:05 -08:00
multi_process_runner_test.py	Fix tsan failure in multi_process_runner_test.	2020-06-22 15:50:17 -07:00
multi_process_runner.py	Fix tsan failure in multi_process_runner_test.	2020-06-22 15:50:17 -07:00
multi_worker_continuous_run_test.py	Improve multi_process_runner	2020-06-11 05:20:30 -07:00
multi_worker_test_base.py	Mark multi-process utilities with subprocess module as deprecated in favor of using MultiProcessRunner.	2020-05-11 15:07:56 -07:00
multi_worker_util_test.py	Allow evaluator not in cluster_spec, to be consistent with legacy Estimator.	2019-11-21 15:40:54 -08:00
multi_worker_util.py	Allow evaluator not in cluster_spec, to be consistent with legacy Estimator.	2019-11-21 15:40:54 -08:00
numpy_dataset_test.py	Add `tf.distribute.Strategy.experimental_make_numpy_iterator()` function.	2019-01-09 14:10:49 -08:00
numpy_dataset.py	Remove *args from disribute/ variable creators	2020-02-01 16:58:14 -08:00
one_device_strategy_test.py	Create a DistributedDataset/DistributedDatasetV1 based on TF2/1.x mode.	2020-04-14 01:06:49 -07:00
one_device_strategy.py	Add InputOptions to experimental_distribute_dataset(s_from_function).	2020-06-15 16:05:58 -07:00
packed_distributed_variable_test.py	Support packed variable in DistributedVariable. Add an option to enable packed variable in TPUStrategy.	2020-06-18 20:12:02 -07:00
packed_distributed_variable.py	Support packed variable in DistributedVariable. Add an option to enable packed variable in TPUStrategy.	2020-06-18 20:12:02 -07:00
parameter_server_strategy_test.py	Another round of refactoring of values.py to split utility functions that use distributed Variable types defined in values.py.	2020-06-12 12:04:39 -07:00
parameter_server_strategy.py	Add InputOptions to experimental_distribute_dataset(s_from_function).	2020-06-15 16:05:58 -07:00
ps_values_test.py	Refactor values.py into a utility file and a PS values file.	2020-05-28 10:58:13 -07:00
ps_values.py	Override "map_resources" in AggregatingVariable.	2020-06-19 16:18:14 -07:00
README.md	Graduate TPUStrategy from experimental.	2020-06-20 13:10:50 -07:00
reduce_util.py	minor spelling tweaks	2020-02-11 15:09:21 +09:00
remote_mirrored_strategy_eager_test.py	Support LogicalDevice in MirroredStrategy config	2019-11-13 15:19:24 -08:00
saved_model_mixed_api_test.py	Migrate saved_model_mixed_api_test to V2 API	2020-06-11 10:50:30 -07:00
saved_model_save_load_test.py	Drop experimental and v2 qualifiers from Strategy experimental_run_v2 method.	2020-03-12 10:26:26 -07:00
saved_model_test_base.py	Support packed variable in DistributedVariable. Add an option to enable packed variable in TPUStrategy.	2020-06-18 20:12:02 -07:00
sharded_variable_test.py	Add `ShardedVariable` class.	2019-10-03 15:05:15 -07:00
sharded_variable.py	Support ShardedVariable in `tf.keras.layers.Embedding`.	2020-05-21 11:06:33 -07:00
shared_variable_creator_test.py	A round of moving some DistributionStrategy libraries from contrib to	2018-11-16 12:02:34 -08:00
shared_variable_creator.py	Remove *args from disribute/ variable creators	2020-02-01 16:58:14 -08:00
single_loss_example.py	Move the optimizer name scope from model.training to optimizer	2019-05-21 14:52:30 -07:00
step_fn_test.py	Do not perform colocation checks for IdentityN since it just forwards its inputs.	2019-07-09 11:31:50 -07:00
step_fn.py	Move to using 'initializer' from 'initialize' to be more consistent with the tf.data APIs.	2020-01-15 13:20:27 -08:00
strategy_combinations_test.py	Drop experimental and v2 qualifiers from Strategy experimental_run_v2 method.	2020-03-12 10:26:26 -07:00
strategy_combinations.py	Support packed variable in DistributedVariable. Add an option to enable packed variable in TPUStrategy.	2020-06-18 20:12:02 -07:00
strategy_common_test.py	Make cluster_resolver standard property in tf.distribute strategies.	2020-06-22 18:19:14 -07:00
strategy_reduce_test.py	When calling `strategy.reduce` in eager mode, wrap the `strategy.run` calls inside with `tf.function` so it is compatible with TPUStrategy.	2020-05-22 12:23:49 -07:00
strategy_test_lib.py	Another round of refactoring of values.py to split utility functions that use distributed Variable types defined in values.py.	2020-06-12 12:04:39 -07:00
summary_op_util.py	Separating out summary_util for distribution strategy to avoid possible circular dependencies.	2019-03-14 17:11:11 -07:00
tf_function_test.py	Use first worker as default device in tf_function_test.	2020-06-08 11:32:39 -07:00
tpu_strategy_test.py	Support dynamic outputs for XLA on demand ops.	2020-07-01 11:11:47 -07:00
tpu_strategy.py	Make cluster_resolver standard property in tf.distribute strategies.	2020-06-22 18:19:14 -07:00
tpu_values.py	Support packed variable in DistributedVariable. Add an option to enable packed variable in TPUStrategy.	2020-06-18 20:12:02 -07:00
values_test.py	Support packed variable in DistributedVariable. Add an option to enable packed variable in TPUStrategy.	2020-06-18 20:12:02 -07:00
values_util.py	Support packed variable in DistributedVariable. Add an option to enable packed variable in TPUStrategy.	2020-06-18 20:12:02 -07:00
values.py	Support packed variable in DistributedVariable. Add an option to enable packed variable in TPUStrategy.	2020-06-18 20:12:02 -07:00
warm_starting_util_test.py	Small adjustments on import spacing.	2019-12-18 20:32:12 -08:00
zero_batch_test.py	Fix input size used for batch normalization.	2020-04-09 22:01:21 -07:00

README.md

Tensorflow Distribute Libraries

Overview

tf.distribute.Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines or TPUs. Using this API, users can distribute their existing models and training code with minimal code changes.

It can be used with TensorFlow's high level APIs, tf.keras and tf.estimator, with just a couple of lines of code change. It does so by changing the underlying components of TensorFlow to become strategy-aware. This includes variables, layers, models, optimizers, metrics, summaries, and checkpoints.

Documentation

Distributed Training Guide

Distributed Training With Keras Tutorial

Distributed Training With Custom Training Loops Tutorial

Multiworker Training With Keras Tutorial

Multiworker Training With Estimator Tutorial

Save and Load with Distribution Strategy

Simple Examples

Using compile fit with GPUs.

# Create the strategy instance. It will automatically detect all the GPUs.
mirrored_strategy = tf.distribute.MirroredStrategy()

# Create and compile the keras model under strategy.scope()
with mirrored_strategy.scope():
  model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])
  model.compile(loss='mse', optimizer='sgd')

# Call model.fit and model.evaluate as before.
dataset = tf.data.Dataset.from_tensors(([1.], [1.])).repeat(100).batch(10)
model.fit(dataset, epochs=2)
model.evaluate(dataset)

Custom training loop with TPUs.

# Create the strategy instance.
tpu_strategy = tf.distribute.TPUStrategy(resolver)


# Create the keras model under strategy.scope()
with tpu_strategy.scope():
  model = keras.layers.Dense(1, name="dense")

# Create custom training loop body as tf.function.
@tf.function
def train_step(iterator):
  def step_fn(inputs):
    images, targets = inputs
    with tf.GradientTape() as tape:
      outputs = model(images)
      loss = tf.reduce_sum(outputs - targets)
    grads = tape.gradient(loss, model.variables)
    return grads

  return tpu_strategy.run(
      step_fn, args=(next(iterator),))

# Run the loop body once on at dataset.
dataset = tf.data.Dataset.from_tensors(([1.], [1.])).repeat(100).batch(10
input_iterator = iter(tpu_strategy.experimental_distribute_dataset(dataset))
train_step(input_iterator)

Testing

Tests here should cover all distribution strategies to ensure feature parity. This can be done using the test decorators in strategy_combinations.py.