STT-tensorflow/tensorflow/python/distribute
Ruoxin Sang 02ad000479 Support dynamic outputs for XLA on demand ops.
PiperOrigin-RevId: 317902879
Change-Id: I6b6dfa54855d5996ac15d4b5c48a5db5dc230025
2020-07-01 11:11:47 -07:00
..
cluster_resolver Make task_type and task_id standard properties in tf.distribute cluster resolvers. 2020-06-22 14:48:14 -07:00
experimental Apply 'buildozer fix moveLicensesAndDistribs movePackageToTop' to all BUILD files. 2019-05-24 03:46:12 -07:00
model_collection Remove passing experimental_run_tf_function in most tests. 2020-02-27 13:34:09 -08:00
parallel_device Wrap save/restore logic in tf.function when in eager mode. This allows parallel saving and restoring when using multiple devices. 2020-06-22 13:23:14 -07:00
__init__.py Expose MultiWorkerMirroredStrategy and ParameterServerStrategy 2019-02-20 17:57:26 -08:00
all_reduce_test.py Fixed cases where tf.TensorShape was constructed with float dimensions 2019-10-14 20:53:35 -07:00
all_reduce.py Merge pull request #36713 from tensorflow:terrytangyuan-patch-2 2020-02-13 10:26:24 -08:00
BUILD Make cluster_resolver standard property in tf.distribute strategies. 2020-06-22 18:19:14 -07:00
central_storage_strategy.py Drop experimental and v2 qualifiers from Strategy experimental_run_v2 method. 2020-03-12 10:26:26 -07:00
checkpoint_utils_test.py Small adjustments on import spacing. 2019-12-18 20:32:12 -08:00
checkpointing_test.py Support packed variable in DistributedVariable. Add an option to enable packed variable in TPUStrategy. 2020-06-18 20:12:02 -07:00
collective_all_reduce_strategy_test.py Make cluster_resolver standard property in tf.distribute strategies. 2020-06-22 18:19:14 -07:00
collective_all_reduce_strategy.py Make cluster_resolver standard property in tf.distribute strategies. 2020-06-22 18:19:14 -07:00
collective_util.py Rename all_reduce_sum_gradients to experimental_aggregate_gradients 2020-03-24 13:31:15 -07:00
combinations_test.py Add multi worker mirrored strategy combinations to enable easier testing again 2020-04-06 15:59:19 -07:00
combinations.py Treat "test_xla_gpu" as GPU_TEST in "NamedGPUCombination". 2020-04-30 12:31:23 -07:00
cross_device_ops_test.py Fix cross_device_ops_test with multi GPU 2020-06-16 15:33:47 -07:00
cross_device_ops.py Explicitly take the set of devices in CollectiveAllReduce 2020-06-16 13:20:30 -07:00
cross_device_utils_test.py Add an experimental_hints to batch all reduce 2020-02-25 20:32:44 -08:00
cross_device_utils.py Explicitly take the set of devices in CollectiveAllReduce 2020-06-16 13:20:30 -07:00
ctl_correctness_test.py Removing v1 optimizer tests from ctl_correctness_test to speed it up 2020-04-30 14:21:05 -07:00
custom_training_loop_gradient_test.py Support Google-internal TPU resolution in strategy combinations. 2020-05-27 14:29:14 -07:00
custom_training_loop_input_test.py Support packed variable for tf data captured function. 2020-06-22 16:24:37 -07:00
custom_training_loop_metrics_test.py Drop experimental and v2 qualifiers from Strategy experimental_run_v2 method. 2020-03-12 10:26:26 -07:00
custom_training_loop_models_test.py When calling strategy.reduce in eager mode, wrap the strategy.run calls inside with tf.function so it is compatible with TPUStrategy. 2020-05-22 12:23:49 -07:00
custom_training_loop_optimizer_test.py Error when experimental_aggregate_gradients=False is used with 2020-03-24 20:22:12 -07:00
device_util_test.py Try to deduce job, replica and task from config.list_logical_devices() again 2020-06-16 15:22:24 -07:00
device_util.py Try to deduce job, replica and task from config.list_logical_devices() again 2020-06-16 15:22:24 -07:00
distribute_config.py Add an option to RunConfig and train_and_evaluate to run distribute coordinator. 2018-08-24 19:21:07 -07:00
distribute_coordinator_context.py Use distribution strategy to configure distribute coordinator. 2018-08-16 12:28:46 -07:00
distribute_coordinator_test.py minor spelling tweaks 2020-02-11 15:09:21 +09:00
distribute_coordinator.py Distribute Coordinator currently assumes TF_CONFIG to be the only way to configure a strategy. We now allow cluster resolvers to be passed as arguments to instantiate the strategy instead of TF_CONFIG which should be used instead if set by the user. 2020-03-16 12:03:17 -07:00
distribute_lib_test.py Make cluster_resolver standard property in tf.distribute strategies. 2020-06-22 18:19:14 -07:00
distribute_lib.py Make cluster_resolver standard property in tf.distribute strategies. 2020-06-22 18:19:14 -07:00
distribute_utils.py Support packed variable in DistributedVariable. Add an option to enable packed variable in TPUStrategy. 2020-06-18 20:12:02 -07:00
distributed_file_utils_test.py Ensure distributed_file_utils.remove_temp_dirpath() can be safely called multiple times. 2020-04-27 16:05:05 -07:00
distributed_file_utils.py Ensure distributed_file_utils.remove_temp_dirpath() can be safely called multiple times. 2020-04-27 16:05:05 -07:00
distribution_strategy_context.py Drop experimental and v2 qualifiers from Strategy experimental_run_v2 method. 2020-03-12 10:26:26 -07:00
estimator_training.py minor spelling tweaks 2020-02-11 15:09:21 +09:00
input_lib_test.py Enable last partial batch for MWMS in TF2.x 2020-06-22 17:27:34 -07:00
input_lib_type_spec_test.py Fix incompatibilities between DistributedIterator and the corresponding DistributedIteratorSpec. 2020-04-30 18:10:48 -07:00
input_lib.py Enable last partial batch for MWMS in TF2.x 2020-06-22 17:27:34 -07:00
input_ops_test.py [tf.data] Exposing dataset / iterator element type specification in the public API as element_spec. 2019-07-02 13:56:20 -07:00
input_ops.py Consolidate DistributeOptions.auto_shard into DistributeOptions.auto_shard_policy. 2019-10-21 16:03:57 -07:00
keras_metrics_test.py Drop experimental and v2 qualifiers from Strategy experimental_run_v2 method. 2020-03-12 10:26:26 -07:00
keras_save_load_test.py Remove passing experimental_run_tf_function in most tests. 2020-02-27 13:34:09 -08:00
metrics_v1_test.py Move to using 'initializer' from 'initialize' to be more consistent with the tf.data APIs. 2020-01-15 13:20:27 -08:00
minimize_loss_test.py Drop experimental and v2 qualifiers from Strategy experimental_run_v2 method. 2020-03-12 10:26:26 -07:00
mirrored_function_strategy_test.py Drop experimental and v2 qualifiers from Strategy experimental_run_v2 method. 2020-03-12 10:26:26 -07:00
mirrored_function_strategy.py Another round of refactoring of values.py to split utility functions that use distributed Variable types defined in values.py. 2020-06-12 12:04:39 -07:00
mirrored_run.py Another round of refactoring of values.py to split utility functions that use distributed Variable types defined in values.py. 2020-06-12 12:04:39 -07:00
mirrored_strategy_test.py Fork the keras related mirrored_strategy_test to keras/distribute. 2020-06-19 10:27:02 -07:00
mirrored_strategy.py Graduate TPUStrategy from experimental. 2020-06-20 13:10:50 -07:00
mirrored_variable_test.py Fork keras related mirror_variable_test to keras/distribute. 2020-06-19 10:57:00 -07:00
model_combinations.py tf.saved_model: Re-create concrete functions at saving time - this ensures that if the cache key has changed, the function will be traced again. For example, when the model is run with distribution strategy, but saved without it, we want to the saved version to trace again without strategy. 2019-08-01 00:42:05 -07:00
moving_averages_test.py In assign_moving_average, call update_fn instead of strategy.extended.update(var, update_fn) when in update context. 2020-06-10 17:52:55 -07:00
multi_process_lib.py Set TF_FORCE_GPU_ALLOW_GROWTH for multi process tests 2020-06-12 12:22:38 -07:00
multi_process_runner_no_init_test.py Refactor MultiProcessRunner: 2019-11-08 20:24:05 -08:00
multi_process_runner_test.py Fix tsan failure in multi_process_runner_test. 2020-06-22 15:50:17 -07:00
multi_process_runner.py Fix tsan failure in multi_process_runner_test. 2020-06-22 15:50:17 -07:00
multi_worker_continuous_run_test.py Improve multi_process_runner 2020-06-11 05:20:30 -07:00
multi_worker_test_base.py Mark multi-process utilities with subprocess module as deprecated in favor of using MultiProcessRunner. 2020-05-11 15:07:56 -07:00
multi_worker_util_test.py Allow evaluator not in cluster_spec, to be consistent with legacy Estimator. 2019-11-21 15:40:54 -08:00
multi_worker_util.py Allow evaluator not in cluster_spec, to be consistent with legacy Estimator. 2019-11-21 15:40:54 -08:00
numpy_dataset_test.py Add tf.distribute.Strategy.experimental_make_numpy_iterator() function. 2019-01-09 14:10:49 -08:00
numpy_dataset.py Remove *args from disribute/ variable creators 2020-02-01 16:58:14 -08:00
one_device_strategy_test.py Create a DistributedDataset/DistributedDatasetV1 based on TF2/1.x mode. 2020-04-14 01:06:49 -07:00
one_device_strategy.py Add InputOptions to experimental_distribute_dataset(s_from_function). 2020-06-15 16:05:58 -07:00
packed_distributed_variable_test.py Support packed variable in DistributedVariable. Add an option to enable packed variable in TPUStrategy. 2020-06-18 20:12:02 -07:00
packed_distributed_variable.py Support packed variable in DistributedVariable. Add an option to enable packed variable in TPUStrategy. 2020-06-18 20:12:02 -07:00
parameter_server_strategy_test.py Another round of refactoring of values.py to split utility functions that use distributed Variable types defined in values.py. 2020-06-12 12:04:39 -07:00
parameter_server_strategy.py Add InputOptions to experimental_distribute_dataset(s_from_function). 2020-06-15 16:05:58 -07:00
ps_values_test.py Refactor values.py into a utility file and a PS values file. 2020-05-28 10:58:13 -07:00
ps_values.py Override "map_resources" in AggregatingVariable. 2020-06-19 16:18:14 -07:00
README.md Graduate TPUStrategy from experimental. 2020-06-20 13:10:50 -07:00
reduce_util.py minor spelling tweaks 2020-02-11 15:09:21 +09:00
remote_mirrored_strategy_eager_test.py Support LogicalDevice in MirroredStrategy config 2019-11-13 15:19:24 -08:00
saved_model_mixed_api_test.py Migrate saved_model_mixed_api_test to V2 API 2020-06-11 10:50:30 -07:00
saved_model_save_load_test.py Drop experimental and v2 qualifiers from Strategy experimental_run_v2 method. 2020-03-12 10:26:26 -07:00
saved_model_test_base.py Support packed variable in DistributedVariable. Add an option to enable packed variable in TPUStrategy. 2020-06-18 20:12:02 -07:00
sharded_variable_test.py Add ShardedVariable class. 2019-10-03 15:05:15 -07:00
sharded_variable.py Support ShardedVariable in tf.keras.layers.Embedding. 2020-05-21 11:06:33 -07:00
shared_variable_creator_test.py A round of moving some DistributionStrategy libraries from contrib to 2018-11-16 12:02:34 -08:00
shared_variable_creator.py Remove *args from disribute/ variable creators 2020-02-01 16:58:14 -08:00
single_loss_example.py Move the optimizer name scope from model.training to optimizer 2019-05-21 14:52:30 -07:00
step_fn_test.py Do not perform colocation checks for IdentityN since it just forwards its inputs. 2019-07-09 11:31:50 -07:00
step_fn.py Move to using 'initializer' from 'initialize' to be more consistent with the tf.data APIs. 2020-01-15 13:20:27 -08:00
strategy_combinations_test.py Drop experimental and v2 qualifiers from Strategy experimental_run_v2 method. 2020-03-12 10:26:26 -07:00
strategy_combinations.py Support packed variable in DistributedVariable. Add an option to enable packed variable in TPUStrategy. 2020-06-18 20:12:02 -07:00
strategy_common_test.py Make cluster_resolver standard property in tf.distribute strategies. 2020-06-22 18:19:14 -07:00
strategy_reduce_test.py When calling strategy.reduce in eager mode, wrap the strategy.run calls inside with tf.function so it is compatible with TPUStrategy. 2020-05-22 12:23:49 -07:00
strategy_test_lib.py Another round of refactoring of values.py to split utility functions that use distributed Variable types defined in values.py. 2020-06-12 12:04:39 -07:00
summary_op_util.py Separating out summary_util for distribution strategy to avoid possible circular dependencies. 2019-03-14 17:11:11 -07:00
tf_function_test.py Use first worker as default device in tf_function_test. 2020-06-08 11:32:39 -07:00
tpu_strategy_test.py Support dynamic outputs for XLA on demand ops. 2020-07-01 11:11:47 -07:00
tpu_strategy.py Make cluster_resolver standard property in tf.distribute strategies. 2020-06-22 18:19:14 -07:00
tpu_values.py Support packed variable in DistributedVariable. Add an option to enable packed variable in TPUStrategy. 2020-06-18 20:12:02 -07:00
values_test.py Support packed variable in DistributedVariable. Add an option to enable packed variable in TPUStrategy. 2020-06-18 20:12:02 -07:00
values_util.py Support packed variable in DistributedVariable. Add an option to enable packed variable in TPUStrategy. 2020-06-18 20:12:02 -07:00
values.py Support packed variable in DistributedVariable. Add an option to enable packed variable in TPUStrategy. 2020-06-18 20:12:02 -07:00
warm_starting_util_test.py Small adjustments on import spacing. 2019-12-18 20:32:12 -08:00
zero_batch_test.py Fix input size used for batch normalization. 2020-04-09 22:01:21 -07:00

Tensorflow Distribute Libraries

Overview

tf.distribute.Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines or TPUs. Using this API, users can distribute their existing models and training code with minimal code changes.

It can be used with TensorFlow's high level APIs, tf.keras and tf.estimator, with just a couple of lines of code change. It does so by changing the underlying components of TensorFlow to become strategy-aware. This includes variables, layers, models, optimizers, metrics, summaries, and checkpoints.

Documentation

Distributed Training Guide

Distributed Training With Keras Tutorial

Distributed Training With Custom Training Loops Tutorial

Multiworker Training With Keras Tutorial

Multiworker Training With Estimator Tutorial

Save and Load with Distribution Strategy

Simple Examples

Using compile fit with GPUs.

# Create the strategy instance. It will automatically detect all the GPUs.
mirrored_strategy = tf.distribute.MirroredStrategy()

# Create and compile the keras model under strategy.scope()
with mirrored_strategy.scope():
  model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])
  model.compile(loss='mse', optimizer='sgd')

# Call model.fit and model.evaluate as before.
dataset = tf.data.Dataset.from_tensors(([1.], [1.])).repeat(100).batch(10)
model.fit(dataset, epochs=2)
model.evaluate(dataset)

Custom training loop with TPUs.

# Create the strategy instance.
tpu_strategy = tf.distribute.TPUStrategy(resolver)


# Create the keras model under strategy.scope()
with tpu_strategy.scope():
  model = keras.layers.Dense(1, name="dense")

# Create custom training loop body as tf.function.
@tf.function
def train_step(iterator):
  def step_fn(inputs):
    images, targets = inputs
    with tf.GradientTape() as tape:
      outputs = model(images)
      loss = tf.reduce_sum(outputs - targets)
    grads = tape.gradient(loss, model.variables)
    return grads

  return tpu_strategy.run(
      step_fn, args=(next(iterator),))

# Run the loop body once on at dataset.
dataset = tf.data.Dataset.from_tensors(([1.], [1.])).repeat(100).batch(10
input_iterator = iter(tpu_strategy.experimental_distribute_dataset(dataset))
train_step(input_iterator)

Testing

Tests here should cover all distribution strategies to ensure feature parity. This can be done using the test decorators in strategy_combinations.py.