History

Yujing Zhang 58f1434ed4 Disable multi_worker_continuous_run_test on asan PiperOrigin-RevId: 358344741 Change-Id: I41b0536dfbde6168085477306cab6a8b7c7f65a6		2021-02-18 23:30:57 -08:00
..
cluster_resolver	Comment out a number of google-internal targets when copybara-exporting instead of removing them.	2021-01-26 05:47:31 -08:00
coordinator	Remove cluster_coordinator_test.py from pip, add it to oss.	2021-02-17 23:43:11 -08:00
experimental	PY2 removal cleanup	2021-01-15 16:48:57 -08:00
integration_test	Raise meaningful error message when loading a ShardedVariable.	2020-12-21 15:59:59 -08:00
parallel_device	PY2 removal cleanup	2021-01-15 16:48:57 -08:00
v1	PY2 removal cleanup	2021-01-15 16:48:57 -08:00
BUILD	Disable multi_worker_continuous_run_test on asan	2021-02-18 23:30:57 -08:00
central_storage_strategy.py	Merge pull request #38968 from kushanam:distribute_dali_ctl	2020-10-19 09:25:22 -07:00
checkpoint_utils_test.py
checkpointing_test.py	Add callable wrapper to CheckpointValueInitializer so that we can delay the variable restore until after variable creation scopes have been called.	2020-09-01 15:42:47 -07:00
collective_all_reduce_strategy_test.py	Remove enable collective ops tests	2020-11-24 14:27:20 -08:00
collective_all_reduce_strategy.py	Add all_reduce APIs that can be called in replica context to class `CrossDeviceOps` and `StrategyExtended`.	2021-02-09 15:29:14 -08:00
collective_util_test.py	Fix constructor of CommunicationOptions	2020-11-11 13:17:33 -08:00
collective_util.py	Fix constructor of CommunicationOptions	2020-11-11 13:17:33 -08:00
combinations_test.py	Add convenient methods to write test combinations with and without tf.function	2021-01-12 12:30:57 -08:00
combinations.py	Add convenient methods to write test combinations with and without tf.function	2021-01-12 12:30:57 -08:00
cross_device_ops_test.py	Add all_reduce APIs that can be called in replica context to class `CrossDeviceOps` and `StrategyExtended`.	2021-02-09 15:29:14 -08:00
cross_device_ops.py	Add all_reduce APIs that can be called in replica context to class `CrossDeviceOps` and `StrategyExtended`.	2021-02-09 15:29:14 -08:00
cross_device_utils_test.py	Refactor collective utils to be of one replica	2020-10-13 20:02:22 -07:00
cross_device_utils.py	Comment on why NCCL can't be ordered in tf1.	2021-02-03 14:39:20 -08:00
custom_training_loop_gradient_test.py	Support Google-internal TPU resolution in strategy combinations.	2020-05-27 14:29:14 -07:00
custom_training_loop_input_test.py	Fix the returned response for experimental_local_results in case of MirroredStrategy for dict, list and tuple types.	2021-02-09 21:48:26 -08:00
device_util_test.py	Try to deduce job, replica and task from config.list_logical_devices() again	2020-06-16 15:22:24 -07:00
device_util.py	Use __slots__ for small classes	2020-06-28 18:41:22 +02:00
distribute_config.py
distribute_coordinator_context.py
distribute_coordinator_test.py	Skip creating std server for evaluator.	2021-01-05 10:52:49 -08:00
distribute_coordinator.py	Skip creating std server for evaluator.	2021-01-05 10:52:49 -08:00
distribute_lib_test.py	Graduate experimental_hints to options in all_reduce/reduce/batch_reduce	2020-10-16 11:54:24 -07:00
distribute_lib.py	PSv2/cfit: tf.distribute changes to accompany compile-fit support.	2021-02-10 17:51:16 -08:00
distribute_utils_test.py	Expand distribute_utils.regroup to work with collections.abc.Mapping-derived containers.	2021-01-15 12:46:22 -08:00
distribute_utils.py	Retire AutoPolicy	2021-02-04 16:16:59 -08:00
distribution_strategy_context.py	Generate replica_id tensor at call time	2020-07-27 19:21:33 -07:00
estimator_training.py
input_lib_test.py	Always enable get_next_as_optional unless the dataset is finite.	2021-01-29 23:11:37 -08:00
input_lib_type_spec_test.py	Remove the workaround that sets PerReplica spec to dynamic batch	2021-01-27 11:29:31 -08:00
input_lib.py	Always enable get_next_as_optional unless the dataset is finite.	2021-01-29 23:11:37 -08:00
input_ops_test.py
input_ops.py	[tf.data + tf.distribute] Use RebatchDataset instead of LegacyRebatchDataset in distribution strategies when global batch size can be statically determined.	2020-09-30 12:18:30 -07:00
metrics_v1_test.py
mirrored_run.py	Return the correct replica id within a sync group for MWMS. Currently we return the local replica id within a worker as opposed to within a sync group.	2020-10-09 13:20:06 -07:00
mirrored_strategy_test.py	Swap the use of NcclAllReduce for NCCL Collective Ops in MirroredStrategy.	2021-01-22 18:19:19 -08:00
mirrored_strategy.py	Fix the returned response for experimental_local_results in case of MirroredStrategy for dict, list and tuple types.	2021-02-09 21:48:26 -08:00
mirrored_variable_test.py	Use utility to identify OnWrite and OnRead synchronized variables.	2020-07-27 14:14:19 -07:00
moving_averages_test.py	Set 2 virtual cpus and 2 virtual gpus by default for test cases.	2020-11-03 16:57:08 -08:00
multi_process_lib.py	Update multi_process_lib to handle file path for OSS keras build/test.	2020-12-07 15:19:40 -08:00
multi_process_runner_no_init_test.py	TF Internal API: tf_export a few distribute-related symbols:	2020-10-07 14:38:53 -07:00
multi_process_runner_test.py	Re-enable multi process pool runner tests	2020-10-26 11:58:41 -07:00
multi_process_runner.py	MultiProcessPoolRunner: Comment correction as we're longer using atexit. Upon testing it seems we don't need _shutdown_all_pool_runners at the end of _pool_runner_worker either now.	2020-10-26 14:02:51 -07:00
multi_worker_continuous_run_test.py	MultiProcessRunner: symbol replacement: barrier->get_barrier	2020-10-07 10:51:25 -07:00
multi_worker_test_base_test.py	Use MPR for fault tolerance test	2020-08-21 00:08:42 -07:00
multi_worker_test_base.py	[*.py,tensorflow/cc/framework/cc_op_gen.cc] Rename "Arguments:" to "Args:"	2020-12-22 09:24:04 +11:00
multi_worker_util_test.py	Move away from deprecated asserts	2020-06-30 16:10:22 -07:00
multi_worker_util.py	PSv2: Check that there is no more than one chief, and at least one ps/worker. Combine the validation logic with multi_worker_util.	2020-11-10 18:37:30 -08:00
numpy_dataset_test.py
numpy_dataset.py
one_device_strategy_test.py	Add InputOption support to all remaining strategies.	2020-06-24 16:20:39 -07:00
one_device_strategy.py	Merge pull request #38968 from kushanam:distribute_dali_ctl	2020-10-19 09:25:22 -07:00
packed_distributed_variable_test.py	Support packed variable in DistributedVariable. Add an option to enable packed variable in TPUStrategy.	2020-06-18 20:12:02 -07:00
packed_distributed_variable.py	[rollback] Use self.handle inside ResourceVariable to allow tf.distribute to customize	2021-02-09 11:01:03 -08:00
parameter_server_strategy_test.py	fix typos in python directory	2020-10-29 16:21:24 +03:00
parameter_server_strategy_v2_test.py	PSv2: Add checks that `ParameterServerStrategy`'s `run`, `reduce`, `experimental_distribute_dataset`, and `distribute_datasets_from_function` are used with a `ClusterCoordinator`, and that `run` and `reduce` need to be used within a function that is used with `schedule`.	2020-11-25 12:30:24 -08:00
parameter_server_strategy_v2.py	PSv2/cfit: tf.distribute changes to accompany compile-fit support.	2021-02-10 17:51:16 -08:00
parameter_server_strategy.py	Fix the returned response for experimental_local_results in case of MirroredStrategy for dict, list and tuple types.	2021-02-09 21:48:26 -08:00
ps_values_test.py	Replace usages of Tensorflow DistributionStrategy method experimental_run_v2 with run.	2020-06-29 11:22:53 -07:00
ps_values.py	[TF DistStrat] Add support for deepcopy on AggregatingVariable (PS)	2020-08-19 08:57:16 -07:00
random_generator_test.py	Allows creating tf.random.Generator under distribution-strategy scopes. Different replicas will get different random-number streams.	2021-01-08 15:58:22 -08:00
README.md	Graduate TPUStrategy from experimental.	2020-06-20 13:10:50 -07:00
reduce_util.py
remote_mirrored_strategy_eager_test.py
sharded_variable_test.py	Support slicing in ShardedVariable. The slicing semantic is identical to Tensor/Variable.	2021-02-02 14:14:36 -08:00
sharded_variable.py	Support slicing in ShardedVariable. The slicing semantic is identical to Tensor/Variable.	2021-02-02 14:14:36 -08:00
shared_variable_creator_test.py	Move away from deprecated asserts	2020-06-30 16:10:22 -07:00
shared_variable_creator.py
single_loss_example.py	Update minimize_loss_test to not rely on Keras.	2020-07-07 21:39:06 -07:00
step_fn.py
strategy_combinations_test.py	Create different strategy based on TF1/2 in strategy_combinations	2020-10-09 17:02:10 -07:00
strategy_combinations.py	Swap the use of NcclAllReduce for NCCL Collective Ops in MirroredStrategy.	2021-01-22 18:19:19 -08:00
strategy_common_test.py	Fix a multi-gpu test failure.	2021-02-12 13:12:33 -08:00
strategy_gather_test.py	Swap the use of NcclAllReduce for NCCL Collective Ops in MirroredStrategy.	2021-01-22 18:19:19 -08:00
strategy_test_lib.py	Remove numpy_datasets from V2 strategies	2020-10-12 14:30:17 -07:00
summary_op_util.py
test_util_test.py	Order NCCL all-reduce with ordering token	2020-11-11 11:18:30 -08:00
test_util.py	Order NCCL all-reduce with ordering token	2020-11-11 11:18:30 -08:00
tf_function_test.py	Always retrace in tf.saved_model.save	2020-10-10 12:18:19 -07:00
tpu_strategy_compilation_test.py	Pass non empty MLIR module serialized string when constructing TpuCompilationCacheKey.	2020-07-24 16:40:48 -07:00
tpu_strategy_test.py	[xla_compiler] Do not promote TF shape constant folding when input is a XLA dynamic shape.	2021-02-10 22:36:47 -08:00
tpu_strategy.py	Fix the returned response for experimental_local_results in case of MirroredStrategy for dict, list and tuple types.	2021-02-09 21:48:26 -08:00
tpu_util.py	[retry]Move enclosing_tpu_context to a separate util file	2021-02-04 11:04:21 -08:00
tpu_values.py	Retire AutoPolicy	2021-02-04 16:16:59 -08:00
values_test.py	strategy.extended.update allows assigning non-mirrored values to non-mirrored	2021-02-01 15:50:31 -08:00
values_util.py	Disallow saving if the function cannot be used for inference	2020-10-15 21:08:51 -07:00
values.py	Retire AutoPolicy	2021-02-04 16:16:59 -08:00
vars_test.py	Add test_util.main() and test_util.set_logical_devices_to_at_least()	2020-10-06 16:30:51 -07:00
warm_starting_util_test.py
zero_batch_test.py

README.md

Tensorflow Distribute Libraries

Overview

tf.distribute.Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines or TPUs. Using this API, users can distribute their existing models and training code with minimal code changes.

It can be used with TensorFlow's high level APIs, tf.keras and tf.estimator, with just a couple of lines of code change. It does so by changing the underlying components of TensorFlow to become strategy-aware. This includes variables, layers, models, optimizers, metrics, summaries, and checkpoints.

Documentation

Distributed Training Guide

Distributed Training With Keras Tutorial

Distributed Training With Custom Training Loops Tutorial

Multiworker Training With Keras Tutorial

Multiworker Training With Estimator Tutorial

Save and Load with Distribution Strategy

Simple Examples

Using compile fit with GPUs.

# Create the strategy instance. It will automatically detect all the GPUs.
mirrored_strategy = tf.distribute.MirroredStrategy()

# Create and compile the keras model under strategy.scope()
with mirrored_strategy.scope():
  model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])
  model.compile(loss='mse', optimizer='sgd')

# Call model.fit and model.evaluate as before.
dataset = tf.data.Dataset.from_tensors(([1.], [1.])).repeat(100).batch(10)
model.fit(dataset, epochs=2)
model.evaluate(dataset)

Custom training loop with TPUs.

# Create the strategy instance.
tpu_strategy = tf.distribute.TPUStrategy(resolver)


# Create the keras model under strategy.scope()
with tpu_strategy.scope():
  model = keras.layers.Dense(1, name="dense")

# Create custom training loop body as tf.function.
@tf.function
def train_step(iterator):
  def step_fn(inputs):
    images, targets = inputs
    with tf.GradientTape() as tape:
      outputs = model(images)
      loss = tf.reduce_sum(outputs - targets)
    grads = tape.gradient(loss, model.variables)
    return grads

  return tpu_strategy.run(
      step_fn, args=(next(iterator),))

# Run the loop body once on at dataset.
dataset = tf.data.Dataset.from_tensors(([1.], [1.])).repeat(100).batch(10
input_iterator = iter(tpu_strategy.experimental_distribute_dataset(dataset))
train_step(input_iterator)

Testing

Tests here should cover all distribution strategies to ensure feature parity. This can be done using the test decorators in strategy_combinations.py.