Commit Graph

382 Commits

Author SHA1 Message Date
A. Unique TensorFlower
034633f23b PY2 removal cleanup
PiperOrigin-RevId: 352106691
Change-Id: I382d53c64f0d29da430b8cb6d2395a2cb281509e
2021-01-15 16:48:57 -08:00
Scott Zhu
c4aab50762 Move the LossReduction class from tf to Keras.
PiperOrigin-RevId: 351654535
Change-Id: I405d4f79568c05ddeaa3755c3171960db633e6de
2021-01-13 13:40:14 -08:00
Jeremy Lau
b6c59262a8 Temporarily disable flaky input_lib_test under tsan.
PiperOrigin-RevId: 351641375
Change-Id: I24b3ccb0f8080d9e85e25faf224ad8e3b363b1d0
2021-01-13 12:40:04 -08:00
Revan Sopher
58fa23c36c Disable random_generator_test on Cloud TPU.
PiperOrigin-RevId: 351631939
Change-Id: I990a305c084f6a35752da2b6a6e7ed8555cf7a24
2021-01-13 12:03:26 -08:00
Ran Chen
3b5bb5a706 Add convenient methods to write test combinations with and without tf.function
Instead of using def_function, in test we can parameterize with these two
objects to test both tf.function and eager execution.

PiperOrigin-RevId: 351420316
Change-Id: I037d1678ca843f6df88694981efd4519c2947cd3
2021-01-12 12:30:57 -08:00
Peng Wang
587ac71f68 Allows creating tf.random.Generator under distribution-strategy scopes. Different replicas will get different random-number streams.
All strategies are supported except for CentralStorageStrategy and ParameterServerStrategy.

This CL also removes the CompositeTensor superclass from Generator. Generator is a wrapper around tf.Variable, and because tf.Variable is not a CompositeTensor, Generator can't be a CompositeTensor in theory. Previously we made it a CompositeTensor by returning Variable.handle, but that breaks down when the variable is a DistributedVariable (in cross-replica context).

PiperOrigin-RevId: 350851648
Change-Id: I5f4d77ddb990557fcc9c7336987203ecdaec5b9a
2021-01-08 15:58:22 -08:00
Chenkai Kuang
12c67c0d47 Raise meaningful error message when loading a ShardedVariable.
PiperOrigin-RevId: 348539354
Change-Id: I2c4a8466c3d1355ec8e5984ed039194c18c4305c
2020-12-21 15:59:59 -08:00
TensorFlower Gardener
ee11cf13ea Merge pull request from ROCmSoftwarePlatform:google_upstream_rocm_add_remove_no_rocm_tag
PiperOrigin-RevId: 348425014
Change-Id: I1f4ec5ab94f27ea201a9dbc8b53d337d03e01d6c
2020-12-21 00:32:48 -08:00
Yanhua Sun
fcef8dd078 disable failed tsan test
PiperOrigin-RevId: 348115144
Change-Id: I71af99a588804dec96923781d52c0aace9543b9e
2020-12-17 16:33:37 -08:00
Yanhua Sun
39cbd155f4 disable asan failed test
PiperOrigin-RevId: 347902281
Change-Id: I8ed7c6589627a6f3e9fd4e3c2566c1e6b3f57649
2020-12-16 14:55:32 -08:00
Yanhua Sun
112778a924 add BUILD file for python/util and refactor python/BUILD
PiperOrigin-RevId: 347652147
Change-Id: I97aa24d6a109ed5aec8fca9a9b5e3b2b1a96b760
2020-12-15 11:43:18 -08:00
Deven Desai
6ccddc5d3f Removing no_rocm tag from unit-tests that are now passing on the ROCm platform 2020-11-25 02:06:27 +00:00
Ran Chen
239a96c0ac Enable NCCL for all all-reduces
PiperOrigin-RevId: 344134571
Change-Id: I7234c42d196716570c820c1714376a2b4311cc06
2020-11-24 14:55:12 -08:00
Priya Gupta
b0d40302ec tf.distribute: Move old/unused all_reduce util to v1/
PiperOrigin-RevId: 343975576
Change-Id: Idb77013708cf40fb21127adf6e5f65f063f16029
2020-11-23 19:46:50 -08:00
Chenkai Kuang
ace1f5d6c9 Internal test change.
PiperOrigin-RevId: 343524786
Change-Id: Ifaafaa83b7e6b7c3f3d30744b0b65ea257013a48
2020-11-20 11:03:15 -08:00
Ran Chen
de33613977 [rollforward]Guess test binary path from TEST_TARGET env var
PiperOrigin-RevId: 343132917
Change-Id: I6af62b595875070dac9e47f22564852dd4976252
2020-11-18 16:41:53 -08:00
Ran Chen
3d03ae2086 [rollback]Guess test binary path from TEST_TARGET env var
PiperOrigin-RevId: 343121732
Change-Id: I8945009c09f1d6b555000db1f8c17452cb0e6d64
2020-11-18 16:21:13 -08:00
Ran Chen
448e080c75 Guess test binary path from TEST_TARGET env var
PiperOrigin-RevId: 342941997
Change-Id: Ibb7d8da784b869927c26f9eb4fe7564aee05a21d
2020-11-17 14:11:55 -08:00
Peng Wang
ce6f0682d2 Some internal changes
PiperOrigin-RevId: 341884008
Change-Id: I585bf735d032503679f45f71f9820784e339b012
2020-11-11 12:25:10 -08:00
Ran Chen
07b9eccccf Order NCCL all-reduce with ordering token
Auto control dep will chain operations with the same resource input. We'll do the same thing for all-gather after some refactoring is done.

PiperOrigin-RevId: 341868107
Change-Id: I5570a28c2e1c638980e3509088c0525e957c463b
2020-11-11 11:18:30 -08:00
Scott Zhu
a861238b03 Fork keras related sharded_variable_test to keras/distribute.
TF distribute shouldn't rely on any keras code.

PiperOrigin-RevId: 340483958
Change-Id: I4c3774dce1e914dc1f257d13117420a3fb9b3406
2020-11-03 11:18:23 -08:00
Anna R
a877252856 Disable tests that flakily timeout in tensorflow.cuda_asan project.
PiperOrigin-RevId: 339922525
Change-Id: I88c0d616154484ddd353d262f2fe7408cfc058c4
2020-10-30 12:55:46 -07:00
Anna R
1b9ba7cb6e Disable more tests that time out with cuda asan.
PiperOrigin-RevId: 339157584
Change-Id: I834a765fbcb1aa782756c0ef92ed7bb35feb176a
2020-10-26 18:12:25 -07:00
Scott Zhu
b8f0d8c418 Move the distribute_file_utils to Keras.
The only usage of it is from Keras callbacks.

PiperOrigin-RevId: 339128802
Change-Id: I70e7911f678404680d40ed9b5d2bdbb1b4d4ff5a
2020-10-26 15:29:19 -07:00
Ran Chen
3a62bc8168 Use the same worker pool for tests that requires the same number of workers
We used to use one worker pool per strategy combination, but it's not necessary.
If the cluster topology is the same they can share the same worker pool. This
reduces the overhead of initializing worker pools, which can take O(10s) for
GPU builds.

PiperOrigin-RevId: 339117353
Change-Id: I1f631f79597b07991528c77482c44c201a01abe4
2020-10-26 14:31:49 -07:00
Ran Chen
5fa0d44596 Enable a few tests on py3.8 version
PiperOrigin-RevId: 339076865
Change-Id: I6b622191b1f952f0b749cd4592046c4f03391c08
2020-10-26 11:35:47 -07:00
Anna R
4a1e7e0f66 Disable tests that time out with cuda asan.
PiperOrigin-RevId: 338706453
Change-Id: If24250d7893df112a01026c7eb40ecd4536ea976
2020-10-23 11:09:48 -07:00
Kibeom Kim
0c67638ac2 Remove deprecated tfrt_enabled test target flag.
PiperOrigin-RevId: 338530097
Change-Id: I0bd2ad366210330ece06f99a4fdb16de395ece05
2020-10-22 12:55:06 -07:00
A. Unique TensorFlower
4e1e1499fe Disable a few failing tests on py3.8 version
PiperOrigin-RevId: 338381183
Change-Id: I95b58ff09033376936e05364c6ec35ec38389ea0
2020-10-21 19:31:09 -07:00
Chenkai Kuang
785353f8d4 Use ops dispatch to overwrite the behavior of embedding_lookup ops when called with ShardedVariable. Otherwise ShardedVariable will be converted to a dense tensor when passing to embedding_lookup.
Ops like `tf.nn.nce_loss` and `tf.nn.sampled_softmax_loss` also benefit from this as they use embedding_lookup internally.

PiperOrigin-RevId: 338369985
Change-Id: I89ebe2a452fc1d599567cb80e80ee9b023e5aa1c
2020-10-21 18:02:03 -07:00
Chenkai Kuang
e185142f34 Replace v1 partitioners with newly exported partitioners symbol in ps tests and example code.
PiperOrigin-RevId: 338301301
Change-Id: Ie83b62c5b16db684be04aabd55603fed0c2ffaaa
2020-10-21 11:50:53 -07:00
Ran Chen
9f51b98f0b [retry]DistributedDataset creates elements with fixed spec to help avoid retracing
tf.function tracing depends on the inputs to the function. For a typical training loop:

x, y = next(iter)
train_fn(x,y)

it may retrace when getting a partial/batches. This is problematic for multi client training since different client may retrace at different time. We assign collective instance_key when tracing a function, retracing results in different sets of instance keys.

This change we overrides the PerReplica type spec, which is used to calculate function cache key. This tries to avoid retracing in common cases, but it doesn't guarantee that it won't happen.

Note that after such change, the function also gets partial shape information. This is the reason we only do it for multi client strategies (MWMS), to avoid performance penalty to e.g. TPU.

PiperOrigin-RevId: 338203534
Change-Id: Iae9d6c3c82113d623707e19142fbebe5597d7898
2020-10-20 22:48:17 -07:00
Pavithra Vijay
fbcdf129b9 Disable broken failing test
PiperOrigin-RevId: 338200487
Change-Id: If0d5b224326843446333d81bb47a45387fe2232f
2020-10-20 22:16:58 -07:00
Mihai Maruseac
fd914a2b02 Disable broken Windows test
PiperOrigin-RevId: 338180192
Change-Id: Ic3f582bba112b7f0f73fac52617fc8425553aa5b
2020-10-20 19:04:05 -07:00
Ran Chen
0e14b0fdc4 [retry] Graduate MultiWorkerMirroredStrategy out of experimental
Over the past months we've several improvements:
  - Test coverage is now on par with other strategies.
  - Peer failure will no longer cause the cluster to hang.
  - Major issues with saving are fixed.
  - gather() API is added.

PiperOrigin-RevId: 338175223
Change-Id: I3c52a4d53d1c487558f1caaae7d094fe2245183b
2020-10-20 18:16:47 -07:00
Chenkai Kuang
3438443618 Add and export V2 partitioner methods in tf.distribute namespace.
compat.v1 partitioners are left unchanged. V2 partitioners are exported in tf.distribute namespace as they are supposed to work with sharded variable, which is a concept in tf.distribute. Implementations of the partitioners are reused.

While on it, also took the opportunity to refine the naming:

- variable_axis_size_partitioner -> MaxSizePartitioner (partitioner that keeps shards under a maximum size)
- min_max_variable_partitioner -> MinSizePartitioner (partitioner that allocates shards above a minimum size)
- fixed_size_partitioner -> FixedShardsPartitioner (partitioner that allocates fixed number of shards).

PiperOrigin-RevId: 338157380
Change-Id: I19f517e38f20e4e9c85745863e764da0aad6eeeb
2020-10-20 16:15:55 -07:00
Rick Chao
32f35aabce PSv2: Export a few tf.distribute symbols related to TF2 parameter server training.
This change exports the following class symbols, and adds relevant documentation and example code to

tf.distribute.experimental.ParameterServerStrategy
tf.distribute.experimental.coordinator.ClusterCoordinator
tf.distribute.experimental.coordinator.PerWorkerValues
tf.distribute.experimental.coordinator.RemoteValue

PiperOrigin-RevId: 338151262
Change-Id: If2d1c513d30a999c728cecc2e73b75adda1948c2
2020-10-20 15:42:17 -07:00
A. Unique TensorFlower
d5ab30ca14 Graduate MultiWorkerMirroredStrategy out of experimental
Over the past months we've several improvements:
  - Test coverage is now on par with other strategies.
  - Peer failure will no longer cause the cluster to hang.
  - Major issues with saving are fixed.
  - gather() API is added.

PiperOrigin-RevId: 338132035
Change-Id: I384c084717cd5f2b6167668ebe96af0f7b371530
2020-10-20 14:05:53 -07:00
Ran Chen
ebab4d6209 Graduate MultiWorkerMirroredStrategy out of experimental
Over the past months we've several improvements:
  - Test coverage is now on par with other strategies.
  - Peer failure will no longer cause the cluster to hang.
  - Major issues with saving are fixed.
  - gather() API is added.

PiperOrigin-RevId: 338110984
Change-Id: I92eeb981c67acb0c44f658316b6ad564162508bc
2020-10-20 12:32:12 -07:00
Chen Chen
d9e09b0723 Increase shard_count of strategy_gather_test to avoid timeout:
PiperOrigin-RevId: 337996150
Change-Id: Ifa68a95b8a21f88c5e7fe9dd5f8796afec37a2e3
2020-10-19 22:07:06 -07:00
Ran Chen
fcd8113d1b [rollback]DistributedDataset creates elements with fixed spec to help avoid retracing
PiperOrigin-RevId: 337887709
Change-Id: Ifd2863ca4d0a6f619ba050ff32d37001118f0d7c
2020-10-19 10:52:55 -07:00
Ran Chen
7ba60d5a29 DistributedDataset creates elements with fixed spec to help avoid retracing
tf.function tracing depends on the inputs to the function. For a typical training loop:

x, y = next(iter)
train_fn(x,y)

it may retrace when getting a partial/batches. This is problematic for multi client training since different client may retrace at different time. We assign collective instance_key when tracing a function, retracing results in different sets of instance keys.

This change we overrides the PerReplica type spec, which is used to calculate function cache key. This tries to avoid retracing in common cases, but it doesn't guarantee that it won't happen.

Note that after such change, the function also gets partial shape information. This is the reason we only do it for multi client strategies (MWMS), to avoid performance penalty to e.g. TPU.

PiperOrigin-RevId: 337792983
Change-Id: Ib029d61cd360d6a25e38e894913e4d78af20d1dd
2020-10-18 22:25:39 -07:00
Ran Chen
f196a243ea Graduate experimental_hints to options in all_reduce/reduce/batch_reduce
The CollectiveHints class is also renamed to CommunicationOptions. The communication enum is added to it.

CommunicationOptions stays experimental since the detailed options may change, but it's rather clear we need an options argument for these cross device communications.

PiperOrigin-RevId: 337547832
Change-Id: I376171672698d5923b4e52f2567d4a584c8e21b6
2020-10-16 11:54:24 -07:00
Ran Chen
380478ff5f Disallow saving if the function cannot be used for inference
With distribution strategy, traced ConcreteFunctions may contain training specific logics that assumes the variable is a distributed variable. Such functions cannot be used for inference. Since we do not know if such ConcreteFunction will be saved for inference or not, we always mark them as unsaveable unless it's traced under a save context.

The user can tf.function instead, which can be retraced in saving.

Impacted usages:
- MultiWorkerMirroredStrategy
  - Reading a synchronization=ON_READ variable. E.g. a batch norm layer.
- MultiWorkerMirroredStrategy, MirroredStrategy, TPUStrategy
  - Updating a variable.
  - Reading a synchronization=ON_READ aggregation=SUM variable.

It's TBD if we also need to mark functions that use packed handle as unsaveable. They do contain TPU:0 device annotations but with soft placement it may not be a problem.

PiperOrigin-RevId: 337438256
Change-Id: Ie89d0d6beb3e71d3ebbb867d1f91f2953468840c
2020-10-15 21:08:51 -07:00
Mihai Maruseac
7c595a218b Disable a few asan tests that fail with -fsanitize-null
PiperOrigin-RevId: 337397375
Change-Id: Iada5ce9193c053599b5f83a2ba2105878577dc05
2020-10-15 15:54:09 -07:00
Chenkai Kuang
d6e0181f1d Make ShardedVariable a composite tensor.
This allows us to:
1. Pass `ShardedVariable` to tf.function inputs while avoiding retracing if spec of the ShardedVariable doesn't change.
2. Use `nest.flatten(sharded_variable, expand_composites=True)` to retrieve the list of component variables. This is used by `tf.module` and keras Layer to collect variables from attributes that are nested structures, so this change makes them be able to collect component variables when a sharded_variable is assigned to their attribute.

`layer.add_weight` already works, this change adds a test for that.

PiperOrigin-RevId: 337382403
Change-Id: I4c7e490cdc8fd772ed57c4074894637147986dac
2020-10-15 14:32:44 -07:00
Rick Chao
e70e880bd5 MultiProcessRunner: Open source multi_process_runner with a OSS backend.
Some tests are timing out and being disabled on tap. Cause TBD.

PiperOrigin-RevId: 337376027
Change-Id: Ia0e58be434ce59469498db3a24c3bd32cc17c023
2020-10-15 14:09:23 -07:00
Richard Uhler
b881485eb5 Fix logic for enabling MLIR bridge depending on has_tensor_list_arg.
It appears that the polarity of the use of has_tensor_list_arg was
inadvertently flipped.

Disable any MLIR bridge enabled tests that were passing because they weren't
using the MLIR bridge due to this issue.

PiperOrigin-RevId: 337125651
Change-Id: I93e9e61acda9a2aeffaee5cce13e93635d33f5a4
2020-10-14 11:20:34 -07:00
Ran Chen
56124bd1ab Support batching all-reduce with concat and split
Collective v2 doesn't support scoped allocator. While it's possible to make scoped allocator work with it. Concat/split is much simpler.

PiperOrigin-RevId: 337109439
Change-Id: I3535f5e0b090696f3bb620617f2b57f1f4b78b22
2020-10-14 10:16:07 -07:00
A. Unique TensorFlower
c0d20ffd82 Support batching all-reduce with concat and split
Collective v2 doesn't support scoped allocator. While it's possible to make scoped allocator work with it. Concat/split is much simpler.

PiperOrigin-RevId: 337021023
Change-Id: I6e6e2fdc3c94ffbc59a52c20a451dcd74fd864e4
2020-10-13 22:16:32 -07:00