Commit Graph

1596 Commits

Author SHA1 Message Date
Yujing Zhang
58f1434ed4 Disable multi_worker_continuous_run_test on asan
PiperOrigin-RevId: 358344741
Change-Id: I41b0536dfbde6168085477306cab6a8b7c7f65a6
2021-02-18 23:30:57 -08:00
Yuefeng Zhou
426558e017 Remove cluster_coordinator_test.py from pip, add it to oss.
PiperOrigin-RevId: 358113811
Change-Id: I3e04a42ba6e3c264c3a725d0b13cdd3565bbe6a5
2021-02-17 23:43:11 -08:00
Rick Chao
18bfd69f92 PSv2: Enable cluster_coordinator_test in OSS as it has been previously fixed.
PiperOrigin-RevId: 358055546
Change-Id: Idc309a81d052e03bdc44c2165295c11528a39564
2021-02-17 16:14:47 -08:00
Chenkai Kuang
9663abe4c9 Document limitation of using tf.data.Options in coordinator.create_per_worker_dataset.
PiperOrigin-RevId: 357907240
Change-Id: Ib6ce02efeca322409969af14be3318248ce928b4
2021-02-17 02:37:47 -08:00
Rick Chao
e66d5d56a8 Multi-worker testing: Reenable multi_worker_continuous_run_test on TAP as it has been fixed.
PiperOrigin-RevId: 357889388
Change-Id: Ib713df996bd150c1ef1fcd1e33cd69fd9689612c
2021-02-17 00:10:06 -08:00
Rick Chao
7db445b3fb PSv2: Enable cluster_coordinator_test in OSS as it has been previously fixed.
PiperOrigin-RevId: 357655428
Change-Id: I0fdfbb6c6664058b25a8c29e21447b0f4e01ef4a
2021-02-15 23:30:39 -08:00
Chenkai Kuang
aa9bd19fe3 Fix a multi-gpu test failure.
The test uses tf.constant as input to all_reduce in pure eager mode, however in eager mode tf.constant always creates host tensors regardless of the enclosing device scope. This leads to NCCL error.

PiperOrigin-RevId: 357252841
Change-Id: Iddaf5f52fe6634ec29dd385a9fa034761f3df91f
2021-02-12 13:12:33 -08:00
Yunxing Dai
443f13e41a [xla_compiler] Do not promote TF shape constant folding when input is a XLA dynamic shape.
PiperOrigin-RevId: 356904310
Change-Id: Iff329b8f81777f895333726e8ca98e2d3ad4ddb5
2021-02-10 22:36:47 -08:00
Rick Chao
4835f6ec4f PSv2/cfit: tf.distribute changes to accompany compile-fit support.
1) Single instance of ClusterCoordinator given a Strategy object
2) Circular references of ClusterCoordinator and ParameterServerStrategy
3) Attribute of a Strategy indicating if it is supposed to be used with a ClusterCoordinator

PiperOrigin-RevId: 356868615
Change-Id: If19600c0101f40a9e840fe71abb848f386e32735
2021-02-10 17:51:16 -08:00
A. Unique TensorFlower
24493c8698 Fix the returned response for experimental_local_results in case of MirroredStrategy for dict, list and tuple types.
PiperOrigin-RevId: 356663538
Change-Id: Ie2338c6dbb63ac0e129b9051dae56626e47f6450
2021-02-09 21:48:26 -08:00
Isha Arkatkar
21273f6e32 Fix the returned response for experimental_local_results in case of MirroredStrategy for dict, list and tuple types.
PiperOrigin-RevId: 356645884
Change-Id: I0c5d5628e8bb88d661ecba41a90ddbe13aa59543
2021-02-09 19:29:21 -08:00
Chenkai Kuang
eb31d8660d Add all_reduce APIs that can be called in replica context to class CrossDeviceOps and StrategyExtended.
For `StrategyExtended`, it is a private API that will be used by `ReplicaContext.all_reduce`.

This is in preparation for deprecation of merge_call from user API.

PiperOrigin-RevId: 356604626
Change-Id: I2528b35b87db1b93907b17a246dbfbcfcb64ad33
2021-02-09 15:29:14 -08:00
Rick Chao
d4c8c579e1 MultiProcessRunner: Enable multi_process_runner_no_init_test in OSS with MultiProcessRunner's availability.
PiperOrigin-RevId: 356560120
Change-Id: Id34aebde1e405bcca65f2d3d42ef439923f9b434
2021-02-09 12:29:40 -08:00
Ran Chen
beab125d24 [rollback] Use self.handle inside ResourceVariable to allow tf.distribute to customize
handle behavior

PiperOrigin-RevId: 356541183
Change-Id: If4dbfc32a834c464bc94ce1c3ae71b3fb72e1e55
2021-02-09 11:01:03 -08:00
TensorFlower Gardener
9829af63fe Merge pull request from 8bitmp3:patch-1
PiperOrigin-RevId: 356526123
Change-Id: Ie4e5815ba6291dc7505bf08629f355085c50f3e4
2021-02-09 10:05:41 -08:00
8bitmp3
38784a85c3 Address Ubuntu Sanity error 2021-02-07 01:27:31 +00:00
Ran Chen
525524f99e Fix a silly mistake in update
PiperOrigin-RevId: 355902594
Change-Id: I02261440a6c3d42c57ee285e64cd5efbdc45bb35
2021-02-05 12:34:34 -08:00
Xinyi Wang
66e3bde76a Delete meaningless comment.
PiperOrigin-RevId: 355821017
Change-Id: Iadbaac2670ebf3456bdc939fd81b02d5947da97d
2021-02-05 03:58:15 -08:00
Ran Chen
0ac07a2fc4 Retire AutoPolicy
This is part of the effort to refactor distributed variables. Auto is somewhat
confusing and adds additional implementation complexity.

PiperOrigin-RevId: 355733796
Change-Id: I7446c3ed706624178fcb26c9b992632a93b939f6
2021-02-04 16:16:59 -08:00
Revan Sopher
a0bd36e7f4 Support passing and returning Nones in TPUStrategy.
This is supported in non-TPU strategies, and surprises users trying to migrate.
The lower-level input and output replication ops can't handle values that aren't convertible to Tensor, so we need to do some massaging around this. Nones in inputs are temporarily replaced with a constant, then replaced after replication. Nones in outputs are simply not replicated, as output replication happens at a per-value granularity.

PiperOrigin-RevId: 355690080
Change-Id: I9d2435e953c8feb7818a882cb5280327f310c919
2021-02-04 13:07:39 -08:00
Ran Chen
f4b06261c9 Use self.handle inside ResourceVariable to allow tf.distribute to customize
handle behavior

I'm working on a new version of DistributedVariable which directly inherits from BaseResourceVariable. Its handle would return different resource tensors under different context, e.g. self.handle would be a replicated tensor under tpu context. This can avoid the need to use raw variable operations for special resource handles like tpu replicate handle or parallel device handle.

PiperOrigin-RevId: 355663353
Change-Id: I16201f94ef27a0dc7ac1491c616d7bd68397123a
2021-02-04 11:26:57 -08:00
Ran Chen
a58d44afc8 [retry]Move enclosing_tpu_context to a separate util file
This is part of the variable refactor work to avoid dependency cycles.

PiperOrigin-RevId: 355654271
Change-Id: I92f0d00ddff6655c174d999abd0290ae3e4c1849
2021-02-04 11:04:21 -08:00
A. Unique TensorFlower
c37243b055 Fix a condition used for collective ops.
PiperOrigin-RevId: 355497796
Change-Id: I297c1841569d81742006fc5470a311fd2317fd10
2021-02-03 15:55:24 -08:00
Chenkai Kuang
6ff048e951 Comment on why NCCL can't be ordered in tf1.
PiperOrigin-RevId: 355484203
Change-Id: I23606db9759b72355f58147cdf7ad61d3076641d
2021-02-03 14:39:20 -08:00
A. Unique TensorFlower
ace531b56d Move enclosing_tpu_context to a separate util file
This is part of the variable refactor work to avoid dependency cycles.

PiperOrigin-RevId: 355456079
Change-Id: I7a8afc89d17eee9372afb3fd4c6da8126791e499
2021-02-03 12:35:59 -08:00
Ran Chen
428ce93ee4 Move enclosing_tpu_context to a separate util file
This is part of the variable refactor work to avoid dependency cycles.

PiperOrigin-RevId: 355414142
Change-Id: I36651a7be6462c198aae477923bc2ef0f7e7d0fb
2021-02-03 09:36:08 -08:00
Yuefeng Zhou
40d5a5685f Fix strategy.run's docstring: python literals are not supported in args or kwargs.
PiperOrigin-RevId: 355343919
Change-Id: I3bc8cf411d799a9ff36b722812e9600955278015
2021-02-03 01:18:09 -08:00
Xinyi Wang
c4d1165b15 Disable collective ops for MS.
PiperOrigin-RevId: 355266199
Change-Id: I58e71e086542b39effb5c4f0f3d088169b8810b1
2021-02-02 15:34:13 -08:00
Chenkai Kuang
b4bf78ffec Support slicing in ShardedVariable. The slicing semantic is identical to Tensor/Variable.
PiperOrigin-RevId: 355249212
Change-Id: Ic9a14b5ae5cc0a446142eaa529f052c09c445396
2021-02-02 14:14:36 -08:00
Chen Chen
0add9081c1 Skip //tensorflow/python/distribute:strategy_gather_test_tpu in oss to save the build
PiperOrigin-RevId: 355241184
Change-Id: I4be22489929d8f80bbbaae69a7cb58c7c55610d8
2021-02-02 13:43:54 -08:00
A. Unique TensorFlower
f7d0a77b53 An internal change.
PiperOrigin-RevId: 355220346
Change-Id: I78e8d291cf2a6168ec5ba9b2679f8292fc1271e6
2021-02-02 12:07:52 -08:00
Ran Chen
e8262389c4 strategy.extended.update allows assigning non-mirrored values to non-mirrored
variables

PiperOrigin-RevId: 355043692
Change-Id: I4b33e840636d77489358880ee506868f6dae787b
2021-02-01 15:50:31 -08:00
A. Unique TensorFlower
055896a275 Always enable get_next_as_optional unless the dataset is finite.
PiperOrigin-RevId: 354672864
Change-Id: I3a490952e8bd075bf035a0126e62b9cf5082104e
2021-01-29 23:11:37 -08:00
Ruoxin Sang
cf3d55222d Always enable get_next_as_optional unless the dataset is finite.
PiperOrigin-RevId: 354668482
Change-Id: I5af5fffa27bdda4b0774a231ca804995c78f9bde
2021-01-29 22:11:49 -08:00
Ran Chen
1a46fdc4a2 Remove collective v1 code path
PiperOrigin-RevId: 354577402
Change-Id: I200d98a6a80dfe1e463044f9dedef9291ff7d846
2021-01-29 11:47:05 -08:00
Ran Chen
8a356e8ca5 [retry] Use same var key in _create_slots/get_slot in V1 optimizer
We have special handling for distributed variable in get_slot, but not
create_slot, while these keys need to match. This change modifies get_slot to use _var_key as well to avoid confusion. It is also to prepare for a upcoming refactor in dist strat code.

Note that we need to make sure the keys don't change, so existing checkpoints can still be used.

A bunch of build rules are modified to break cyclic dependencies.

PiperOrigin-RevId: 354341520
Change-Id: Ifd9786263024a11806ddde0c3bd1d36157ab8db7
2021-01-28 10:48:00 -08:00
8bitmp3
95930eeb24
Review tpu_strategy.py following feedback 2021-01-28 17:12:04 +00:00
Andrew Audibert
4b124a09df Add no_oss tag for flaky parameter_server_strategy_v2_test
PiperOrigin-RevId: 354262891
Change-Id: I5f14527aad5caa11dd52176201b2d813d9682b3c
2021-01-28 01:19:11 -08:00
8bitmp3
54a8ca01a0
Improve rendering for tf.distribute.cluster_resolver.TPUClusterResolver in tpu_strategy API docs 2021-01-28 00:55:35 +00:00
Ran Chen
3db793ee03 Remove the workaround that sets PerReplica spec to dynamic batch
It's no longer needed as we stopped reusing collective instance keys. Note that we still modifies element_spec to have a dynamic batch for multi worker strategies when partial batch is enabled, so that element_spec is compatible with the data produced.

PiperOrigin-RevId: 354132185
Change-Id: I3857b4bb25c825befdd1f7c667437dc3bbf4ba50
2021-01-27 11:29:31 -08:00
Christian Sigg
4838793e12 Comment out a number of google-internal targets when copybara-exporting instead of removing them.
PiperOrigin-RevId: 353848826
Change-Id: I0801c0e713a0c63597deb5aed31c8bdb37999c6a
2021-01-26 05:47:31 -08:00
Xinyi Wang
30fb80d468 Swap the use of NcclAllReduce for NCCL Collective Ops in MirroredStrategy.
Also remove the use of async executor to launch collective ops in eager mode and use one thread per device instead. This resolves the issue of not being able to call numpy() on the result of async executor. This change applies to MWMS too.

PiperOrigin-RevId: 353355403
Change-Id: I9c9f30dfe18dc830a4a8fa9bbaec042c7c2edd8f
2021-01-22 18:19:19 -08:00
Rick Chao
140a2b2bcb PSv2: Merge cluster_coordinator_mpr_test into fault_tolerance_test: step 3: verifying 1) executing functions on workers after PS failure results in expected failure types.
PiperOrigin-RevId: 353334801
Change-Id: I4aeab0b088acec0c204d3aba7dd7a6bac84817e2
2021-01-22 16:01:31 -08:00
TensorFlower Gardener
b444969f2a Merge pull request from ROCmSoftwarePlatform:google_upstream_rocm_misc_update_210118
PiperOrigin-RevId: 353101032
Change-Id: I1250b4f0b23ae581d10f33a461986c1f31fc7372
2021-01-21 14:21:02 -08:00
Ruoxin Sang
5642c34f2e In dynamic padder, use xla.set_dynamic_dimension_size to set dimension upper bound rather than propagating padding_map to XLABuilder.
PiperOrigin-RevId: 352929263
Change-Id: Ie1b284536a0ca25abdb51fde9462034d0f835894
2021-01-20 20:03:11 -08:00
A. Unique TensorFlower
985ad0276a PY2 removal cleanup
PiperOrigin-RevId: 352907145
Change-Id: I82de30d92dc9c2b53215d6d5732c67afe339c23d
2021-01-20 17:11:44 -08:00
Deven Desai
fbf8a4a1f1 Adding no_rocm tag to unit-tests that are FLAKY on the ROCm CI nodes. The cause for their flakiness has been identified and the fix for it will be in ROCm 4.1. See JIRA ticket SWDEV 263833 for details 2021-01-20 03:32:35 +00:00
Revan Sopher
37d51318b0 Fix handling of TPUStrategy.run() when passing Variables to methods.
If the user function defines "self" as the first argument, we skip it.
Note that this will fail in the weird case of a user function that defines "self" without being a method, in which case the fix (and best practice) would be to name the arg something else.

PiperOrigin-RevId: 352576482
Change-Id: I82622536fd89ce77993bcfe1c65f5f172e8ebcd4
2021-01-19 08:49:06 -08:00
A. Unique TensorFlower
034633f23b PY2 removal cleanup
PiperOrigin-RevId: 352106691
Change-Id: I382d53c64f0d29da430b8cb6d2395a2cb281509e
2021-01-15 16:48:57 -08:00
RJ Skerry-Ryan
102e1f9855 Expand distribute_utils.regroup to work with collections.abc.Mapping-derived containers.
Motivation: This enables user-defined dict-like types inheriting from collections.abc.Mapping to work as return values of functions used with DistributionStrategy.run. Without this change, the entire collection is wrapped in a PerReplica which breaks assumptions of downstream code.
PiperOrigin-RevId: 352064455
Change-Id: Iefda92654fa73d12ab213abe7ea13e0007201f95
2021-01-15 12:46:22 -08:00