The test uses tf.constant as input to all_reduce in pure eager mode, however in eager mode tf.constant always creates host tensors regardless of the enclosing device scope. This leads to NCCL error.
PiperOrigin-RevId: 357252841
Change-Id: Iddaf5f52fe6634ec29dd385a9fa034761f3df91f
1) Single instance of ClusterCoordinator given a Strategy object
2) Circular references of ClusterCoordinator and ParameterServerStrategy
3) Attribute of a Strategy indicating if it is supposed to be used with a ClusterCoordinator
PiperOrigin-RevId: 356868615
Change-Id: If19600c0101f40a9e840fe71abb848f386e32735
For `StrategyExtended`, it is a private API that will be used by `ReplicaContext.all_reduce`.
This is in preparation for deprecation of merge_call from user API.
PiperOrigin-RevId: 356604626
Change-Id: I2528b35b87db1b93907b17a246dbfbcfcb64ad33
This is part of the effort to refactor distributed variables. Auto is somewhat
confusing and adds additional implementation complexity.
PiperOrigin-RevId: 355733796
Change-Id: I7446c3ed706624178fcb26c9b992632a93b939f6
This is supported in non-TPU strategies, and surprises users trying to migrate.
The lower-level input and output replication ops can't handle values that aren't convertible to Tensor, so we need to do some massaging around this. Nones in inputs are temporarily replaced with a constant, then replaced after replication. Nones in outputs are simply not replicated, as output replication happens at a per-value granularity.
PiperOrigin-RevId: 355690080
Change-Id: I9d2435e953c8feb7818a882cb5280327f310c919
handle behavior
I'm working on a new version of DistributedVariable which directly inherits from BaseResourceVariable. Its handle would return different resource tensors under different context, e.g. self.handle would be a replicated tensor under tpu context. This can avoid the need to use raw variable operations for special resource handles like tpu replicate handle or parallel device handle.
PiperOrigin-RevId: 355663353
Change-Id: I16201f94ef27a0dc7ac1491c616d7bd68397123a
This is part of the variable refactor work to avoid dependency cycles.
PiperOrigin-RevId: 355654271
Change-Id: I92f0d00ddff6655c174d999abd0290ae3e4c1849
This is part of the variable refactor work to avoid dependency cycles.
PiperOrigin-RevId: 355456079
Change-Id: I7a8afc89d17eee9372afb3fd4c6da8126791e499
This is part of the variable refactor work to avoid dependency cycles.
PiperOrigin-RevId: 355414142
Change-Id: I36651a7be6462c198aae477923bc2ef0f7e7d0fb
We have special handling for distributed variable in get_slot, but not
create_slot, while these keys need to match. This change modifies get_slot to use _var_key as well to avoid confusion. It is also to prepare for a upcoming refactor in dist strat code.
Note that we need to make sure the keys don't change, so existing checkpoints can still be used.
A bunch of build rules are modified to break cyclic dependencies.
PiperOrigin-RevId: 354341520
Change-Id: Ifd9786263024a11806ddde0c3bd1d36157ab8db7
It's no longer needed as we stopped reusing collective instance keys. Note that we still modifies element_spec to have a dynamic batch for multi worker strategies when partial batch is enabled, so that element_spec is compatible with the data produced.
PiperOrigin-RevId: 354132185
Change-Id: I3857b4bb25c825befdd1f7c667437dc3bbf4ba50
Also remove the use of async executor to launch collective ops in eager mode and use one thread per device instead. This resolves the issue of not being able to call numpy() on the result of async executor. This change applies to MWMS too.
PiperOrigin-RevId: 353355403
Change-Id: I9c9f30dfe18dc830a4a8fa9bbaec042c7c2edd8f
If the user function defines "self" as the first argument, we skip it.
Note that this will fail in the weird case of a user function that defines "self" without being a method, in which case the fix (and best practice) would be to name the arg something else.
PiperOrigin-RevId: 352576482
Change-Id: I82622536fd89ce77993bcfe1c65f5f172e8ebcd4
Motivation: This enables user-defined dict-like types inheriting from collections.abc.Mapping to work as return values of functions used with DistributionStrategy.run. Without this change, the entire collection is wrapped in a PerReplica which breaks assumptions of downstream code.
PiperOrigin-RevId: 352064455
Change-Id: Iefda92654fa73d12ab213abe7ea13e0007201f95