Enabling this logic removes cross-worker send/recv dependencies required for TPUExecuteOp nodes to access a model's variables. This decreases overhead at the start of a training loop. The approach used is to replace remote variable reads with zero tensors on each worker, except for the primary worker. The zero tensors feed TPUExecute nodes that are local to that worker. For large distributed systems with large variables, this removes the need for the initial Send/Recv variable broadcast, which can be expensive. PiperOrigin-RevId: 351904109 Change-Id: I9f1ed63c2401f227646010a94a70c04f1c96cb7e |
||
---|---|---|
.. | ||
lib | ||
BUILD | ||
client_library.cc | ||
client_library.h | ||
client.cc | ||
client.h | ||
compile_only_client.cc | ||
compile_only_client.h | ||
executable_build_options.cc | ||
executable_build_options.h | ||
global_data.cc | ||
global_data.h | ||
local_client.cc | ||
local_client.h | ||
padding_test.cc | ||
padding.cc | ||
padding.h | ||
sharding_builder.cc | ||
sharding_builder.h | ||
xla_builder_test.cc | ||
xla_builder.cc | ||
xla_builder.h | ||
xla_computation.cc | ||
xla_computation.h |