Tayo Oguntebi 6983bacea1 Enables per-host dummy args for TPUExecute (TF1) and adds XLA options.
Enabling this logic removes cross-worker send/recv dependencies required for TPUExecuteOp nodes to access a model's variables. This decreases overhead at the start of a training loop.

The approach used is to replace remote variable reads with zero tensors on each worker, except for the primary worker. The zero tensors feed TPUExecute nodes that are local to that worker.  For large distributed systems with large variables, this removes the need for the initial Send/Recv variable broadcast, which can be expensive.

PiperOrigin-RevId: 351904109
Change-Id: I9f1ed63c2401f227646010a94a70c04f1c96cb7e
2021-01-14 17:03:51 -08:00
..
2020-07-26 22:14:33 +00:00
2020-05-20 23:40:37 -07:00