RefreshRemoteAttributes() is used to initialize device attributes between workers when we use CollectiveOp. We need it to send GetStatus RPC with fail_fast = false so that each worker can block waiting for other workers to start up.

PiperOrigin-RevId: 266207854
This commit is contained in:
Dong Lin 2019-08-29 12:57:42 -07:00 committed by TensorFlower Gardener
parent 988882bb84
commit 42652b1699

View File

@ -98,7 +98,7 @@ void DeviceResolverDistributed::RefreshRemoteAttributes(
WorkerInterface* worker = worker_cache_->GetOrCreateWorker(task);
CHECK(worker) << "Failed to get worker for " << task;
worker->GetStatusAsync(
req, resp, /*fail_fast=*/true,
req, resp, /*fail_fast=*/false,
[this, device, task, req, resp, worker, done](Status s) {
if (s.ok()) {
mutex_lock l(mu_);