STT-tensorflow/tensorflow/core/distributed_runtime
Bramandia Ramadhana 80498bb94f - Adds support to optionally adds replica as part of job name
PiperOrigin-RevId: 351419660
Change-Id: I59d6f801f40fd598e51f80284b4db0064f86e5c5
2021-01-12 12:26:47 -08:00
..
eager Stop holding custom devices in TensorHandles 2021-01-12 11:22:48 -08:00
rpc Disables CUDA_ASAN for //third_party/tensorflow/core/distributed_runtime/rpc:grpc_session_test due to timeout. 2020-12-29 07:15:28 -08:00
base_rendezvous_mgr.cc Fix cancellation race condition in BaseRendezvousMgr::RegisterCall 2020-06-19 13:04:53 -07:00
base_rendezvous_mgr.h Fix cancellation race condition in BaseRendezvousMgr::RegisterCall 2020-06-19 13:04:53 -07:00
BUILD Support aborting RING communication in multi worker collectives 2020-10-21 17:03:09 -07:00
call_options_test.cc
call_options.cc
call_options.h Prefixing TensorFlow thread annotation macros with TF_. 2020-03-05 08:42:01 -08:00
cancellable_call.cc Support aborting RING communication in multi worker collectives 2020-10-21 17:03:09 -07:00
cancellable_call.h Support aborting RING communication in multi worker collectives 2020-10-21 17:03:09 -07:00
cluster_function_library_runtime_test.cc Pass in GrpcWorkerEnv when creating GrpcWorkerCache. 2020-06-04 11:46:09 -07:00
cluster_function_library_runtime.cc Change the function output type, either a Tensor for a local output or a TensorShape for a remote output, preparing for the support of function outputs placed on remote workers. 2020-08-04 19:13:03 -07:00
cluster_function_library_runtime.h Use the original output indices when adding a component function output to RemoteMgr. 2020-08-19 14:41:05 -07:00
collective_param_resolver_distributed_test.cc Set a timeout to check health RPC 2020-10-21 13:02:25 -07:00
collective_param_resolver_distributed.cc Support aborting param resolution in multi worker collectives 2020-10-21 18:40:58 -07:00
collective_param_resolver_distributed.h Support aborting param resolution in multi worker collectives 2020-10-21 18:40:58 -07:00
collective_rma_distributed_test.cc Set a timeout to check health RPC 2020-10-21 13:02:25 -07:00
collective_rma_distributed.cc Support aborting RING communication in multi worker collectives 2020-10-21 17:03:09 -07:00
collective_rma_distributed.h Support aborting RING communication in multi worker collectives 2020-10-21 17:03:09 -07:00
device_resolver_distributed_test.cc Use device attributes from group resolution 2020-09-09 10:53:49 -07:00
device_resolver_distributed.cc Use device attributes from group resolution 2020-09-09 10:53:49 -07:00
device_resolver_distributed.h Use device attributes from group resolution 2020-09-09 10:53:49 -07:00
graph_mgr.cc [TF2XLA] Remove the serialization of CustomKernelCreator, since there is only one, and we won't add new ones 2020-09-22 10:59:59 -07:00
graph_mgr.h When calling connect_to_cluser, if the options are identical and there is no renaming of local device, reuse existing local DeviceManager, otherwise we keep the old DeviceManager around to allow the old Tensor created to be usable. 2020-05-20 08:53:52 -07:00
local_master.cc
local_master.h
master_env.h Fix two memory leaks and enable asan for C API remote tests. 2020-07-17 10:17:20 -07:00
master_interface.h
master_session.cc fix typos in core directory 2020-10-29 02:52:55 +03:00
master_session.h Prefixing TensorFlow thread annotation macros with TF_. 2020-03-05 08:42:01 -08:00
master_test.cc
master.cc Prefixing TensorFlow thread annotation macros with TF_. 2020-03-05 08:42:01 -08:00
master.h Prefixing TensorFlow thread annotation macros with TF_. 2020-03-05 08:42:01 -08:00
message_wrappers_test.cc
message_wrappers.cc Support TensorProtos as Operation inputs, in order to support remote inputs passed as Tensors to EagerClusterFunctionLibraryRuntime::Run. 2020-03-09 17:10:38 -07:00
message_wrappers.h Support TensorProtos as Operation inputs, in order to support remote inputs passed as Tensors to EagerClusterFunctionLibraryRuntime::Run. 2020-03-09 17:10:38 -07:00
partial_run_mgr_test.cc
partial_run_mgr.cc
partial_run_mgr.h Prefixing TensorFlow thread annotation macros with TF_. 2020-03-05 08:42:01 -08:00
README.md
recent_request_ids_test.cc Internal tests cleanup. 2020-10-27 13:24:35 -07:00
recent_request_ids.cc
recent_request_ids.h Prefixing TensorFlow thread annotation macros with TF_. 2020-03-05 08:42:01 -08:00
remote_device_test.cc Pass in GrpcWorkerEnv when creating GrpcWorkerCache. 2020-06-04 11:46:09 -07:00
remote_device.cc - Adds support to optionally adds replica as part of job name 2021-01-12 12:26:47 -08:00
remote_device.h - Adds support to optionally adds replica as part of job name 2021-01-12 12:26:47 -08:00
rendezvous_mgr_interface.h [Cleanup] Remove unused method RendezvousMgrInterface::CleanupAll(). 2020-04-13 11:26:06 -07:00
request_id_test.cc
request_id.cc
request_id.h
rpc_collective_executor_mgr_test.cc Make NcclManager part of CollectiveExecutorMgr 2020-09-17 14:35:16 -07:00
rpc_collective_executor_mgr.cc Make NcclManager part of CollectiveExecutorMgr 2020-09-17 14:35:16 -07:00
rpc_collective_executor_mgr.h Make NcclManager part of CollectiveExecutorMgr 2020-09-17 14:35:16 -07:00
rpcbench_test.cc Internal tests cleanup. 2020-10-27 13:24:35 -07:00
scheduler.cc
scheduler.h
server_lib_test.cc When calling connect_to_cluser, if the options are identical and there is no renaming of local device, reuse existing local DeviceManager, otherwise we keep the old DeviceManager around to allow the old Tensor created to be usable. 2020-05-20 08:53:52 -07:00
server_lib.cc When calling connect_to_cluser, if the options are identical and there is no renaming of local device, reuse existing local DeviceManager, otherwise we keep the old DeviceManager around to allow the old Tensor created to be usable. 2020-05-20 08:53:52 -07:00
server_lib.h When calling connect_to_cluser, if the options are identical and there is no renaming of local device, reuse existing local DeviceManager, otherwise we keep the old DeviceManager around to allow the old Tensor created to be usable. 2020-05-20 08:53:52 -07:00
session_mgr_test.cc Garbage collect old WorkerSession when the restarted master task create new one. 2020-08-03 11:31:26 -07:00
session_mgr.cc Garbage collect old WorkerSession when the restarted master task create new one. 2020-08-03 11:31:26 -07:00
session_mgr.h Garbage collect old WorkerSession when the restarted master task create new one. 2020-08-03 11:31:26 -07:00
tensor_coding_test.cc Internal tests cleanup. 2020-10-27 13:24:35 -07:00
tensor_coding.cc
tensor_coding.h
test_utils.h Set a timeout to check health RPC 2020-10-21 13:02:25 -07:00
worker_cache_logger.cc
worker_cache_logger.h Prefixing TensorFlow thread annotation macros with TF_. 2020-03-05 08:42:01 -08:00
worker_cache_partial.cc
worker_cache_partial.h Prefixing TensorFlow thread annotation macros with TF_. 2020-03-05 08:42:01 -08:00
worker_cache_wrapper.h
worker_cache.h
worker_env.h Fix two memory leaks and enable asan for C API remote tests. 2020-07-17 10:17:20 -07:00
worker_interface.h Set a timeout to check health RPC 2020-10-21 13:02:25 -07:00
worker_session.cc When calling connect_to_cluser, if the options are identical and there is no renaming of local device, reuse existing local DeviceManager, otherwise we keep the old DeviceManager around to allow the old Tensor created to be usable. 2020-05-20 08:53:52 -07:00
worker_session.h When calling connect_to_cluser, if the options are identical and there is no renaming of local device, reuse existing local DeviceManager, otherwise we keep the old DeviceManager around to allow the old Tensor created to be usable. 2020-05-20 08:53:52 -07:00
worker.cc Set a timeout to check health RPC 2020-10-21 13:02:25 -07:00
worker.h Set a timeout to check health RPC 2020-10-21 13:02:25 -07:00

Distributed TensorFlow

This directory contains the initial open-source implementation of the distributed TensorFlow runtime, using gRPC for inter-process communication.

To learn how to use the distributed runtime to create a TensorFlow cluster, see the Distributed TensorFlow How-To.