- MaybePinSmallOpsToCpu
- MaybePinToResourceDevice
- MaybePinToCustomDevice
We are going to reuse MaybePinSmallOpsToCpu in TFRT but not the other two. Because TFRT doesn't have native Resource neither Custom Device.
PiperOrigin-RevId: 317766813
Change-Id: I43241b5786120ddf39dc4bfff6071239afdfd785
1. Thread the RPC cancel signal through the eager service RunComponentFunction calls;
2. Always pass the cancellation manager to the underlying executor (instead of only passing when `is_eager` is true, i.e., pure eager ops). With this we do not need to cancel the rendezvous from the process FLR; instead the ExecutorState takes care of it when op fails.
3. Do not mark all statuses as derived when aborting rendezvous or triggering cancellation. This usually results in the original errors buried as one of the derived errors.
PiperOrigin-RevId: 313814162
Change-Id: Ia866f5f522a0b1aa54e9dce7b9cc0bcf7682136a
In executing a multi-device or distributed function, one component function failure could cause other component functions to hang due to dependencies (e.g., they are pending receiving tensors from the failed component function). This can often lead to issues that are hard to debug especially with a large number of workers.
This change cancels local and remote component functions in multi-device function execution if one component function fails, by cancelling the function rendezvous and the component function execution request RPCs. Since the cancelled errors are marked as derived, the original failure error message will be reported to users.
PiperOrigin-RevId: 311805431
Change-Id: I2f0b819e2b0a228fdeb242361b41ef4cadc7e3d2
- Support copying a packed TensorHandle from a client to a remote worker.
PiperOrigin-RevId: 311404609
Change-Id: Iadf2c7793dc3631f7be05de611d059733bbfdd63
Introduce a C API TFE_CreatePackedTensorHandle which creates a TFE_TensorHandle referring to multiple TFE_TensorHandles.
PiperOrigin-RevId: 310610230
Change-Id: Icc0ffd5c58ad7780eca38d552c1a2f4617f04891
- Expand the packed _Arg nodes when the graph is ready for graph partition.
- Introduce an optional sub-index to function Arg nodes, in order to distinguish between two arguments with the same "index". It happens after replacing a packed _Arg node which is assigned to a CompositeDevice with multiple replica nodes (one per device).
The "index" of an _Arg node is unique before expanding it. It's also unique within each subgraph after graph partition.
PiperOrigin-RevId: 309781835
Change-Id: Ic6e351f45b7523288b5dae30997ddf0dae86660b
This change modifies these includes to point to
"tensorflow/core/common_runtime/graph_constructor.h" instead. This change will enable us to remove the accidental dependency from //tensorflow/core/graph to //tensorflow/core/common_runtime.
PiperOrigin-RevId: 309035649
Change-Id: I2af0fdd6a6ccc4ae8d351a9117a69b6fc80c22e9
The current implementation of KernelAndDeviceFunc::Run ties up a thread while the function execution is pending. This leads to distributed deadlock issues in large-scale parameter server training when the number of workers exceed the thread pool size.
This change leverages the RunComponentFunction request to execute component functions in a non-blocking manner. By avoiding the thread tying issue, it removes the constraint on the number of concurrent component functions to execute in parallel.
PiperOrigin-RevId: 308939721
Change-Id: I086f9ee587c4df76b303158f27c362a9bcb8314c
Modularize components in eager execute and eager service implementation. They will be shared and reused in the upcoming async code path.
PiperOrigin-RevId: 308347746
Change-Id: Ida9cca1a1a88d3e6509c61950f4eaa4f18dbe864
After this change, we no longer guarantee that an enqueue request does not overlap with a cluster update request. It means that a function execution can see inconsistent cluster views (e.g., some part using old server def and some with new one). Function executed in parallel with or shortly after cluster update requests can potentially fail on connection errors due to this inconsistency if the functions span remote targets that are being updated. To prevent this race condition from happening, the client needs to coordinate and avoid concurrent executing remote functions and updating the cluster.
PiperOrigin-RevId: 308342265
Change-Id: I57fd5f3ec9acd1da906d2718d7b61398156bb4f4
* When setting TensorHandle remote shape, it currently uses context_view_id to index mirrors, but should not use it to check when setting the shape of RemoteTensorHandleData itself.
* If a tensor is ready because it was previously poisoned, return the original error instead of the less useful error "SetShape is only called on non-ready handles".
PiperOrigin-RevId: 307633262
Change-Id: I22436402c6beeb41731802060b59851e807627d9
ReusePort scenario: parent process occupies the port, then share
the port through service such as ZooKeeper, and then child
process (TensorFlow process) reuse the port.
use with the latter since lookups / deletions from the rendezvous table block globally.
PiperOrigin-RevId: 306252615
Change-Id: I07e4402436ca75801804a578e3253ab5ff4ab762
In practice this is already done by expanding its absl::variant<>
definition in a handful of places. By making the type visible we can
properly account for its usage.
PiperOrigin-RevId: 305760610
Change-Id: I95d65461ebb70c2d4e33eb59985b01d6cb18554e