The Abstract interfaces are shared with tracing mode.
Introduce an AbstractFunction which handles the conversion between MLIR function and FunctionDef and the runtime can query whichever representation is suitable. Right now this only supports GetFunctionDef but an API for fetching the MLIR function directly will be added in future changes.
PiperOrigin-RevId: 316942774
Change-Id: I1abebbe853b98dd0048bab9fc092252f4caf3d1b
Creates one sync executor per thread.
Requires fixing a tangential use-after-free where the context assumed all of the thread-local executors were still allocated at shutdown.
PiperOrigin-RevId: 316783819
Change-Id: I62e7a91dcccb847d4e1c2a5f08e30c2877556618
I wasn't sure this would work rather than deadlocking. But executors are per-thread, so it looks like ParallelDevice making new threads implicitly disables async anyway (which is what we need for now).
This will deadlock if the Context's default executor is async. We do have a public API for that, so I'll fix that integration in a followup.
PiperOrigin-RevId: 316004808
Change-Id: I2c5c946557c5ff28b0fc5252850f6de5648690fd
Does not create new executors for each of its threads at the moment, so I'm not sure how this will interact with a user-level async setting. It'll be easy enough to create one sync executor per thread if that's a problem.
Requires some tweaks to TFE_OpAddAttrs to make it safe to call for different ops in different threads (due to protobuf arena destruction being global and not thread-safe). These changes avoid serialization+deserialization of attributes, so they make sense anyway.
PiperOrigin-RevId: 315990383
Change-Id: I211ceae1eb34e490b4d2a1df76226995c7ae7929
I think we'll eventually need to move implicit broadcasting out of execute entirely for gradients to work, and moving it a little bit out helps simplify things for now.
PiperOrigin-RevId: 315575773
Change-Id: Ib7e42e5f68d7261a431a4d0de01ca471090cd967
Some limitations of current implementation:
1. We doesn't have a cache for compiled function. Right now it JIT compile the function in each invocation.
2. Does not support nested function.
3. Does not support variable.
PiperOrigin-RevId: 315031999
Change-Id: I8d96ed26d0da7c071b7f89e65d6ada7bbc290a37
The tests are failing potentially due to issues shutting down rpc servers
PiperOrigin-RevId: 314979132
Change-Id: Ie1fa76a1d8f096ddb09491d3ace4319bc0d5565b
The actual delta isn't huge; I'm moving some test utilities to a testlib since the remote tests need them. The remote tests are in a different target because they need to disable global heap checking, which I'd like to keep on for the rest of the tests.
PiperOrigin-RevId: 313698670
Change-Id: I846294a748e3b007eba0472901b0e58358b8edd5
This is plumbing just enough to pass all the unit-tests.
The conversion to the function library is quite inefficient, but it isn't
clear if we want to optimize this or just focus on TFRT moving forward.
PiperOrigin-RevId: 313356850
Change-Id: I83815317d4958786d0103168b5d88498f89511ed
In executing a multi-device or distributed function, one component function failure could cause other component functions to hang due to dependencies (e.g., they are pending receiving tensors from the failed component function). This can often lead to issues that are hard to debug especially with a large number of workers.
This change cancels local and remote component functions in multi-device function execution if one component function fails, by cancelling the function rendezvous and the component function execution request RPCs. Since the cancelled errors are marked as derived, the original failure error message will be reported to users.
PiperOrigin-RevId: 311805431
Change-Id: I2f0b819e2b0a228fdeb242361b41ef4cadc7e3d2
This patch intends to make function tracing more of a first class concept in
the API. It tries to move away from the "flat graph" model with "placeholder"
operation introduced with the expectation to turn them into function
parameters later. Instead the user starts by creating an empty function which
is an ExecutionContext (and as such can trace operations). Function parameters
can get added to this context using a dedicated API returning an AbstractTensor.
The diff in UnifiedCAPI/TestBasicGraph is probably a good illustration of the
change from a client point of view.
Another important point of this patch is to make it so that no C public API is
defined in the `c_api_unified_experimental_graph.cc` file, instead the
implementation is dispatched based on a registered factory function to create the
tracing context. This will allow to swap the tracing implementation through
injection later.
PiperOrigin-RevId: 311529850
Change-Id: I822047f4306835abc0e044dc87c14179596f64bd
- Support copying a packed TensorHandle from a client to a remote worker.
PiperOrigin-RevId: 311404609
Change-Id: Iadf2c7793dc3631f7be05de611d059733bbfdd63
Enable setting soft device placement as well as logging dynamically.
This required ensuring the device placement policy was part of the cache
key.
Further, we fix the logging to ensure in eager mode if a kernel is
retrieved from the kernel cache, then the execution is still logged. We
also log closer to the actual op execution to avoid logging before all
checks have been done.
PiperOrigin-RevId: 311271808
Change-Id: I9765228894f84a3447cc03332a2559f6d933165b
Cluster updates (due to recreated distribution strategies, remote worker failures, etc.) can lead to crashing failures with segfaults when accessing resources created before the update. Some common patterns are:
* Accessing datasets created on old remote workers;
* Accessing variables created on failed workers;
* Garbage collecting datasets/iterators created on old remote workers;
This CL validate the remote devices to make sure the access is safe before executing the ops by looking up the device in a set of device pointers and checking its incarnation ID. Remote workers on restarted devices will have different incarnation IDs, and accessing resources on those devices will fail gracefully.
PiperOrigin-RevId: 311261000
Change-Id: Ifc07862229b06301e0275fe80975565d9df28152
The op doesn't really make sense to register kernels for, so I'm not registering it anywhere by default yet; it's currently just registered in the parallel device tests.
PiperOrigin-RevId: 311141160
Change-Id: Iff1839112dac6fe3406e4b31f0e6f7239809a5bb
Introduce a C API TFE_CreatePackedTensorHandle which creates a TFE_TensorHandle referring to multiple TFE_TensorHandles.
PiperOrigin-RevId: 310610230
Change-Id: Icc0ffd5c58ad7780eca38d552c1a2f4617f04891
c_api_tfrt is an implementation of the C API, so it should not be in c/eager/.
Move it to core/tfrt/eager to mirror the setup for the current TF runtime
directory core/common_runtime/eager.
PiperOrigin-RevId: 310590751
Change-Id: I6840756c321c29eec2a6b648c3484ec4fc8bd46e
We'll want this for implementing copy for `TF_AbstractOp`s backed by `TFE_Op`s (since we want to copy the type/attributes but not the inputs).
PiperOrigin-RevId: 309756974
Change-Id: I07a8c48f50ab6d3c8a7d7db972fb60202b86434d