Enable setting soft device placement as well as logging dynamically.
This required ensuring the device placement policy was part of the cache
key.
Further, we fix the logging to ensure in eager mode if a kernel is
retrieved from the kernel cache, then the execution is still logged. We
also log closer to the actual op execution to avoid logging before all
checks have been done.
PiperOrigin-RevId: 311271808
Change-Id: I9765228894f84a3447cc03332a2559f6d933165b
Introduce a C API TFE_CreatePackedTensorHandle which creates a TFE_TensorHandle referring to multiple TFE_TensorHandles.
PiperOrigin-RevId: 310610230
Change-Id: Icc0ffd5c58ad7780eca38d552c1a2f4617f04891
We'll want this for implementing copy for `TF_AbstractOp`s backed by `TFE_Op`s (since we want to copy the type/attributes but not the inputs).
PiperOrigin-RevId: 309756974
Change-Id: I07a8c48f50ab6d3c8a7d7db972fb60202b86434d
The op name was included twice, and TFE_OpGetAttrs is unusable without a way to allocate a TFE_OpAttrs on the heap (and so has no callers). I'm removing it for now.
PiperOrigin-RevId: 308859222
Change-Id: Ibb3901a1821ffc2e9ebc0efb26592e5b3d8bb88f
Extern, plus it was missing TF_CAPI_EXPORT which is probably the main reason it wasn't in the Windows DLL
PiperOrigin-RevId: 305795200
Change-Id: I7ab3d847f3f60f71588f19bfa962a861d02bba44
The API accepts TFE_RegisterCustomDevice arguments as PyCapsules, so each custom device will need some method to create those. Presumably most custom devices will end up wrapping the PyCapsule creation+registration rather than exposing it to the user.
No public API yet, but this is roughly what I have in mind at the moment.
This only works with --config=monolithic or when the custom device registration is bundled with pywrap_tensorflow.so right now since that has its own copy of the C API. Something like this could work if we switched pywrap_tensorflow.so to instead rely on libtensorflow.so for the C API, then custom device extensions could link against that.
PiperOrigin-RevId: 305762978
Change-Id: I4d2d9bd9c01ba22391e138244a3948bae8963c5c
The existing TF_AllocateTensor & TFE_NewTensorHandle APIs do not take a
TFE_Context which is undesirable as the TFE_Context indicates ownership
of the tensor. Thus we add new APIs to super-seed the existing ones.
PiperOrigin-RevId: 305126310
Change-Id: I9863ebc692d48875c61b79197ab418f29503a8c6
Implicit mirroring is set to true by default already and is essential
for eager performance. This CL just removes dead code since there is no
API to disable mirroring for tensors.
We also shouldn't have this in the TensorHandleInterface class since
mirroring is a runtime-specific implementation detail.
PiperOrigin-RevId: 304421014
Change-Id: I383fa24da08a86028cabb3a4b1c5f2612d57336d
This will be useful for switching between graph building and eager execution (although that may need a different context type), but also gives us the option to pass a custom device representation into language bindings without requiring them to expose their TFE_Context directly (they still expose it to the custom device when executing operations).
PiperOrigin-RevId: 300630552
Change-Id: I41083c63db1b137af60f932114f1fcaae8ac2eb0
Add tests to demonstrate the usage of the primitives in handling exceptions thrown in remote async execution.
PiperOrigin-RevId: 297041596
Change-Id: Ibc9ffa7c5eaaa9b62c6849e815c0c933ff0ec86c
Also adds an experimental eager C API for serializing op attributes as generic name->value mappings
It's a bit sad that being generic requires serializing here, but I don't see a great way around it if the attributes will be used generically (e.g. to build a FunctionDef). We can add special cases that don't require serialization for fetching attributes when the type is known.
PiperOrigin-RevId: 297003316
Change-Id: Id6e65bc7a8178fbbb8a85a542bd31def08225fe6
The eager executor tried to prevent forwarding of any input tensors by
incrementing the reference count of any "non-consumed" inputs. This
involved highly delicate logic which first signaled "non-consumed"
inputs as those with a reference count greater than 1 (1 from python and
another from the EagerOperation class), which require "protecting" by
incrementing underlying tensor buffer. This logic is highly heavyweight
for the common case of synchronous execution. We thus simplify the logic
by having all TensorHandle Tensors protected at construction and
"unprotect" then if the reference count is 1.
- Hold 2 reference counts a TensorHandle's backing Tensor. This protects
the Tensor from being forwarded.
- Add the ability to unprotect a TensorHandle's backing Tensor when the
reference count is 1.
- Split ExecuteNode into Async implementation. The sync ExecuteNode
class can avoid various copies such as the list of inputs and the
forwarding map.
- Remove the experimental TFE_OpConsumeInput API. Input forwarding can
be achieved by releasing the handle after calling TFE_OpAddInput as
demonstrated by the added tests.
- Fix TF_AllocateTensor to return a forwardable tensor it was previously
disabled due to re-using the logic in TF_NewTensor.
- Save mirror tensor when calling TFE_TensorHandleResolve.
PiperOrigin-RevId: 296225251
Change-Id: I484cfccbef8b44e82757b8bda0981cd7fd2f8096
Right now you can only fetch the whole attribute map and set it wholesale, but we can add more fine-grained attribute control in the future.
This allows the custom device API to pass in attributes, and custom devices to forward these to their own TFE_Execute calls. This is required for creating variables.
PiperOrigin-RevId: 296096192
Change-Id: I98c23bdcd13e479235b3e27850b1bb0bd7a53bba
We allow a TensorHandle to reference multiple tensors on the local host.
This allows us to essentially cache any implicit copies that occur
before executing an op. This helps avoid repeated copies if a tensor is
constantly fed to an op on a different device.
Additional clean-ups:
- Move CustomDevice TensorHandle constructor to separate constructor
- If the TensorHandle is on the host CPU device, ensure that device_ is
set to nullptr.
- Clean up CAPI test to use ASSERT_EQ instead of ASSERT_TRUE
PiperOrigin-RevId: 294180977
Change-Id: I26892e9058973eebac557fc529b46de793418e12
Custom devices are an experimental hook into eager op execution, allowing experimentation outside the TensorFlow codebase. These devices do not work in traced code at the moment.
PiperOrigin-RevId: 293615055
Change-Id: I031da213e964caa7d4e11e0f491a3985d034b175
A lot of functionality in TFE_Op was simply a pass-through to
EagerOperation. We instead want the TFE_Op to be a simple struct and
have the functionality defined in the operation member.
The following changes were made:
- Remove a pointer to the TFE_Context in TFE_Op as the context is stored
in EagerOperation.
- Modify the constructor of EagerOperation to only take a EagerContext
pointer and require the caller to call Reset. This allows callers to
handle any errors from construction.
- We expect the context to not be null. We enforce this with references
and clean up the code to ensure that an eager context is never reset
with a different context. As a result the `ctx` parameter has been
removed from TFE_OpReset.
- Move OpInferenceContext into EagerOperation
PiperOrigin-RevId: 290386452
Change-Id: I3ffb62b01dce230ddc555d84d6ae39fd4ec90b2f
Introduce `update_server_def`, which support running remote ops and functions with dynamic cluster membership in a cluster. The client will register new contexts on the newly added workers, remove old contexts from the removed servers, and rebuild the connections between workers for proper communication.
PiperOrigin-RevId: 271234187
1. Remove async_wait() and async_clear_error() in EagerContext.
2. Allow getting current executor from EagerContext.
3. Remove StartAsync() method in EagerExecutor.
PiperOrigin-RevId: 262445965
This fixes a breakage on Python 3.7+, where the SWIG wrapper uses the reserved keyword `async` as a parameter name. This was recently fixed in https://github.com/swig/swig/pull/1382.
PiperOrigin-RevId: 260301284
The experimental interface uses `cancellation.CancellationManager`:
```python
c_mgr = cancellation.CancellationManager()
@tf.function
def f(?):
?
cancelable_f = c_mgr.get_cancelable_function(f.get_concrete_function(?))
# Call a function that might run for a long time.
cancelable_f(?)
# Asynchronously:
c_mgr.start_cancel()
```
A subsequent change will add a publicly-accessible (probably experimental) API endpoint for `CancellationManager`.
PiperOrigin-RevId: 258648702
This change is a step towards supporting user-driven cancellation for eager function calls. In a future change, I plan to add an experimental method for calling a `tf.function` and passing a `CancellationManager` argument, so that the caller can cancel execution asynchronously.
PiperOrigin-RevId: 256369003
When executing on a remote worker, we may have to copy the TensorHandle
for each executed op. To avoid duplicated work, we expand the
TensorHandle to keep track of mirrors which are tied to the lifetime of
the TensorHandle. If a mirror already exists on a remote worker, no
additional copy is needed.
The change consists of the following:
- Add map of remote mirrors in TensorHandle.
- Add `mirror` boolean argument to EagerCopyToDevice which indicates to try
configuring a mirror if possible.
- Add Device argument to RemoteAddress to handle mirrors.
- Expose a ContextMirroringPolicy for the EagerContext. We plan to add
additional policies in the future, such as local tensor mirroring.
- Rename ContextDevicePlacementPolicy variables to be consistent with
ContextMirroringPolicy.
PiperOrigin-RevId: 253945140