Commit Graph

499 Commits

Author SHA1 Message Date
VoVAllen
ed8d661b5d cp 2020-07-02 07:37:10 +00:00
VoVAllen
8cdfc53a63 Cherry pick dlpack fix into r2.3 2020-07-02 07:31:34 +00:00
Haitang Hu
780c0a29fe Update device_id to be int32 rather than int64.
PiperOrigin-RevId: 317709385
Change-Id: I577dd469d223cc05c50dbdb6a8bd908e2e757344
2020-06-22 13:09:26 -07:00
Brian Zhao
ebf57bdfc7 Moving RAII helpers for TensorHandle, Tensor, and Operation to their respective classes.
PiperOrigin-RevId: 317578771
Change-Id: Iaf674696ea7d7dfdf94924f4c60d555a613c5f57
2020-06-21 19:12:38 -07:00
A. Unique TensorFlower
e647a3b425 Add experimental C API to access EagerContext context ID.
PiperOrigin-RevId: 317476439
Change-Id: I9e97bce61cf526695f0c903b5f4f837116fef455
2020-06-20 11:32:20 -07:00
Saurabh Saxena
8d5171bad7 Split Abstract interfaces into Abstract and ImmediateExecution interfaces.
The Abstract interfaces are shared with tracing mode.
Introduce an AbstractFunction which handles the conversion between MLIR function and FunctionDef and the runtime can query whichever representation is suitable. Right now this only supports GetFunctionDef but an API for fetching the MLIR function directly will be added in future changes.

PiperOrigin-RevId: 316942774
Change-Id: I1abebbe853b98dd0048bab9fc092252f4caf3d1b
2020-06-17 12:39:13 -07:00
Allen Lavoie
f8657c62c6 Parallel device: avoid deadlocks when the EagerContext's default executor is async
Creates one sync executor per thread.

Requires fixing a tangential use-after-free where the context assumed all of the thread-local executors were still allocated at shutdown.

PiperOrigin-RevId: 316783819
Change-Id: I62e7a91dcccb847d4e1c2a5f08e30c2877556618
2020-06-16 16:53:43 -07:00
Kibeom Kim
aee694363c Support TFRT async config.
PiperOrigin-RevId: 316733436
Change-Id: I0eef2279b9c77d6084c1300a8d38a987a2cee065
2020-06-16 12:26:42 -07:00
Brian Zhao
08420ed0b6 Removing LoadSavedModelAPI from AbstractContextInterface. SavedModel is conceptually layered on top of the runtime, as it uses the context to eagerly execute ops (like reloading resources, restoring tensors, etc). This fixes what otherwise would be a circular dependency once we implement SavedModelAPI (context -> SavedModelAPI -> context).
PiperOrigin-RevId: 316169246
Change-Id: I2349803c185f771a65e836273f326bd81622c42b
2020-06-12 13:57:13 -07:00
Allen Lavoie
854df86f53 Parallel device: Add tests for async executors+collectives
I wasn't sure this would work rather than deadlocking. But executors are per-thread, so it looks like ParallelDevice making new threads implicitly disables async anyway (which is what we need for now).

This will deadlock if the Context's default executor is async. We do have a public API for that, so I'll fix that integration in a followup.

PiperOrigin-RevId: 316004808
Change-Id: I2c5c946557c5ff28b0fc5252850f6de5648690fd
2020-06-11 16:54:54 -07:00
Allen Lavoie
348b3e6224 Parallel device: switch to manual threading to avoid synchronization issues with remote eager
Does not create new executors for each of its threads at the moment, so I'm not sure how this will interact with a user-level async setting. It'll be easy enough to create one sync executor per thread if that's a problem.

Requires some tweaks to TFE_OpAddAttrs to make it safe to call for different ops in different threads (due to protobuf arena destruction being global and not thread-safe). These changes avoid serialization+deserialization of attributes, so they make sense anyway.

PiperOrigin-RevId: 315990383
Change-Id: I211ceae1eb34e490b4d2a1df76226995c7ae7929
2020-06-11 15:33:39 -07:00
Allen Lavoie
0507043058 Parallel device: move broadcasting a bit earlier to simplify type signatures
I think we'll eventually need to move implicit broadcasting out of execute entirely for gradients to work, and moving it a little bit out helps simplify things for now.

PiperOrigin-RevId: 315575773
Change-Id: Ib7e42e5f68d7261a431a4d0de01ca471090cd967
2020-06-09 15:52:53 -07:00
Xiao Yu
54d74710e4 Initial @tf.function support in TFRT. And enable some tests in def_function_test.py.
Some limitations of current implementation:
1. We doesn't have a cache for compiled function. Right now it JIT compile the function in each invocation.
2. Does not support nested function.
3. Does not support variable.

PiperOrigin-RevId: 315031999
Change-Id: I8d96ed26d0da7c071b7f89e65d6ada7bbc290a37
2020-06-05 18:37:04 -07:00
Gunhan Gulsoy
21715bfb30 Disable flaky tests on windows.
The tests are failing potentially due to issues shutting down rpc servers

PiperOrigin-RevId: 314979132
Change-Id: Ie1fa76a1d8f096ddb09491d3ace4319bc0d5565b
2020-06-05 13:00:24 -07:00
Allen Lavoie
bfc32671fd Factor out C++ types from the parallel device
PiperOrigin-RevId: 314807016
Change-Id: I4e41ac3e8a08ea0f1db93826652142a083f17fd1
2020-06-04 14:49:16 -07:00
Allen Lavoie
fb6e0c1cdd Parallel device: sync executors after each parallel op
PiperOrigin-RevId: 314396769
Change-Id: I4697f3e488d351d8610cfb80da1701e2fa24848e
2020-06-02 13:56:53 -07:00
Haoyu Zhang
58f1e31019 Fix c_api_remote_test tsan flakiness.
PiperOrigin-RevId: 313813747
Change-Id: I428eaa271adcb0cca3236edd0c52232dda9719a6
2020-05-29 13:08:17 -07:00
Allen Lavoie
69c0447f01 ParallelDevice: Sync executors when returning non-parallel TensorHandles, add remote tests
The actual delta isn't huge; I'm moving some test utilities to a testlib since the remote tests need them. The remote tests are in a different target because they need to disable global heap checking, which I'd like to keep on for the rest of the tests.

PiperOrigin-RevId: 313698670
Change-Id: I846294a748e3b007eba0472901b0e58358b8edd5
2020-05-28 18:30:40 -07:00
Mehdi Amini
f0ef163443 Add an MLIR tracing implementation to the C unified API
This is plumbing just enough to pass all the unit-tests.
The conversion to the function library is quite inefficient, but it isn't
clear if we want to optimize this or just focus on TFRT moving forward.

PiperOrigin-RevId: 313356850
Change-Id: I83815317d4958786d0103168b5d88498f89511ed
2020-05-27 03:00:08 -07:00
Bramandia Ramadhana
550581f6bd When calling connect_to_cluser, if the options are identical and there is no renaming of local device, reuse existing local DeviceManager, otherwise we keep the old DeviceManager around to allow the old Tensor created to be usable.
PiperOrigin-RevId: 312489501
Change-Id: Id392d0324aba7e7f9e92f8efeaf33683157470e1
2020-05-20 08:53:52 -07:00
Yifei Feng
81f379b0f4 Disable flaky tensorflow/c/eager:c_api_remote_test on asan
PiperOrigin-RevId: 312403791
Change-Id: Ide29a0e661c6dcb44abb0d39657d1e97fecf04d6
2020-05-19 19:44:41 -07:00
Alexandre Passos
b4360f894c Remove unused experimental APIs
PiperOrigin-RevId: 312330039
Change-Id: I721642d67294ea5e0ba3702058106ea423db72d1
2020-05-19 12:36:03 -07:00
Yifei Feng
714092f360 Disable flaky tensorflow/c/eager:c_api_test
PiperOrigin-RevId: 312195494
Change-Id: I7cbd78f2142ef586e6ca78da73c2cf53304ae3b6
2020-05-18 18:40:40 -07:00
Haoyu Zhang
c61bc6a4f3 Support cancellation in multi-device and distributed function execution.
In executing a multi-device or distributed function, one component function failure could cause other component functions to hang due to dependencies (e.g., they are pending receiving tensors from the failed component function). This can often lead to issues that are hard to debug especially with a large number of workers.

This change cancels local and remote component functions in multi-device function execution if one component function fails, by cancelling the function rendezvous and the component function execution request RPCs. Since the cancelled errors are marked as derived, the original failure error message will be reported to users.

PiperOrigin-RevId: 311805431
Change-Id: I2f0b819e2b0a228fdeb242361b41ef4cadc7e3d2
2020-05-15 14:53:58 -07:00
Xiao Yu
dc1c299833 Add Unsupported dtype in tfrt for backward compatibility. We will use this dtype to support legacy types (e.g. DT_RESOURCE, DT_VARIANT) that are not natively implemented in TFRT.
PiperOrigin-RevId: 311791879
Change-Id: Ied0bfadf68f07e68fe8eb941c0d02bcb9f1a0b40
2020-05-15 13:36:05 -07:00
Xiao Yu
d968853cc6 Skip TFE_ContextAsyncWait for tfrt. In current TF-TFRT integration, all ops are executed synchronously. We will revisit this later.
PiperOrigin-RevId: 311777624
Change-Id: I3a27805dcce53ccf572f3c500d6fd0a532b286b2
2020-05-15 12:18:55 -07:00
Yujing Zhang
5cf4311435 Fix a memory leak.
PiperOrigin-RevId: 311662668
Change-Id: I59f9c9cdb8baed7a9828bb818ce1d293d185e6b6
2020-05-14 21:03:46 -07:00
Mehdi Amini
215616fddc Add support for setting up a TF_OutputList from the client and use it to build function with multiple results
PiperOrigin-RevId: 311585364
Change-Id: I5245fd0f5e5c0e8e7e22350d970c508e0154d59b
2020-05-14 12:38:55 -07:00
A. Unique TensorFlower
5d3c548620 Resolve trivial aliases for portable TensorFlow targets.
PiperOrigin-RevId: 311548335
Change-Id: I837aa5a62500682783607841f0c993c2b6c238ed
2020-05-14 09:39:36 -07:00
Mehdi Amini
ec2cc2903f Introduce a higher-level function handling in the tracing oriented unified API
This patch intends to make function tracing more of a first class concept in
the API. It tries to move away from the "flat graph" model with "placeholder"
operation introduced with the expectation to turn them into function
parameters later. Instead the user starts by creating an empty function which
is an ExecutionContext (and as such can trace operations). Function parameters
can get added to this context using a dedicated API returning an AbstractTensor.

The diff in UnifiedCAPI/TestBasicGraph is probably a good illustration of the
change from a client point of view.

Another important point of this patch is to make it so that no C public API is
defined in the `c_api_unified_experimental_graph.cc` file, instead the
implementation is dispatched based on a registered factory function to create the
tracing context. This will allow to swap the tracing implementation through
injection later.

PiperOrigin-RevId: 311529850
Change-Id: I822047f4306835abc0e044dc87c14179596f64bd
2020-05-14 07:45:39 -07:00
Yujing Zhang
0ac3572e8d Make SerializeRemoteTensorHandle block only when the remote op is a function, in order to still benefit from async execution.
PiperOrigin-RevId: 311423473
Change-Id: I87a3973ddf1954facb69c14499ce2fa07a9d6e99
2020-05-13 16:10:53 -07:00
Yujing Zhang
8588e0aab8 Support running a remote function with packed input handles.
- Support copying a packed TensorHandle from a client to a remote worker.

PiperOrigin-RevId: 311404609
Change-Id: Iadf2c7793dc3631f7be05de611d059733bbfdd63
2020-05-13 14:31:22 -07:00
Gaurav Jain
d5b3ec27d1 Allow dynamically configuring device placement
Enable setting soft device placement as well as logging dynamically.
This required ensuring the device placement policy was part of the cache
key.

Further, we fix the logging to ensure in eager mode if a kernel is
retrieved from the kernel cache, then the execution is still logged. We
also log closer to the actual op execution to avoid logging before all
checks have been done.

PiperOrigin-RevId: 311271808
Change-Id: I9765228894f84a3447cc03332a2559f6d933165b
2020-05-12 23:17:39 -07:00
Haoyu Zhang
1c74b32aa2 Validate remote resource devices before safe access of resources.
Cluster updates (due to recreated distribution strategies, remote worker failures, etc.) can lead to crashing failures with segfaults when accessing resources created before the update. Some common patterns are:
* Accessing datasets created on old remote workers;
* Accessing variables created on failed workers;
* Garbage collecting datasets/iterators created on old remote workers;

This CL validate the remote devices to make sure the access is safe before executing the ops by looking up the device in a set of device pointers and checking its incarnation ID. Remote workers on restarted devices will have different incarnation IDs, and accessing resources on those devices will fail gracefully.

PiperOrigin-RevId: 311261000
Change-Id: Ifc07862229b06301e0275fe80975565d9df28152
2020-05-12 21:27:15 -07:00
Brian Zhao
5100abc4af Initial checkin of C++ header-only TensorHandle as part of RFC https://github.com/tensorflow/community/pull/207.
PiperOrigin-RevId: 311179503
Change-Id: Ib3cfb2547150d09ee655db6ca6bc72ef3ef7adde
2020-05-12 12:37:48 -07:00
Allen Lavoie
8e3bc844b1 Add support for a device ID op in parallel_device
The op doesn't really make sense to register kernels for, so I'm not registering it anywhere by default yet; it's currently just registered in the parallel device tests.

PiperOrigin-RevId: 311141160
Change-Id: Iff1839112dac6fe3406e4b31f0e6f7239809a5bb
2020-05-12 09:34:03 -07:00
Allen Lavoie
22a24beeee Add a TF-internal visibility declaration for the parallel device
PiperOrigin-RevId: 311028466
Change-Id: Ic19ed07c49b796c94e0fd3370aa0cf5c83fe3fd6
2020-05-11 17:33:03 -07:00
Mehdi Amini
4da3e08cd5 Fix typo in the test names: UnifedCAPI->UnifiedCAPI
PiperOrigin-RevId: 310941575
Change-Id: Ie6b16eb317bd15e98de54652568f25827cf147df
2020-05-11 10:40:46 -07:00
Mehdi Amini
3528e494a2 Add missing ASSERT_EQ on status after API call in c_api_unified_experimental_test
PiperOrigin-RevId: 310671562
Change-Id: Id0b07d6889340d631f6144f8fa6dd3f3309b3776
2020-05-08 19:35:36 -07:00
Kibeom Kim
837b493f3c Implement Numpy to tensor conversion for TFRT.
PiperOrigin-RevId: 310657168
Change-Id: I3133a28194f41586f377d688dc64bff52f120d33
2020-05-08 17:18:26 -07:00
Yujing Zhang
7e6ea21148 Support running a function with packed input handles through C APIs.
Introduce a C API TFE_CreatePackedTensorHandle which creates a TFE_TensorHandle referring to multiple TFE_TensorHandles.

PiperOrigin-RevId: 310610230
Change-Id: Icc0ffd5c58ad7780eca38d552c1a2f4617f04891
2020-05-08 12:53:55 -07:00
Jing Dong
ae3c619cf7 Move c_api_tfrt to core/tfrt/eager (NFC)
c_api_tfrt is an implementation of the C API, so it should not be in c/eager/.
Move it to core/tfrt/eager to mirror the setup for the current TF runtime
directory core/common_runtime/eager.

PiperOrigin-RevId: 310590751
Change-Id: I6840756c321c29eec2a6b648c3484ec4fc8bd46e
2020-05-08 11:15:16 -07:00
Yujing Zhang
b5b150f79c Fix an issue of out of order execution. Don't serialize a remote input handle for function execution until it's ready on a remote device. Otherwise, on a remote worker, a remote function execution request could be enqueued before a request for producing a function input.
PiperOrigin-RevId: 310253012
Change-Id: I20e649494ec27f4bd581798d2ed458453f75d30f
2020-05-06 16:47:48 -07:00
Xiao Yu
d243d43513 Add AbstractContextInterface::StartStep() and AbstractContextInterface::EndStep() which are used in training models.
PiperOrigin-RevId: 309790225
Change-Id: I909affc76f0e09a6f3a8cf5994f0e64b933bd381
2020-05-04 11:57:06 -07:00
Allen Lavoie
6e3bea20a1 Less pointer indirection for TFE_OpAttrs, add TFE_OpGetAttrs
We'll want this for implementing copy for `TF_AbstractOp`s backed by `TFE_Op`s (since we want to copy the type/attributes but not the inputs).

PiperOrigin-RevId: 309756974
Change-Id: I07a8c48f50ab6d3c8a7d7db972fb60202b86434d
2020-05-04 09:24:03 -07:00
Haoyu Zhang
1fd3d693a9 Fix flakiness in c_api_remote_test.
PiperOrigin-RevId: 309452719
Change-Id: I1ac91638464b2a3a1914cee5cadb0353b307e6df
2020-05-01 12:30:47 -07:00
Haoyu Zhang
9596f52784 Avoid partially creating/updating cluster when some workers fail during update.
PiperOrigin-RevId: 309397022
Change-Id: I0c20db46a5c0bc629a662a5854d4b72e39b82322
2020-05-01 06:19:03 -07:00
Mihai Maruseac
ac54b6e7f1 Disable flaky guitar test
PiperOrigin-RevId: 309346828
Change-Id: I13b7651af2532e7cab78bfbefc61d69ea7b2de5b
2020-04-30 19:45:52 -07:00
Mihai Maruseac
fa5194b3b5 Disable flaky test on TSAN
PiperOrigin-RevId: 309344029
Change-Id: I2aab1976de7d95f05c07869be1bae6fae70fc7d1
2020-04-30 19:13:13 -07:00
Mihai Maruseac
387429dd3b Rollback of "Avoid partially creating/updating cluster when some workers fail during update." as it breaks on Windows
PiperOrigin-RevId: 309330222
Change-Id: Ib769799aa072af218f92e8693ce5be764e171dc6
2020-04-30 17:28:56 -07:00