Commit Graph

869 Commits

Author SHA1 Message Date
Xiao Yu
fe6e64b098 Refactor eager placement logic into three util methods:
- MaybePinSmallOpsToCpu
- MaybePinToResourceDevice
- MaybePinToCustomDevice

We are going to reuse MaybePinSmallOpsToCpu in TFRT but not the other two. Because TFRT doesn't have native Resource neither Custom Device.

PiperOrigin-RevId: 317766813
Change-Id: I43241b5786120ddf39dc4bfff6071239afdfd785
2020-06-22 18:05:46 -07:00
Yujing Zhang
38d95ad2d8 [Cleanup] Remove allowed_devices of ResourceHandle since it's no longer used.
PiperOrigin-RevId: 317710941
Change-Id: Ib1920c5ee25d405290f852b725d693ee5ea09766
2020-06-22 13:18:43 -07:00
Yujing Zhang
39d080e8b9 Use the same CompositeDevice name on remote workers as the one on a client.
PiperOrigin-RevId: 317702206
Change-Id: I7068efb25eb930252f89a167108ed59c69c2078f
2020-06-22 12:42:05 -07:00
Haoyu Zhang
18c43b7bd4 Clear cancel callback when gRPC eager call returns with state.
PiperOrigin-RevId: 317393892
Change-Id: Ife800821494dd4cc2992eec9a5470d989596a6d7
2020-06-19 15:48:45 -07:00
Haoyu Zhang
bd98ba765a Fix cancellation race condition in BaseRendezvousMgr::RegisterCall
PiperOrigin-RevId: 317363743
Change-Id: Ide89dd360a9885b5e8f67b12f362cbce8cb85d80
2020-06-19 13:04:53 -07:00
Jiho Choi
287a60c12a Change the prefix of xprof arguments from "$" to "_".
PiperOrigin-RevId: 316166741
Change-Id: I3a83f550755e6c34a179781d61ab6c24b03cca13
2020-06-12 13:44:24 -07:00
Jiho Choi
b46d4e1c09 Fix FunctionRun's TraceMe and apply the new TraceMe APIs.
PiperOrigin-RevId: 315830899
Change-Id: Ic16e3e98efa6bbcb702f3a59c9236eb35e3f5e6f
2020-06-10 21:47:11 -07:00
A. Unique TensorFlower
b6b9f0815e LSC: Replace cord.ToString() with std::string(cord)
PiperOrigin-RevId: 314966296
Change-Id: Icfadb1d6848c10f5ab5f081f36c496bdc98b0402
2020-06-05 11:45:40 -07:00
Haoyu Zhang
48678a1e2d Pass in GrpcWorkerEnv when creating GrpcWorkerCache.
PiperOrigin-RevId: 314769356
Change-Id: I154786dbba5eeb69d151baed83fc89a9dcd6a989
2020-06-04 11:46:09 -07:00
Jiri Simsa
fe33f393b8 Allowing tf.debugging.set_log_device_placement to control logging of device placement of tf.functions.
PiperOrigin-RevId: 314746176
Change-Id: I2516ec91e99f5d2b21db11b1cbfa2fdd1d5ef796
2020-06-04 09:47:31 -07:00
Haoyu Zhang
356121e563 Server-side cancellation support for distributed function execution.
1. Thread the RPC cancel signal through the eager service RunComponentFunction calls;
2. Always pass the cancellation manager to the underlying executor (instead of only passing when `is_eager` is true, i.e., pure eager ops). With this we do not need to cancel the rendezvous from the process FLR; instead the ExecutorState takes care of it when op fails.
3. Do not mark all statuses as derived when aborting rendezvous or triggering cancellation. This usually results in the original errors buried as one of the derived errors.

PiperOrigin-RevId: 313814162
Change-Id: Ia866f5f522a0b1aa54e9dce7b9cc0bcf7682136a
2020-05-29 13:11:45 -07:00
Haoyu Zhang
58f1e31019 Fix c_api_remote_test tsan flakiness.
PiperOrigin-RevId: 313813747
Change-Id: I428eaa271adcb0cca3236edd0c52232dda9719a6
2020-05-29 13:08:17 -07:00
Haoyu Zhang
956278ab3d Make GrpcEagerClientCache::GetClient thread safe.
PiperOrigin-RevId: 313211894
Change-Id: I3195db70af77816183cf041d024f694c32613164
2020-05-26 10:11:43 -07:00
Haoyu Zhang
09af9319d9 Make sure the rendezvous abort check is finished before triggering the callback.
PiperOrigin-RevId: 313204522
Change-Id: I88f38391d9ee2296fac9a6e86bb9f9d2c477f1c8
2020-05-26 09:31:52 -07:00
Bramandia Ramadhana
550581f6bd When calling connect_to_cluser, if the options are identical and there is no renaming of local device, reuse existing local DeviceManager, otherwise we keep the old DeviceManager around to allow the old Tensor created to be usable.
PiperOrigin-RevId: 312489501
Change-Id: Id392d0324aba7e7f9e92f8efeaf33683157470e1
2020-05-20 08:53:52 -07:00
Haoyu Zhang
32254100c3 Look up eager client directly from target in eager cluster FLR.
PiperOrigin-RevId: 312403019
Change-Id: I05e0e4f039e2f92d404eac1fbc9561249d6c3d1f
2020-05-19 19:36:08 -07:00
TensorFlower Gardener
d4c048beee Merge pull request from fesun:remove_localhost
PiperOrigin-RevId: 312118557
Change-Id: I19620028d8c5532a6f333a638b0cca5004142ae1
2020-05-18 11:28:33 -07:00
Haoyu Zhang
dbc0fffedb Report remote target name for worker service RPCs.
PiperOrigin-RevId: 312095453
Change-Id: I73fc7948f994426b8d62bdefd5573cfe3b5b793d
2020-05-18 09:43:13 -07:00
Haoyu Zhang
c61bc6a4f3 Support cancellation in multi-device and distributed function execution.
In executing a multi-device or distributed function, one component function failure could cause other component functions to hang due to dependencies (e.g., they are pending receiving tensors from the failed component function). This can often lead to issues that are hard to debug especially with a large number of workers.

This change cancels local and remote component functions in multi-device function execution if one component function fails, by cancelling the function rendezvous and the component function execution request RPCs. Since the cancelled errors are marked as derived, the original failure error message will be reported to users.

PiperOrigin-RevId: 311805431
Change-Id: I2f0b819e2b0a228fdeb242361b41ef4cadc7e3d2
2020-05-15 14:53:58 -07:00
Haoyu Zhang
d5e0f468cd Report remote target in error messages for gRPC eager service requests.
PiperOrigin-RevId: 311634462
Change-Id: Ib0550c172e419ea17dac9ffa28c18b9e1a03b3cc
2020-05-14 17:09:29 -07:00
Yujing Zhang
0ac3572e8d Make SerializeRemoteTensorHandle block only when the remote op is a function, in order to still benefit from async execution.
PiperOrigin-RevId: 311423473
Change-Id: I87a3973ddf1954facb69c14499ce2fa07a9d6e99
2020-05-13 16:10:53 -07:00
Yujing Zhang
8588e0aab8 Support running a remote function with packed input handles.
- Support copying a packed TensorHandle from a client to a remote worker.

PiperOrigin-RevId: 311404609
Change-Id: Iadf2c7793dc3631f7be05de611d059733bbfdd63
2020-05-13 14:31:22 -07:00
Yujing Zhang
7e6ea21148 Support running a function with packed input handles through C APIs.
Introduce a C API TFE_CreatePackedTensorHandle which creates a TFE_TensorHandle referring to multiple TFE_TensorHandles.

PiperOrigin-RevId: 310610230
Change-Id: Icc0ffd5c58ad7780eca38d552c1a2f4617f04891
2020-05-08 12:53:55 -07:00
TensorFlower Gardener
7cfa63117f Merge pull request from liutongxuan:features/reuse_port_option
PiperOrigin-RevId: 310392434
Change-Id: Ia7304812ec321085dc0a9f2fd408d386a0244021
2020-05-07 10:50:13 -07:00
Yujing Zhang
b5b150f79c Fix an issue of out of order execution. Don't serialize a remote input handle for function execution until it's ready on a remote device. Otherwise, on a remote worker, a remote function execution request could be enqueued before a request for producing a function input.
PiperOrigin-RevId: 310253012
Change-Id: I20e649494ec27f4bd581798d2ed458453f75d30f
2020-05-06 16:47:48 -07:00
tongxuan.ltx
fe3c1035c2 Check return status of reading environment variable 2020-05-06 12:08:11 +08:00
Frank Chen
a5efcd5d0c Update criteria for TPU job/worker experiments
PiperOrigin-RevId: 309946989
Change-Id: I5d2fe6d20ead6f0b2ee9dec96c09c4fc62ab60f7
2020-05-05 08:08:48 -07:00
Yujing Zhang
8fcb130e92 Support packed tensor inputs in ProcessFunctionLibraryRuntime.
- Expand the packed _Arg nodes when the graph is ready for graph partition.
- Introduce an optional sub-index to function Arg nodes, in order to distinguish between two arguments with the same "index". It happens after replacing a packed _Arg node which is assigned to a CompositeDevice with multiple replica nodes (one per device).

The "index" of an _Arg node is unique before expanding it. It's also unique within each subgraph after graph partition.

PiperOrigin-RevId: 309781835
Change-Id: Ic6e351f45b7523288b5dae30997ddf0dae86660b
2020-05-04 11:18:39 -07:00
Derek Murray
000c8f09ea [Build cleanup] Update #includes of moved header "graph/graph_constructor.h".
This change modifies these includes to point to
"tensorflow/core/common_runtime/graph_constructor.h" instead. This change will enable us to remove the accidental dependency from //tensorflow/core/graph to //tensorflow/core/common_runtime.

PiperOrigin-RevId: 309035649
Change-Id: I2af0fdd6a6ccc4ae8d351a9117a69b6fc80c22e9
2020-04-29 09:20:48 -07:00
Fei Sun
3a8b6ba5c1 Edit according to PR comments 2020-04-29 10:35:01 +08:00
Haoyu Zhang
448f351cfe Introduce non-blocking component function execution.
The current implementation of KernelAndDeviceFunc::Run ties up a thread while the function execution is pending. This leads to distributed deadlock issues in large-scale parameter server training when the number of workers exceed the thread pool size.

This change leverages the RunComponentFunction request to execute component functions in a non-blocking manner. By avoiding the thread tying issue, it removes the constraint on the number of concurrent component functions to execute in parallel.

PiperOrigin-RevId: 308939721
Change-Id: I086f9ee587c4df76b303158f27c362a9bcb8314c
2020-04-28 18:41:46 -07:00
Yujing Zhang
4430ba27bb Introduce PackedTensorHandleData to TensorHandle. A PackedTensorHandleData refers to a list of TensorHandles of the same dtype and shape.
PiperOrigin-RevId: 308702161
Change-Id: Ide047f4cde1c17e7be9e0d64f78f499a022a430e
2020-04-27 14:47:57 -07:00
Haoyu Zhang
47995cbaf7 Non-functional change in preparation for introducing non-blocking function execution.
Modularize components in eager execute and eager service implementation. They will be shared and reused in the upcoming async code path.

PiperOrigin-RevId: 308347746
Change-Id: Ida9cca1a1a88d3e6509c61950f4eaa4f18dbe864
2020-04-24 16:28:01 -07:00
Haoyu Zhang
0713f9aede Remove the necessity of acquiring context_update_mu_ in cluster update.
After this change, we no longer guarantee that an enqueue request does not overlap with a cluster update request. It means that a function execution can see inconsistent cluster views (e.g., some part using old server def and some with new one). Function executed in parallel with or shortly after cluster update requests can potentially fail on connection errors due to this inconsistency if the functions span remote targets that are being updated. To prevent this race condition from happening, the client needs to coordinate and avoid concurrent executing remote functions and updating the cluster.

PiperOrigin-RevId: 308342265
Change-Id: I57fd5f3ec9acd1da906d2718d7b61398156bb4f4
2020-04-24 15:53:28 -07:00
Haoyu Zhang
70a0ea2da3 Avoid re-creating cluster FLR and process FLR when updating worker context.
PiperOrigin-RevId: 308153596
Change-Id: I2f8bc377213855fd5eeb6307fd1a2e74d26f4140
2020-04-23 16:34:37 -07:00
tongxuan.ltx
231edfa418 fix build break 2020-04-23 10:13:45 +08:00
Haoyu Zhang
ffac1f3e0d Fix two issues in remote tensor handle.
* When setting TensorHandle remote shape, it currently uses context_view_id to index mirrors, but should not use it to check when setting the shape of RemoteTensorHandleData itself.
* If a tensor is ready because it was previously poisoned, return the original error instead of the less useful error "SetShape is only called on non-ready handles".

PiperOrigin-RevId: 307633262
Change-Id: I22436402c6beeb41731802060b59851e807627d9
2020-04-21 10:40:44 -07:00
A. Unique TensorFlower
73b1648c8e Make static const class/struct members as constexpr
PiperOrigin-RevId: 307458656
Change-Id: I4705d18ec0f2be8e306de6484f8a9b09df0a2beb
2020-04-20 12:59:32 -07:00
tongxuan.ltx
bbe13474e7 fix typo 2020-04-20 23:12:36 +08:00
tongxuan.ltx
a4d22c8d01 Support options(environment variable) to enable grpc reuse port.
ReusePort scenario: parent process occupies the port, then share
the port through service such as ZooKeeper, and then child
process (TensorFlow process) reuse the port.
2020-04-20 19:16:52 +08:00
Fei Sun
966ed1cafc Use provided host name/ip instead of localhost if possible 2020-04-16 18:41:20 +08:00
Gaurav Jain
f290fd7f48 Clean up some methods in EagerOperation
Rename AddInput and make it along with a few other methods private.

PiperOrigin-RevId: 306745935
Change-Id: I1e333419552a28e96755bb249448974ba6a49eb7
2020-04-15 16:49:51 -07:00
Derek Murray
709f61ee06 [Cleanup] Remove unused method RendezvousMgrInterface::CleanupAll().
PiperOrigin-RevId: 306271233
Change-Id: I2cc278deab6bdb9f70d5bb9a7e8678d2969220bd
2020-04-13 11:26:06 -07:00
A. Unique TensorFlower
b1fe37760f FlatMap seems to have poorer performance than absl::flat_hash_map. Replacing its
use with the latter since lookups / deletions from the rendezvous table block globally.

PiperOrigin-RevId: 306252615
Change-Id: I07e4402436ca75801804a578e3253ab5ff4ab762
2020-04-13 10:01:32 -07:00
Haoyu Zhang
67397a2ad6 Unify KernalAndDevice::Run API to optionally take step_container.
PiperOrigin-RevId: 305900352
Change-Id: Ia6c0b14a6de87ebc7f3663bf1205bf65d267612d
2020-04-10 10:35:22 -07:00
Yujing Zhang
63b323acd2 [Cleanup] Remove redundant inputs ref/unref during execution, since inputs are not set during instantiation.
PiperOrigin-RevId: 305897768
Change-Id: Ia1a697e198287aeadeb643cec7f11b918dd99bb4
2020-04-10 10:23:02 -07:00
Cesar Crusius
747c37add5 Make the VariantDevice type visible at the same level of CustomDevice.
In practice this is already done by expanding its absl::variant<>
definition in a handful of places. By making the type visible we can
properly account for its usage.

PiperOrigin-RevId: 305760610
Change-Id: I95d65461ebb70c2d4e33eb59985b01d6cb18554e
2020-04-09 14:31:48 -07:00
A. Unique TensorFlower
27058058e3 Add annotations for memory region type, tensor data type and shape.
PiperOrigin-RevId: 305585689
Change-Id: I6fec53e29afa0f91e99351cc50d3d9128241d173
2020-04-08 17:14:43 -07:00
Yujing Zhang
8424ef8160 Merge EagerPFLR and PFLR.
PiperOrigin-RevId: 305145252
Change-Id: Ie6f14c5a1af5335fefd21fbe107dc2c946dd7d66
2020-04-06 16:46:20 -07:00
Yujing Zhang
67570827b3 Deprecate inputs in Operation, since op_inputs has been in use.
PiperOrigin-RevId: 304728711
Change-Id: I386264fdf07b6a76910a03a23f9bd04ca7e63a35
2020-04-03 18:08:47 -07:00