Commit Graph

1463 Commits

Author SHA1 Message Date
Andrew Audibert
3d28cdc603 [tf.data service] Improve cancellation for tf.data service requests.
1. If a DataServiceDataset iterator is cancelled, it will now call TryCancel on its outstanding RPCs.
2. As a result, we can reduce the frequency of returning from blocked round-robin requests to check whether the iterator is cancelled. This may avoid delays in GetNext() that could happen if one consumer reads from a round earlier than others, and needs to perform multiple retries with exponential backoff.
3. Because of (2), server shutdown may take up to 1 minute if a round-robin request is blocked waiting for other consumers. To prevent slow unit tests, certain tests store their servers globally so that they are destroyed immediately at process exit without waiting for their outstanding RPCs to finish.

Running data_service_ops_test.py locally, this CL reduces the time from 27 seconds to 20 seconds

PiperOrigin-RevId: 351825888
Change-Id: Iba20a456bdabf251d03b94f090fe760616d3da4d
2021-01-14 10:26:18 -08:00
Andrew Audibert
2cafdafc93 Improve xprof tracing for data service requests.
PiperOrigin-RevId: 351681153
Change-Id: I42b542851f788995165b82e4422d23b707cc0454
2021-01-13 15:53:27 -08:00
Andrew Audibert
706350f023 [tf.data service] Implement round-robin reads.
This enables a new mode of reading from the tf.data service, where consumers read from tasks in a coordinated fashion, instead of the normal first-come first-served.

The main use case for this is coordinated bucketization for synchronous training, where we want to ensure that at each step consumers get batches with elements of similar sizes. This mitigates the inefficiency of some consumers slowly training on large examples while others quickly train on small examples, then block waiting for the slower examples to be processed.

When `consumer_index` and `num_consumers` are specified to `distribute`, each task will enforce a strict round-robin order, where its first element goes to consumer 0, second element to consumer 1, and so on. This requires that all consumers consume the same number of elements.

PiperOrigin-RevId: 351625063
Change-Id: I9b400f55ad61406cb125af8225096e7ff5dc4b0c
2021-01-13 11:29:05 -08:00
TensorFlower Gardener
692d06a0f3 Merge pull request from kvignesh1420:iterator-state
PiperOrigin-RevId: 351423787
Change-Id: If07e63a43b7d25bc2c6f1a7cc112734c3c32e1a8
2021-01-12 13:06:49 -08:00
Jay Shi
7306a4392a [tf.data] Rolls out the optimization map_parallelization as experiment to 5% of Borg jobs.
PiperOrigin-RevId: 351218900
Change-Id: Iad9c50673c49d6cd136ad7c1826ffe59035c877e
2021-01-11 13:14:20 -08:00
Vignesh Kothapalli
ef2a6a68c8
handle name conflicts 2021-01-12 02:28:10 +05:30
Vignesh Kothapalli
d86b3271d6
convert accessors to lower-case 2021-01-12 02:05:34 +05:30
Vignesh Kothapalli
fa75f1142e
pass references instead of copies 2021-01-07 03:50:16 +05:30
Vignesh Kothapalli
45ed98546c
bump to trigger ci jobs 2021-01-07 03:06:55 +05:30
Vignesh Kothapalli
ac37981874
[tf.data] use a Class for IteratorResource State 2021-01-07 02:39:54 +05:30
Jay Shi
242f42a186 [tf.data] Rolls out the optimization map_parallelization as experiment to 1% of Borg jobs.
PiperOrigin-RevId: 350163563
Change-Id: I5e842ad12d10eca205a2e50027e96965197203ce
2021-01-05 10:18:55 -08:00
A. Unique TensorFlower
4ef33e3c38 Internal change
PiperOrigin-RevId: 349608848
Change-Id: Idb0733edbdedc55ea588875f2f542c3f22af2954
2020-12-30 15:34:17 -08:00
Andrew Audibert
7b31eb3cf8 [tf.data] Try many times before giving up on empty repeat().
The dataset `Dataset.range(0).repeat()` could potentially infinite loop because
`repeat()` will keep trying to instantiate `range(0)` to get the next element.
We avoid this by detecting when the input to `repeat()` is empty, and returning
early from `repeat()`. However, we may return *too* early if the input to
`repeat` has a chance of being nonempty. For example, consider
`Dataset.range(2).shuffle(2).take(1).filter(lambda x: x == 0).repeat()`. It
should produce {0, 0, 0, ...}, but will actually produce a random number of
zeros before reporting end of sequence. This change increases the number of
empty sequences we need to see before giving up. We'll now only exit after
seeing 10000 consecutive empty sequences. Iterating over an empty sequence
takes on the order of microseconds, so we will still exit quickly in cases
where the input is truly empty.

PiperOrigin-RevId: 349596829
Change-Id: I588f8abf6cae3a4bec616cc43085e109687bc86c
2020-12-30 13:44:42 -08:00
Andrew Audibert
59f5abfbc8 [tf.data] Support asserting infinite cardinality.
Previously using tf.data.experimental.assert_cardinality(tf.data.experimental.INFINITE_CARDINALITY) would cause the assertion to fail as soon as the first dataset element was produced, even if the dataset actually was infinite. After this CL, we will only raise an error if the dataset runs out of elements.

Fixes https://github.com/tensorflow/tensorflow/issues/45894

PiperOrigin-RevId: 349321521
Change-Id: I54804225da55f49cef4fa69e498a239854d16e22
2020-12-28 13:09:55 -08:00
Andrew Audibert
c5c8ad8892 [tf.data] Record element sizes in IteratorGetNext xprof traces.
PiperOrigin-RevId: 348100471
Change-Id: Ic534eb373936f9d610e36d3ce1028da549450de1
2020-12-17 15:09:08 -08:00
Andrew Audibert
45370f0d3b [tf.data] Fix segfault in parallel_map
This CL addresses a race condition where parallel map could segfault when intra-op parallelism is disabled.

The race can happen if an iterator is destroyed while there are outstanding map function calls. Normally the destructor will block until all outstanding calls are finished. But before this CL, `CallFunction` might call `done()` too early, decrementing `num_calls_` and allowing the destructor to run before `CallFunction`'s final calls to RecordStop/RecordStart.

PiperOrigin-RevId: 347747695
Change-Id: Iacf790c082dca80ef5800593c9f885bc481d78b6
2020-12-15 20:55:14 -08:00
Frank Chen
816bb157e9 Add absl::Cord support for builds with TF_CORD_SUPPORT enabled
Also fixes various bugs within TF's absl::Cord handling.

PiperOrigin-RevId: 346884244
Change-Id: I04cec023bedb5d772833614e19c766a7557bef5e
2020-12-10 16:03:38 -08:00
Jay Shi
3ce8f8f39c [tf.data] Turn off the experiment map_parallelization currently.
PiperOrigin-RevId: 346681247
Change-Id: Ie1382f41696a0b2e6b6a8ff19e3bde2a31396bbf
2020-12-09 18:19:38 -08:00
Jay Shi
a0a365e408 [tf.data] Add vlog to record how much time each optimization step takes.
PiperOrigin-RevId: 345557199
Change-Id: I54f8f549b10e98ac605905853505e828e9502f45
2020-12-03 16:15:55 -08:00
TensorFlower Gardener
61e1d72034 Merge pull request from linux-on-ibm-z:s390x-fix-10-28
PiperOrigin-RevId: 345267871
Change-Id: I12b3ca3fa5cdd49f3725a0f456c9ae9c9d4b32e7
2020-12-02 11:08:06 -08:00
Andrew Audibert
cde6ac1b83 [tf.data] Set cardinality to UNKNOWN when using ignore_errors transformation.
ignore_errors may cause some elements to be dropped, so we no longer know the exact cardinality.

PiperOrigin-RevId: 344875178
Change-Id: I98beadcf8322d5eb23da14192cfeb83e1303e36a
2020-11-30 13:30:50 -08:00
Andrew Audibert
31160b7c21 [tf.data service] Combine CreateJob and GetOrCreateJob into one RPC.
This avoid error-prone duplication of logic.

PiperOrigin-RevId: 343883195
Change-Id: Ib02a2ae7751502493e3d37b9c2cdad2cf89bd2b1
2020-11-23 11:05:20 -08:00
Jay Shi
e2de99a4e3 [tf.data] Increase the roll out percentage of optimization map_parallelization to 100%.
PiperOrigin-RevId: 343870179
Change-Id: I0fff3616e045149a883c3c9eb6538400ba2f7a4c
2020-11-23 09:55:42 -08:00
Andrew Audibert
2e7a684d3f [tf.data] Implement InputDatasets for DirectedInterleaveDataset
PiperOrigin-RevId: 343367125
Change-Id: I56de1e4a427cd3d607bbaa6131a8a7d104fe3310
2020-11-19 14:28:31 -08:00
Andrew Audibert
525b147a41 [tf.data service] Relax client-side timeout.
Previously we would time out if we repeatedly failed to connect to the dispatcher for one hour, or if we failed to connect to any individual worker for one hour. This CL changes the logic to retry indefinitely. This prevents us from raising an error if just one worker can't be contacted.

PiperOrigin-RevId: 343363558
Change-Id: I3ade1d057ad86e5857b1bfca328bbcdc7511f906
2020-11-19 14:12:43 -08:00
Amit Patankar
cdae1a1ee0 Move the data/experimental proto files into the proto top level directory. This effort is a part of the build file and package cleanup.
PiperOrigin-RevId: 342975430
Change-Id: I009ef3bc1d963a8cd29e456e32b97127e545d32b
2020-11-17 17:02:55 -08:00
Andrew Audibert
abe233392e [tf.data] Check cycle length when restoring parallel interleave iterator.
If we try to restore into an iterator with a smaller cycle length from the original, it will produce a segmentation fault. This can happen either due to user error, or due to the cycle_length being autotuned.

This CL is a stopgap solution to give a better error message than a segmentation fault. In the long term we aim to support adjusting the cycle_length so that autotuned cycle_length + checkpointing just works.

PiperOrigin-RevId: 342733442
Change-Id: Ie9869224cc1598e74e6eb00397df35e6a1a46859
2020-11-16 15:36:27 -08:00
Andrew Audibert
52480d12ee Use TF_ prefix for thread annotations in tf.data
PiperOrigin-RevId: 342707406
Change-Id: Ia1a88dc6b7beac73ab76f6b9385b6f2aafff8f87
2020-11-16 13:37:15 -08:00
Andrew Audibert
1d1ad47792 Prefix thread annotations with ABSL_.
PiperOrigin-RevId: 342696677
Change-Id: I8923e0d6aae1a29d8674e1196339e8e224baf25c
2020-11-16 13:06:26 -08:00
A. Unique TensorFlower
e7165535db Prefix thread annotations with ABSL_.
PiperOrigin-RevId: 342683326
Change-Id: I270a31085e8613f269eabd0144f3fb31a49a6308
2020-11-16 12:00:49 -08:00
TensorFlower Gardener
66efcc65fa Merge pull request from mkuchnik:mkuchnik_fix_tfrecord_patch
PiperOrigin-RevId: 342681111
Change-Id: I56f191f5d73ab70ac44e1c81572b40e3033cfc5b
2020-11-16 11:56:52 -08:00
Jay Shi
7b637feb1d [tf.data] Increase the roll out percentage of optimization map_parallelization to 50%.
PiperOrigin-RevId: 342653375
Change-Id: I56c9bbb5bfae4d8cc8fba3d06641c987e2db6cf3
2020-11-16 09:35:08 -08:00
Jay Shi
6c838fcbf8 [tf.data] Modify the optimization autotune_buffer_sizes so that it will inject autotuned PrefetchDatasets after non-prefetched asynchronous datasets. The optimization will also rewrite those existing non-autotuned PrefetchDatasets into autotuned with fixed start value and minimal value for the tunable parameter buffer_size.
PiperOrigin-RevId: 342347611
Change-Id: I1d4fe8f00944595a6fb9b5b99a1679a493b32edf
2020-11-13 15:22:13 -08:00
Jiri Simsa
d7cfe3c6f5 [tf.data] Adding TraceMe metadata for the Model transformation.
PiperOrigin-RevId: 342124378
Change-Id: I5c43ab69ab8a744c77b6777ebfee1fdc27beb452
2020-11-12 13:59:11 -08:00
Jay Shi
4cd80bd641 [tf.data] Turn off the experiment enable_gradient_descent currently.
PiperOrigin-RevId: 341903198
Change-Id: Ic015cc29b9d54a7270bef86b9c0e96035b5b1b14
2020-11-11 14:15:51 -08:00
Michael Kuchnik
da50078b0b Fix TFRecord uncompressed test cases 2020-11-11 15:23:11 -05:00
Ruoxin Sang
489074cda4 Preserving infinite cardinality information for tf.data.experimental.sample_from_datasets transformation.
PiperOrigin-RevId: 341768470
Change-Id: If6f379472834640e35146116987a9348efb662f6
2020-11-10 23:03:43 -08:00
Jay Shi
b3d45cd17c [tf.data] Apply gradient descent method as default algorithm for autotuning optimization.
PiperOrigin-RevId: 341499875
Change-Id: Ie2eab5ed5e85e0c9afac1fb5b612057e51bd0e12
2020-11-09 16:04:32 -08:00
Jay Shi
1afab85484 [tf.data] Apply gradient descent method as default algorithm for autotuning optimization.
PiperOrigin-RevId: 341444332
Change-Id: I359e8269166d5e8e89514f5fe8e53f0733dab456
2020-11-09 11:21:25 -08:00
Jay Shi
d0d33f6d04 [tf.data] Increase the roll out percentage of optimization map_parallelization to 20%.
PiperOrigin-RevId: 341436467
Change-Id: Ieeefda39f9896d7c34f089c9e6eccf07b904d222
2020-11-09 10:50:40 -08:00
Reed Wanderman-Milne
5e33e7418e Fix issue preventing debug mode from working.
The issue was some constants were declared but not defined. When optimizations were enabled, the constants would be inlined which is why this wasn't causing issues in opt mode. @timshen91 helped me debug this issue.

This was reproducable by running the following command to build the pip package, then installing the pip package and running `import tensorflow as tf`:

    bazel build --config=dbg  //tensorflow/tools/pip_package:build_pip_package

The error was:

    ImportError: /home/reedwm/venvs/tf_dbg/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so: undefined symbol: _ZN10tensorflow4data12experimental19SnapshotDatasetV2Op5kHashE

PiperOrigin-RevId: 341209683
Change-Id: If71c6ca529bd87cd1a589422440454548165e813
2020-11-07 09:48:51 -08:00
Ruoxin Sang
5ab2ad101e Preserving infinite cardinality information for dataset.interleave transformation.
PiperOrigin-RevId: 341064372
Change-Id: I628f362366d52286118753e7a64a334204daf739
2020-11-06 09:42:23 -08:00
TensorFlower Gardener
609bad4d8b Merge pull request from mkuchnik:executor_based_runsync_metrics
PiperOrigin-RevId: 341011613
Change-Id: Ic4e8ec1beeae76fe9b6ba2ece93a1a27c38004a4
2020-11-06 02:10:14 -08:00
Andrew Audibert
1d2800b2f8 [tf.data service] Add additional traceme information.
PiperOrigin-RevId: 340996118
Change-Id: I093f7afdfad644d0cf114da0f211df3e3c875d64
2020-11-05 23:32:56 -08:00
Michael Kuchnik
68ddda9cf6 Add missing out_iterator argument 2020-11-05 13:08:14 -05:00
Michael Kuchnik
a5ac7edf39 Add overload to MakeIteratorFromInputElement 2020-11-03 18:04:55 -05:00
Andrew Audibert
cbc7f31b7d [tf.data] Determine snapshot hashes before optimization.
This fixes two issues:
1) We would hash AUTO compression type into the snapshot hash. We should instead resolve AUTO before including it in the snapshot hash, so that we can share snapshots based on their actual compression type.
2) Snapshot hashes could previously be determined either before or after dataset graph optimization. This can be a problem if dataset graph optimization changes the graph, since we don't always optimize a dataset before iterating over it (e.g. if the dataset was produced by a `flat_map` or `interleave` function). To address this, we will now always use the hash of the pre-optimized input dataset. This has an additional benefit of avoiding issues with optimizations being potentially non-deterministic.

This issue was originally raised by a user in https://github.com/tensorflow/tensorflow/issues/44278.

PiperOrigin-RevId: 340490555
Change-Id: Iab6fb39a9ff94b7857061adec551d6813ba9b8f9
2020-11-03 11:59:47 -08:00
Michael Kuchnik
a2ac69be20 Add modeling support for MakeIteratorFromInputElement 2020-11-03 14:15:46 -05:00
Jiri Simsa
b0140088d4 [tf.data] Various changes to iterator metrics collection.
This CL:
- adds support for collecting aggregate time tf.data iterator spent actively servicing `GetNext` requests (accounting for concurrent requests)
- adds support for collecting the "lifetime" of tf.data iterator, that is time between receiving the first `GetNext` request and servicing the last `GetNext` request
- removes support for collecting time between subsequent calls to IteratorGetNextOp

PiperOrigin-RevId: 340474712
Change-Id: Icdfd35c46623160e9faacf1af69f897af88049f6
2020-11-03 10:40:09 -08:00
Michael Kuchnik
26aea76751 Replace default args with function overload 2020-11-03 13:31:08 -05:00