STT-tensorflow

Author	SHA1	Message	Date
Andrew Audibert	3d28cdc603	[tf.data service] Improve cancellation for tf.data service requests. 1. If a DataServiceDataset iterator is cancelled, it will now call TryCancel on its outstanding RPCs. 2. As a result, we can reduce the frequency of returning from blocked round-robin requests to check whether the iterator is cancelled. This may avoid delays in GetNext() that could happen if one consumer reads from a round earlier than others, and needs to perform multiple retries with exponential backoff. 3. Because of (2), server shutdown may take up to 1 minute if a round-robin request is blocked waiting for other consumers. To prevent slow unit tests, certain tests store their servers globally so that they are destroyed immediately at process exit without waiting for their outstanding RPCs to finish. Running data_service_ops_test.py locally, this CL reduces the time from 27 seconds to 20 seconds PiperOrigin-RevId: 351825888 Change-Id: Iba20a456bdabf251d03b94f090fe760616d3da4d	2021-01-14 10:26:18 -08:00
Andrew Audibert	2cafdafc93	Improve xprof tracing for data service requests. PiperOrigin-RevId: 351681153 Change-Id: I42b542851f788995165b82e4422d23b707cc0454	2021-01-13 15:53:27 -08:00
Andrew Audibert	706350f023	[tf.data service] Implement round-robin reads. This enables a new mode of reading from the tf.data service, where consumers read from tasks in a coordinated fashion, instead of the normal first-come first-served. The main use case for this is coordinated bucketization for synchronous training, where we want to ensure that at each step consumers get batches with elements of similar sizes. This mitigates the inefficiency of some consumers slowly training on large examples while others quickly train on small examples, then block waiting for the slower examples to be processed. When `consumer_index` and `num_consumers` are specified to `distribute`, each task will enforce a strict round-robin order, where its first element goes to consumer 0, second element to consumer 1, and so on. This requires that all consumers consume the same number of elements. PiperOrigin-RevId: 351625063 Change-Id: I9b400f55ad61406cb125af8225096e7ff5dc4b0c	2021-01-13 11:29:05 -08:00
TensorFlower Gardener	692d06a0f3	Merge pull request #46237 from kvignesh1420:iterator-state PiperOrigin-RevId: 351423787 Change-Id: If07e63a43b7d25bc2c6f1a7cc112734c3c32e1a8	2021-01-12 13:06:49 -08:00
Jay Shi	7306a4392a	[tf.data] Rolls out the optimization `map_parallelization` as experiment to 5% of Borg jobs. PiperOrigin-RevId: 351218900 Change-Id: Iad9c50673c49d6cd136ad7c1826ffe59035c877e	2021-01-11 13:14:20 -08:00
Vignesh Kothapalli	ef2a6a68c8	handle name conflicts	2021-01-12 02:28:10 +05:30
Vignesh Kothapalli	d86b3271d6	convert accessors to lower-case	2021-01-12 02:05:34 +05:30
Vignesh Kothapalli	fa75f1142e	pass references instead of copies	2021-01-07 03:50:16 +05:30
Vignesh Kothapalli	45ed98546c	bump to trigger ci jobs	2021-01-07 03:06:55 +05:30
Vignesh Kothapalli	ac37981874	[tf.data] use a Class for IteratorResource State	2021-01-07 02:39:54 +05:30
Jay Shi	242f42a186	[tf.data] Rolls out the optimization `map_parallelization` as experiment to 1% of Borg jobs. PiperOrigin-RevId: 350163563 Change-Id: I5e842ad12d10eca205a2e50027e96965197203ce	2021-01-05 10:18:55 -08:00
A. Unique TensorFlower	4ef33e3c38	Internal change PiperOrigin-RevId: 349608848 Change-Id: Idb0733edbdedc55ea588875f2f542c3f22af2954	2020-12-30 15:34:17 -08:00
Andrew Audibert	7b31eb3cf8	[tf.data] Try many times before giving up on empty repeat(). The dataset `Dataset.range(0).repeat()` could potentially infinite loop because `repeat()` will keep trying to instantiate `range(0)` to get the next element. We avoid this by detecting when the input to `repeat()` is empty, and returning early from `repeat()`. However, we may return too early if the input to `repeat` has a chance of being nonempty. For example, consider `Dataset.range(2).shuffle(2).take(1).filter(lambda x: x == 0).repeat()`. It should produce {0, 0, 0, ...}, but will actually produce a random number of zeros before reporting end of sequence. This change increases the number of empty sequences we need to see before giving up. We'll now only exit after seeing 10000 consecutive empty sequences. Iterating over an empty sequence takes on the order of microseconds, so we will still exit quickly in cases where the input is truly empty. PiperOrigin-RevId: 349596829 Change-Id: I588f8abf6cae3a4bec616cc43085e109687bc86c	2020-12-30 13:44:42 -08:00
Andrew Audibert	59f5abfbc8	[tf.data] Support asserting infinite cardinality. Previously using tf.data.experimental.assert_cardinality(tf.data.experimental.INFINITE_CARDINALITY) would cause the assertion to fail as soon as the first dataset element was produced, even if the dataset actually was infinite. After this CL, we will only raise an error if the dataset runs out of elements. Fixes https://github.com/tensorflow/tensorflow/issues/45894 PiperOrigin-RevId: 349321521 Change-Id: I54804225da55f49cef4fa69e498a239854d16e22	2020-12-28 13:09:55 -08:00
Andrew Audibert	c5c8ad8892	[tf.data] Record element sizes in IteratorGetNext xprof traces. PiperOrigin-RevId: 348100471 Change-Id: Ic534eb373936f9d610e36d3ce1028da549450de1	2020-12-17 15:09:08 -08:00
Andrew Audibert	45370f0d3b	[tf.data] Fix segfault in parallel_map This CL addresses a race condition where parallel map could segfault when intra-op parallelism is disabled. The race can happen if an iterator is destroyed while there are outstanding map function calls. Normally the destructor will block until all outstanding calls are finished. But before this CL, `CallFunction` might call `done()` too early, decrementing `num_calls_` and allowing the destructor to run before `CallFunction`'s final calls to RecordStop/RecordStart. PiperOrigin-RevId: 347747695 Change-Id: Iacf790c082dca80ef5800593c9f885bc481d78b6	2020-12-15 20:55:14 -08:00
Frank Chen	816bb157e9	Add absl::Cord support for builds with TF_CORD_SUPPORT enabled Also fixes various bugs within TF's absl::Cord handling. PiperOrigin-RevId: 346884244 Change-Id: I04cec023bedb5d772833614e19c766a7557bef5e	2020-12-10 16:03:38 -08:00
Jay Shi	3ce8f8f39c	[tf.data] Turn off the experiment `map_parallelization` currently. PiperOrigin-RevId: 346681247 Change-Id: Ie1382f41696a0b2e6b6a8ff19e3bde2a31396bbf	2020-12-09 18:19:38 -08:00
Jay Shi	a0a365e408	[tf.data] Add vlog to record how much time each optimization step takes. PiperOrigin-RevId: 345557199 Change-Id: I54f8f549b10e98ac605905853505e828e9502f45	2020-12-03 16:15:55 -08:00
TensorFlower Gardener	61e1d72034	Merge pull request #44584 from linux-on-ibm-z:s390x-fix-10-28 PiperOrigin-RevId: 345267871 Change-Id: I12b3ca3fa5cdd49f3725a0f456c9ae9c9d4b32e7	2020-12-02 11:08:06 -08:00
Andrew Audibert	cde6ac1b83	[tf.data] Set cardinality to UNKNOWN when using ignore_errors transformation. ignore_errors may cause some elements to be dropped, so we no longer know the exact cardinality. PiperOrigin-RevId: 344875178 Change-Id: I98beadcf8322d5eb23da14192cfeb83e1303e36a	2020-11-30 13:30:50 -08:00
Andrew Audibert	31160b7c21	[tf.data service] Combine CreateJob and GetOrCreateJob into one RPC. This avoid error-prone duplication of logic. PiperOrigin-RevId: 343883195 Change-Id: Ib02a2ae7751502493e3d37b9c2cdad2cf89bd2b1	2020-11-23 11:05:20 -08:00
Jay Shi	e2de99a4e3	[tf.data] Increase the roll out percentage of optimization `map_parallelization` to 100%. PiperOrigin-RevId: 343870179 Change-Id: I0fff3616e045149a883c3c9eb6538400ba2f7a4c	2020-11-23 09:55:42 -08:00
Andrew Audibert	2e7a684d3f	[tf.data] Implement InputDatasets for DirectedInterleaveDataset PiperOrigin-RevId: 343367125 Change-Id: I56de1e4a427cd3d607bbaa6131a8a7d104fe3310	2020-11-19 14:28:31 -08:00
Andrew Audibert	525b147a41	[tf.data service] Relax client-side timeout. Previously we would time out if we repeatedly failed to connect to the dispatcher for one hour, or if we failed to connect to any individual worker for one hour. This CL changes the logic to retry indefinitely. This prevents us from raising an error if just one worker can't be contacted. PiperOrigin-RevId: 343363558 Change-Id: I3ade1d057ad86e5857b1bfca328bbcdc7511f906	2020-11-19 14:12:43 -08:00
Amit Patankar	cdae1a1ee0	Move the data/experimental proto files into the proto top level directory. This effort is a part of the build file and package cleanup. PiperOrigin-RevId: 342975430 Change-Id: I009ef3bc1d963a8cd29e456e32b97127e545d32b	2020-11-17 17:02:55 -08:00
Andrew Audibert	abe233392e	[tf.data] Check cycle length when restoring parallel interleave iterator. If we try to restore into an iterator with a smaller cycle length from the original, it will produce a segmentation fault. This can happen either due to user error, or due to the cycle_length being autotuned. This CL is a stopgap solution to give a better error message than a segmentation fault. In the long term we aim to support adjusting the cycle_length so that autotuned cycle_length + checkpointing just works. PiperOrigin-RevId: 342733442 Change-Id: Ie9869224cc1598e74e6eb00397df35e6a1a46859	2020-11-16 15:36:27 -08:00
Andrew Audibert	52480d12ee	Use TF_ prefix for thread annotations in tf.data PiperOrigin-RevId: 342707406 Change-Id: Ia1a88dc6b7beac73ab76f6b9385b6f2aafff8f87	2020-11-16 13:37:15 -08:00
Andrew Audibert	1d1ad47792	Prefix thread annotations with ABSL_. PiperOrigin-RevId: 342696677 Change-Id: I8923e0d6aae1a29d8674e1196339e8e224baf25c	2020-11-16 13:06:26 -08:00
A. Unique TensorFlower	e7165535db	Prefix thread annotations with ABSL_. PiperOrigin-RevId: 342683326 Change-Id: I270a31085e8613f269eabd0144f3fb31a49a6308	2020-11-16 12:00:49 -08:00
TensorFlower Gardener	66efcc65fa	Merge pull request #44778 from mkuchnik:mkuchnik_fix_tfrecord_patch PiperOrigin-RevId: 342681111 Change-Id: I56f191f5d73ab70ac44e1c81572b40e3033cfc5b	2020-11-16 11:56:52 -08:00
Jay Shi	7b637feb1d	[tf.data] Increase the roll out percentage of optimization `map_parallelization` to 50%. PiperOrigin-RevId: 342653375 Change-Id: I56c9bbb5bfae4d8cc8fba3d06641c987e2db6cf3	2020-11-16 09:35:08 -08:00
Jay Shi	6c838fcbf8	[tf.data] Modify the optimization `autotune_buffer_sizes` so that it will inject autotuned PrefetchDatasets after non-prefetched asynchronous datasets. The optimization will also rewrite those existing non-autotuned PrefetchDatasets into autotuned with fixed start value and minimal value for the tunable parameter `buffer_size`. PiperOrigin-RevId: 342347611 Change-Id: I1d4fe8f00944595a6fb9b5b99a1679a493b32edf	2020-11-13 15:22:13 -08:00
Jiri Simsa	d7cfe3c6f5	[tf.data] Adding TraceMe metadata for the `Model` transformation. PiperOrigin-RevId: 342124378 Change-Id: I5c43ab69ab8a744c77b6777ebfee1fdc27beb452	2020-11-12 13:59:11 -08:00
Jay Shi	4cd80bd641	[tf.data] Turn off the experiment `enable_gradient_descent` currently. PiperOrigin-RevId: 341903198 Change-Id: Ic015cc29b9d54a7270bef86b9c0e96035b5b1b14	2020-11-11 14:15:51 -08:00
Michael Kuchnik	da50078b0b	Fix TFRecord uncompressed test cases	2020-11-11 15:23:11 -05:00
Ruoxin Sang	489074cda4	Preserving infinite cardinality information for `tf.data.experimental.sample_from_datasets` transformation. PiperOrigin-RevId: 341768470 Change-Id: If6f379472834640e35146116987a9348efb662f6	2020-11-10 23:03:43 -08:00
Jay Shi	b3d45cd17c	[tf.data] Apply gradient descent method as default algorithm for autotuning optimization. PiperOrigin-RevId: 341499875 Change-Id: Ie2eab5ed5e85e0c9afac1fb5b612057e51bd0e12	2020-11-09 16:04:32 -08:00
Jay Shi	1afab85484	[tf.data] Apply gradient descent method as default algorithm for autotuning optimization. PiperOrigin-RevId: 341444332 Change-Id: I359e8269166d5e8e89514f5fe8e53f0733dab456	2020-11-09 11:21:25 -08:00
Jay Shi	d0d33f6d04	[tf.data] Increase the roll out percentage of optimization `map_parallelization` to 20%. PiperOrigin-RevId: 341436467 Change-Id: Ieeefda39f9896d7c34f089c9e6eccf07b904d222	2020-11-09 10:50:40 -08:00
Reed Wanderman-Milne	5e33e7418e	Fix issue preventing debug mode from working. The issue was some constants were declared but not defined. When optimizations were enabled, the constants would be inlined which is why this wasn't causing issues in opt mode. @timshen91 helped me debug this issue. This was reproducable by running the following command to build the pip package, then installing the pip package and running `import tensorflow as tf`: bazel build --config=dbg //tensorflow/tools/pip_package:build_pip_package The error was: ImportError: /home/reedwm/venvs/tf_dbg/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so: undefined symbol: _ZN10tensorflow4data12experimental19SnapshotDatasetV2Op5kHashE PiperOrigin-RevId: 341209683 Change-Id: If71c6ca529bd87cd1a589422440454548165e813	2020-11-07 09:48:51 -08:00
Ruoxin Sang	5ab2ad101e	Preserving infinite cardinality information for `dataset.interleave` transformation. PiperOrigin-RevId: 341064372 Change-Id: I628f362366d52286118753e7a64a334204daf739	2020-11-06 09:42:23 -08:00
TensorFlower Gardener	609bad4d8b	Merge pull request #44523 from mkuchnik:executor_based_runsync_metrics PiperOrigin-RevId: 341011613 Change-Id: Ic4e8ec1beeae76fe9b6ba2ece93a1a27c38004a4	2020-11-06 02:10:14 -08:00
Andrew Audibert	1d2800b2f8	[tf.data service] Add additional traceme information. PiperOrigin-RevId: 340996118 Change-Id: I093f7afdfad644d0cf114da0f211df3e3c875d64	2020-11-05 23:32:56 -08:00
Michael Kuchnik	68ddda9cf6	Add missing out_iterator argument	2020-11-05 13:08:14 -05:00
Michael Kuchnik	a5ac7edf39	Add overload to MakeIteratorFromInputElement	2020-11-03 18:04:55 -05:00
Andrew Audibert	cbc7f31b7d	[tf.data] Determine snapshot hashes before optimization. This fixes two issues: 1) We would hash AUTO compression type into the snapshot hash. We should instead resolve AUTO before including it in the snapshot hash, so that we can share snapshots based on their actual compression type. 2) Snapshot hashes could previously be determined either before or after dataset graph optimization. This can be a problem if dataset graph optimization changes the graph, since we don't always optimize a dataset before iterating over it (e.g. if the dataset was produced by a `flat_map` or `interleave` function). To address this, we will now always use the hash of the pre-optimized input dataset. This has an additional benefit of avoiding issues with optimizations being potentially non-deterministic. This issue was originally raised by a user in https://github.com/tensorflow/tensorflow/issues/44278. PiperOrigin-RevId: 340490555 Change-Id: Iab6fb39a9ff94b7857061adec551d6813ba9b8f9	2020-11-03 11:59:47 -08:00
Michael Kuchnik	a2ac69be20	Add modeling support for MakeIteratorFromInputElement	2020-11-03 14:15:46 -05:00
Jiri Simsa	b0140088d4	[tf.data] Various changes to iterator metrics collection. This CL: - adds support for collecting aggregate time tf.data iterator spent actively servicing `GetNext` requests (accounting for concurrent requests) - adds support for collecting the "lifetime" of tf.data iterator, that is time between receiving the first `GetNext` request and servicing the last `GetNext` request - removes support for collecting time between subsequent calls to IteratorGetNextOp PiperOrigin-RevId: 340474712 Change-Id: Icdfd35c46623160e9faacf1af69f897af88049f6	2020-11-03 10:40:09 -08:00
Michael Kuchnik	26aea76751	Replace default args with function overload	2020-11-03 13:31:08 -05:00

1 2 3 4 5 ...

1463 Commits