STT-tensorflow

Author	SHA1	Message	Date
A. Unique TensorFlower	034633f23b	PY2 removal cleanup PiperOrigin-RevId: 352106691 Change-Id: I382d53c64f0d29da430b8cb6d2395a2cb281509e	2021-01-15 16:48:57 -08:00
Scott Zhu	c4aab50762	Move the LossReduction class from tf to Keras. PiperOrigin-RevId: 351654535 Change-Id: I405d4f79568c05ddeaa3755c3171960db633e6de	2021-01-13 13:40:14 -08:00
Jeremy Lau	b6c59262a8	Temporarily disable flaky input_lib_test under tsan. PiperOrigin-RevId: 351641375 Change-Id: I24b3ccb0f8080d9e85e25faf224ad8e3b363b1d0	2021-01-13 12:40:04 -08:00
Revan Sopher	58fa23c36c	Disable random_generator_test on Cloud TPU. PiperOrigin-RevId: 351631939 Change-Id: I990a305c084f6a35752da2b6a6e7ed8555cf7a24	2021-01-13 12:03:26 -08:00
Ran Chen	3b5bb5a706	Add convenient methods to write test combinations with and without tf.function Instead of using def_function, in test we can parameterize with these two objects to test both tf.function and eager execution. PiperOrigin-RevId: 351420316 Change-Id: I037d1678ca843f6df88694981efd4519c2947cd3	2021-01-12 12:30:57 -08:00
Peng Wang	587ac71f68	Allows creating tf.random.Generator under distribution-strategy scopes. Different replicas will get different random-number streams. All strategies are supported except for CentralStorageStrategy and ParameterServerStrategy. This CL also removes the CompositeTensor superclass from Generator. Generator is a wrapper around tf.Variable, and because tf.Variable is not a CompositeTensor, Generator can't be a CompositeTensor in theory. Previously we made it a CompositeTensor by returning Variable.handle, but that breaks down when the variable is a DistributedVariable (in cross-replica context). PiperOrigin-RevId: 350851648 Change-Id: I5f4d77ddb990557fcc9c7336987203ecdaec5b9a	2021-01-08 15:58:22 -08:00
Chenkai Kuang	12c67c0d47	Raise meaningful error message when loading a ShardedVariable. PiperOrigin-RevId: 348539354 Change-Id: I2c4a8466c3d1355ec8e5984ed039194c18c4305c	2020-12-21 15:59:59 -08:00
TensorFlower Gardener	ee11cf13ea	Merge pull request #45166 from ROCmSoftwarePlatform:google_upstream_rocm_add_remove_no_rocm_tag PiperOrigin-RevId: 348425014 Change-Id: I1f4ec5ab94f27ea201a9dbc8b53d337d03e01d6c	2020-12-21 00:32:48 -08:00
Yanhua Sun	fcef8dd078	disable failed tsan test PiperOrigin-RevId: 348115144 Change-Id: I71af99a588804dec96923781d52c0aace9543b9e	2020-12-17 16:33:37 -08:00
Yanhua Sun	39cbd155f4	disable asan failed test PiperOrigin-RevId: 347902281 Change-Id: I8ed7c6589627a6f3e9fd4e3c2566c1e6b3f57649	2020-12-16 14:55:32 -08:00
Yanhua Sun	112778a924	add BUILD file for python/util and refactor python/BUILD PiperOrigin-RevId: 347652147 Change-Id: I97aa24d6a109ed5aec8fca9a9b5e3b2b1a96b760	2020-12-15 11:43:18 -08:00
Deven Desai	6ccddc5d3f	Removing no_rocm tag from unit-tests that are now passing on the ROCm platform	2020-11-25 02:06:27 +00:00
Ran Chen	239a96c0ac	Enable NCCL for all all-reduces PiperOrigin-RevId: 344134571 Change-Id: I7234c42d196716570c820c1714376a2b4311cc06	2020-11-24 14:55:12 -08:00
Priya Gupta	b0d40302ec	tf.distribute: Move old/unused all_reduce util to v1/ PiperOrigin-RevId: 343975576 Change-Id: Idb77013708cf40fb21127adf6e5f65f063f16029	2020-11-23 19:46:50 -08:00
Chenkai Kuang	ace1f5d6c9	Internal test change. PiperOrigin-RevId: 343524786 Change-Id: Ifaafaa83b7e6b7c3f3d30744b0b65ea257013a48	2020-11-20 11:03:15 -08:00
Ran Chen	de33613977	[rollforward]Guess test binary path from TEST_TARGET env var PiperOrigin-RevId: 343132917 Change-Id: I6af62b595875070dac9e47f22564852dd4976252	2020-11-18 16:41:53 -08:00
Ran Chen	3d03ae2086	[rollback]Guess test binary path from TEST_TARGET env var PiperOrigin-RevId: 343121732 Change-Id: I8945009c09f1d6b555000db1f8c17452cb0e6d64	2020-11-18 16:21:13 -08:00
Ran Chen	448e080c75	Guess test binary path from TEST_TARGET env var PiperOrigin-RevId: 342941997 Change-Id: Ibb7d8da784b869927c26f9eb4fe7564aee05a21d	2020-11-17 14:11:55 -08:00
Peng Wang	ce6f0682d2	Some internal changes PiperOrigin-RevId: 341884008 Change-Id: I585bf735d032503679f45f71f9820784e339b012	2020-11-11 12:25:10 -08:00
Ran Chen	07b9eccccf	Order NCCL all-reduce with ordering token Auto control dep will chain operations with the same resource input. We'll do the same thing for all-gather after some refactoring is done. PiperOrigin-RevId: 341868107 Change-Id: I5570a28c2e1c638980e3509088c0525e957c463b	2020-11-11 11:18:30 -08:00
Scott Zhu	a861238b03	Fork keras related sharded_variable_test to keras/distribute. TF distribute shouldn't rely on any keras code. PiperOrigin-RevId: 340483958 Change-Id: I4c3774dce1e914dc1f257d13117420a3fb9b3406	2020-11-03 11:18:23 -08:00
Anna R	a877252856	Disable tests that flakily timeout in tensorflow.cuda_asan project. PiperOrigin-RevId: 339922525 Change-Id: I88c0d616154484ddd353d262f2fe7408cfc058c4	2020-10-30 12:55:46 -07:00
Anna R	1b9ba7cb6e	Disable more tests that time out with cuda asan. PiperOrigin-RevId: 339157584 Change-Id: I834a765fbcb1aa782756c0ef92ed7bb35feb176a	2020-10-26 18:12:25 -07:00
Scott Zhu	b8f0d8c418	Move the distribute_file_utils to Keras. The only usage of it is from Keras callbacks. PiperOrigin-RevId: 339128802 Change-Id: I70e7911f678404680d40ed9b5d2bdbb1b4d4ff5a	2020-10-26 15:29:19 -07:00
Ran Chen	3a62bc8168	Use the same worker pool for tests that requires the same number of workers We used to use one worker pool per strategy combination, but it's not necessary. If the cluster topology is the same they can share the same worker pool. This reduces the overhead of initializing worker pools, which can take O(10s) for GPU builds. PiperOrigin-RevId: 339117353 Change-Id: I1f631f79597b07991528c77482c44c201a01abe4	2020-10-26 14:31:49 -07:00
Ran Chen	5fa0d44596	Enable a few tests on py3.8 version PiperOrigin-RevId: 339076865 Change-Id: I6b622191b1f952f0b749cd4592046c4f03391c08	2020-10-26 11:35:47 -07:00
Anna R	4a1e7e0f66	Disable tests that time out with cuda asan. PiperOrigin-RevId: 338706453 Change-Id: If24250d7893df112a01026c7eb40ecd4536ea976	2020-10-23 11:09:48 -07:00
Kibeom Kim	0c67638ac2	Remove deprecated `tfrt_enabled` test target flag. PiperOrigin-RevId: 338530097 Change-Id: I0bd2ad366210330ece06f99a4fdb16de395ece05	2020-10-22 12:55:06 -07:00
A. Unique TensorFlower	4e1e1499fe	Disable a few failing tests on py3.8 version PiperOrigin-RevId: 338381183 Change-Id: I95b58ff09033376936e05364c6ec35ec38389ea0	2020-10-21 19:31:09 -07:00
Chenkai Kuang	785353f8d4	Use ops dispatch to overwrite the behavior of embedding_lookup ops when called with ShardedVariable. Otherwise ShardedVariable will be converted to a dense tensor when passing to embedding_lookup. Ops like `tf.nn.nce_loss` and `tf.nn.sampled_softmax_loss` also benefit from this as they use embedding_lookup internally. PiperOrigin-RevId: 338369985 Change-Id: I89ebe2a452fc1d599567cb80e80ee9b023e5aa1c	2020-10-21 18:02:03 -07:00
Chenkai Kuang	e185142f34	Replace v1 partitioners with newly exported partitioners symbol in ps tests and example code. PiperOrigin-RevId: 338301301 Change-Id: Ie83b62c5b16db684be04aabd55603fed0c2ffaaa	2020-10-21 11:50:53 -07:00
Ran Chen	9f51b98f0b	[retry]DistributedDataset creates elements with fixed spec to help avoid retracing tf.function tracing depends on the inputs to the function. For a typical training loop: x, y = next(iter) train_fn(x,y) it may retrace when getting a partial/batches. This is problematic for multi client training since different client may retrace at different time. We assign collective instance_key when tracing a function, retracing results in different sets of instance keys. This change we overrides the PerReplica type spec, which is used to calculate function cache key. This tries to avoid retracing in common cases, but it doesn't guarantee that it won't happen. Note that after such change, the function also gets partial shape information. This is the reason we only do it for multi client strategies (MWMS), to avoid performance penalty to e.g. TPU. PiperOrigin-RevId: 338203534 Change-Id: Iae9d6c3c82113d623707e19142fbebe5597d7898	2020-10-20 22:48:17 -07:00
Pavithra Vijay	fbcdf129b9	Disable broken failing test PiperOrigin-RevId: 338200487 Change-Id: If0d5b224326843446333d81bb47a45387fe2232f	2020-10-20 22:16:58 -07:00
Mihai Maruseac	fd914a2b02	Disable broken Windows test PiperOrigin-RevId: 338180192 Change-Id: Ic3f582bba112b7f0f73fac52617fc8425553aa5b	2020-10-20 19:04:05 -07:00
Ran Chen	0e14b0fdc4	[retry] Graduate MultiWorkerMirroredStrategy out of experimental Over the past months we've several improvements: - Test coverage is now on par with other strategies. - Peer failure will no longer cause the cluster to hang. - Major issues with saving are fixed. - gather() API is added. PiperOrigin-RevId: 338175223 Change-Id: I3c52a4d53d1c487558f1caaae7d094fe2245183b	2020-10-20 18:16:47 -07:00
Chenkai Kuang	3438443618	Add and export V2 partitioner methods in tf.distribute namespace. compat.v1 partitioners are left unchanged. V2 partitioners are exported in tf.distribute namespace as they are supposed to work with sharded variable, which is a concept in tf.distribute. Implementations of the partitioners are reused. While on it, also took the opportunity to refine the naming: - variable_axis_size_partitioner -> MaxSizePartitioner (partitioner that keeps shards under a maximum size) - min_max_variable_partitioner -> MinSizePartitioner (partitioner that allocates shards above a minimum size) - fixed_size_partitioner -> FixedShardsPartitioner (partitioner that allocates fixed number of shards). PiperOrigin-RevId: 338157380 Change-Id: I19f517e38f20e4e9c85745863e764da0aad6eeeb	2020-10-20 16:15:55 -07:00
Rick Chao	32f35aabce	PSv2: Export a few tf.distribute symbols related to TF2 parameter server training. This change exports the following class symbols, and adds relevant documentation and example code to tf.distribute.experimental.ParameterServerStrategy tf.distribute.experimental.coordinator.ClusterCoordinator tf.distribute.experimental.coordinator.PerWorkerValues tf.distribute.experimental.coordinator.RemoteValue PiperOrigin-RevId: 338151262 Change-Id: If2d1c513d30a999c728cecc2e73b75adda1948c2	2020-10-20 15:42:17 -07:00
A. Unique TensorFlower	d5ab30ca14	Graduate MultiWorkerMirroredStrategy out of experimental Over the past months we've several improvements: - Test coverage is now on par with other strategies. - Peer failure will no longer cause the cluster to hang. - Major issues with saving are fixed. - gather() API is added. PiperOrigin-RevId: 338132035 Change-Id: I384c084717cd5f2b6167668ebe96af0f7b371530	2020-10-20 14:05:53 -07:00
Ran Chen	ebab4d6209	Graduate MultiWorkerMirroredStrategy out of experimental Over the past months we've several improvements: - Test coverage is now on par with other strategies. - Peer failure will no longer cause the cluster to hang. - Major issues with saving are fixed. - gather() API is added. PiperOrigin-RevId: 338110984 Change-Id: I92eeb981c67acb0c44f658316b6ad564162508bc	2020-10-20 12:32:12 -07:00
Chen Chen	d9e09b0723	Increase shard_count of strategy_gather_test to avoid timeout: PiperOrigin-RevId: 337996150 Change-Id: Ifa68a95b8a21f88c5e7fe9dd5f8796afec37a2e3	2020-10-19 22:07:06 -07:00
Ran Chen	fcd8113d1b	[rollback]DistributedDataset creates elements with fixed spec to help avoid retracing PiperOrigin-RevId: 337887709 Change-Id: Ifd2863ca4d0a6f619ba050ff32d37001118f0d7c	2020-10-19 10:52:55 -07:00
Ran Chen	7ba60d5a29	DistributedDataset creates elements with fixed spec to help avoid retracing tf.function tracing depends on the inputs to the function. For a typical training loop: x, y = next(iter) train_fn(x,y) it may retrace when getting a partial/batches. This is problematic for multi client training since different client may retrace at different time. We assign collective instance_key when tracing a function, retracing results in different sets of instance keys. This change we overrides the PerReplica type spec, which is used to calculate function cache key. This tries to avoid retracing in common cases, but it doesn't guarantee that it won't happen. Note that after such change, the function also gets partial shape information. This is the reason we only do it for multi client strategies (MWMS), to avoid performance penalty to e.g. TPU. PiperOrigin-RevId: 337792983 Change-Id: Ib029d61cd360d6a25e38e894913e4d78af20d1dd	2020-10-18 22:25:39 -07:00
Ran Chen	f196a243ea	Graduate experimental_hints to options in all_reduce/reduce/batch_reduce The CollectiveHints class is also renamed to CommunicationOptions. The communication enum is added to it. CommunicationOptions stays experimental since the detailed options may change, but it's rather clear we need an options argument for these cross device communications. PiperOrigin-RevId: 337547832 Change-Id: I376171672698d5923b4e52f2567d4a584c8e21b6	2020-10-16 11:54:24 -07:00
Ran Chen	380478ff5f	Disallow saving if the function cannot be used for inference With distribution strategy, traced ConcreteFunctions may contain training specific logics that assumes the variable is a distributed variable. Such functions cannot be used for inference. Since we do not know if such ConcreteFunction will be saved for inference or not, we always mark them as unsaveable unless it's traced under a save context. The user can tf.function instead, which can be retraced in saving. Impacted usages: - MultiWorkerMirroredStrategy - Reading a synchronization=ON_READ variable. E.g. a batch norm layer. - MultiWorkerMirroredStrategy, MirroredStrategy, TPUStrategy - Updating a variable. - Reading a synchronization=ON_READ aggregation=SUM variable. It's TBD if we also need to mark functions that use packed handle as unsaveable. They do contain TPU:0 device annotations but with soft placement it may not be a problem. PiperOrigin-RevId: 337438256 Change-Id: Ie89d0d6beb3e71d3ebbb867d1f91f2953468840c	2020-10-15 21:08:51 -07:00
Mihai Maruseac	7c595a218b	Disable a few asan tests that fail with -fsanitize-null PiperOrigin-RevId: 337397375 Change-Id: Iada5ce9193c053599b5f83a2ba2105878577dc05	2020-10-15 15:54:09 -07:00
Chenkai Kuang	d6e0181f1d	Make `ShardedVariable` a composite tensor. This allows us to: 1. Pass `ShardedVariable` to tf.function inputs while avoiding retracing if spec of the ShardedVariable doesn't change. 2. Use `nest.flatten(sharded_variable, expand_composites=True)` to retrieve the list of component variables. This is used by `tf.module` and keras Layer to collect variables from attributes that are nested structures, so this change makes them be able to collect component variables when a sharded_variable is assigned to their attribute. `layer.add_weight` already works, this change adds a test for that. PiperOrigin-RevId: 337382403 Change-Id: I4c7e490cdc8fd772ed57c4074894637147986dac	2020-10-15 14:32:44 -07:00
Rick Chao	e70e880bd5	MultiProcessRunner: Open source multi_process_runner with a OSS backend. Some tests are timing out and being disabled on tap. Cause TBD. PiperOrigin-RevId: 337376027 Change-Id: Ia0e58be434ce59469498db3a24c3bd32cc17c023	2020-10-15 14:09:23 -07:00
Richard Uhler	b881485eb5	Fix logic for enabling MLIR bridge depending on has_tensor_list_arg. It appears that the polarity of the use of has_tensor_list_arg was inadvertently flipped. Disable any MLIR bridge enabled tests that were passing because they weren't using the MLIR bridge due to this issue. PiperOrigin-RevId: 337125651 Change-Id: I93e9e61acda9a2aeffaee5cce13e93635d33f5a4	2020-10-14 11:20:34 -07:00
Ran Chen	56124bd1ab	Support batching all-reduce with concat and split Collective v2 doesn't support scoped allocator. While it's possible to make scoped allocator work with it. Concat/split is much simpler. PiperOrigin-RevId: 337109439 Change-Id: I3535f5e0b090696f3bb620617f2b57f1f4b78b22	2020-10-14 10:16:07 -07:00
A. Unique TensorFlower	c0d20ffd82	Support batching all-reduce with concat and split Collective v2 doesn't support scoped allocator. While it's possible to make scoped allocator work with it. Concat/split is much simpler. PiperOrigin-RevId: 337021023 Change-Id: I6e6e2fdc3c94ffbc59a52c20a451dcd74fd864e4	2020-10-13 22:16:32 -07:00

1 2 3 4 5 ...

382 Commits