Commit Graph

1042 Commits

Author SHA1 Message Date
A. Unique TensorFlower
58b1c0f401 Internal change
PiperOrigin-RevId: 292220288
Change-Id: Ib7e23f56f7b79174669d10ae7d938a82d9c19900
2020-01-29 14:37:24 -08:00
Reed Wanderman-Milne
837b673aa7 Improve performance of argmax and argmin GPU kernel in some cases.
Eigen has very poor performance when the output tensor has few elements. With this change, if there are at most 1024 elements, a different implementation is used. The new implementation uses the functor::ReduceImpl function that is used for most other TF reductions. Eigen performs better than functor::ReduceImpl when there are many output elements, which is why Eigen is still used when the number of output elements is greater than 1024.

A benchmark was added. The results from running on my machine with two Xeon E5-2690 v4 CPUs and a Titan V GPU are shown below. All times are in milliseconds. The benchmarks were run in the internal version of TensorFlow. Only float32 benchmarks are shown, as float16 and float64 results are similar. Also only benchmarks where the new implementation are shown, as when the old implementation is used, the performance is the same as before this change.

Benchmark                   New time (s)  Old time (s)  old of new %
1d_float32_dim0             0.00089       0.06431         1.4%
rectangle1_2d_float32_dim1  0.00285       0.06736         4.2%
rectangle2_2d_float32_dim0  0.00298       0.05501         5.2%
rectangle1_3d_float32_dim0  0.07876       0.12668        62.2%
rectangle2_3d_float32_dim1  0.07869       0.12757        61.7%
rectangle3_3d_float32_dim2  0.07847       0.78461        10.0%

PiperOrigin-RevId: 292206797
Change-Id: Ic586910e0935463190761dc3ec9e7122bba06bd6
2020-01-29 13:36:42 -08:00
George Karpenkov
9b544af08d Change std::call_once to absl::call_once
absl::call_once is faster and supports fibers.
PiperOrigin-RevId: 292148213
Change-Id: I66e96d735b722a2642508a7e7a1e73de254234d7
2020-01-29 08:40:19 -08:00
Brian Atkinson
efb083cff3 Make //tensorflow/core/kernels:non_max_suppression_op_gpu_test IWYU clean.
PiperOrigin-RevId: 292045357
Change-Id: I7d799733c5ff20336c4d805f915fd6fb2f9a002a
2020-01-28 17:20:20 -08:00
Srinivas Vasudevan
b105944eb6 Add Broadcasted Matrix Triangular Solve.
Add Numpy-style broadcasting in the batch dimensions for tf.linalg.triangular_solve op. The last two dimensions of both operands constitute the matrix dimensions. The dimensions beyond these are broadcasted to form a common output shape with the standard NumPy broadcasting rules. (https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
Note: This implementation differs from Numpy's behavior in that vectors (rank-1 Tensors) are not promoted to matrices (rank-2 Tensors) by appending/prepending dimensions.
PiperOrigin-RevId: 291857632
Change-Id: Ifce8f1ae3e0e5b990b71cf468978e1cdc7663d1f
2020-01-27 20:43:43 -08:00
A. Unique TensorFlower
2938772a08 Internal change
PiperOrigin-RevId: 291690233
Change-Id: I294cf4f577d1d17ad5bbb6e3be022ece6268575c
2020-01-27 03:17:56 -08:00
TensorFlower Gardener
4f5d1cc221 Merge pull request from duncanriach:multi-algorithm-deterministic-cudnn-convolutions
PiperOrigin-RevId: 291684013
Change-Id: I818177de66eeec3dd52e276a5894a1d7a7166459
2020-01-27 02:35:26 -08:00
Smit Hinsu
85cc56784d Define TensorList class in a separate library
Defining TensorList outside list_kernels library will allow clients to use TensorList class without having to also include the kernels operating on it.

PiperOrigin-RevId: 291474941
Change-Id: Iaab9d6c077b6a6c6236896c80d53ac8196472a82
2020-01-24 17:37:07 -08:00
Brian Atkinson
659c90d9c5 Split eigen_backward_spatial_convolutions_test into two.
The test file is testing two distinct headers, and in some contexts, the
complexity of this file is leading to really long compile times. By splitting in
two, we should at least enable parallel compilation if not potentially reduce
the effect of whatever behavior this is triggering in the compiler.

PiperOrigin-RevId: 291462169
Change-Id: I4df8934d8eaad1c93f986c074734eb31fa98e91a
2020-01-24 16:07:33 -08:00
Brian Atkinson
23e0a04dbf Add a number of missing headers being transitively pulled in.
This enables a few headers to be removed from implementations and in turn
simplify the build graph some.

PiperOrigin-RevId: 291286881
Change-Id: I0b8c9d1419cf81ea8b3a48b422ea2dc0fd9187d9
2020-01-23 18:18:03 -08:00
Derek Murray
649a04fbe9 [tf.data] Further optimizations for SerializeManySparseOp.
1. Use DMAHelper to access the tensor base pointers without additional
   alignment and type checks, and use these pointers to access the elements of
   the tensors directly.

2. Add a special case for rank == 2 (which is the common case when batching
   Example protos), to avoid a length-1 loop per element.

3. Use `memcpy` where possible (and otherwise, `std::copy_n`) instead of Eigen
   assignment for the group values.

PiperOrigin-RevId: 291176480
Change-Id: I331213c0ac1caadf620c87833759b8a6550f1752
2020-01-23 09:00:57 -08:00
Jeremy Lau
d277dc5e68 Disable tests that can't run under msan.
PiperOrigin-RevId: 291020785
Change-Id: I045d262e2e728b8a5dce80a2fa2993065c31f85a
2020-01-22 13:43:56 -08:00
A. Unique TensorFlower
a1bc56203f Fix 64-bit integer portability problems in TensorFlow kernels.
Removes reliance on the assumption that tensorflow::int64 is long long. This is intended to eventually enable changing the definition to int64_t from <cstdint>.

PiperOrigin-RevId: 290872365
Change-Id: I18534aeabf153d65c3521599855f8cca279fce51
2020-01-21 19:14:20 -08:00
TensorFlower Gardener
bd4c38b3dc Merge pull request from houtoms:pr_cudnn_ctc_loss
PiperOrigin-RevId: 290387603
Change-Id: I28491f42a4559a9f79bd6a7b73d8e6b670f55368
2020-01-17 20:32:44 -08:00
Smit Hinsu
c8e8ba577e Add Broadcasted Matrix Triangular Solve.
Add Numpy-style broadcasting in the batch dimensions for tf.linalg.triangular_solve op. The last two dimensions of both operands constitute the matrix dimensions. The dimensions beyond these are broadcasted to form a common output shape with the standard NumPy broadcasting rules. (https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
Note: This implementation differs from Numpy's behavior in that vectors (rank-1 Tensors) are not pr...
PiperOrigin-RevId: 289978628
Change-Id: I66e41e292e57e6df8111745cbe47ccffacb53edc
2020-01-15 18:25:29 -08:00
Srinivas Vasudevan
b1b7f38c25 Add Broadcasted Matrix Triangular Solve.
Add Numpy-style broadcasting in the batch dimensions for tf.linalg.triangular_solve op. The last two dimensions of both operands constitute the matrix dimensions. The dimensions beyond these are broadcasted to form a common output shape with the standard NumPy broadcasting rules. (https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
Note: This implementation differs from Numpy's behavior in that vectors (rank-1 Tensors) are not promoted to matrices (rank-2 Tensors) by appending/prepending dimensions.
PiperOrigin-RevId: 289966825
Change-Id: Ib276b9ed1f4b7d10c25617d7ba5f1564b2077610
2020-01-15 17:03:51 -08:00
TensorFlower Gardener
0e2b5a9d2a Merge pull request from ROCmSoftwarePlatform:google_upstream_rocm_csr_sparse_matrix_support
PiperOrigin-RevId: 289617600
Change-Id: Ic1aa3714126d7b867295ae386b6be643c1dc83e4
2020-01-14 03:19:11 -08:00
Ayush Dubey
bdb99e06c5 Disable collective_nccl_test on single GPU and enable on multiple GPUs.
PiperOrigin-RevId: 289142542
Change-Id: I6b9c41f74062accc32173cc7afa4228e500bf31c
2020-01-10 12:33:02 -08:00
A. Unique TensorFlower
b6d83da696 Explicitly export files needed by other packages
PiperOrigin-RevId: 289068233
Change-Id: Iad295a519968341f3765116f5f3c6508efd51d24
2020-01-10 04:12:13 -08:00
TensorFlower Gardener
c1971ab97c Merge pull request from ROCmSoftwarePlatform:google_upstream_rocm_miopen_immediate_mode
PiperOrigin-RevId: 289053613
Change-Id: I233d95adc3aa888460bd39a07fd7e168fea14846
2020-01-10 01:43:54 -08:00
Srinivas Vasudevan
19986377f2 Add tf.math.xlog1py, a safe way to compute x * log1p(y)
PiperOrigin-RevId: 288971952
Change-Id: I3850da3b37f006b11198d203a1b73f3cb336b833
2020-01-09 14:28:37 -08:00
Mihai Maruseac
31b0483fdf Cleanup unused load statements.
PiperOrigin-RevId: 288791479
Change-Id: Ib68dc2bfa2856d839a47e7430d565da766df12ad
2020-01-08 15:53:53 -08:00
Brian Atkinson
229be70a12 Cleanup unused load statements.
PiperOrigin-RevId: 288759430
Change-Id: I45b80521b527d1ea2e3202a76ddc111dc6cbd273
2020-01-08 13:17:44 -08:00
TensorFlower Gardener
448c0b41bc Merge pull request from Intel-tensorflow:yang/mirrorpad
PiperOrigin-RevId: 287920237
Change-Id: I3e7aacbdd52584bfb99cc347ade8c6429d89196d
2020-01-02 17:22:31 -08:00
Deven Desai
f5b5f3d22d [ROCm] Enabling ROCm support for code in gpu_util.cc 2019-12-30 16:10:45 +00:00
Duncan Riach
330e7ad14e Refactor code that enables deterministic operation of cuDNN 2019-12-27 13:08:14 -08:00
A. Unique TensorFlower
50a1c3be8b Add tf.math.sobol_sample for generating Sobol sequences (CPU implementation only).
PiperOrigin-RevId: 286365254
Change-Id: Ia0c2482f4f264f36fe61db5f9c72f24db35faf65
2019-12-19 03:49:17 -08:00
Martin Wicke
9b20626323 Roll back "Use absl::flat_hash_map ... in lookup_table_op."
The change omits necessary checks to is_initialized(), which could lead to data races.

PiperOrigin-RevId: 286203946
Change-Id: I678d05ccc7c5220e2d30111a853fd20f505fe933
2019-12-18 09:12:46 -08:00
Derek Murray
7b2c3eb190 Roll back "Use absl::flat_hash_map ... in lookup_table_op."
The change omits necessary checks to is_initialized(), which could lead to data races.

PiperOrigin-RevId: 285986418
Change-Id: I12d40188473b855e398437514237b72eddb0443f
2019-12-17 08:41:29 -08:00
Guangda Lai
5d72174749 Changes package visibility.
PiperOrigin-RevId: 285907961
Change-Id: Ie2861ae7106ef34545561397024ef441c020b2ad
2019-12-16 21:13:36 -08:00
Martin Wicke
adfe745b21 Use absl::flat_hash_map instead of std::unordered_map for potential memory savings in lookup_table_op.
I ran the microbenchmarks -- the effect of this isn't obviously significant but there are some speed improvements.

Before/After:

MutableHashTableBenchmark.benchmark_many_repeated_batch_32_insert_scalar
  wall_time: 0.000236988067627 -> 0.000297546386719
  allocator_maximum_num_bytes_cpu: 776.0 -> 776.0

MutableHashTableBenchmark.benchmark_many_repeated_scalar_insert_scalar
  wall_time: 0.000243902206421 -> 0.000259876251221
  allocator_maximum_num_bytes_cpu: 388.0 -> 268.0

MutableHashTableBenchmark.benchmark_single_repeated_batch_32_insert_scalar
  wall_time: 0.000108957290649 -> 9.79900360107e-05
  allocator_maximum_num_bytes_cpu: 640.0 -> 512.0

MutableHashTableBenchmark.benchmark_single_repeated_scalar_insert_scalar
  wall_time: 0.0001060962677 -> 0.000105142593384
  allocator_maximum_num_bytes_cpu: 268.0 -> 268.0

DenseHashTableBenchmark.benchmark_many_repeated_batch_32_insert_scalar
  wall_time: 0.000240087509155 -> 0.000237941741943
  allocator_maximum_num_bytes_cpu: 776.0 -> 776.0

DenseHashTableBenchmark.benchmark_many_repeated_scalar_insert_scalar
  wall_time: 0.000249147415161 -> 0.000262022018433
  allocator_maximum_num_bytes_cpu: 400.0 -> 400.0

DenseHashTableBenchmark.benchmark_single_repeated_batch_32_insert_scalar
  wall_time: 0.0001060962677 -> 9.79900360107e-05
  allocator_maximum_num_bytes_cpu: 640.0 -> 640.0

DenseHashTableBenchmark.benchmark_single_repeated_scalar_insert_scalar
  wall_time: 0.000102996826172 -> 0.000121116638184
  allocator_maximum_num_bytes_cpu: 280.0 -> 280.0

PiperOrigin-RevId: 285790064
Change-Id: I26ed8de9f1332d8a1fb5b0cdaf3842f6e2e55d3b
2019-12-16 10:15:02 -08:00
Deven Desai
7e8ccbd22b Adding ROCm support for the GpuSparse API (TF wrapper for cuSPARSE/hipSPARSE) 2019-12-13 15:24:42 +00:00
Gunhan Gulsoy
57c057b461 Prepare for new ops that have suitable kernels for dynamic loading.
Accept the fact that old kernels will start with static linking.
Once the build restructuring is complete, revisit old kernels.

PiperOrigin-RevId: 285151131
Change-Id: I30fee8a789ff9733ea0573b1ce9f44bfd66a4923
2019-12-12 02:20:37 -08:00
Kaixi Hou
47306c8618 Solved a conflict 2019-12-10 16:30:16 -08:00
Brian Zhao
ead06270dc Adding tensorflow/core/platform/default/BUILD and tensorflow/core/platform/windows/BUILD.
This is part of the refactoring described in the Tensorflow Build Improvements RFC: https://github.com/tensorflow/community/pull/179
Subsequent changes will migrate targets from build_refactor.bzl into the new BUILD files.

PiperOrigin-RevId: 284712709
Change-Id: I650eb200ba0ea87e95b15263bad53b0243732ef5
2019-12-10 00:08:58 -08:00
Duncan Riach
5341e3d299 Add multi-algorithm deterministic cuDNN convolutions 2019-12-06 18:18:38 -08:00
Brian Atkinson
e0fe27afa6 cudnn_rnn_ops and sdca_ops doesn't use farmhash directly and the dep can be removed.
PiperOrigin-RevId: 284295152
Change-Id: I6755471f177da420fd4a051a5619392574004c12
2019-12-06 17:49:57 -08:00
Kaixi Hou
cb7e008708 Avoid to register the ctc loss kernel when cudnn is older than 7.6.3 2019-12-05 13:33:30 -08:00
Kaixi Hou
4a89f04615 Use DnnScratchAllocator 2019-12-05 09:42:44 -08:00
A. Unique TensorFlower
8053a43598 Explicitly export files needed by other packages
PiperOrigin-RevId: 282800860
Change-Id: Ie7b8863629c0e4b2169c654714bf4a2338d667e5
2019-11-27 11:32:26 -08:00
ShengYang1
a1bdc83cc8 change MirrorPad packet region 2019-11-27 08:52:42 +08:00
Brian Atkinson
b12e985cc8 Use redirection point in core/platform for build_config.bzl
PiperOrigin-RevId: 282458453
Change-Id: I55d997c870125aee7179e86cad664f7ecbe1e3a7
2019-11-25 16:38:05 -08:00
Brian Atkinson
b5d2f3677f Add a redirection point to core/platform for build_config_root.bzl
PiperOrigin-RevId: 282394372
Change-Id: Iea26860cafba0304fe0846f4a992ad535292029c
2019-11-25 12:25:58 -08:00
TensorFlower Gardener
f46e758677 Merge pull request from lamarrr:patch-5
PiperOrigin-RevId: 282381036
Change-Id: I9870a352cf7a664ec4f7dfab13c6bb9bff41ec72
2019-11-25 11:24:46 -08:00
A. Unique TensorFlower
0b3bcfe4d9 Explicitly export files needed by other packages
PiperOrigin-RevId: 281482304
Change-Id: Iada44c7e5f7cc5070395d9966ce2a491b9fbc816
2019-11-20 02:57:36 -08:00
Zhuo Peng
de930a5ed3 Fixed a bug in DecodeProtoSparse.
A repeated field may appear on wire in an non-contiguous manner.

PiperOrigin-RevId: 280693833
Change-Id: Iff38580a249d2501d3508f9ca28f656cb9ec0dbc
2019-11-15 11:05:30 -08:00
Penporn Koanantakool
90e67385a4 Add Matrix{Diag,SetDiag,DiagPart}V3 ops with alignment options.
V2 ops always align the diagonals to the left (LEFT_LEFT) in the compact format. V3 ops support 4 alignments: RIGHT_LEFT, LEFT_RIGHT, LEFT_LEFT, and RIGHT_RIGHT. We would like to use RIGHT_LEFT as the default alignment. This contradicts with v2's behavior so we need new a version.

V2 has never been exposed to the public APIs. We will skip V2 and go from V1 to V3 directly. V3 features are currently under forward compatibility guards and will be enabled automatically in ~3 weeks from now.

This commit contains
- V3 API definitions.
- Modifications to C++ Matrix{Diag,SetDiag,DiagPart}Op kernels (CPU, GPU, XLA) and shape inference functions to support v3.
- Additional tests and gradient implementations in Python for v3.
- Pfor and TFLite TOCO converters for v3.
- The TFLite MLIR converter for MatrixDiagV3 is intentionally left out because of an MLIR test infrastructure issue and will be added in a separate commit.

Notes:
- Python changes cannot be in a separate follow-up commit because all kernel tests are in Python. (No C++ tests.)
- All three ops have to be in the same commit because their gradients call each other.
PiperOrigin-RevId: 280527550
Change-Id: I88e91abab5c4b50419204807ede4fa60657f048a
2019-11-14 16:02:19 -08:00
Kaixi Hou
46aa1ca220 Put the reusable class CudnnAllocatorInTemp to a separate file 2019-11-08 13:27:12 -08:00
TensorFlower Gardener
8977382c53 Merge pull request from samikama:GenerateBoxProposalsOp
PiperOrigin-RevId: 279101236
Change-Id: Icf3e1b03365161708b906b44b7d544b2a4adba10
2019-11-07 09:37:48 -08:00
Adrian Kuegel
88b38d40a8 internal BUILD file cleanup
PiperOrigin-RevId: 278834295
Change-Id: I5b2624c45e6b4b6aa765b23a9e73f65b4b95c626
2019-11-06 05:10:36 -08:00