Eigen has very poor performance when the output tensor has few elements. With this change, if there are at most 1024 elements, a different implementation is used. The new implementation uses the functor::ReduceImpl function that is used for most other TF reductions. Eigen performs better than functor::ReduceImpl when there are many output elements, which is why Eigen is still used when the number of output elements is greater than 1024.
A benchmark was added. The results from running on my machine with two Xeon E5-2690 v4 CPUs and a Titan V GPU are shown below. All times are in milliseconds. The benchmarks were run in the internal version of TensorFlow. Only float32 benchmarks are shown, as float16 and float64 results are similar. Also only benchmarks where the new implementation are shown, as when the old implementation is used, the performance is the same as before this change.
Benchmark New time (s) Old time (s) old of new %
1d_float32_dim0 0.00089 0.06431 1.4%
rectangle1_2d_float32_dim1 0.00285 0.06736 4.2%
rectangle2_2d_float32_dim0 0.00298 0.05501 5.2%
rectangle1_3d_float32_dim0 0.07876 0.12668 62.2%
rectangle2_3d_float32_dim1 0.07869 0.12757 61.7%
rectangle3_3d_float32_dim2 0.07847 0.78461 10.0%
PiperOrigin-RevId: 292206797
Change-Id: Ic586910e0935463190761dc3ec9e7122bba06bd6
Add Numpy-style broadcasting in the batch dimensions for tf.linalg.triangular_solve op. The last two dimensions of both operands constitute the matrix dimensions. The dimensions beyond these are broadcasted to form a common output shape with the standard NumPy broadcasting rules. (https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
Note: This implementation differs from Numpy's behavior in that vectors (rank-1 Tensors) are not promoted to matrices (rank-2 Tensors) by appending/prepending dimensions.
PiperOrigin-RevId: 291857632
Change-Id: Ifce8f1ae3e0e5b990b71cf468978e1cdc7663d1f
Defining TensorList outside list_kernels library will allow clients to use TensorList class without having to also include the kernels operating on it.
PiperOrigin-RevId: 291474941
Change-Id: Iaab9d6c077b6a6c6236896c80d53ac8196472a82
The test file is testing two distinct headers, and in some contexts, the
complexity of this file is leading to really long compile times. By splitting in
two, we should at least enable parallel compilation if not potentially reduce
the effect of whatever behavior this is triggering in the compiler.
PiperOrigin-RevId: 291462169
Change-Id: I4df8934d8eaad1c93f986c074734eb31fa98e91a
This enables a few headers to be removed from implementations and in turn
simplify the build graph some.
PiperOrigin-RevId: 291286881
Change-Id: I0b8c9d1419cf81ea8b3a48b422ea2dc0fd9187d9
1. Use DMAHelper to access the tensor base pointers without additional
alignment and type checks, and use these pointers to access the elements of
the tensors directly.
2. Add a special case for rank == 2 (which is the common case when batching
Example protos), to avoid a length-1 loop per element.
3. Use `memcpy` where possible (and otherwise, `std::copy_n`) instead of Eigen
assignment for the group values.
PiperOrigin-RevId: 291176480
Change-Id: I331213c0ac1caadf620c87833759b8a6550f1752
Removes reliance on the assumption that tensorflow::int64 is long long. This is intended to eventually enable changing the definition to int64_t from <cstdint>.
PiperOrigin-RevId: 290872365
Change-Id: I18534aeabf153d65c3521599855f8cca279fce51
Add Numpy-style broadcasting in the batch dimensions for tf.linalg.triangular_solve op. The last two dimensions of both operands constitute the matrix dimensions. The dimensions beyond these are broadcasted to form a common output shape with the standard NumPy broadcasting rules. (https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
Note: This implementation differs from Numpy's behavior in that vectors (rank-1 Tensors) are not pr...
PiperOrigin-RevId: 289978628
Change-Id: I66e41e292e57e6df8111745cbe47ccffacb53edc
Add Numpy-style broadcasting in the batch dimensions for tf.linalg.triangular_solve op. The last two dimensions of both operands constitute the matrix dimensions. The dimensions beyond these are broadcasted to form a common output shape with the standard NumPy broadcasting rules. (https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
Note: This implementation differs from Numpy's behavior in that vectors (rank-1 Tensors) are not promoted to matrices (rank-2 Tensors) by appending/prepending dimensions.
PiperOrigin-RevId: 289966825
Change-Id: Ib276b9ed1f4b7d10c25617d7ba5f1564b2077610
The change omits necessary checks to is_initialized(), which could lead to data races.
PiperOrigin-RevId: 286203946
Change-Id: I678d05ccc7c5220e2d30111a853fd20f505fe933
The change omits necessary checks to is_initialized(), which could lead to data races.
PiperOrigin-RevId: 285986418
Change-Id: I12d40188473b855e398437514237b72eddb0443f
Accept the fact that old kernels will start with static linking.
Once the build restructuring is complete, revisit old kernels.
PiperOrigin-RevId: 285151131
Change-Id: I30fee8a789ff9733ea0573b1ce9f44bfd66a4923
This is part of the refactoring described in the Tensorflow Build Improvements RFC: https://github.com/tensorflow/community/pull/179
Subsequent changes will migrate targets from build_refactor.bzl into the new BUILD files.
PiperOrigin-RevId: 284712709
Change-Id: I650eb200ba0ea87e95b15263bad53b0243732ef5
V2 ops always align the diagonals to the left (LEFT_LEFT) in the compact format. V3 ops support 4 alignments: RIGHT_LEFT, LEFT_RIGHT, LEFT_LEFT, and RIGHT_RIGHT. We would like to use RIGHT_LEFT as the default alignment. This contradicts with v2's behavior so we need new a version.
V2 has never been exposed to the public APIs. We will skip V2 and go from V1 to V3 directly. V3 features are currently under forward compatibility guards and will be enabled automatically in ~3 weeks from now.
This commit contains
- V3 API definitions.
- Modifications to C++ Matrix{Diag,SetDiag,DiagPart}Op kernels (CPU, GPU, XLA) and shape inference functions to support v3.
- Additional tests and gradient implementations in Python for v3.
- Pfor and TFLite TOCO converters for v3.
- The TFLite MLIR converter for MatrixDiagV3 is intentionally left out because of an MLIR test infrastructure issue and will be added in a separate commit.
Notes:
- Python changes cannot be in a separate follow-up commit because all kernel tests are in Python. (No C++ tests.)
- All three ops have to be in the same commit because their gradients call each other.
PiperOrigin-RevId: 280527550
Change-Id: I88e91abab5c4b50419204807ede4fa60657f048a