Update eager/context.py and tfe_wrapper to support returning
the real value of mlir_bridge_rollout (enabled/disabled/unspecified)
instead of a bool. This gives users a clearer signal of whether
or not the mlir bridge is being used. At the moment, the mlir
bridge is only enabled when mlir_bridge_rollout is set to
enabled but this will change in the future.
PiperOrigin-RevId: 338124102
Change-Id: I5c93cbdd2815a698e6b41244db8eed716f4988e6
This change allows the linkage of multithreaded XLA AOT CPU backend objects,
such as multithreaded matmul, conv2d, etc. These are not enabled by default.
New unit tests confirm that the objects are emitted and linked correctly,
and the resulting computations are numerically correct.
MKL service backend objects are not included.
Other changes:
* C++ Unit tests now use arg_feed_{x,y} instead of arg0/arg1, since the names
are flaky (they may swap from the signature)
* Add argument "multithreading=" to the bzl file and saved_model_cli.
* Add unit tests using "nm" to ensure that the proper symbols are used when
enabling or disabling multithreading (not sure if they are windows-friendly).
* Use a simpler and more unique string for the entry_point string.
PiperOrigin-RevId: 338112208
Change-Id: Id734e75e63e72db93a743f451ddb7eb6f489c1c7
The existing tf_mlir_enable_mlir_bridge flag allows models to
selectively enable or disable the model via TF_XLA_FLAGS. If the
flag is not set, it defaults to false.
In order to slowly and safely rollout the mlir_bridge, we will
need to distinguish between unspecified and forcibly disabled.
If the flag is unspecified, we can selectively choose when the
bridge is enabled. This will allow us to slowly ramp up the
number of models that use the new bridge.
This patch continues to support the existing TF_XLA_FLAG
interface (tf_mlir_enable_mlir_bridge can be set to true or false)
but internally, TensorFlow can now distinguish between false
(forcibly disabled) and unset (unspecified).
PiperOrigin-RevId: 337523318
Change-Id: I8ebb49da104663e12e5c1fa6399a1bf79239a44f
The threshold on tensor size was applied after the value is computed, only when replacing the old nodes. However, that could already have caused OOM in large models.
Changed compilation to XLA to limit TF constant folding to 1024 bytes, since it's only used for getting the shapes, and XLA internally also has constant folding.
PiperOrigin-RevId: 337226951
Change-Id: Ib7ebb91950e379cac6978027a7162438eb0a58d2
The threshold on tensor size was applied after the value is computed, only when replacing the old nodes. However, that could already have caused OOM in large models.
Changed compilation to XLA to limit TF constant folding to 1024 bytes, since it's only used for getting the shapes, and XLA internally also has constant folding.
PiperOrigin-RevId: 337221696
Change-Id: I4cdca20d28141f34b2c85120298bffb89e6df85d
This is in preparation of updating graph pruning to always prune imported function graphs.
PiperOrigin-RevId: 335944889
Change-Id: I3f6156aa08384883eee6227210f8fc8f1b7cc575
Some models are using "TF2" but also using Session and passing
a ConfigProto. This is TF1 code running on the TF2 MLIR bridge.
The TF2 MLIR bridge assumed that this was not possible. This cl
updates the TF2 version of the MLIR bridge to support a ConfigProto
passed in via Session.
Disable MLIR bridge presubmit testing of saved_model_test.py because this fix reveals that the test is actually broken. It is using TF2 control flow but loading a model with Session.
PiperOrigin-RevId: 335915669
Change-Id: Ib50bef389449ce0011878dd50b73856e9c520289
The existing tf_mlir_enable_mlir_bridge flag allows models to
selectively enable or disable the model via TF_XLA_FLAGS. If the
flag is not set, it defaults to false.
In order to slowly and safely rollout the mlir_bridge, we will
need to distinguish between unspecified and forcibly disabled.
If the flag is unspecified, we can selectively choose when the
bridge is enabled. This will allow us to slowly ramp up the
number of models that use the new bridge.
This patch continues to support the existing TF_XLA_FLAG
interface (tf_mlir_enable_mlir_bridge can be set to true or false)
but internally, TensorFlow can now distinguish between false
(forcibly disabled) and unset (unspecified).
PiperOrigin-RevId: 335662030
Change-Id: Iefc44436620e52ff21a72583d57ebf29124a2691
Merge all language specific proto libraries into just tf_proto_library.
PiperOrigin-RevId: 333400278
Change-Id: Ic891331668db3e562d42805295eade90fd017e91
Serialization adds a new surface area for bugs, as not all the callers
propagate the CustomKernelCreator correctly. Moreover, the mechanism is quite
hacky and in the future we plan to potentially switch to a different one.
PiperOrigin-RevId: 333111910
Change-Id: I5a02200dfdffde657bd5d9e4547c470d8644d892
The new ops have three differences from existing (old) stateless RNG ops:
* They take in `key` and `counter` instead of `seed` (thus no seed scrambling).
* They take in an `alg` argument to control which RNG algorithm to use, unlike the old ones which pick algorithm based on device.
* They don't have `HostMemory` constraints on `key` and `counter` (the old ones have such constraints on `seed`).
Two new ops `StatelessRandomGetKeyCounterAlg` and `RngReadAndSkip` are also added to bridge the gaps between the new stateless ops and the Python API for stateless RNGs and tf.random.Generator, so that the Python API's behavior doesn't change.
Also adds set_soft_device_placement(False) to tests to control which kernels are tested.
PiperOrigin-RevId: 332346574
Change-Id: Ibe0e41cccce82e50b5581ea6298218efb163157a
- The inputs to strided slice are stored in a stack for backward pass.
- Popping items from the stack makes the input unknown in backward pass.
- In the case where the begins and ends are unknown, lower the strided slice grads into dynamic update slice instead.
PiperOrigin-RevId: 330987081
Change-Id: I0116a02f2fd7d660b49757622afc9934bb4b37e6
This will allow for std::vector<XlaArgument> and llvm::SmallVector arg parameters in CompileGraphToXlaHlo to be used under different builds.
PiperOrigin-RevId: 329757301
Change-Id: I1025f3106af21b2672e2157c3f5b80af07ef0d0f
For cases where we cannot infer the bound of a value, the compilation would fail. This gives user an escape patch.
PiperOrigin-RevId: 329626655
Change-Id: Ib5d71054088692697eaf5f2b21c0c5d1a097f1eb
Extend the XLA codegen to generate parallel reductions when there are multiple
reduce instructions in a fusion computation.
We see ~3% e2e gain for NVIDIA JoC BERT.
For `ManyParallelReductions` with 128 reduce instructions in the unittest, the
execution time is reduced from 325us to 3.9us (83X), reported by nvprof as below.
Before:
Type Time(%) Time Calls Avg Min Max Name
32.50% 325.54us 1 325.54us 325.54us 325.54us fusion
After:
Type Time(%) Time Calls Avg Min Max Name
0.59% 3.9030us 1 3.9030us 3.9030us 3.9030us fusion