The existing tf_mlir_enable_mlir_bridge flag allows models to
selectively enable or disable the model via TF_XLA_FLAGS. If the
flag is not set, it defaults to false.
In order to slowly and safely rollout the mlir_bridge, we will
need to distinguish between unspecified and forcibly disabled.
If the flag is unspecified, we can selectively choose when the
bridge is enabled. This will allow us to slowly ramp up the
number of models that use the new bridge.
This patch continues to support the existing TF_XLA_FLAG
interface (tf_mlir_enable_mlir_bridge can be set to true or false)
but internally, TensorFlow can now distinguish between false
(forcibly disabled) and unset (unspecified).
PiperOrigin-RevId: 337523318
Change-Id: I8ebb49da104663e12e5c1fa6399a1bf79239a44f
The threshold on tensor size was applied after the value is computed, only when replacing the old nodes. However, that could already have caused OOM in large models.
Changed compilation to XLA to limit TF constant folding to 1024 bytes, since it's only used for getting the shapes, and XLA internally also has constant folding.
PiperOrigin-RevId: 337226951
Change-Id: Ib7ebb91950e379cac6978027a7162438eb0a58d2
The threshold on tensor size was applied after the value is computed, only when replacing the old nodes. However, that could already have caused OOM in large models.
Changed compilation to XLA to limit TF constant folding to 1024 bytes, since it's only used for getting the shapes, and XLA internally also has constant folding.
PiperOrigin-RevId: 337221696
Change-Id: I4cdca20d28141f34b2c85120298bffb89e6df85d
This is in preparation of updating graph pruning to always prune imported function graphs.
PiperOrigin-RevId: 335944889
Change-Id: I3f6156aa08384883eee6227210f8fc8f1b7cc575
The existing tf_mlir_enable_mlir_bridge flag allows models to
selectively enable or disable the model via TF_XLA_FLAGS. If the
flag is not set, it defaults to false.
In order to slowly and safely rollout the mlir_bridge, we will
need to distinguish between unspecified and forcibly disabled.
If the flag is unspecified, we can selectively choose when the
bridge is enabled. This will allow us to slowly ramp up the
number of models that use the new bridge.
This patch continues to support the existing TF_XLA_FLAG
interface (tf_mlir_enable_mlir_bridge can be set to true or false)
but internally, TensorFlow can now distinguish between false
(forcibly disabled) and unset (unspecified).
PiperOrigin-RevId: 335662030
Change-Id: Iefc44436620e52ff21a72583d57ebf29124a2691
This will allow for std::vector<XlaArgument> and llvm::SmallVector arg parameters in CompileGraphToXlaHlo to be used under different builds.
PiperOrigin-RevId: 329757301
Change-Id: I1025f3106af21b2672e2157c3f5b80af07ef0d0f
Explains option for enabling soft_device_placement if an Unsupported op
is encountered by XlaCompiler.
PiperOrigin-RevId: 328636713
Change-Id: I6913818d640902afe0695d05131534a064d3fb61
In some rare cases, some functions may be compiled multiple times and the key with same data may be inserted multiple times. If this is the case, it should not result in an error.
PiperOrigin-RevId: 327653747
Change-Id: Ibb5f98e0916721bc50b67241b7fb947472398ff1
- Replace SetDynamicBinding with SetDimensionSize models the information into the IR. Makes problems easier to reproduce by just looking at the HLO graph.
- This one of the last few places that use SetDynamicBinding, after the clean up, we should be able to replace this old API.
PiperOrigin-RevId: 327057424
Change-Id: I7fbadef18a9cd076c12fc61a53310311498416a0
This is needed for invoking the MLIR tf2xla bridge from xla_compiler.
This CL breaks apart items from xla_compiler into individual build targets,
which are then depended on from the MLIR TF bridge.
PiperOrigin-RevId: 323640340
Change-Id: I78b972503db9e7b5254014ca7e889005490d8339
Previously, the XLA argument parameter was incorrectly assumed to be
corresponding to the index in the vector of `XlaCompiler::Argument`.
This is not correct, since not all `XlaCompiler::Argument`s become arguments to
the compiler: notably, constants and uninitialized resource variables do not.
PiperOrigin-RevId: 321709603
Change-Id: I730fd6385949c360b2b831318a5b59c08f8362ef
The code surrounding the handling of _noinline functions is very rarely hit,
and as a result is not well tested. For now, the better approach is to follow
a more well-lit codepath and try to minimize the use of _noinline functions.
As a starting point, inline blocks even with _noinline inside force-compiled
blocks.
PiperOrigin-RevId: 313280139
Change-Id: I9f2d9b95d4bfe15eb2acea2a3d101b82355c14d5
The code surrounding the handling of _noinline functions is very rarely hit,
and as a result is not well tested. For now, the better approach is to follow
a more well-lit codepath and try to minimize the use of _noinline functions.
As a starting point, inline blocks even with _noinline inside force-compiled
blocks.
PiperOrigin-RevId: 313256383
Change-Id: If2f60aac933ac8e27f3dcb65bf6b389611c45bd7
This change modifies these includes to point to
"tensorflow/core/common_runtime/graph_constructor.h" instead. This change will enable us to remove the accidental dependency from //tensorflow/core/graph to //tensorflow/core/common_runtime.
PiperOrigin-RevId: 309035649
Change-Id: I2af0fdd6a6ccc4ae8d351a9117a69b6fc80c22e9
Sharding is present with model parallelism. Depending on what type of sharding is present, argument/result shapes and layouts need to be updated. ShapeRepresentationFn and shardings are used to determine the new shapes and layouts.
PiperOrigin-RevId: 303182568
Change-Id: I4185c1ae12de618b0b2ce9c07d2cd795c4e329b8
This call can be reused when determining argument layouts with sharding.
PiperOrigin-RevId: 302111008
Change-Id: I3607e41dc987e348e8405b96f09ebc549a8427bc
XlaCompilationCache is the only user of single op compilation so we can move single op handling to the cache. This will allow MLIR based on demand compilation to reuse this logic in a follow-up change.
PiperOrigin-RevId: 300799049
Change-Id: I50d3f258e815cbc2caa6315eff0d902695146537
1. The previous approach might have different layouts for computation.GetProgramShape() and xla_output_shape. It only used shape_representation_fn for xla_output_shape, but not entry's program shape. These being different are often confusing, and may make it hard to reproduce a bug with HLO dump which doesn't have HloModuleConfig.
2. Output shapes were not updated with layout when there is sharding.
3. The updated value of a resource did not preserve the fast_mem annotation on the argument.
PiperOrigin-RevId: 295811071
Change-Id: I801a46d3039b2349dd0196cbc14ec3d9a8211d55
- When a resource update is presented, automatically alias the input and output.
- Also fix an issue where the input/output proto config is overwritten.
PiperOrigin-RevId: 294984983
Change-Id: I45e96513dfeaa91f523db63837355b698bd2fb85
After parameter sharding, per core argument might have different layout. In XLA compiler we cannot deduce layout for sharded parameter any more (because we cannot access shape_representation_fn any more). So we override XLA parameter layout with sharded parameter layout.
In XlaDeviceContext, CopyCPUTensorToDevice() use shape_representation_fn(cpu_tensor_shape) as device tensor shape, so we must use the same shape as XLA compiler input shape. For CopyDeviceTensorToCPU(), device tensor shape is defined by XLA compiler directly, so we do not need to fix anything.
PiperOrigin-RevId: 284812560
Change-Id: I567f180a8035ff71982d49910b84c98d07eb25d1
Constant folding inside `XlaCompiler::CompileGraph` is not necessary for
correctness, but is a performance optimization. This optimization is not
necessary though: the XLA compiler performs constant folding in any case (also
using HloEvaluator), and in some cases constant folding in the bridge leads to
severe performance issues (1.5+hrs compile time): the XLA constant evaluator
does not perform folding on "broadcast" and "iota" operations, which can be
overly expensive for the interpreter.
This change pushes calculations which previously went through the HLO
interpreter to the corresponding HLO backend.
Consequently, numeric stability changes due to some optimizations performed by
the backends (namely: fast math optimizations on CPU backend, A / B => A * (1 /
B) rewrite, "-nvptx-prec-divf32=1" on GPU).
Additionally, reductions numerics are different: float reductions on
interpreter use double interpreter, while float reductions in XLA use float
accumulator, and in non-CPU backends the error grows linearly with the size of
the input.
PiperOrigin-RevId: 281554507
Change-Id: Ic58c547727fc0cb9a93bfc7eb2db763dc1e8b02e
We avoid recreating a ScopedStepContainer by storing one for reuse in
the KernelAndDeviceOp & KernelAndDeviceFunc classes. Further, we can
avoid doing a resource manager lookup to perform a clean by adding a
dirty flag to indicate the ScopedStepContainer was accessed.
In addition, we simplify the signature of MakeResourceHandle by avoiding
the need to pass in the entire OpKernelContext object.
PiperOrigin-RevId: 281110991
Change-Id: I0a186583a1ff50b08bf68c18cfb99c912e05386d
canonicalization and signature generation faster
Added benchmark for XlaCompilationCache::BuildSignature to measure time
taken to build a signature for the cache.
Base is this CL with just the changes to add the benchmark in
xla_compilation_cache_test.cc, New is this whole CL.
Run on desktop machine (40 X 2793 MHz CPUs); 2019-09-17T08:30:04.125894664-07:00
CPU: Intel Ivybridge with HyperThreading (20 cores) dL1:32KB dL2:256KB dL3:25MB
Benchmark Base (ns) New (ns) Improvement
----------------------------------------------------------------------------
BM_BuildSignature/0 226 87 +61.5%
BM_BuildSignature/1 337 171 +49.3%
BM_BuildSignature/2 504 259 +48.6%
BM_BuildSignature/5 1008 592 +41.3%
BM_BuildSignature/10 1751 1238 +29.3%
RELNOTES: n/a
PiperOrigin-RevId: 276289188
Change-Id: Ia47343203f6ac587a921a92f86c2428dd04db2a7
Also pass the ConfigProto through distributed function calls both in the standard
graph registration mode and in the new eager master setup.
The PFLR stores a std::optional<ConfigProto> instead of a pointer, because it may be created with a pointer that would dangle after its creation. At the same time, we need to know if a ConfigProto was available at creation time, which is why it's a std::optional. In contrast, the FLR gets a pointer directly because it is given a valid pointer that will outlast it in all cases.
PiperOrigin-RevId: 272763578