New order:
MlirBridgePass: Phase 0 before all passes
Passes that were at Phase 0 originally are moved to Phase 10
Passes that were at Phase 1 originally are moved to Phase 20
Passes that were at Phase 20+ originally are moved to Phase 30+
PiperOrigin-RevId: 282394988
Change-Id: Ief93072c52fcc073ceb0998271e1e4d5ad2d1f74
This pass uses some heuristics to add scopes to nodes to guide the
clustering results. Currently, the only heuristic is to preserve the
parallelism between Tensorflow pipeline stages.
1. This is only required for XLA, so it makes sense to move it into XlaCompiler;
2. We need it inside XLA compiler so TPU eager mode works (TPU eager mode does not call graph rewrite passes).
PiperOrigin-RevId: 248432264
TF/XLA bridge expects FunctionDef to satisfy the following rules:
1. DT_RESOURCE arguments are always in the last;
2. Do not return DT_RESOURCE as return values.
But functions defined by Tensorflow might not satisfy them.
PiperOrigin-RevId: 244714052
I don't particularly love this approach since IMO it is papering over a problem
in mark_for_compilation_pass -- mark_for_compilation_pass should instead
rematerialize constants as necessary to create larger clusters. But this is
what fits in best with scheme we have today.
PiperOrigin-RevId: 236729916
It was reverted because the pass creates slices from DT_INT64 tensors which
wasn't supported on the GPU. We now support these slices so the pass can be
re-enabled.
PiperOrigin-RevId: 220738294
Increases the amount of dynamism representable by XLA clusters by rewriting the
TensorFlow graph. See the header for a description.
This pass, combined with jit/partially_decluster_pass, reduces the number of
unnecessary cluster recompilations in some common cases.
The CL is organized as follows:
- cc/framework/scope* and core/graph/node_builder are modified so that new
nodes can now be automatically put in an XLA cluster using
Scope::WithXlaCluster.
- The pass is implemented in jit/increase_dynamism_for_auto_jit_pass.
- In jit/jit_compilation_pass_registration The new pass is registered to run
between MarkForCompilationPass and PartiallyDeclusterPass.
PiperOrigin-RevId: 218907734
This CL splits the functionality in XlaLaunch into two separate operations:
- XlaCompile, responsible for compiling a TF function into a LocalExecutable
- XlaRun, responsible for executing a LocalExecutable created by XlaCompile
This CL is a stepping stone towards implementing lazy compilation for TF/XLA.
The XlaCompile op is spec'ed to return a boolean indicating whether the
compilation was successful. Right now that boolean is always set to true by
XlaCompile and its value is otherwise ignored, but in the future it will be used
to indicate whether the TF function was compiled or not, and thus whether we
should execute XlaRun or just directly call the TF function.
XlaLaunch still exists, and will be created by create_xla_launch_op.cc. In the
future we may consider removing it altogether. build_xla_launch_ops.cc, now
renamed to build_xla_ops.cc, creates a XlaCompile/XlaRun pair instead of
XlaLaunch.
This CL is organized as follows:
- jit/ops/xla_ops.cc gets two new XLA-specific operations, XlaCompile and
XlaRun, described above. XlaRun redundantly takes the must-be-constant
inputs to the TensorFlow cluster to keep the implementation simple (simple in
the sense of similar to XlaLaunch), but I will remove this in a subsequent
cleanup CL.
- jit/kernels/xla_ops.cc implements XlaCompile and XlaRun in a fairly
straightforward manner. XlaCompile compiles the TF function, puts it in a
process-global storage, XlaExecutableClosureStore, and produces a int64 key.
XlaRun uses the key to read out the LocalExecutable and execute it. I'm not
sure if XlaExecutableClosureStore should be a resource like
XlaCompilationCache; I did not immediately see any reason to make it so.
- There are changes to the various _device files to register XlaCompile and
XlaRun for the XLA_* devices.
- Finally, I had to fix some tests that were expecting XlaLaunch in the
execution timeline.
PiperOrigin-RevId: 213895405
"Partial declustering" is defined as cloning a clustered node outside its
cluster and transferring some of its outgoing edges to the cloned version.
Some TensorFlow operations expect their inputs in host-memory and, because XLA
only produces device tensors, such nodes can incur a device-to-host copy if not
clustered along with their producers. TensorFlow operations, on the other hand,
may produce their outputs in host memory so cloning the producer to outside the
cluster and moving the host-mem expecting consumers to use the cloned version
instead lets us avoid the memcpy.
PiperOrigin-RevId: 208710603
XLA is a compiler-based linear algebra execution engine that targets CPUs, GPUs and custom accelerators.
XLA is still experimental; we are releasing it early to get the community involved.
Change: 143990941