Commit Graph

23 Commits

Author SHA1 Message Date
George Karpenkov
9771765f41 [TF/XLA] Force all tensors which need to be constant during the XLA compilation to be located on the host
Otherwise, this leads to strange crashes during the compilation.

PiperOrigin-RevId: 304226917
Change-Id: Ia2f1e77b13a25c7e15f009787af81f93b90e8bca
2020-04-01 11:34:32 -07:00
Yanan Cao
1e71cf27dc Move MLIR bridge pass to before all other passes.
New order:
MlirBridgePass: Phase 0 before all passes
Passes that were at Phase 0 originally are moved to Phase 10
Passes that were at Phase 1 originally are moved to Phase 20
Passes that were at Phase 20+ originally are moved to Phase 30+
PiperOrigin-RevId: 282394988
Change-Id: Ief93072c52fcc073ceb0998271e1e4d5ad2d1f74
2019-11-25 12:30:15 -08:00
Trent Lo
b1dd38b13a [TF:XLA] Add cluster_scoping_pass.
This pass uses some heuristics to add scopes to nodes to guide the
clustering results. Currently, the only heuristic is to preserve the
parallelism between Tensorflow pipeline stages.
2019-08-13 12:48:03 -07:00
Sanjoy Das
df54488c64 Roll forward "Add an XLA "activity listener" mechanism."
PiperOrigin-RevId: 253323502
2019-06-14 17:14:19 -07:00
Gunhan Gulsoy
eed9bfdeb0 Automated rollback of commit 5f2291877d
PiperOrigin-RevId: 253284324
2019-06-14 13:26:05 -07:00
Sanjoy Das
5f2291877d Add an XLA "activity listener" mechanism.
This allows various components to listen to auto-clustering and JIT compilation
events in TensorFlow.

PiperOrigin-RevId: 253265614
2019-06-14 11:37:35 -07:00
JiangXIAO
bf19e4a398 bugfix:this file has been moved to another dir 2019-06-12 18:41:45 +08:00
Tong Shen
83668b0826 Move function argument rearrangement from a graph pass to XlaCompiler.
1. This is only required for XLA, so it makes sense to move it into XlaCompiler;
2. We need it inside XLA compiler so TPU eager mode works (TPU eager mode does not call graph rewrite passes).

PiperOrigin-RevId: 248432264
2019-05-15 16:58:38 -07:00
Tong Shen
5399872207 Add a RearrangeFunctionArgumentPass to rearrange function _Arg/_Retval nodes.
TF/XLA bridge expects FunctionDef to satisfy the following rules:
1. DT_RESOURCE arguments are always in the last;
2. Do not return DT_RESOURCE as return values.
But functions defined by Tensorflow might not satisfy them.

PiperOrigin-RevId: 244714052
2019-04-22 12:59:38 -07:00
Sanjoy Das
ef5519ab43 Add a debug-only pass that introduces a small error to a designated TF node
This can let us check how susceptible a model or a unit test is to floating
point differences.

PiperOrigin-RevId: 240824222
2019-03-28 12:25:47 -07:00
Sanjoy Das
0c9cb95315 Add an optimization pass that clones Constant nodes to make larger clusters
I don't particularly love this approach since IMO it is papering over a problem
in mark_for_compilation_pass -- mark_for_compilation_pass should instead
rematerialize constants as necessary to create larger clusters.  But this is
what fits in best with scheme we have today.

PiperOrigin-RevId: 236729916
2019-03-04 15:25:22 -08:00
Sanjoy Das
485923680d Re-enable IncreaseDynamismForAutoJitPass
It was reverted because the pass creates slices from DT_INT64 tensors which
wasn't supported on the GPU.  We now support these slices so the pass can be
re-enabled.

PiperOrigin-RevId: 220738294
2018-11-08 19:03:31 -08:00
Sanjoy Das
8efee06786 Disable IncreaseDynamismForAutoJitPass
It creates slices of index type DT_INT64 which do not have a kernels on GPU.

PiperOrigin-RevId: 219080243
2018-10-28 22:57:48 -07:00
Sanjoy Das
2c164ed32f Introduce a pass to increase the amount of dynamism supported by an XLA cluster
Increases the amount of dynamism representable by XLA clusters by rewriting the
TensorFlow graph.  See the header for a description.

This pass, combined with jit/partially_decluster_pass, reduces the number of
unnecessary cluster recompilations in some common cases.

The CL is organized as follows:

 - cc/framework/scope* and core/graph/node_builder are modified so that new
   nodes can now be automatically put in an XLA cluster using
   Scope::WithXlaCluster.

 - The pass is implemented in jit/increase_dynamism_for_auto_jit_pass.

 - In jit/jit_compilation_pass_registration The new pass is registered to run
   between MarkForCompilationPass and PartiallyDeclusterPass.

PiperOrigin-RevId: 218907734
2018-10-26 14:01:34 -07:00
Sanjoy Das
4d39844c1d Split XlaLaunch into XlaCompile and XlaRun; NFC
This CL splits the functionality in XlaLaunch into two separate operations:

 - XlaCompile, responsible for compiling a TF function into a LocalExecutable
 - XlaRun, responsible for executing a LocalExecutable created by XlaCompile

This CL is a stepping stone towards implementing lazy compilation for TF/XLA.
The XlaCompile op is spec'ed to return a boolean indicating whether the
compilation was successful.  Right now that boolean is always set to true by
XlaCompile and its value is otherwise ignored, but in the future it will be used
to indicate whether the TF function was compiled or not, and thus whether we
should execute XlaRun or just directly call the TF function.

XlaLaunch still exists, and will be created by create_xla_launch_op.cc.  In the
future we may consider removing it altogether.  build_xla_launch_ops.cc, now
renamed to build_xla_ops.cc, creates a XlaCompile/XlaRun pair instead of
XlaLaunch.

This CL is organized as follows:

 - jit/ops/xla_ops.cc gets two new XLA-specific operations, XlaCompile and
   XlaRun, described above.  XlaRun redundantly takes the must-be-constant
   inputs to the TensorFlow cluster to keep the implementation simple (simple in
   the sense of similar to XlaLaunch), but I will remove this in a subsequent
   cleanup CL.

 - jit/kernels/xla_ops.cc implements XlaCompile and XlaRun in a fairly
   straightforward manner.  XlaCompile compiles the TF function, puts it in a
   process-global storage, XlaExecutableClosureStore, and produces a int64 key.
   XlaRun uses the key to read out the LocalExecutable and execute it.  I'm not
   sure if XlaExecutableClosureStore should be a resource like
   XlaCompilationCache; I did not immediately see any reason to make it so.

 - There are changes to the various _device files to register XlaCompile and
   XlaRun for the XLA_* devices.

 - Finally, I had to fix some tests that were expecting XlaLaunch in the
   execution timeline.

PiperOrigin-RevId: 213895405
2018-09-20 15:45:36 -07:00
A. Unique TensorFlower
29b56bde1e Automated rollback of commit ac60b46e2c
PiperOrigin-RevId: 212896336
2018-09-13 16:12:39 -07:00
Tong Shen
37ddb13ece Roll forward change "Move control flow functionalization as a graph optimization pass, instead of a step in XlaCompiler.".
PiperOrigin-RevId: 212657932
2018-09-12 10:07:48 -07:00
Yanan Cao
ac60b46e2c Automated rollback of commit 45965cfd8b
PiperOrigin-RevId: 212465918
2018-09-11 09:38:56 -07:00
A. Unique TensorFlower
45965cfd8b Graph optimization pass that creates XlaLaunch ops for the computations that have been explicitly marked to be compiled via xla.compile()
PiperOrigin-RevId: 212407112
2018-09-11 00:54:33 -07:00
Tong Shen
b40ace8f28 Automated rollback of commit a3776a234f
PiperOrigin-RevId: 212182923
2018-09-09 09:54:26 -07:00
Tong Shen
a3776a234f Move control flow functionalization as a graph optimization pass, instead of a step in XlaCompiler.
PiperOrigin-RevId: 212164482
2018-09-09 01:41:14 -07:00
Sanjoy Das
4ab8a1056a Avoid device to host copies by "partially declustering" certain nodes.
"Partial declustering" is defined as cloning a clustered node outside its
cluster and transferring some of its outgoing edges to the cloned version.

Some TensorFlow operations expect their inputs in host-memory and, because XLA
only produces device tensors, such nodes can incur a device-to-host copy if not
clustered along with their producers.  TensorFlow operations, on the other hand,
may produce their outputs in host memory so cloning the producer to outside the
cluster and moving the host-mem expecting consumers to use the cloned version
instead lets us avoid the memcpy.

PiperOrigin-RevId: 208710603
2018-08-14 14:19:45 -07:00
Peter Hawkins
1e67c90e2c Initial open-source release of XLA: Accelerated Linear Algebra.
XLA is a compiler-based linear algebra execution engine that targets CPUs, GPUs and custom accelerators.

XLA is still experimental; we are releasing it early to get the community involved.
Change: 143990941
2017-01-09 12:26:35 -08:00