[XLA/GPU] Sync the XLA/GPU -> MLIR doc.

PiperOrigin-RevId: 322214657 Change-Id: I8326adca5cd1d388e95b7a1cdba7a34f6d6dbca0
2020-07-20 13:35:35 -07:00 · 2020-07-20 13:35:35 -07:00 · 9bf535aed3
commit 9bf535aed3
parent 24c4a6dd90
1 changed files with 22 additions and 18 deletions
--- a/tensorflow/compiler/mlir/g3doc/xla_gpu_codegen.md
+++ b/tensorflow/compiler/mlir/g3doc/xla_gpu_codegen.md
@ -74,7 +74,6 @@ We have several choices on how to lower the host-side part from LHLO:
    *   (Pro) easy to implement library calls (cuDNN, cuBLAS, cuFFT, etc), as
        TFRT ops are interpreted by C++ code.
    *   (Con) host side is under development and not tested.
-    *   (Con) the JAX integration isn’t clear from a runtime point of view
 *   Jitted CPU code
    *   (Pro) great lower-ability. Create a few loops and conditions and it's
        done.
@ -84,8 +83,7 @@ We have several choices on how to lower the host-side part from LHLO:
        dynamic loading, etc).
 *   Existing (interpreting) XLA runtime

-Tentative conclusion: Use jitted CPU code during the transition, and optionally
-adopt TFRT in the end.
+Decision: adopt TFRT, but also support jitting CPU code in TFRT.

 ## Migrating Device LLVM IR (Task 3)

@ -114,7 +112,7 @@ end state of each XLA op:
    *   (Cost) Will be throw-away work if we want to ultimately migrate to
        Standard.
    *   (Benefit) It is easy and mechanical. Can be done in a short period.
-    *   (Benefit) It doesn't benefit more compared to a).
+    *   (Benefit) It doesn't benefit more compared to (1).
 1.  Refactor old emitters to be like LHLO -> MLIR GPU + Standard + Loops:
    *   (Cost) Lifting existing emitters to Standard introduces some challenges.
        Pointers and GEPs need to be converted to MemRefs and SubViews. Ensuring
@ -134,6 +132,19 @@ end state of each XLA op:
    *   (Benefit) unified stack; community support; portability; more
        optimization potentials.

+Conclusions:
+
+*   Don't go for (2). (1) or (3) are just better than (2). (2) costs more than
+    (1), since it requires a lot of mechanical refactoring. With (1) we can
+    still achieve the goal of enabling XLA to pick up MLIR emitters. This is by
+    doing LHLO -> LLVM IR -> run legacy device emitters.
+*   ElementalIrEmitter ops go for (4), but not incrementally. There is no way to
+    do it op by op, because all elementally-emitted ops are connected into the
+    same graph. This work can also serve as a unification point of several
+    on-going forces (xla/service/mlir\_gpu, the kernel generator, Linalg).
+*   All other ops go for (1). As a stretch goal, they might be migrated to (3)
+    or (4).
+
 ## Prioritization

 While all three tasks mentioned above are parallelizable, under limited
@ -210,26 +221,19 @@ The exact profiling can't be easily done for MLIR-generated ops, since:

 ### Step 3: (Task 2) Migrating Thunks

-This step migrates all host ops and library calls. This step will eliminate most
-of the thunks and produce serializable MLIR instead.
-
-There are roughly three kinds of thunks:
-
+As a note, there are roughly three kinds of thunks:
 *   KernelThunk, which launches a kernel.
 *   Control flow thunks, which has host control flow logic (conditional, while,
    for, sequence) and launch body kernels.
 *   Library thunks: cuDNN, cuBLAS, cuFFT, NCCL, etc.

-The **bottom line** is to:
+The plan is:
+*   Make Thunks (de)serializable.
+*   Help improve TFRT to a state where it can support these semantics.
+*   As the state improves, migrate individual thunks incrementally.

-*   Create a Thunk dialect that provides (de)serialize logic for all existing
-    C++-based Thunks.
-*   Change emitters to emit a graph of Thunk dialect.
-
-**Optionally**, we can relieve some thunks from C++ implementation. KernelThunk
-can lower to the GPU LaunchKernelOp. Control flow thunks can leverage the CFG
-Dialect for loops and conditions, combined with LaunchKernelOp. This optional
-step requires profiling and stream support.
+These action items are only partially ordered. The actual execution order /
+engineering parallelism is to be evaluated as it goes.

 ### Step 4: (Task 3) Migrated ElementalIrEmitter