From 9bf535aed35a197714251135b9155bd998df68c4 Mon Sep 17 00:00:00 2001
From: Tim Shen <timshen@google.com>
Date: Mon, 20 Jul 2020 13:35:35 -0700
Subject: [PATCH] [XLA/GPU] Sync the XLA/GPU -> MLIR doc.

PiperOrigin-RevId: 322214657
Change-Id: I8326adca5cd1d388e95b7a1cdba7a34f6d6dbca0
---
 .../compiler/mlir/g3doc/xla_gpu_codegen.md    | 40 ++++++++++---------
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/tensorflow/compiler/mlir/g3doc/xla_gpu_codegen.md b/tensorflow/compiler/mlir/g3doc/xla_gpu_codegen.md
index 2fe109c1783..8e7e605fc4c 100644
--- a/tensorflow/compiler/mlir/g3doc/xla_gpu_codegen.md
+++ b/tensorflow/compiler/mlir/g3doc/xla_gpu_codegen.md
@@ -74,7 +74,6 @@ We have several choices on how to lower the host-side part from LHLO:
     *   (Pro) easy to implement library calls (cuDNN, cuBLAS, cuFFT, etc), as
         TFRT ops are interpreted by C++ code.
     *   (Con) host side is under development and not tested.
-    *   (Con) the JAX integration isn’t clear from a runtime point of view
 *   Jitted CPU code
     *   (Pro) great lower-ability. Create a few loops and conditions and it's
         done.
@@ -84,8 +83,7 @@ We have several choices on how to lower the host-side part from LHLO:
         dynamic loading, etc).
 *   Existing (interpreting) XLA runtime
 
-Tentative conclusion: Use jitted CPU code during the transition, and optionally
-adopt TFRT in the end.
+Decision: adopt TFRT, but also support jitting CPU code in TFRT.
 
 ## Migrating Device LLVM IR (Task 3)
 
@@ -114,7 +112,7 @@ end state of each XLA op:
     *   (Cost) Will be throw-away work if we want to ultimately migrate to
         Standard.
     *   (Benefit) It is easy and mechanical. Can be done in a short period.
-    *   (Benefit) It doesn't benefit more compared to a).
+    *   (Benefit) It doesn't benefit more compared to (1).
 1.  Refactor old emitters to be like LHLO -> MLIR GPU + Standard + Loops:
     *   (Cost) Lifting existing emitters to Standard introduces some challenges.
         Pointers and GEPs need to be converted to MemRefs and SubViews. Ensuring
@@ -134,6 +132,19 @@ end state of each XLA op:
     *   (Benefit) unified stack; community support; portability; more
         optimization potentials.
 
+Conclusions:
+
+*   Don't go for (2). (1) or (3) are just better than (2). (2) costs more than
+    (1), since it requires a lot of mechanical refactoring. With (1) we can
+    still achieve the goal of enabling XLA to pick up MLIR emitters. This is by
+    doing LHLO -> LLVM IR -> run legacy device emitters.
+*   ElementalIrEmitter ops go for (4), but not incrementally. There is no way to
+    do it op by op, because all elementally-emitted ops are connected into the
+    same graph. This work can also serve as a unification point of several
+    on-going forces (xla/service/mlir\_gpu, the kernel generator, Linalg).
+*   All other ops go for (1). As a stretch goal, they might be migrated to (3)
+    or (4).
+
 ## Prioritization
 
 While all three tasks mentioned above are parallelizable, under limited
@@ -210,26 +221,19 @@ The exact profiling can't be easily done for MLIR-generated ops, since:
 
 ### Step 3: (Task 2) Migrating Thunks
 
-This step migrates all host ops and library calls. This step will eliminate most
-of the thunks and produce serializable MLIR instead.
-
-There are roughly three kinds of thunks:
-
+As a note, there are roughly three kinds of thunks:
 *   KernelThunk, which launches a kernel.
 *   Control flow thunks, which has host control flow logic (conditional, while,
     for, sequence) and launch body kernels.
 *   Library thunks: cuDNN, cuBLAS, cuFFT, NCCL, etc.
 
-The **bottom line** is to:
+The plan is:
+*   Make Thunks (de)serializable.
+*   Help improve TFRT to a state where it can support these semantics.
+*   As the state improves, migrate individual thunks incrementally.
 
-*   Create a Thunk dialect that provides (de)serialize logic for all existing
-    C++-based Thunks.
-*   Change emitters to emit a graph of Thunk dialect.
-
-**Optionally**, we can relieve some thunks from C++ implementation. KernelThunk
-can lower to the GPU LaunchKernelOp. Control flow thunks can leverage the CFG
-Dialect for loops and conditions, combined with LaunchKernelOp. This optional
-step requires profiling and stream support.
+These action items are only partially ordered. The actual execution order /
+engineering parallelism is to be evaluated as it goes.
 
 ### Step 4: (Task 3) Migrated ElementalIrEmitter