From 9bf535aed35a197714251135b9155bd998df68c4 Mon Sep 17 00:00:00 2001 From: Tim Shen Date: Mon, 20 Jul 2020 13:35:35 -0700 Subject: [PATCH] [XLA/GPU] Sync the XLA/GPU -> MLIR doc. PiperOrigin-RevId: 322214657 Change-Id: I8326adca5cd1d388e95b7a1cdba7a34f6d6dbca0 --- .../compiler/mlir/g3doc/xla_gpu_codegen.md | 40 ++++++++++--------- 1 file changed, 22 insertions(+), 18 deletions(-) diff --git a/tensorflow/compiler/mlir/g3doc/xla_gpu_codegen.md b/tensorflow/compiler/mlir/g3doc/xla_gpu_codegen.md index 2fe109c1783..8e7e605fc4c 100644 --- a/tensorflow/compiler/mlir/g3doc/xla_gpu_codegen.md +++ b/tensorflow/compiler/mlir/g3doc/xla_gpu_codegen.md @@ -74,7 +74,6 @@ We have several choices on how to lower the host-side part from LHLO: * (Pro) easy to implement library calls (cuDNN, cuBLAS, cuFFT, etc), as TFRT ops are interpreted by C++ code. * (Con) host side is under development and not tested. - * (Con) the JAX integration isn’t clear from a runtime point of view * Jitted CPU code * (Pro) great lower-ability. Create a few loops and conditions and it's done. @@ -84,8 +83,7 @@ We have several choices on how to lower the host-side part from LHLO: dynamic loading, etc). * Existing (interpreting) XLA runtime -Tentative conclusion: Use jitted CPU code during the transition, and optionally -adopt TFRT in the end. +Decision: adopt TFRT, but also support jitting CPU code in TFRT. ## Migrating Device LLVM IR (Task 3) @@ -114,7 +112,7 @@ end state of each XLA op: * (Cost) Will be throw-away work if we want to ultimately migrate to Standard. * (Benefit) It is easy and mechanical. Can be done in a short period. - * (Benefit) It doesn't benefit more compared to a). + * (Benefit) It doesn't benefit more compared to (1). 1. Refactor old emitters to be like LHLO -> MLIR GPU + Standard + Loops: * (Cost) Lifting existing emitters to Standard introduces some challenges. Pointers and GEPs need to be converted to MemRefs and SubViews. Ensuring @@ -134,6 +132,19 @@ end state of each XLA op: * (Benefit) unified stack; community support; portability; more optimization potentials. +Conclusions: + +* Don't go for (2). (1) or (3) are just better than (2). (2) costs more than + (1), since it requires a lot of mechanical refactoring. With (1) we can + still achieve the goal of enabling XLA to pick up MLIR emitters. This is by + doing LHLO -> LLVM IR -> run legacy device emitters. +* ElementalIrEmitter ops go for (4), but not incrementally. There is no way to + do it op by op, because all elementally-emitted ops are connected into the + same graph. This work can also serve as a unification point of several + on-going forces (xla/service/mlir\_gpu, the kernel generator, Linalg). +* All other ops go for (1). As a stretch goal, they might be migrated to (3) + or (4). + ## Prioritization While all three tasks mentioned above are parallelizable, under limited @@ -210,26 +221,19 @@ The exact profiling can't be easily done for MLIR-generated ops, since: ### Step 3: (Task 2) Migrating Thunks -This step migrates all host ops and library calls. This step will eliminate most -of the thunks and produce serializable MLIR instead. - -There are roughly three kinds of thunks: - +As a note, there are roughly three kinds of thunks: * KernelThunk, which launches a kernel. * Control flow thunks, which has host control flow logic (conditional, while, for, sequence) and launch body kernels. * Library thunks: cuDNN, cuBLAS, cuFFT, NCCL, etc. -The **bottom line** is to: +The plan is: +* Make Thunks (de)serializable. +* Help improve TFRT to a state where it can support these semantics. +* As the state improves, migrate individual thunks incrementally. -* Create a Thunk dialect that provides (de)serialize logic for all existing - C++-based Thunks. -* Change emitters to emit a graph of Thunk dialect. - -**Optionally**, we can relieve some thunks from C++ implementation. KernelThunk -can lower to the GPU LaunchKernelOp. Control flow thunks can leverage the CFG -Dialect for loops and conditions, combined with LaunchKernelOp. This optional -step requires profiling and stream support. +These action items are only partially ordered. The actual execution order / +engineering parallelism is to be evaluated as it goes. ### Step 4: (Task 3) Migrated ElementalIrEmitter