[XLA/GPU] Sync the XLA/GPU -> MLIR doc.
PiperOrigin-RevId: 322214657 Change-Id: I8326adca5cd1d388e95b7a1cdba7a34f6d6dbca0
This commit is contained in:
parent
24c4a6dd90
commit
9bf535aed3
@ -74,7 +74,6 @@ We have several choices on how to lower the host-side part from LHLO:
|
|||||||
* (Pro) easy to implement library calls (cuDNN, cuBLAS, cuFFT, etc), as
|
* (Pro) easy to implement library calls (cuDNN, cuBLAS, cuFFT, etc), as
|
||||||
TFRT ops are interpreted by C++ code.
|
TFRT ops are interpreted by C++ code.
|
||||||
* (Con) host side is under development and not tested.
|
* (Con) host side is under development and not tested.
|
||||||
* (Con) the JAX integration isn’t clear from a runtime point of view
|
|
||||||
* Jitted CPU code
|
* Jitted CPU code
|
||||||
* (Pro) great lower-ability. Create a few loops and conditions and it's
|
* (Pro) great lower-ability. Create a few loops and conditions and it's
|
||||||
done.
|
done.
|
||||||
@ -84,8 +83,7 @@ We have several choices on how to lower the host-side part from LHLO:
|
|||||||
dynamic loading, etc).
|
dynamic loading, etc).
|
||||||
* Existing (interpreting) XLA runtime
|
* Existing (interpreting) XLA runtime
|
||||||
|
|
||||||
Tentative conclusion: Use jitted CPU code during the transition, and optionally
|
Decision: adopt TFRT, but also support jitting CPU code in TFRT.
|
||||||
adopt TFRT in the end.
|
|
||||||
|
|
||||||
## Migrating Device LLVM IR (Task 3)
|
## Migrating Device LLVM IR (Task 3)
|
||||||
|
|
||||||
@ -114,7 +112,7 @@ end state of each XLA op:
|
|||||||
* (Cost) Will be throw-away work if we want to ultimately migrate to
|
* (Cost) Will be throw-away work if we want to ultimately migrate to
|
||||||
Standard.
|
Standard.
|
||||||
* (Benefit) It is easy and mechanical. Can be done in a short period.
|
* (Benefit) It is easy and mechanical. Can be done in a short period.
|
||||||
* (Benefit) It doesn't benefit more compared to a).
|
* (Benefit) It doesn't benefit more compared to (1).
|
||||||
1. Refactor old emitters to be like LHLO -> MLIR GPU + Standard + Loops:
|
1. Refactor old emitters to be like LHLO -> MLIR GPU + Standard + Loops:
|
||||||
* (Cost) Lifting existing emitters to Standard introduces some challenges.
|
* (Cost) Lifting existing emitters to Standard introduces some challenges.
|
||||||
Pointers and GEPs need to be converted to MemRefs and SubViews. Ensuring
|
Pointers and GEPs need to be converted to MemRefs and SubViews. Ensuring
|
||||||
@ -134,6 +132,19 @@ end state of each XLA op:
|
|||||||
* (Benefit) unified stack; community support; portability; more
|
* (Benefit) unified stack; community support; portability; more
|
||||||
optimization potentials.
|
optimization potentials.
|
||||||
|
|
||||||
|
Conclusions:
|
||||||
|
|
||||||
|
* Don't go for (2). (1) or (3) are just better than (2). (2) costs more than
|
||||||
|
(1), since it requires a lot of mechanical refactoring. With (1) we can
|
||||||
|
still achieve the goal of enabling XLA to pick up MLIR emitters. This is by
|
||||||
|
doing LHLO -> LLVM IR -> run legacy device emitters.
|
||||||
|
* ElementalIrEmitter ops go for (4), but not incrementally. There is no way to
|
||||||
|
do it op by op, because all elementally-emitted ops are connected into the
|
||||||
|
same graph. This work can also serve as a unification point of several
|
||||||
|
on-going forces (xla/service/mlir\_gpu, the kernel generator, Linalg).
|
||||||
|
* All other ops go for (1). As a stretch goal, they might be migrated to (3)
|
||||||
|
or (4).
|
||||||
|
|
||||||
## Prioritization
|
## Prioritization
|
||||||
|
|
||||||
While all three tasks mentioned above are parallelizable, under limited
|
While all three tasks mentioned above are parallelizable, under limited
|
||||||
@ -210,26 +221,19 @@ The exact profiling can't be easily done for MLIR-generated ops, since:
|
|||||||
|
|
||||||
### Step 3: (Task 2) Migrating Thunks
|
### Step 3: (Task 2) Migrating Thunks
|
||||||
|
|
||||||
This step migrates all host ops and library calls. This step will eliminate most
|
As a note, there are roughly three kinds of thunks:
|
||||||
of the thunks and produce serializable MLIR instead.
|
|
||||||
|
|
||||||
There are roughly three kinds of thunks:
|
|
||||||
|
|
||||||
* KernelThunk, which launches a kernel.
|
* KernelThunk, which launches a kernel.
|
||||||
* Control flow thunks, which has host control flow logic (conditional, while,
|
* Control flow thunks, which has host control flow logic (conditional, while,
|
||||||
for, sequence) and launch body kernels.
|
for, sequence) and launch body kernels.
|
||||||
* Library thunks: cuDNN, cuBLAS, cuFFT, NCCL, etc.
|
* Library thunks: cuDNN, cuBLAS, cuFFT, NCCL, etc.
|
||||||
|
|
||||||
The **bottom line** is to:
|
The plan is:
|
||||||
|
* Make Thunks (de)serializable.
|
||||||
|
* Help improve TFRT to a state where it can support these semantics.
|
||||||
|
* As the state improves, migrate individual thunks incrementally.
|
||||||
|
|
||||||
* Create a Thunk dialect that provides (de)serialize logic for all existing
|
These action items are only partially ordered. The actual execution order /
|
||||||
C++-based Thunks.
|
engineering parallelism is to be evaluated as it goes.
|
||||||
* Change emitters to emit a graph of Thunk dialect.
|
|
||||||
|
|
||||||
**Optionally**, we can relieve some thunks from C++ implementation. KernelThunk
|
|
||||||
can lower to the GPU LaunchKernelOp. Control flow thunks can leverage the CFG
|
|
||||||
Dialect for loops and conditions, combined with LaunchKernelOp. This optional
|
|
||||||
step requires profiling and stream support.
|
|
||||||
|
|
||||||
### Step 4: (Task 3) Migrated ElementalIrEmitter
|
### Step 4: (Task 3) Migrated ElementalIrEmitter
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user