Add a design doc for the convolution emitter.
PiperOrigin-RevId: 306339540 Change-Id: Ifad7cf55c92c7c6741777c26d85b91583896df87
This commit is contained in:
parent
a70f444d99
commit
c3c62c31c5
@ -0,0 +1,324 @@
|
||||
# Convolution Emitter
|
||||
|
||||
## Context
|
||||
|
||||
This is a doc that describes a set of patches that are still under review.
|
||||
TODO(timshen): Change once all patches are checked in.
|
||||
|
||||
The convolution emitter is a prototype with the following goals:
|
||||
|
||||
* The top priority is performance.
|
||||
* It supports arbitrarily sophiscated layouts.
|
||||
* It supports platform-specific high-performance instructions.
|
||||
* It is as portable as possible.
|
||||
* It enables fusion support in the future.
|
||||
|
||||
## Current Design
|
||||
|
||||
### Overview
|
||||
|
||||
The prototype consists of the following components:
|
||||
|
||||
* The emitter currently focuses on NVIDIA Volta architecture and N(C/4)HW4
|
||||
layout.
|
||||
* An MLIR-based emitter. It takes a set of tuning parameters and a convolution
|
||||
configuration, then produces a NVVM device function.
|
||||
* An autotuner, which generates tuning parameters given a convolution
|
||||
configuration.
|
||||
* A test framework, which executes the generated device function with random
|
||||
inputs, and compares the result against cuDNN.
|
||||
|
||||
### The Emitter - Naive Implementation
|
||||
|
||||
The emitter starts with a hand-built, naive implementation that looks like
|
||||
following Resnet first layer convolution (pseudo code):
|
||||
|
||||
```mlir
|
||||
func @Conv(%input : memref<128x1x224x224xvector<4xf16>>,
|
||||
%filter : memref<64x1x7x7xvector<4xf16>>,
|
||||
%output : memref<128x64x224x224xf16>) {
|
||||
affine.parallel (%n, %o, %oh, %ow) = 0 to 128, 0 to 64, 0 to 112, 0 to 112 {
|
||||
%acc = alloc() : memref<f32>
|
||||
affine.store 0, %acc[]
|
||||
affine.for (%c, %fh, %fw) = 0 to 1, 0 to 7, 0 to 7 {
|
||||
%a = affine.padded.load %input[%n, %c, %oh * 2 + %fh - 3, %ow * 2 + %fw - 3]
|
||||
%b = affine.load %filter[%o, %c, %fh, %fw]
|
||||
%c = affine.load %acc[]
|
||||
%d = std.fpext %a to vector<4xf32>
|
||||
%e = std.fpext %b to vector<4xf32>
|
||||
%f = std.multiply %d, %e
|
||||
%g = "reduce" %f
|
||||
%v = %g + %c
|
||||
affine.store %v, %acc[]
|
||||
}
|
||||
%c = affine.load %acc[]
|
||||
affine.store %acc, %output[%n, %o, %oh, %ow]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
A few extensions are used in the example above:
|
||||
|
||||
* affine.padded.load allows out-of-bounds access, in which case the result is
|
||||
always 0.
|
||||
* The "reduce" operation produces the sum of elements in a vector.
|
||||
|
||||
Also notice that the input element type is vector<4xf16> only because the
|
||||
current implementation does so. A MemRef with <...x4xf16> should work as well,
|
||||
given the alignment properly aligned to at least 8 (usually 16).
|
||||
|
||||
Then the emitter does a few semantic preserving transformations to work the code
|
||||
towards PTX's structure.
|
||||
|
||||
### The Emitter - Tiling
|
||||
|
||||
The following is the naive code after loop tiling:
|
||||
|
||||
```mlir
|
||||
func @Conv(%input : memref<128x1x224x224xvector<4xf16>>,
|
||||
%filter : memref<64x1x7x7xvector<4xf16>>,
|
||||
%output : memref<128x64x224x224xf16>) {
|
||||
affine.parallel (%n0, %o0, %oh0, %ow0) = 0 to 128, 0 to 1, 0 to 7, 0 to 7 {
|
||||
affine.parallel (%n1, %o1, %oh1, %ow1) = 0 to 1, 0 to 64, 0 to 16, 0 to 16 {
|
||||
%acc = alloc() : memref<f32>
|
||||
affine.store 0, %acc[]
|
||||
affine.for (%c0, %fh0, %fw0) = 0 to 1, 0 to 1, 0 to 1 {
|
||||
affine.for (%c1, %fh1, %fw1) = 0 to 1, 0 to 7, 0 to 7 {
|
||||
%a = affine.padded.load %input[
|
||||
%n0 * 1 + %n1,
|
||||
%c0 * 1 + %c1,
|
||||
(%oh0 * 16 + %oh1) * 2 + %fh0 * 7 + %fh1 - 3,
|
||||
(%ow0 * 16 + %ow1) * 2 + %fw0 * 7 + %fw1 - 3]
|
||||
%b = affine.load %filter[
|
||||
%o0 * 64 + %o1,
|
||||
%c0 * 1 + %c1,
|
||||
%fh0 * 7 + %fh1,
|
||||
%fw0 * 7 + %fw1]
|
||||
%old = affine.load %acc[]
|
||||
%d = std.fpext %a to vector<4xf32>
|
||||
%e = std.fpext %b to vector<4xf32>
|
||||
%f = std.multiply %d, %e
|
||||
%g = "reduce" %f
|
||||
%new = %g + %old
|
||||
affine.store %new, %acc[]
|
||||
}
|
||||
}
|
||||
%v = affine.load %acc[]
|
||||
affine.store %v, %output[
|
||||
%n0 * 1 + %n1,
|
||||
%o0 * 64 + %o1,
|
||||
%oh0 * 16 + %oh1,
|
||||
%ow0 * 16 + %ow1]
|
||||
} { ptx_block }
|
||||
} { ptx_grid }
|
||||
}
|
||||
```
|
||||
|
||||
The motivation is obvious - we need to decide which loops are parallelized on
|
||||
the compute units in the PTX architecture. The `ptx_grid` and `ptx_block`
|
||||
directs that the loop should be parallelized on a grid / a block, respectively.
|
||||
|
||||
Also notice that to keep the code pattern clean and neat, tiling is implemented
|
||||
in the following way. Defining "simple loop" as a loop with lower bound 0, and
|
||||
step 1, the tiling:
|
||||
|
||||
* only takes simple loops.
|
||||
* only produces simple loops.
|
||||
* no extra operation is generated. All altered index calculations are done in
|
||||
each user AffineMaps.
|
||||
|
||||
The contracting dimensions (%c, %fh, %fw) are also tiled for once. The
|
||||
significance will be seen later in shared memory promotion.
|
||||
|
||||
### The Emitter - Splitting
|
||||
|
||||
This step splits the body of the (%n1, %o1, %oh1, %ow1) loop into several parts:
|
||||
|
||||
* The code that sets the accumulators to 0.
|
||||
* The actual convolution computation code.
|
||||
* The code that writes back accumulators to the %output buffer.
|
||||
|
||||
This transformation "vectorizes" the accumulator accordingly as the `alloc()`
|
||||
gets hoisted out of the `affine.parallel` op.
|
||||
|
||||
After splitting:
|
||||
|
||||
```mlir
|
||||
func @Conv(%input : memref<128x1x224x224xvector<4xf16>>,
|
||||
%filter : memref<64x1x7x7xvector<4xf16>>,
|
||||
%output : memref<128x64x224x224xf16>) {
|
||||
affine.parallel (%n0, %o0, %oh0, %ow0) = 0 to 128, 0 to 1, 0 to 7, 0 to 7 {
|
||||
%acc = alloc() : memref<1x64x16x16xf32>
|
||||
affine.parallel (%n1, %o1, %oh1, %ow1) = 0 to 1, 0 to 64, 0 to 16, 0 to 16 {
|
||||
affine.store 0, %acc[%n1, %o1, %oh1, %ow1]
|
||||
} { ptx_block }
|
||||
affine.parallel (%n1, %o1, %oh1, %ow1) = 0 to 1, 0 to 64, 0 to 16, 0 to 16 {
|
||||
affine.for (%c0, %fh0, %fw0) = 0 to 1, 0 to 1, 0 to 1 {
|
||||
affine.for (%c1, %fh1, %fw1) = 0 to 1, 0 to 7, 0 to 7 {
|
||||
%a = affine.padded.load %input[
|
||||
%n0 * 1 + %n1,
|
||||
%c0 * 1 + %c1,
|
||||
(%oh0 * 16 + %oh1) * 2 + %fh0 * 7 + %fh1 - 3,
|
||||
(%ow0 * 16 + %ow1) * 2 + %fw0 * 7 + %fw1 - 3]
|
||||
%b = affine.load %filter[
|
||||
%o0 * 64 + %o1,
|
||||
%c0 * 1 + %c1,
|
||||
%fh0 * 7 + %fh1,
|
||||
%fw0 * 7 + %fw1]
|
||||
%old = affine.load %acc[%n1, %o1, %oh1, %ow1]
|
||||
%d = std.fpext %a to vector<4xf32>
|
||||
%e = std.fpext %b to vector<4xf32>
|
||||
%f = std.multiply %d, %e
|
||||
%g = "reduce" %f
|
||||
%new = %g + %old
|
||||
affine.store %new, %acc[%n1, %o1, %oh1, %ow1]
|
||||
}
|
||||
}
|
||||
} { ptx_block }
|
||||
affine.parallel (%n1, %o1, %oh1, %ow1) = 0 to 1, 0 to 64, 0 to 16, 0 to 16 {
|
||||
%v = affine.load %acc[%n1, %o1, %oh1, %ow1]
|
||||
affine.store %v, %output[
|
||||
%n0 * 1 + %n1,
|
||||
%o0 * 64 + %o1,
|
||||
%oh0 * 16 + %oh1,
|
||||
%ow0 * 16 + %ow1]
|
||||
} { ptx_block }
|
||||
} { ptx_grid }
|
||||
}
|
||||
```
|
||||
|
||||
To prepare for the next transformations, we'd also like to sink the (%n1, %o1,
|
||||
%oh1, %ow1), as (%c0, %fh0, %fw0) is not interesting.
|
||||
|
||||
```
|
||||
affine.parallel (%n1, %o1, %oh1, %ow1) = 0 to 1, 0 to 64, 0 to 16, 0 to 16 {
|
||||
affine.for (%c0, %fh0, %fw0) = 0 to 1, 0 to 1, 0 to 1 {
|
||||
affine.for (%c1, %fh1, %fw1) = 0 to 1, 0 to 7, 0 to 7 {
|
||||
...
|
||||
}
|
||||
}
|
||||
} { ptx_block }
|
||||
|
||||
=>
|
||||
|
||||
affine.for (%c0, %fh0, %fw0) = 0 to 1, 0 to 1, 0 to 1 {
|
||||
affine.for (%c1, %fh1, %fw1) = 0 to 1, 0 to 7, 0 to 7 {
|
||||
affine.parallel (%n1, %o1, %oh1, %ow1) = 0 to 1, 0 to 64, 0 to 16, 0 to 16 {
|
||||
...
|
||||
} { ptx_block }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### The Emitter - Shared Memory Promotion
|
||||
|
||||
This transformation is done by `affineDataCopyGenerate`, which does precise
|
||||
calculation on how much memory is transferred for a load operation.
|
||||
|
||||
After calculating the sizes of the shared memory buffer (`%promoted_input` and
|
||||
`%promoted_filter`), the transformation also creates loads and stores to
|
||||
pre-fetch data from global memory (`%input`, `%filter`) to the promoted, shared
|
||||
memory.
|
||||
|
||||
```mlir
|
||||
// Before
|
||||
affine.for (%c1, %fh1, %fw1) = 0 to 1, 0 to 7, 0 to 7 {
|
||||
affine.parallel (%n1, %o1, %oh1, %ow1) = 0 to 1, 0 to 64, 0 to 16, 0 to 16 {
|
||||
%a = affine.padded.load %input[
|
||||
%n0 * 1 + %n1,
|
||||
%c0 * 1 + %c1,
|
||||
(%oh0 * 16 + %oh1) * 2 + %fh0 * 7 + %fh1 - 3,
|
||||
(%ow0 * 16 + %ow1) * 2 + %fw0 * 7 + %fw1 - 3]
|
||||
%b = affine.load %filter[
|
||||
%o0 * 64 + %o1,
|
||||
%c0 * 1 + %c1,
|
||||
%fh0 * 7 + %fh1,
|
||||
%fw0 * 7 + %fw1]
|
||||
%old = affine.load %acc[%n1, %o1, %oh1, %ow1]
|
||||
%d = std.fpext %a to vector<4xf32>
|
||||
%e = std.fpext %b to vector<4xf32>
|
||||
%f = std.multiply %d, %e
|
||||
%g = "reduce" %f
|
||||
%new = %g + %old
|
||||
affine.store %new, %acc[%n1, %o1, %oh1, %ow1]
|
||||
} { ptx_block }
|
||||
}
|
||||
```
|
||||
|
||||
```mlir
|
||||
// After
|
||||
|
||||
%promoted_input = alloc() : memref<1x1x37x37, memory_space = 3>
|
||||
%promoted_filter = alloc() : memref<64x1x7x7, memory_space = 3>
|
||||
affine.parallel (%i0, %i1, %i2, %i3) = 0 to 1, 0 to 1, 0 to 37, 0 to 37 {
|
||||
%v = affine.padded.load %input[
|
||||
%n0 * 1 + %i0,
|
||||
%c0 * 1 + %i1,
|
||||
(%oh0 * 16) * 2 + %fh0 * 7 + %i2 - 3,
|
||||
(%ow0 * 16) * 2 + %fw0 * 7 + %i3 - 3]
|
||||
affine.store %v, %promoted_input[%i0, %i1, %i2, %i3]
|
||||
} { ptx_block }
|
||||
affine.parallel (%i0, %i1, %i2, %i3) = 0 to 64, 0 to 1, 0 to 7, 0 to 7 {
|
||||
%v = affine.load %filter[
|
||||
%o0 * 64 + %i0,
|
||||
%c0 * 1 + %i1,
|
||||
%fh0 * 7 + %i2,
|
||||
%fw0 * 7 + %i3]
|
||||
affine.store %v, %promoted_filter[%i0, %i1, %i2, %i3]
|
||||
} { ptx_block }
|
||||
affine.for (%c1, %fh1, %fw1) = 0 to 1, 0 to 7, 0 to 7 {
|
||||
affine.parallel (%n1, %o1, %oh1, %ow1) = 0 to 1, 0 to 64, 0 to 16, 0 to 16 {
|
||||
%a = affine.load %promoted_input[%n1, %c1, %oh1 * 2 + %fh1, %ow1 * 2 + %fw1]
|
||||
%b = affine.load %promoted_filter[%o1, %c1, %fh1, %fw1]
|
||||
%old = affine.load %acc[%n1, %o1, %oh1, %ow1]
|
||||
%d = std.fpext %a to vector<4xf32>
|
||||
%e = std.fpext %b to vector<4xf32>
|
||||
%f = std.multiply %d, %e
|
||||
%g = "reduce" %f
|
||||
%new = %g + %old
|
||||
affine.store %new, %acc[%n1, %o1, %oh1, %ow1]
|
||||
} { ptx_block }
|
||||
}
|
||||
```
|
||||
|
||||
### The Emitter - Volta MMA Instruction
|
||||
|
||||
This transformation turns the inner loop:
|
||||
|
||||
```mlir
|
||||
affine.parallel (%n1, %o1, %oh1, %ow1) = 0 to 1, 0 to 64, 0 to 16, 0 to 16 {
|
||||
%a = affine.load %promoted_input[%n1, %c1, %oh1 * 2 + %fh1, %ow1 * 2 + %fw1]
|
||||
%b = affine.load %promoted_filter[%o1, %c1, %fh1, %fw1]
|
||||
%old = affine.load %acc[%n1, %o1, %oh1, %ow1]
|
||||
%d = std.fpext %a to vector<4xf32>
|
||||
%e = std.fpext %b to vector<4xf32>
|
||||
%f = std.multiply %d, %e
|
||||
%g = "reduce" %f
|
||||
%new = %g + %old
|
||||
affine.store %new, %acc[%n1, %o1, %oh1, %ow1]
|
||||
} { ptx_block }
|
||||
```
|
||||
|
||||
to multiple Volta mma.sync instructions. The result is not shown here, because
|
||||
the prototype currently only hacks it up to achieve benchmark goals.
|
||||
|
||||
### The Autotuner
|
||||
|
||||
As shown above, many parameters dictate how a naive implementation is
|
||||
transformed. For now, the parameters are all tile sizes. On the top of the
|
||||
emitter, the prototype includes a simple autotuner that enumerates all good
|
||||
combinations of tile sizes and invoke the emitter with each of the combinations.
|
||||
With the assistance of in-process benchmarking, the autotuner is able to pick
|
||||
the best set of parameters.
|
||||
|
||||
## Future Improvements
|
||||
|
||||
* Explore Linalg/Vector for a higher-level naive implementation. MMA
|
||||
instruction handling would be much easier with high-level functional
|
||||
constructs.
|
||||
* Explore other layouts. The current layout corresponds to NVIDIA
|
||||
`CUDNN_TENSOR_NCHW_VECT_C` but for fp16s.
|
||||
* Iron out GPU dialect related lowering. Annotations like `ptx_grid` and
|
||||
`ptx_block` should be generalized to more architectures.
|
||||
* Speed up autotuning through more pruning.
|
||||
* Support dynamic shapes.
|
Loading…
Reference in New Issue
Block a user