Add a design doc for the convolution emitter.

PiperOrigin-RevId: 306339540 Change-Id: Ifad7cf55c92c7c6741777c26d85b91583896df87
2020-04-13 17:00:10 -07:00 · 2020-04-13 17:00:10 -07:00 · c3c62c31c5
commit c3c62c31c5
parent a70f444d99
1 changed files with 324 additions and 0 deletions
--- a/tensorflow/compiler/xla/service/mlir_gpu/experimental/conv_emitter/g3doc/conv_emitter.md
+++ b/tensorflow/compiler/xla/service/mlir_gpu/experimental/conv_emitter/g3doc/conv_emitter.md
@ -0,0 +1,324 @@
+# Convolution Emitter
+
+## Context
+
+This is a doc that describes a set of patches that are still under review.
+TODO(timshen): Change once all patches are checked in.
+
+The convolution emitter is a prototype with the following goals:
+
+*   The top priority is performance.
+*   It supports arbitrarily sophiscated layouts.
+*   It supports platform-specific high-performance instructions.
+*   It is as portable as possible.
+*   It enables fusion support in the future.
+
+## Current Design
+
+### Overview
+
+The prototype consists of the following components:
+
+*   The emitter currently focuses on NVIDIA Volta architecture and N(C/4)HW4
+    layout.
+*   An MLIR-based emitter. It takes a set of tuning parameters and a convolution
+    configuration, then produces a NVVM device function.
+*   An autotuner, which generates tuning parameters given a convolution
+    configuration.
+*   A test framework, which executes the generated device function with random
+    inputs, and compares the result against cuDNN.
+
+### The Emitter - Naive Implementation
+
+The emitter starts with a hand-built, naive implementation that looks like
+following Resnet first layer convolution (pseudo code):
+
+```mlir
+func @Conv(%input : memref<128x1x224x224xvector<4xf16>>,
+           %filter : memref<64x1x7x7xvector<4xf16>>,
+           %output : memref<128x64x224x224xf16>) {
+  affine.parallel (%n, %o, %oh, %ow) = 0 to 128, 0 to 64, 0 to 112, 0 to 112 {
+    %acc = alloc() : memref<f32>
+    affine.store 0, %acc[]
+    affine.for (%c, %fh, %fw) = 0 to 1, 0 to 7, 0 to 7 {
+      %a = affine.padded.load %input[%n, %c, %oh * 2 + %fh - 3, %ow * 2 + %fw - 3]
+      %b = affine.load %filter[%o, %c, %fh, %fw]
+      %c = affine.load %acc[]
+      %d = std.fpext %a to vector<4xf32>
+      %e = std.fpext %b to vector<4xf32>
+      %f = std.multiply %d, %e
+      %g = "reduce" %f
+      %v = %g + %c
+      affine.store %v, %acc[]
+    }
+    %c = affine.load %acc[]
+    affine.store %acc, %output[%n, %o, %oh, %ow]
+  }
+}
+```
+
+A few extensions are used in the example above:
+
+*   affine.padded.load allows out-of-bounds access, in which case the result is
+    always 0.
+*   The "reduce" operation produces the sum of elements in a vector.
+
+Also notice that the input element type is vector<4xf16> only because the
+current implementation does so. A MemRef with <...x4xf16> should work as well,
+given the alignment properly aligned to at least 8 (usually 16).
+
+Then the emitter does a few semantic preserving transformations to work the code
+towards PTX's structure.
+
+### The Emitter - Tiling
+
+The following is the naive code after loop tiling:
+
+```mlir
+func @Conv(%input : memref<128x1x224x224xvector<4xf16>>,
+           %filter : memref<64x1x7x7xvector<4xf16>>,
+           %output : memref<128x64x224x224xf16>) {
+  affine.parallel (%n0, %o0, %oh0, %ow0) = 0 to 128, 0 to 1, 0 to 7, 0 to 7 {
+    affine.parallel (%n1, %o1, %oh1, %ow1) = 0 to 1, 0 to 64, 0 to 16, 0 to 16 {
+      %acc = alloc() : memref<f32>
+      affine.store 0, %acc[]
+      affine.for (%c0, %fh0, %fw0) = 0 to 1, 0 to 1, 0 to 1 {
+        affine.for (%c1, %fh1, %fw1) = 0 to 1, 0 to 7, 0 to 7 {
+          %a = affine.padded.load %input[
+              %n0 * 1 + %n1,
+              %c0 * 1 + %c1,
+              (%oh0 * 16 + %oh1) * 2 + %fh0 * 7 + %fh1 - 3,
+              (%ow0 * 16 + %ow1) * 2 + %fw0 * 7 + %fw1 - 3]
+          %b = affine.load %filter[
+              %o0 * 64 + %o1,
+              %c0 * 1 + %c1,
+              %fh0 * 7 + %fh1,
+              %fw0 * 7 + %fw1]
+          %old = affine.load %acc[]
+          %d = std.fpext %a to vector<4xf32>
+          %e = std.fpext %b to vector<4xf32>
+          %f = std.multiply %d, %e
+          %g = "reduce" %f
+          %new = %g + %old
+          affine.store %new, %acc[]
+        }
+      }
+      %v = affine.load %acc[]
+      affine.store %v, %output[
+          %n0 * 1 + %n1,
+          %o0 * 64 + %o1,
+          %oh0 * 16 + %oh1,
+          %ow0 * 16 + %ow1]
+    } { ptx_block }
+  } { ptx_grid }
+}
+```
+
+The motivation is obvious - we need to decide which loops are parallelized on
+the compute units in the PTX architecture. The `ptx_grid` and `ptx_block`
+directs that the loop should be parallelized on a grid / a block, respectively.
+
+Also notice that to keep the code pattern clean and neat, tiling is implemented
+in the following way. Defining "simple loop" as a loop with lower bound 0, and
+step 1, the tiling:
+
+*   only takes simple loops.
+*   only produces simple loops.
+*   no extra operation is generated. All altered index calculations are done in
+    each user AffineMaps.
+
+The contracting dimensions (%c, %fh, %fw) are also tiled for once. The
+significance will be seen later in shared memory promotion.
+
+### The Emitter - Splitting
+
+This step splits the body of the (%n1, %o1, %oh1, %ow1) loop into several parts:
+
+*   The code that sets the accumulators to 0.
+*   The actual convolution computation code.
+*   The code that writes back accumulators to the %output buffer.
+
+This transformation "vectorizes" the accumulator accordingly as the `alloc()`
+gets hoisted out of the `affine.parallel` op.
+
+After splitting:
+
+```mlir
+func @Conv(%input : memref<128x1x224x224xvector<4xf16>>,
+           %filter : memref<64x1x7x7xvector<4xf16>>,
+           %output : memref<128x64x224x224xf16>) {
+  affine.parallel (%n0, %o0, %oh0, %ow0) = 0 to 128, 0 to 1, 0 to 7, 0 to 7 {
+    %acc = alloc() : memref<1x64x16x16xf32>
+    affine.parallel (%n1, %o1, %oh1, %ow1) = 0 to 1, 0 to 64, 0 to 16, 0 to 16 {
+      affine.store 0, %acc[%n1, %o1, %oh1, %ow1]
+    } { ptx_block }
+    affine.parallel (%n1, %o1, %oh1, %ow1) = 0 to 1, 0 to 64, 0 to 16, 0 to 16 {
+      affine.for (%c0, %fh0, %fw0) = 0 to 1, 0 to 1, 0 to 1 {
+        affine.for (%c1, %fh1, %fw1) = 0 to 1, 0 to 7, 0 to 7 {
+          %a = affine.padded.load %input[
+              %n0 * 1 + %n1,
+              %c0 * 1 + %c1,
+              (%oh0 * 16 + %oh1) * 2 + %fh0 * 7 + %fh1 - 3,
+              (%ow0 * 16 + %ow1) * 2 + %fw0 * 7 + %fw1 - 3]
+          %b = affine.load %filter[
+              %o0 * 64 + %o1,
+              %c0 * 1 + %c1,
+              %fh0 * 7 + %fh1,
+              %fw0 * 7 + %fw1]
+          %old = affine.load %acc[%n1, %o1, %oh1, %ow1]
+          %d = std.fpext %a to vector<4xf32>
+          %e = std.fpext %b to vector<4xf32>
+          %f = std.multiply %d, %e
+          %g = "reduce" %f
+          %new = %g + %old
+          affine.store %new, %acc[%n1, %o1, %oh1, %ow1]
+        }
+      }
+    } { ptx_block }
+    affine.parallel (%n1, %o1, %oh1, %ow1) = 0 to 1, 0 to 64, 0 to 16, 0 to 16 {
+      %v = affine.load %acc[%n1, %o1, %oh1, %ow1]
+      affine.store %v, %output[
+          %n0 * 1 + %n1,
+          %o0 * 64 + %o1,
+          %oh0 * 16 + %oh1,
+          %ow0 * 16 + %ow1]
+    } { ptx_block }
+  } { ptx_grid }
+}
+```
+
+To prepare for the next transformations, we'd also like to sink the (%n1, %o1,
+%oh1, %ow1), as (%c0, %fh0, %fw0) is not interesting.
+
+```
+affine.parallel (%n1, %o1, %oh1, %ow1) = 0 to 1, 0 to 64, 0 to 16, 0 to 16 {
+  affine.for (%c0, %fh0, %fw0) = 0 to 1, 0 to 1, 0 to 1 {
+    affine.for (%c1, %fh1, %fw1) = 0 to 1, 0 to 7, 0 to 7 {
+      ...
+    }
+  }
+} { ptx_block }
+
+=>
+
+affine.for (%c0, %fh0, %fw0) = 0 to 1, 0 to 1, 0 to 1 {
+  affine.for (%c1, %fh1, %fw1) = 0 to 1, 0 to 7, 0 to 7 {
+    affine.parallel (%n1, %o1, %oh1, %ow1) = 0 to 1, 0 to 64, 0 to 16, 0 to 16 {
+      ...
+    } { ptx_block }
+  }
+}
+```
+
+### The Emitter - Shared Memory Promotion
+
+This transformation is done by `affineDataCopyGenerate`, which does precise
+calculation on how much memory is transferred for a load operation.
+
+After calculating the sizes of the shared memory buffer (`%promoted_input` and
+`%promoted_filter`), the transformation also creates loads and stores to
+pre-fetch data from global memory (`%input`, `%filter`) to the promoted, shared
+memory.
+
+```mlir
+// Before
+affine.for (%c1, %fh1, %fw1) = 0 to 1, 0 to 7, 0 to 7 {
+  affine.parallel (%n1, %o1, %oh1, %ow1) = 0 to 1, 0 to 64, 0 to 16, 0 to 16 {
+    %a = affine.padded.load %input[
+        %n0 * 1 + %n1,
+        %c0 * 1 + %c1,
+        (%oh0 * 16 + %oh1) * 2 + %fh0 * 7 + %fh1 - 3,
+        (%ow0 * 16 + %ow1) * 2 + %fw0 * 7 + %fw1 - 3]
+    %b = affine.load %filter[
+        %o0 * 64 + %o1,
+        %c0 * 1 + %c1,
+        %fh0 * 7 + %fh1,
+        %fw0 * 7 + %fw1]
+    %old = affine.load %acc[%n1, %o1, %oh1, %ow1]
+    %d = std.fpext %a to vector<4xf32>
+    %e = std.fpext %b to vector<4xf32>
+    %f = std.multiply %d, %e
+    %g = "reduce" %f
+    %new = %g + %old
+    affine.store %new, %acc[%n1, %o1, %oh1, %ow1]
+  } { ptx_block }
+}
+```
+
+```mlir
+// After
+
+%promoted_input = alloc() : memref<1x1x37x37, memory_space = 3>
+%promoted_filter = alloc() : memref<64x1x7x7, memory_space = 3>
+affine.parallel (%i0, %i1, %i2, %i3) = 0 to 1, 0 to 1, 0 to 37, 0 to 37 {
+  %v = affine.padded.load %input[
+      %n0 * 1 + %i0,
+      %c0 * 1 + %i1,
+      (%oh0 * 16) * 2 + %fh0 * 7 + %i2 - 3,
+      (%ow0 * 16) * 2 + %fw0 * 7 + %i3 - 3]
+  affine.store %v, %promoted_input[%i0, %i1, %i2, %i3]
+} { ptx_block }
+affine.parallel (%i0, %i1, %i2, %i3) = 0 to 64, 0 to 1, 0 to 7, 0 to 7 {
+  %v = affine.load %filter[
+      %o0 * 64 + %i0,
+      %c0 * 1 + %i1,
+      %fh0 * 7 + %i2,
+      %fw0 * 7 + %i3]
+  affine.store %v, %promoted_filter[%i0, %i1, %i2, %i3]
+} { ptx_block }
+affine.for (%c1, %fh1, %fw1) = 0 to 1, 0 to 7, 0 to 7 {
+  affine.parallel (%n1, %o1, %oh1, %ow1) = 0 to 1, 0 to 64, 0 to 16, 0 to 16 {
+    %a = affine.load %promoted_input[%n1, %c1, %oh1 * 2 + %fh1, %ow1 * 2 + %fw1]
+    %b = affine.load %promoted_filter[%o1, %c1, %fh1, %fw1]
+    %old = affine.load %acc[%n1, %o1, %oh1, %ow1]
+    %d = std.fpext %a to vector<4xf32>
+    %e = std.fpext %b to vector<4xf32>
+    %f = std.multiply %d, %e
+    %g = "reduce" %f
+    %new = %g + %old
+    affine.store %new, %acc[%n1, %o1, %oh1, %ow1]
+  } { ptx_block }
+}
+```
+
+### The Emitter - Volta MMA Instruction
+
+This transformation turns the inner loop:
+
+```mlir
+affine.parallel (%n1, %o1, %oh1, %ow1) = 0 to 1, 0 to 64, 0 to 16, 0 to 16 {
+  %a = affine.load %promoted_input[%n1, %c1, %oh1 * 2 + %fh1, %ow1 * 2 + %fw1]
+  %b = affine.load %promoted_filter[%o1, %c1, %fh1, %fw1]
+  %old = affine.load %acc[%n1, %o1, %oh1, %ow1]
+  %d = std.fpext %a to vector<4xf32>
+  %e = std.fpext %b to vector<4xf32>
+  %f = std.multiply %d, %e
+  %g = "reduce" %f
+  %new = %g + %old
+  affine.store %new, %acc[%n1, %o1, %oh1, %ow1]
+} { ptx_block }
+```
+
+to multiple Volta mma.sync instructions. The result is not shown here, because
+the prototype currently only hacks it up to achieve benchmark goals.
+
+### The Autotuner
+
+As shown above, many parameters dictate how a naive implementation is
+transformed. For now, the parameters are all tile sizes. On the top of the
+emitter, the prototype includes a simple autotuner that enumerates all good
+combinations of tile sizes and invoke the emitter with each of the combinations.
+With the assistance of in-process benchmarking, the autotuner is able to pick
+the best set of parameters.
+
+## Future Improvements
+
+*   Explore Linalg/Vector for a higher-level naive implementation. MMA
+    instruction handling would be much easier with high-level functional
+    constructs.
+*   Explore other layouts. The current layout corresponds to NVIDIA
+    `CUDNN_TENSOR_NCHW_VECT_C` but for fp16s.
+*   Iron out GPU dialect related lowering. Annotations like `ptx_grid` and
+    `ptx_block` should be generalized to more architectures.
+*   Speed up autotuning through more pruning.
+*   Support dynamic shapes.