[MLIR:TPU] Add document for space to depth transform pass.

PiperOrigin-RevId: 319886690 Change-Id: I122e13e3fe4bdcc7ba0f607d1b8adc266f3d65be
2020-07-06 17:13:30 -07:00 · 2020-07-06 17:13:30 -07:00 · 91417e256e
commit 91417e256e
parent a8c7599f48
2 changed files with 195 additions and 0 deletions
--- a/tensorflow/compiler/mlir/tensorflow/g3doc/images/sapce_to_depth_transform.png
+++ b/tensorflow/compiler/mlir/tensorflow/g3doc/images/sapce_to_depth_transform.png
--- a/tensorflow/compiler/mlir/tensorflow/g3doc/space_to_depth.md
+++ b/tensorflow/compiler/mlir/tensorflow/g3doc/space_to_depth.md
@ -0,0 +1,195 @@
+# Automatic Space to Depth Transform in MLIR Bridge
+
+Author: wangtao@, yuanzx@, hinsu@, lyandy@, chiachenc@, aminim@, jpienaar@,
+dehao@
+
+## TL;DR
+
+_This document describes an automatic space to depth transform for the first
+convolution in the new MLIR bridge to improve MXU efficiency of low batch size
+convolutions._
+
+## Background
+
+For image models, the first layer is usually not MXU friendly as it has a
+feature size of 3. This results in poor performance especially with small batch.
+
+One way to address this issue is to use the `space-to-depth` transform. This
+optimization tiles the 2x2 space dimensions to the feature dimension so that the
+feature dimension becomes 3\*4=12, which is more MXU friendly. In order to make
+this optimization efficient, the shape of the weight needs to be padded and
+transposed to the shape that the convolution emitter expects. The input also
+needs to be transposed on the host and padded on the device to make the
+convolution efficient. Although a 2x2 space-to-depth transform works only when
+the first convolution has a stride of 2, many image models, ResNet-like in
+particular, have a stride-2 convolution in the first layer.
+
+Space to depth helped models such as MaskRCNN, SSD and I3D gain more than 2X
+speedup and reduce memory usage in the first convolution.
+
+The first convolution in many image models, including ResNet or ResNet-like, is
+a (kernel=7, stride=2) 2D convolution. The input of the convolution is images,
+which usually has RGB channels. The input of this first convolution is of shape
+[batch\_size, height, width, 3] and the kernel size is [kernel\_size,
+kernel\_size, 3, out\_channel]. Space to depth is to transform this first
+convolution's input to [batch\_size, height // stride, width // stride, 3 \*
+stride \* stride] and the kernel to [kernel\_size // stride, kernel\_size //
+stride, 3 \* stride \* stride, out\_channel] to improve TPU MXU utilization.
+
+![drawings](images/sapce_to_depth_transform.png)
+
+This optimization can be automatically done by the graph optimizer where weight
+transformation is done at variable loading time and the input transformation is
+done for every inference invocation. A further optimization can fuse this (at
+host) with the double transpose to minimize memory operation on host.
+
+## Proposed Method
+
+**block\_size** is defined as the number of space sizes transformed to the depth
+dimension. _stride % block\_size == 0_ and _stride >= block\_size_ is required
+to do the transform. There are three parts of automatically space to depth
+transformation:
+
+1.  Transform input on the host.
+
+    Space-to-depth performs the following permutation, which is equivalent to
+    `tf.nn.space_to_depth`.
+
+    ```
+    images = tf.reshape(images, [batch, h // block_size, block_size,
+                               w // block_size, block_size, c])
+    images = tf.transpose(images, [0, 1, 3, 2, 4, 5])
+    images = tf.reshape(images, [batch, h // block_size, w // block_size,
+                               c * (block_size ** 2)])
+    ```
+
+    `SpaceToDepthOp` can be called on the host to perform the transform.
+
+1.  Weight Transformation
+
+    Weight Transformation is similar to Input Transform. Weight transform is
+    needed to apply space to depth optimization for a model that needs to load a
+    pre-train checkpoint. This transform can be done on the host or TPU device
+    based on the cost. As the size of the kernel is relatively small, this won't
+    add additional cost to TPU device time. Below is the logic to transform the
+    kernel of shape [7, 7, 3, 64] to [4, 4, 12, 84].
+
+    ```
+    conv0 = tf.compat.v1.layers.Conv2D(
+     filters=filters,
+     kernel_size=kernel_size,
+     strides=2,
+     padding=('SAME' if strides == 1 else 'VALID'),
+     use_bias=False,
+     kernel_initializer=tf.variance_scaling_initializer(),
+     data_format=data_format)
+
+    # Use the image size without space-to-depth transform as the input of conv0.
+    batch_size, h, w, channel = inputs.get_shape().as_list()
+    conv0.build([
+     batch_size, h * space_to_depth_block_size, w * space_to_depth_block_size,
+     channel // (space_to_depth_block_size**2)
+    ])
+
+    kernel = conv0.weights[0]
+    # [7, 7, 3, 64] --> [8, 8, 3, 64]
+
+    kernel = tf.pad(
+     kernel,
+     paddings=tf.constant([[1, 0], [1, 0], [0, 0], [0, 0]]),
+     mode='CONSTANT',
+     constant_values=0.)
+    # Transform kernel follows the space-to-depth logic: https://www.tensorflow.org/api_docs/python/tf/nn/space_to_depth)
+    kernel = tf.reshape(
+     kernel,
+     [4, space_to_depth_block_size, 4, space_to_depth_block_size, 3, filters])
+
+    kernel = tf.transpose(kernel, [0, 2, 1, 3, 4, 5])
+    kernel = tf.reshape(kernel, [4, 4, int(channel), filters])
+    kernel = tf.cast(kernel, inputs.dtype)
+    ```
+
+    If kernel\_size % block\_size != 0, padding is needed for the weight before
+    transform, input of Convolution needs to be padded as well.
+
+1.  Rewrite the first convolution
+
+    Need to rewrite the first convolution's shape of input from [batch\_size,
+    height, width, 3] to [batch\_size, height // block\_size, width //
+    block\_size, 3 \* block\_size \* block\_size] and kernel shape from
+    [kernel\_size, kernel\_size, 3, out\_channel] to [kernel\_size //
+    block\_size, kernel\_size // block\_size, 3 \* block\_size \* block\_size,
+
+    This is the proposed workflow for automatic space to depth transformation.
+    All the transformations will be triggered in a MLIR SpaceToDepthRewritePass,
+    this Rewrite pass will be triggered before TPURewrite so that no metadata
+    rewrite is needed.
+
+*   First, the rewrite pass will walk through all the convolutions in func of
+    tf\_device::LaunchOp and get the first Convolution and its shape;
+*   Second, the rewrite pass will apply transformations to the first
+    convolution, the padding before the first convolution, first convolution's
+    filters and its Conv2DBackPropFilter;
+*   At last, the rewrite pass will insert SpaceToDepthOp after IteratorGetNext
+    where the iterator's result has the same shape as the first convolution's
+    input.
+
+#### Pseudo MLIR code before and after RewritePass
+
+```
+// Example: original program:
+//
+module {
+   func @while_body {
+     %input = "tf.IteratorGetNext"(...) {device = "/CPU:0"}:
+              -> tensor<2x224x224x3xf32>
+     %device_launch = "tf_device.launch_func"(%input,...) {func = @_func,...)
+     return ...
+   }
+   func @_func(%input: tensor<2x224x224x3xf32>,
+               %filter: tensor<7x7x3x64xf32>) {
+     %6 = "tf.Conv2D"(%input, %filter)  {strides = [1, 2, 2, 1]}:             (tensor<2x230x230x3xf32>, tensor<7x7x3x64xf32>) ->
+      tensor<2x112x112x64xf32>
+   }
+}
+
+// With this pass, the program will be transformed into:
+module {
+   func @while_body {
+     %input = "tf.IteratorGetNext"(...) {device = "/CPU:0"}
+               -> tensor<2x224x224x3xf32>
+     %space_to_depth = "tf.SpaceToDepth"(%input) {block_size = 2, ...}:
+        (tensor<2x224x224x3xf32>) -> tensor<2x112x112x12xf32>
+     %device_launch = "tf_device.launch_func"(%space_to_depth,...) {func = @_func,...)
+     return ...
+   }
+   func @_func(%input: tensor<2x112x112x12xf32>,
+              %filter: tensor<7x7x3x64xf32>) {
+     %filter_transform = "tf.Pad/tf.Transpose/tf.Reshape"(%filter):
+       tensor<7x7x3x64xf32>) -> tensor<4x4x12x64xf32>
+     %conv = "tf.Conv2D"(%input, %filter_transfrom) {strides = [1, 1, 1, 1]}:
+       (tensor<2x112x112x12xf32>, tensor<4x4x12x64xf32>) ->
+       tensor<2x112x112x64xf32>
+   }
+}
+```
+
+### SpaceToDepth Trigger Condition
+
+Space to depth will only be triggered when batch size is small and the first
+convolution channel size is small. Stride of the convolution should be bigger
+than 1 as well. A cost model will be built that takes input shape and host cost
+into consideration to trigger the transformation. There will be a flag to
+disable this feature as well.
+
+### Fuse SpaceToDepth with Automatic Double Transpose
+
+The transpose and reshape op in SpaceToDepthOp on TPU hosts may cause image
+model to be infeed bound. To reduce host time, space to depth transform can be
+fused with `automatic double transpose` to reduce extra overhead on the host.
+
+### Extend from Conv2D to Conv3D
+
+SpaceToDepth not only helps with 2D image models but also 3D image models such
+as I3D. The plan is to apply automatic space to depth for Conv2D as the first
+step. After Conv2D is well tested, will generalize this technique to Conv3D.