Add information about quantization support in GPU delegate documentation

PiperOrigin-RevId: 322152589 Change-Id: I452ffa6fabf5bbbb81267a9b5716b1e6277c0ddb
2020-07-20 08:49:59 -07:00 · 2020-07-20 08:49:59 -07:00 · e7e026d0ea
commit e7e026d0ea
parent cb1119ba71
1 changed files with 18 additions and 0 deletions
--- a/tensorflow/lite/g3doc/performance/gpu_advanced.md
+++ b/tensorflow/lite/g3doc/performance/gpu_advanced.md
@ -244,6 +244,24 @@ as well. This includes all flavors of quantization, including:
 To optimize performance, use models that have floating-point input & output
 tensors.

+#### How does this work?
+
+Since the GPU backend only supports floating-point execution, we run quantized
+models by giving it a ‘floating-point view’ of the original model. At a
+high-level, this entails the following steps:
+
+*   *Constant tensors* (such as weights/biases) are dequantized once into the
+    GPU memory. This happens when the delegate is applied to the TFLite
+    Interpreter.
+
+*   *Inputs and outputs* to the GPU program, if 8-bit quantized, are dequantized
+    and quantized (respectively) for each inference. This is done on the CPU
+    using TFLite’s optimized kernels.
+
+*   The GPU program is modified to mimic quantized behavior by inserting
+    *quantization simulators* between operations. This is necessary for models
+    where ops expect activations to follow bounds learnt during quantization.
+
 This feature can be enabled using delegate options as follows:

 #### Android