Add information about quantization support in GPU delegate documentation

PiperOrigin-RevId: 322152589 Change-Id: I452ffa6fabf5bbbb81267a9b5716b1e6277c0ddb
2020-07-20 08:49:59 -07:00 · 2020-07-20 08:49:59 -07:00 · e7e026d0ea
commit e7e026d0ea
parent cb1119ba71
1 changed files with 18 additions and 0 deletions
--- a/tensorflow/lite/g3doc/performance/gpu_advanced.md
+++ b/tensorflow/lite/g3doc/performance/gpu_advanced.md
@ -244,6 +244,24 @@ as well. This includes all flavors of quantization, including:
 To optimize performance, use models that have floating-point input & output
 tensors.
 #### How does this work?
 Since the GPU backend only supports floating-point execution, we run quantized
 models by giving it a ‘floating-point view’ of the original model. At a
 high-level, this entails the following steps:
 *   *Constant tensors* (such as weights/biases) are dequantized once into the
    GPU memory. This happens when the delegate is applied to the TFLite
    Interpreter.
 *   *Inputs and outputs* to the GPU program, if 8-bit quantized, are dequantized
    and quantized (respectively) for each inference. This is done on the CPU
    using TFLite’s optimized kernels.
 *   The GPU program is modified to mimic quantized behavior by inserting
    *quantization simulators* between operations. This is necessary for models
    where ops expect activations to follow bounds learnt during quantization.
 This feature can be enabled using delegate options as follows:
 #### Android