Add information about quantization support in GPU delegate documentation

PiperOrigin-RevId: 322152589
Change-Id: I452ffa6fabf5bbbb81267a9b5716b1e6277c0ddb
This commit is contained in:
Sachin Joglekar 2020-07-20 08:49:59 -07:00 committed by TensorFlower Gardener
parent cb1119ba71
commit e7e026d0ea

View File

@ -244,6 +244,24 @@ as well. This includes all flavors of quantization, including:
To optimize performance, use models that have floating-point input & output
tensors.
#### How does this work?
Since the GPU backend only supports floating-point execution, we run quantized
models by giving it a floating-point view of the original model. At a
high-level, this entails the following steps:
* *Constant tensors* (such as weights/biases) are dequantized once into the
GPU memory. This happens when the delegate is applied to the TFLite
Interpreter.
* *Inputs and outputs* to the GPU program, if 8-bit quantized, are dequantized
and quantized (respectively) for each inference. This is done on the CPU
using TFLites optimized kernels.
* The GPU program is modified to mimic quantized behavior by inserting
*quantization simulators* between operations. This is necessary for models
where ops expect activations to follow bounds learnt during quantization.
This feature can be enabled using delegate options as follows:
#### Android