Add information about quantization support in GPU delegate documentation
PiperOrigin-RevId: 322152589 Change-Id: I452ffa6fabf5bbbb81267a9b5716b1e6277c0ddb
This commit is contained in:
parent
cb1119ba71
commit
e7e026d0ea
@ -244,6 +244,24 @@ as well. This includes all flavors of quantization, including:
|
|||||||
To optimize performance, use models that have floating-point input & output
|
To optimize performance, use models that have floating-point input & output
|
||||||
tensors.
|
tensors.
|
||||||
|
|
||||||
|
#### How does this work?
|
||||||
|
|
||||||
|
Since the GPU backend only supports floating-point execution, we run quantized
|
||||||
|
models by giving it a ‘floating-point view’ of the original model. At a
|
||||||
|
high-level, this entails the following steps:
|
||||||
|
|
||||||
|
* *Constant tensors* (such as weights/biases) are dequantized once into the
|
||||||
|
GPU memory. This happens when the delegate is applied to the TFLite
|
||||||
|
Interpreter.
|
||||||
|
|
||||||
|
* *Inputs and outputs* to the GPU program, if 8-bit quantized, are dequantized
|
||||||
|
and quantized (respectively) for each inference. This is done on the CPU
|
||||||
|
using TFLite’s optimized kernels.
|
||||||
|
|
||||||
|
* The GPU program is modified to mimic quantized behavior by inserting
|
||||||
|
*quantization simulators* between operations. This is necessary for models
|
||||||
|
where ops expect activations to follow bounds learnt during quantization.
|
||||||
|
|
||||||
This feature can be enabled using delegate options as follows:
|
This feature can be enabled using delegate options as follows:
|
||||||
|
|
||||||
#### Android
|
#### Android
|
||||||
|
Loading…
x
Reference in New Issue
Block a user