Add information about quantization support in GPU delegate documentation
PiperOrigin-RevId: 322152589 Change-Id: I452ffa6fabf5bbbb81267a9b5716b1e6277c0ddb
This commit is contained in:
parent
cb1119ba71
commit
e7e026d0ea
@ -244,6 +244,24 @@ as well. This includes all flavors of quantization, including:
|
||||
To optimize performance, use models that have floating-point input & output
|
||||
tensors.
|
||||
|
||||
#### How does this work?
|
||||
|
||||
Since the GPU backend only supports floating-point execution, we run quantized
|
||||
models by giving it a ‘floating-point view’ of the original model. At a
|
||||
high-level, this entails the following steps:
|
||||
|
||||
* *Constant tensors* (such as weights/biases) are dequantized once into the
|
||||
GPU memory. This happens when the delegate is applied to the TFLite
|
||||
Interpreter.
|
||||
|
||||
* *Inputs and outputs* to the GPU program, if 8-bit quantized, are dequantized
|
||||
and quantized (respectively) for each inference. This is done on the CPU
|
||||
using TFLite’s optimized kernels.
|
||||
|
||||
* The GPU program is modified to mimic quantized behavior by inserting
|
||||
*quantization simulators* between operations. This is necessary for models
|
||||
where ops expect activations to follow bounds learnt during quantization.
|
||||
|
||||
This feature can be enabled using delegate options as follows:
|
||||
|
||||
#### Android
|
||||
|
Loading…
x
Reference in New Issue
Block a user