STT-tensorflow/tensorflow/lite/g3doc/convert/quantization.md
Gregory Clark a628c339c5 Minor TF Lite documentation updates.
PiperOrigin-RevId: 314643633
Change-Id: Ieaa82849c35d1071d6a750b60c72ca08c47a0db7
2020-06-03 18:30:34 -07:00

80 lines
3.2 KiB
Markdown

# Converting Quantized Models
This page provides information for how to convert quantized TensorFlow Lite
models. For more details, please see the
[model optimization](../performance/model_optimization.md).
# Post-training: Quantizing models for CPU model size
The simplest way to create a small model is to quantize the weights to 8 bits
and quantize the inputs/activations "on-the-fly", during inference. This
has latency benefits, but prioritizes size reduction.
During conversion, set the `optimizations` flag to optimize for size:
```python
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()
```
# Full integer quantization of weights and activations
We can get further latency improvements, reductions in peak memory usage, and
access to integer only hardware accelerators by making sure all model math is
quantized. To do this, we need to measure the dynamic range of activations and
inputs with a representative data set. You can simply create an input data
generator and provide it to our converter.
```python
import tensorflow as tf
def representative_dataset_gen():
for _ in range(num_calibration_steps):
# Get sample input data as a numpy array in a method of your choosing.
yield [input]
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset_gen
tflite_quant_model = converter.convert()
```
# During training: Quantizing models for integer-only execution
Quantizing models for integer-only execution gets a model with even faster
latency, smaller size, and integer-only accelerators compatible model.
Currently, this requires training a model with
["fake-quantization" nodes](https://github.com/tensorflow/tensorflow/tree/r1.13/tensorflow/contrib/quantize).
This is only available in the v1 converter. A longer term solution that's
compatible with 2.0 semantics is in progress.
Convert the graph:
```python
converter = tf.compat.v1.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.inference_type = tf.lite.constants.QUANTIZED_UINT8
input_arrays = converter.get_input_arrays()
converter.quantized_input_stats = {input_arrays[0] : (0., 1.)} # mean_value, std_dev
tflite_model = converter.convert()
```
For fully integer models, the inputs are uint8. When the `inference_type` is set
to `QUANTIZED_UINT8` as above, the real_input_value is standardised using the
[standard-score](https://en.wikipedia.org/wiki/Standard_score) as follows:
real_input_value = (quantized_input_value - mean_value) / std_dev_value
The `mean_value` and `std_dev values` specify how those uint8 values map to the
float input values used while training the model. For more details, please see
the
[TFLiteConverter](https://www.tensorflow.org/api_docs/python/tf/compat/v1/lite/TFLiteConverter)
`mean` is the integer value from 0 to 255 that maps to floating point 0.0f.
`std_dev` is 255 / (float_max - float_min).
For most users, we recommend using post-training quantization. We are working on
new tools for post-training and training-time quantization that we hope will
simplify generating quantized models.