Update post-training quantization docs for float16

PiperOrigin-RevId: 261369284
This commit is contained in:
T.J. Alumbaugh 2019-08-02 12:26:22 -07:00 committed by TensorFlower Gardener
parent d90e521d71
commit 20f7886890
7 changed files with 700 additions and 3 deletions

View File

@ -77,9 +77,11 @@ upper_tabs:
- title: "Post-training quantization"
path: /lite/performance/post_training_quantization
- title: "Post-training quantization example"
path: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tutorials/post_training_quant.ipynb
path: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/performance/post_training_quant.ipynb
- title: "Post-training integer quantization example"
path: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tutorials/post_training_integer_quant.ipynb
path: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/performance/post_training_integer_quant.ipynb
- title: "Post-training float16 quantization example"
path: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/performance/post_training_float16_quant.ipynb
status: external
- title: "Delegates"
path: /lite/performance/delegates

View File

@ -272,11 +272,16 @@ following Python code quantizes a `SavedModel` and saves it to disk:
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()
open("converted_model.tflite", "wb").write(tflite_quantized_model)
```
TensorFlow Lite supports reducing precision of values from full floating point
to half-precision floats (float16) or 8-bit integers. There are trade-offs in
model size and accuracy for each choice, and some operations have optimized
implementations for these reduced precision types.
To learn more about quantization, see
[Post-training quantization](../performance/post_training_quantization.md).

Binary file not shown.

Before

(image error) Size: 31 KiB

After

(image error) Size: 82 KiB

View File

@ -0,0 +1,647 @@
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "post-training-fp16-quant.ipynb",
"version": "0.3.2",
"provenance": [],
"private_outputs": true,
"collapsed_sections": [],
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 2",
"name": "python2"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "6Y8E0lw5eYWm"
},
"source": [
"# Post Training FP16 Quantization"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "CIGrZZPTZVeO"
},
"source": [
"<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
" <td>\n",
" <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/lite/tutorials/post_training_float16_quant.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
" </td>\n",
" <td>\n",
" <a target=\"_blank\" href=\"https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tutorials/post_training_float16_quant.ipynb\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on GitHub</a>\n",
" </td>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "BTC1rDAuei_1"
},
"source": [
"## Overview\n",
"\n",
"[TensorFlow Lite](https://www.tensorflow.org/lite/) now supports\n",
"converting weights to 16-bit floating point values during model conversion from TensorFlow to TensorFlow Lite's flat buffer format. This results in a 2x reduction in model size. Some harware, like GPUs, can compute natively in this reduced precision arithmetic, realizing a speedup over traditional floating point execution. The Tensorflow Lite GPU delegate can be configured to run in this way. However, a model converted to float16 weights can still run on the CPU without additional modification: the float16 weights are upsampled to float32 prior to the first inference. This permits a significant reduction in model size in exchange for a minimal impacts to latency and accuracy.\n",
"\n",
"In this tutorial, we train an MNIST model from scratch, check its accuracy in TensorFlow, and then convert the saved model into a Tensorflow Lite flatbuffer\n",
"with float16 quantization. We finally check the\n",
"accuracy of the converted model and compare it to the original saved model. We\n",
"run the training script [mnist.py](https://github.com/tensorflow/models/blob/master/official/mnist/mnist.py) from\n",
"[Tensorflow official MNIST tutorial](https://github.com/tensorflow/models/tree/master/official/mnist).\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "2XsEP17Zelz9"
},
"source": [
"## Building an MNIST model"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "dDqqUIZjZjac"
},
"source": [
"### Setup"
]
},
{
"cell_type": "code",
"metadata": {
"colab_type": "code",
"id": "gyqAw1M9lyab",
"colab": {}
},
"source": [
"! pip uninstall -y tensorflow\n",
"! pip install -U tf-nightly"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab_type": "code",
"id": "WsN6s5L1ieNl",
"colab": {}
},
"source": [
"import tensorflow as tf\n",
"tf.enable_eager_execution()"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab_type": "code",
"id": "00U0taBoe-w7",
"colab": {}
},
"source": [
"! git clone --depth 1 https://github.com/tensorflow/models"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "c6nb7OPlXs_3",
"colab_type": "code",
"colab": {}
},
"source": [
"tf.lite.constants.FLOAT16"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab_type": "code",
"id": "4XZPtSh-fUOc",
"colab": {}
},
"source": [
"import sys\n",
"import os\n",
"\n",
"if sys.version_info.major >= 3:\n",
" import pathlib\n",
"else:\n",
" import pathlib2 as pathlib\n",
"\n",
"# Add `models` to the python path.\n",
"models_path = os.path.join(os.getcwd(), \"models\")\n",
"sys.path.append(models_path)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "eQ6Q0qqKZogR"
},
"source": [
"### Train and export the model"
]
},
{
"cell_type": "code",
"metadata": {
"colab_type": "code",
"id": "eMsw_6HujaqM",
"colab": {}
},
"source": [
"saved_models_root = \"/tmp/mnist_saved_model\""
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab_type": "code",
"id": "hWSAjQWagIHl",
"colab": {}
},
"source": [
"# The above path addition is not visible to subprocesses, add the path for the subprocess as well.\n",
"!PYTHONPATH={models_path} python models/official/mnist/mnist.py --train_epochs=1 --export_dir {saved_models_root} --data_format=channels_last"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "5NMaNZQCkW9X"
},
"source": [
"For the example, we only trained the model for a single epoch, so it only trains to ~96% accuracy.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "xl8_fzVAZwOh"
},
"source": [
"### Convert to a TensorFlow Lite model\n",
"\n",
"The `savedmodel` directory is named with a timestamp. Select the most recent one: "
]
},
{
"cell_type": "code",
"metadata": {
"colab_type": "code",
"id": "Xp5oClaZkbtn",
"colab": {}
},
"source": [
"saved_model_dir = str(sorted(pathlib.Path(saved_models_root).glob(\"*\"))[-1])\n",
"saved_model_dir"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "AT8BgkKmljOy"
},
"source": [
"Using the [Python `TFLiteConverter`](https://www.tensorflow.org/lite/convert/python_api), the saved model can be converted into a TensorFlow Lite model.\n",
"\n",
"First load the model using the `TFLiteConverter`:"
]
},
{
"cell_type": "code",
"metadata": {
"colab_type": "code",
"id": "_i8B2nDZmAgQ",
"colab": {}
},
"source": [
"import tensorflow as tf\n",
"tf.enable_eager_execution()\n",
"tf.logging.set_verbosity(tf.logging.DEBUG)\n",
"\n",
"converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)\n",
"tflite_model = converter.convert()"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "F2o2ZfF0aiCx"
},
"source": [
"Write it out to a `.tflite` file:"
]
},
{
"cell_type": "code",
"metadata": {
"colab_type": "code",
"id": "vptWZq2xnclo",
"colab": {}
},
"source": [
"tflite_models_dir = pathlib.Path(\"/tmp/mnist_tflite_models/\")\n",
"tflite_models_dir.mkdir(exist_ok=True, parents=True)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab_type": "code",
"id": "Ie9pQaQrn5ue",
"colab": {}
},
"source": [
"tflite_model_file = tflite_models_dir/\"mnist_model.tflite\"\n",
"tflite_model_file.write_bytes(tflite_model)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "7BONhYtYocQY"
},
"source": [
"To instead quantize the model to float16 on export, first set the `optimizations` flag to use default optimizations. Then specify that float16 is the supported type on the target platform:"
]
},
{
"cell_type": "code",
"metadata": {
"colab_type": "code",
"id": "HEZ6ET1AHAS3",
"colab": {}
},
"source": [
"tf.logging.set_verbosity(tf.logging.INFO)\n",
"converter.optimizations = [tf.lite.Optimize.DEFAULT]\n",
"converter.target_spec.supported_types = [tf.lite.constants.FLOAT16]"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "xW84iMYjHd9t",
"colab_type": "text"
},
"source": [
"Finally, convert the model like usual. Note, by default the converted model will still use float input and outputs for invocation convenience."
]
},
{
"cell_type": "code",
"metadata": {
"id": "yuNfl3CoHNK3",
"colab_type": "code",
"colab": {}
},
"source": [
"tflite_fp16_model = converter.convert()\n",
"tflite_model_fp16_file = tflite_models_dir/\"mnist_model_quant_f16.tflite\"\n",
"tflite_model_fp16_file.write_bytes(tflite_fp16_model)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "PhMmUTl4sbkz"
},
"source": [
"Note how the resulting file is approximately `1/2` the size."
]
},
{
"cell_type": "code",
"metadata": {
"colab_type": "code",
"id": "JExfcfLDscu4",
"colab": {}
},
"source": [
"!ls -lh {tflite_models_dir}"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "L8lQHMp_asCq"
},
"source": [
"## Run the TensorFlow Lite models"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "-5l6-ciItvX6"
},
"source": [
"We can run the TensorFlow Lite model using the Python TensorFlow Lite\n",
"Interpreter. \n",
"\n",
"### Load the test data\n",
"\n",
"First, let's load the MNIST test data to feed to the model:"
]
},
{
"cell_type": "code",
"metadata": {
"colab_type": "code",
"id": "eTIuU07NuKFL",
"colab": {}
},
"source": [
"import numpy as np\n",
"_, mnist_test = tf.keras.datasets.mnist.load_data()\n",
"images, labels = tf.cast(mnist_test[0], tf.float32)/255.0, mnist_test[1]\n",
"\n",
"mnist_ds = tf.data.Dataset.from_tensor_slices((images, labels)).batch(1)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Ap_jE7QRvhPf"
},
"source": [
"### Load the model into the interpreters"
]
},
{
"cell_type": "code",
"metadata": {
"colab_type": "code",
"id": "Jn16Rc23zTss",
"colab": {}
},
"source": [
"interpreter = tf.lite.Interpreter(model_path=str(tflite_model_file))\n",
"interpreter.allocate_tensors()"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab_type": "code",
"id": "J8Pztk1mvNVL",
"colab": {}
},
"source": [
"interpreter_fp16 = tf.lite.Interpreter(model_path=str(tflite_model_fp16_file))\n",
"interpreter_fp16.allocate_tensors()"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "2opUt_JTdyEu"
},
"source": [
"### Test the models on one image"
]
},
{
"cell_type": "code",
"metadata": {
"colab_type": "code",
"id": "AKslvo2kwWac",
"colab": {}
},
"source": [
"for img, label in mnist_ds:\n",
" break\n",
"\n",
"interpreter.set_tensor(interpreter.get_input_details()[0][\"index\"], img)\n",
"interpreter.invoke()\n",
"predictions = interpreter.get_tensor(\n",
" interpreter.get_output_details()[0][\"index\"])"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab_type": "code",
"id": "XZClM2vo3_bm",
"colab": {}
},
"source": [
"import matplotlib.pylab as plt\n",
"\n",
"plt.imshow(img[0])\n",
"template = \"True:{true}, predicted:{predict}\"\n",
"_ = plt.title(template.format(true= str(label[0].numpy()),\n",
" predict=str(predictions[0])))\n",
"plt.grid(False)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab_type": "code",
"id": "3gwhv4lKbYZ4",
"colab": {}
},
"source": [
"interpreter_fp16.set_tensor(\n",
" interpreter_fp16.get_input_details()[0][\"index\"], img)\n",
"interpreter_fp16.invoke()\n",
"predictions = interpreter_fp16.get_tensor(\n",
" interpreter_fp16.get_output_details()[0][\"index\"])"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab_type": "code",
"id": "CIH7G_MwbY2x",
"colab": {}
},
"source": [
"plt.imshow(img[0])\n",
"template = \"True:{true}, predicted:{predict}\"\n",
"_ = plt.title(template.format(true= str(label[0].numpy()),\n",
" predict=str(predictions[0])))\n",
"plt.grid(False)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "LwN7uIdCd8Gw"
},
"source": [
"### Evaluate the models"
]
},
{
"cell_type": "code",
"metadata": {
"colab_type": "code",
"id": "05aeAuWjvjPx",
"colab": {}
},
"source": [
"def eval_model(interpreter, mnist_ds):\n",
" total_seen = 0\n",
" num_correct = 0\n",
"\n",
" input_index = interpreter.get_input_details()[0][\"index\"]\n",
" output_index = interpreter.get_output_details()[0][\"index\"]\n",
" for img, label in mnist_ds:\n",
" total_seen += 1\n",
" interpreter.set_tensor(input_index, img)\n",
" interpreter.invoke()\n",
" predictions = interpreter.get_tensor(output_index)\n",
" if predictions == label.numpy():\n",
" num_correct += 1\n",
"\n",
" if total_seen % 500 == 0:\n",
" print(\"Accuracy after %i images: %f\" %\n",
" (total_seen, float(num_correct) / float(total_seen)))\n",
"\n",
" return float(num_correct) / float(total_seen)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab_type": "code",
"id": "T5mWkSbMcU5z",
"colab": {}
},
"source": [
"print(eval_model(interpreter, mnist_ds))"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Km3cY9ry8ZlG"
},
"source": [
"We can repeat the evaluation on the float16 quantized model to obtain:\n"
]
},
{
"cell_type": "code",
"metadata": {
"colab_type": "code",
"id": "-9cnwiPp6EGm",
"colab": {}
},
"source": [
"# NOTE: Colab runs on server CPUs. At the time of writing this, TensorFlow Lite\n",
"# doesn't have super optimized server CPU kernels. For this reason this may be\n",
"# slower than the above float interpreter. But for mobile CPUs, considerable\n",
"# speedup can be observed.\n",
"print(eval_model(interpreter_fp16, mnist_ds))\n"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "L7lfxkor8pgv"
},
"source": [
"In this example, we have quantized a model to float16 with no difference in the accuracy.\n",
"\n",
"It's also possible to evaluate the fp16 quantized model on the GPU. To perform all arithmetic with the reduced precision values, be sure to create the `TfLiteGPUDelegateOptions` struct in your app and set `precision_loss_allowed` to `1`, like this:\n",
"\n",
"```\n",
"//Prepare GPU delegate.\n",
"const TfLiteGpuDelegateOptions options = {\n",
" .metadata = NULL,\n",
" .compile_options = {\n",
" .precision_loss_allowed = 1, // FP16\n",
" .preferred_gl_object_type = TFLITE_GL_OBJECT_TYPE_FASTEST,\n",
" .dynamic_batch_enabled = 0, // Not fully functional yet\n",
" },\n",
"};\n",
"```\n",
"\n",
"Detailed documentation on the TFLite GPU delegate and how to use it in your application can be found [here](https://www.tensorflow.org/lite/performance/gpu_advanced?source=post_page---------------------------)"
]
}
]
}

View File

@ -8,6 +8,20 @@ conversion.
### Optimization options
There are several post training quantization options to choose from. Here is a
summary table of the choices and the benefits they provide:
| Technique | Benefits | Hardware |
| ---------------------- | ------------------------- | ------------------- |
| Post training "hybrid" | 4x smaller, 2-3x speedup, | CPU |
: : accuracy : :
| Post training integer | 4x smaller, More speedup | CPU, Edge TPU, etc. |
| Post training fp16 | 2x smaller, Potential GPU | CPU/GPU |
: : acceleration : :
This decision tree can help determine which post-training quantization method is
best for your use case:
![post-training optimization options](images/optimization.jpg)
### Quantizing weights
@ -78,6 +92,35 @@ Note: `target_spec.supported_ops` was previously `target_ops` in the Python API.
This makes the converter throw an error if it encounters an operation it cannot
currently quantize.
### Float16 quantization of weights
We can reduce the size of a floating point model by quantizing the weights to
float16, the IEEE standard for 16 bit floating point numbers. The advantages of
this quantization are as follows:
- reduce model size by up to half (since all weights are now half the original
size)
- minimal loss in accuracy
- some delegates (e.g. the GPU delegate) can operate directly on float16 data,
which results in faster execution than float32 computations.
This quantization may not be a good choice if you need maximum performance (a
quantization to fixed point math would be better in that case). To enable
float16 quantization of weights, specify "DEFAULT" optimization as above and
then specify that float16 is in supported types for the target_spec:
```
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.lite.constants.FLOAT16]
tflite_quant_model = converter.convert()
```
By default, a float16 quantized model will "dequantize" the weights values to
float32 when run on the CPU. The GPU delegate will not perform this
dequantization, since it can operate on float16 data.
### Model accuracy
Since weights are quantized post training, there could be an accuracy loss,