From 4783a85dec10daa2175870555b5c4442687b7a42 Mon Sep 17 00:00:00 2001 From: Khanh LeViet Date: Wed, 11 Mar 2020 19:49:33 -0700 Subject: [PATCH] Minor TF Lite doc update PiperOrigin-RevId: 300461356 Change-Id: I850174418d894144f58d46844ad74b9ef16bf5be --- tensorflow/lite/g3doc/_book.yaml | 23 ++-- .../lite/g3doc/performance/benchmarks.md | 4 +- .../lite/g3doc/performance/best_practices.md | 72 ++++++----- .../lite/g3doc/performance/delegates.md | 3 +- .../g3doc/performance/model_optimization.md | 120 ++++++++++++++---- .../g3doc/performance/quantization_spec.md | 14 +- 6 files changed, 163 insertions(+), 73 deletions(-) diff --git a/tensorflow/lite/g3doc/_book.yaml b/tensorflow/lite/g3doc/_book.yaml index 6c5f5aed6e3..e9891be1c29 100644 --- a/tensorflow/lite/g3doc/_book.yaml +++ b/tensorflow/lite/g3doc/_book.yaml @@ -71,18 +71,9 @@ upper_tabs: path: /lite/performance/best_practices - title: "Benchmarks" path: /lite/performance/benchmarks - - title: "Model optimization" - path: /lite/performance/model_optimization - - title: "Post-training quantization" - path: /lite/performance/post_training_quantization - - title: "Post-training weight quantization" - path: /lite/performance/post_training_quant - - title: "Post-training integer quantization" - path: /lite/performance/post_training_integer_quant - - title: "Post-training float16 quantization" - path: /lite/performance/post_training_float16_quant - title: "Delegates" path: /lite/performance/delegates + status: experimental - title: "GPU delegate" path: /lite/performance/gpu - title: "Advanced GPU" @@ -92,6 +83,18 @@ upper_tabs: - title: "Hexagon delegate" path: /lite/performance/hexagon_delegate status: experimental + + - heading: "Optimize a model" + - title: "Overview" + path: /lite/performance/model_optimization + - title: "Post-training quantization" + path: /lite/performance/post_training_quantization + - title: "Post-training weight quantization" + path: /lite/performance/post_training_quant + - title: "Post-training integer quantization" + path: /lite/performance/post_training_integer_quant + - title: "Post-training float16 quantization" + path: /lite/performance/post_training_float16_quant - title: "Quantization specification" path: /lite/performance/quantization_spec diff --git a/tensorflow/lite/g3doc/performance/benchmarks.md b/tensorflow/lite/g3doc/performance/benchmarks.md index e825f7c41c3..2dc20f2d74c 100644 --- a/tensorflow/lite/g3doc/performance/benchmarks.md +++ b/tensorflow/lite/g3doc/performance/benchmarks.md @@ -7,7 +7,7 @@ These performance benchmark numbers were generated with the [Android TFLite benchmark binary](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark) and the [iOS benchmark app](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark/ios). -# Android performance benchmarks +## Android performance benchmarks For Android benchmarks, the CPU affinity is set to use big cores on the device to reduce variance (see [details](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark#reducing-variance-between-runs-on-android)). @@ -135,7 +135,7 @@ The performance values below are measured on Android 10. -# iOS benchmarks +## iOS benchmarks To run iOS benchmarks, the [benchmark app](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark/ios) diff --git a/tensorflow/lite/g3doc/performance/best_practices.md b/tensorflow/lite/g3doc/performance/best_practices.md index 76553cedcfd..56093e63722 100644 --- a/tensorflow/lite/g3doc/performance/best_practices.md +++ b/tensorflow/lite/g3doc/performance/best_practices.md @@ -2,8 +2,8 @@ Mobile and embedded devices have limited computational resources, so it is important to keep your application resource efficient. We have compiled a list -of best practices and strategies that you can use to optimize your model and -application when using TensorFlow Lite. +of best practices and strategies that you can use to improve your TensorFlow +Lite model performance. ## Choose the best model for the task @@ -23,13 +23,20 @@ One example of models optimized for mobile devices are vision applications. [Hosted models](../models/hosted.md) lists several other models that have been optimized specifically for mobile and embedded devices. -You can retrain the listed models on your own dataset by using transfer learning. Check out our transfer learning tutorial for -[image classification](https://codelabs.developers.google.com/codelabs/tensorflow-for-poets/#0) and - [object detection](https://medium.com/tensorflow/training-and-serving-a-realtime-mobile-object-detector-in-30-minutes-with-cloud-tpus-b78971cf1193). - +You can retrain the listed models on your own dataset by using transfer +learning. Check out our transfer learning tutorial for +[image classification](https://colab.sandbox.google.com/github/tensorflow/examples/blob/master/tensorflow_examples/lite/model_maker/demo/image_classification.ipynb) +and +[object detection](https://medium.com/tensorflow/training-and-serving-a-realtime-mobile-object-detector-in-30-minutes-with-cloud-tpus-b78971cf1193). ## Profile your model -Once you have selected a candidate model that is right for your task, it is a good practice to profile and benchmark your model. TensorFlow Lite [benchmarking tool](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark) has a built-in profiler that shows per operator profiling statistics. This can help in understanding performance bottlenecks and which operators dominate the computation time. + +Once you have selected a candidate model that is right for your task, it is a +good practice to profile and benchmark your model. TensorFlow Lite +[benchmarking tool](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark) +has a built-in profiler that shows per operator profiling statistics. This can +help in understanding performance bottlenecks and which operators dominate the +computation time. ## Profile and optimize operators in the graph @@ -43,23 +50,12 @@ operator is executed. Check out our ## Optimize your model -Model compression aims to create smaller models that are generally faster and -more energy efficient, so that they can be deployed on mobile devices. +Model optimization aims to create smaller models that are generally faster and +more energy efficient, so that they can be deployed on mobile devices. There are +multiple optimization techniques suppored by TensorFlow Lite, such as +quantization. -### Quantization - -If your model uses floating-point weights or activations, then it may be -possible to reduce the size of model up to ~4x by using quantization, which -effectively turns the float weights to 8-bit. There are two flavors of -quantization: [post-training quantization](post_training_quantization.md) and -[quantized training](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/quantize/README.md){:.external}. -The former does not require model re-training, but, in rare cases, may have -accuracy loss. When accuracy loss is beyond acceptable thresholds, quantized -training should be used instead. - -We strongly recommend running benchmarks to make sure that the accuracy is not -impacted during model compression. Check out our -[model optimization docs](model_optimization.md) for details. +Check out our [model optimization docs](model_optimization.md) for details. ## Tweak the number of threads @@ -87,7 +83,14 @@ the Java API is a lot faster if ByteBuffers are used as [inputs](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/java/src/main/java/org/tensorflow/lite/Interpreter.java#L175). ## Profile your application with platform specific tools -Platform specific tools like [Android profiler](https://developer.android.com/studio/profile/android-profiler) and [Instruments](https://help.apple.com/instruments/mac/current/) provide a wealth of profiling information that can be used to debug your app. Sometimes the performance bug may be not in the model but in parts of application code that interact with the model. Make sure to familiarize yourself with platform specific profiling tools and best practices for your platform. + +Platform specific tools like +[Android profiler](https://developer.android.com/studio/profile/android-profiler) +and [Instruments](https://help.apple.com/instruments/mac/current/) provide a +wealth of profiling information that can be used to debug your app. Sometimes +the performance bug may be not in the model but in parts of application code +that interact with the model. Make sure to familiarize yourself with platform +specific profiling tools and best practices for your platform. ## Evaluate whether your model benefits from using hardware accelerators available on the device @@ -102,18 +105,23 @@ interpreter execution. TensorFlow Lite can use delegates by: efficiency of your model. To enable the Neural Networks API, call [UseNNAPI](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/interpreter.h#L343) on the interpreter instance. -* A binary-only GPU delegate has been released for Android and iOS, using - OpenGL and Metal, respectively. To try them out, see the - [GPU delegate tutorial](gpu.md) and [documentation](gpu_advanced.md). +* GPU delegate is available on Android and iOS, using OpenGL/OpenCL and Metal, + respectively. To try them out, see the [GPU delegate tutorial](gpu.md) and + [documentation](gpu_advanced.md). +* Hexagon delegate is available on Android. It leverages the Qualcomm Hexagon + DSP if it is available on the device. See the + [Hexagon delegate tutorial](hexagon_delegate.md) for more information. * It is possible to create your own delegate if you have access to non-standard hardware. See [TensorFlow Lite delegates](delegates.md) for more information. -Be aware that some accelerators work better for different types of models. It is -important to benchmark each delegate to see if it is a good choice for your -application. For example, if you have a very small model, it may not be worth -delegating the model to either the NN API or the GPU. Conversely, accelerators -are a great choice for large models that have high arithmetic intensity. +Be aware that some accelerators work better for different types of models. Some +delegates only support float models or models optimized in a specific way. It is +important to [benchmark](benchmarks.md) each delegate to see if it is a good +choice for your application. For example, if you have a very small model, it may +not be worth delegating the model to either the NN API or the GPU. Conversely, +accelerators are a great choice for large models that have high arithmetic +intensity. ## Need more help diff --git a/tensorflow/lite/g3doc/performance/delegates.md b/tensorflow/lite/g3doc/performance/delegates.md index 212d959487e..4f383b52e1f 100644 --- a/tensorflow/lite/g3doc/performance/delegates.md +++ b/tensorflow/lite/g3doc/performance/delegates.md @@ -1,7 +1,6 @@ # TensorFlow Lite delegates -_Note: Delegate API is still experimental and is subject to change._ - +Note: Delegate API is still experimental and is subject to change. ## What is a TensorFlow Lite delegate? diff --git a/tensorflow/lite/g3doc/performance/model_optimization.md b/tensorflow/lite/g3doc/performance/model_optimization.md index eea225f7e80..befc40aa738 100644 --- a/tensorflow/lite/g3doc/performance/model_optimization.md +++ b/tensorflow/lite/g3doc/performance/model_optimization.md @@ -1,39 +1,102 @@ # Model optimization +Edge devices often have limited memory or computational power. Various +optimizations can be applied to models so that they can be run within these +constraints. In addition, some optimizations allow the use of specialized +hardware for accelerated inference. + Tensorflow Lite and the [Tensorflow Model Optimization Toolkit](https://www.tensorflow.org/model_optimization) provide tools to minimize the complexity of optimizing inference. -Inference efficiency is particularly important for edge devices, such as mobile -and Internet of Things (IoT). Such devices have many restrictions on processing, -memory, power-consumption, and storage for models. Furthermore, model -optimization unlocks the processing power of fixed-point hardware and next -generation hardware accelerators. +It's recommended that you consider model optimization during your application +development process. This document outlines some best practices for optimizing +TensorFlow models for deployment to edge hardware. -## Model quantization +## Why models should be optimized -Quantizing deep neural networks uses techniques that allow for reduced precision -representations of weights and, optionally, activations for both storage and -computation. Quantization provides several benefits: +There are several main ways model optimization can help with application +development. -* Support on existing CPU platforms. -* Quantization of activations reduces memory access costs for reading and storing intermediate activations. -* Many CPU and hardware accelerator implementations provide SIMD instruction capabilities, which are especially beneficial for quantization. +### Size reduction -TensorFlow Lite provides several levels of support for quantization. +Some forms of optimization can be used to reduce the size of a model. Smaller +models have the following benefits: -* Tensorflow Lite [post-training quantization](post_training_quantization.md) - quantizes weights and activations post training easily. -* [Quantization-aware training](https://github.com/tensorflow/tensorflow/tree/r1.13/tensorflow/contrib/quantize){:.external} - allows for training of networks that can be quantized with minimal accuracy - drop; this is only available for a subset of convolutional neural network - architectures. +- **Smaller storage size:** Smaller models occupy less storage space on your + users' devices. For example, an Android app using a smaller model will take + up less storage space on a user's mobile device. +- **Smaller download size:** Smaller models require less time and bandwidth to + download to users' devices. +- **Less memory usage:** Smaller models use less RAM when they are run, which + frees up memory for other parts of your application to use, and can + translate to better performance and stability. -### Latency and accuracy results +Quantization can reduce the size of a model in all of these cases, potentially +at the expense of some accuracy. Pruning can reduce the size of a model for +download by making it more easily compressible. + +### Latency reduction + +*Latency* is the amount of time it takes to run a single inference with a given +model. Some forms of optimization can reduce the amount of computation required +to run inference using a model, resulting in lower latency. Latency can also +have an impact on power consumption. + +Currently, quantization can be used to reduce latency by simplifying the +calculations that occur during inference, potentially at the expense of some +accuracy. + +### Accelerator compatibility + +Some hardware accelerators, such as the +[Edge TPU](https://cloud.google.com/edge-tpu/), can run inference extremely fast +with models that have been correctly optimized. + +Generally, these types of devices require models to be quantized in a specific +way. See each hardware accelerators documentation to learn more about their +requirements. + +## Trade-offs + +Optimizations can potentially result in changes in model accuracy, which must be +considered during the application development process. + +The accuracy changes depend on the individual model being optimized, and are +difficult to predict ahead of time. Generally, models that are optimized for +size or latency will lose a small amount of accuracy. Depending on your +application, this may or may not impact your users' experience. In rare cases, +certain models may gain some accuracy as a result of the optimization process. + +## Types of optimization + +TensorFlow Lite currently supports optimization via quantization and pruning. + +These are part of the +[TensorFlow Model Optimization Toolkit](https://www.tensorflow.org/model_optimization), +which provides resources for model optimization techniques that are compatible +with TensorFlow Lite. + +### Quantization + +[Quantization](https://www.tensorflow.org/model_optimization/guide/quantization) +works by reducing the precision of the numbers used to represent a model's +parameters, which by default are 32-bit floating point numbers. This results in +a smaller model size and faster computation. + +The following types of quantization are available in TensorFlow Lite: + +Technique | Data requirements | Size reduction | Accuracy | Supported hardware +-------------------------------------------------------------------------------------------------------------- | -------------------------------- | -------------- | --------------------------- | ------------------ +[Post-training float16 quantization](post_training_float16_quant.ipynb) | No data | Up to 50% | Insignificant accuracy loss | CPU, GPU +[Post-training weight quantization](post_training_quant.ipynb) | No data | Up to 75% | Accuracy loss | CPU +[Post-training integer quantization](post_training_integer_quant.ipynb) | Unlabelled representative sample | Up to 75% | Smaller accuracy loss | CPU, EdgeTPU, Hexagon DSP +[Quantization-aware training](https://github.com/tensorflow/tensorflow/tree/r1.13/tensorflow/contrib/quantize) | Labelled training data | Up to 75% | Smallest accuracy loss | CPU, EdgeTPU, Hexagon DSP Below are the latency and accuracy results for post-training quantization and quantization-aware training on a few models. All latency numbers are measured on -Pixel 2 devices using a single big core. As the toolkit improves, so will the numbers here: +Pixel 2 devices using a single big core CPU. As the toolkit improves, so will +the numbers here:
@@ -61,7 +124,17 @@ Pixel 2 devices using a single big core. As the toolkit improves, so will t -## Choice of tool +### Pruning + +[Pruning](https://www.tensorflow.org/model_optimization/guide/pruning) works by +removing parameters within a model that have only a minor impact on its +predictions. Pruned models are the same size on disk, and have the same runtime +latency, but can be compressed more effectively. This makes pruning a useful +technique for reducing model download size. + +In the future, TensorFlow Lite will provide latency reduction for pruned models. + +## Development workflow As a starting point, check if the models in [hosted models](../guide/hosted_models.md) can work for your application. If @@ -76,3 +149,6 @@ is the better option. See additional optimization techniques under the [Tensorflow Model Optimization Toolkit](https://www.tensorflow.org/model_optimization). Note: Quantization-aware training supports a subset of convolutional neural network architectures. + +If you want to further reduce your model size, you can try [pruning](#pruning) +prior to quantizing your models. diff --git a/tensorflow/lite/g3doc/performance/quantization_spec.md b/tensorflow/lite/g3doc/performance/quantization_spec.md index b0cea36ac1e..9c30fbdc855 100644 --- a/tensorflow/lite/g3doc/performance/quantization_spec.md +++ b/tensorflow/lite/g3doc/performance/quantization_spec.md @@ -1,6 +1,10 @@ # TensorFlow Lite 8-bit quantization specification -### Specification summary +The following document outlines the specification for TensorFlow Lite's 8-bit +quantization scheme. This is intended to assist hardware developers in providing +hardware support for inference with quantized TensorFlow Lite models. + +## Specification summary We are providing a specification, and we can only provide some guarantees on behaviour if the spec is followed. We also understand different hardware may @@ -27,14 +31,14 @@ Note: In the past our quantized tooling used per-tensor, asymmetric, `uint8` quantization. New tooling, reference kernels, and optimized kernels for 8-bit quantization will use this spec. -### Signed integer vs unsigned integer +## Signed integer vs unsigned integer TensorFlow Lite quantization will primarily prioritize tooling and kernels for `int8` quantization for 8-bit. This is for the convenience of symmetric quantization being represented by zero-point equal to 0. Additionally many backends have additional optimizations for `int8xint8` accumulation. -### Per-axis vs per-tensor +## Per-axis vs per-tensor Per-tensor quantization means that there will be one scale and/or zero-point per entire tensor. Per-axis quantization means that there will be one scale and/or @@ -56,7 +60,7 @@ without performance implications. This has large improvements to accuracy. TFLite has per-axis support for a growing number of operations. At the time of this document support exists for Conv2d and DepthwiseConv2d. -### Symmetric vs asymmetric +## Symmetric vs asymmetric Activations are asymmetric: they can have their zero-point anywhere within the signed `int8` range `[-128, 127]`. Many activations are asymmetric in nature and @@ -95,7 +99,7 @@ The \\(\sum_{i=0}^{n} q_{a}^{(i)} z_b\\) term needs to be computed every inferen since the activation changes every inference. By enforcing weights to be symmetric we can remove the cost of this term. -### int8 quantized operator specifications +## int8 quantized operator specifications Below we describe the quantization requirements for our int8 tflite kernels: