Minor TF Lite doc update

PiperOrigin-RevId: 300461356
Change-Id: I850174418d894144f58d46844ad74b9ef16bf5be
This commit is contained in:
Khanh LeViet 2020-03-11 19:49:33 -07:00 committed by TensorFlower Gardener
parent a4f2960e0e
commit 4783a85dec
6 changed files with 163 additions and 73 deletions

View File

@ -71,18 +71,9 @@ upper_tabs:
path: /lite/performance/best_practices
- title: "Benchmarks"
path: /lite/performance/benchmarks
- title: "Model optimization"
path: /lite/performance/model_optimization
- title: "Post-training quantization"
path: /lite/performance/post_training_quantization
- title: "Post-training weight quantization"
path: /lite/performance/post_training_quant
- title: "Post-training integer quantization"
path: /lite/performance/post_training_integer_quant
- title: "Post-training float16 quantization"
path: /lite/performance/post_training_float16_quant
- title: "Delegates"
path: /lite/performance/delegates
status: experimental
- title: "GPU delegate"
path: /lite/performance/gpu
- title: "Advanced GPU"
@ -92,6 +83,18 @@ upper_tabs:
- title: "Hexagon delegate"
path: /lite/performance/hexagon_delegate
status: experimental
- heading: "Optimize a model"
- title: "Overview"
path: /lite/performance/model_optimization
- title: "Post-training quantization"
path: /lite/performance/post_training_quantization
- title: "Post-training weight quantization"
path: /lite/performance/post_training_quant
- title: "Post-training integer quantization"
path: /lite/performance/post_training_integer_quant
- title: "Post-training float16 quantization"
path: /lite/performance/post_training_float16_quant
- title: "Quantization specification"
path: /lite/performance/quantization_spec

View File

@ -7,7 +7,7 @@ These performance benchmark numbers were generated with the
[Android TFLite benchmark binary](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark)
and the [iOS benchmark app](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark/ios).
# Android performance benchmarks
## Android performance benchmarks
For Android benchmarks, the CPU affinity is set to use big cores on the device to
reduce variance (see [details](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark#reducing-variance-between-runs-on-android)).
@ -135,7 +135,7 @@ The performance values below are measured on Android 10.
</table>
# iOS benchmarks
## iOS benchmarks
To run iOS benchmarks, the
[benchmark app](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark/ios)

View File

@ -2,8 +2,8 @@
Mobile and embedded devices have limited computational resources, so it is
important to keep your application resource efficient. We have compiled a list
of best practices and strategies that you can use to optimize your model and
application when using TensorFlow Lite.
of best practices and strategies that you can use to improve your TensorFlow
Lite model performance.
## Choose the best model for the task
@ -23,13 +23,20 @@ One example of models optimized for mobile devices are
vision applications. [Hosted models](../models/hosted.md) lists several other
models that have been optimized specifically for mobile and embedded devices.
You can retrain the listed models on your own dataset by using transfer learning. Check out our transfer learning tutorial for
[image classification](https://codelabs.developers.google.com/codelabs/tensorflow-for-poets/#0) and
[object detection](https://medium.com/tensorflow/training-and-serving-a-realtime-mobile-object-detector-in-30-minutes-with-cloud-tpus-b78971cf1193).
You can retrain the listed models on your own dataset by using transfer
learning. Check out our transfer learning tutorial for
[image classification](https://colab.sandbox.google.com/github/tensorflow/examples/blob/master/tensorflow_examples/lite/model_maker/demo/image_classification.ipynb)
and
[object detection](https://medium.com/tensorflow/training-and-serving-a-realtime-mobile-object-detector-in-30-minutes-with-cloud-tpus-b78971cf1193).
## Profile your model
Once you have selected a candidate model that is right for your task, it is a good practice to profile and benchmark your model. TensorFlow Lite [benchmarking tool](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark) has a built-in profiler that shows per operator profiling statistics. This can help in understanding performance bottlenecks and which operators dominate the computation time.
Once you have selected a candidate model that is right for your task, it is a
good practice to profile and benchmark your model. TensorFlow Lite
[benchmarking tool](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark)
has a built-in profiler that shows per operator profiling statistics. This can
help in understanding performance bottlenecks and which operators dominate the
computation time.
## Profile and optimize operators in the graph
@ -43,23 +50,12 @@ operator is executed. Check out our
## Optimize your model
Model compression aims to create smaller models that are generally faster and
more energy efficient, so that they can be deployed on mobile devices.
Model optimization aims to create smaller models that are generally faster and
more energy efficient, so that they can be deployed on mobile devices. There are
multiple optimization techniques suppored by TensorFlow Lite, such as
quantization.
### Quantization
If your model uses floating-point weights or activations, then it may be
possible to reduce the size of model up to ~4x by using quantization, which
effectively turns the float weights to 8-bit. There are two flavors of
quantization: [post-training quantization](post_training_quantization.md) and
[quantized training](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/quantize/README.md){:.external}.
The former does not require model re-training, but, in rare cases, may have
accuracy loss. When accuracy loss is beyond acceptable thresholds, quantized
training should be used instead.
We strongly recommend running benchmarks to make sure that the accuracy is not
impacted during model compression. Check out our
[model optimization docs](model_optimization.md) for details.
Check out our [model optimization docs](model_optimization.md) for details.
## Tweak the number of threads
@ -87,7 +83,14 @@ the Java API is a lot faster if ByteBuffers are used as
[inputs](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/java/src/main/java/org/tensorflow/lite/Interpreter.java#L175).
## Profile your application with platform specific tools
Platform specific tools like [Android profiler](https://developer.android.com/studio/profile/android-profiler) and [Instruments](https://help.apple.com/instruments/mac/current/) provide a wealth of profiling information that can be used to debug your app. Sometimes the performance bug may be not in the model but in parts of application code that interact with the model. Make sure to familiarize yourself with platform specific profiling tools and best practices for your platform.
Platform specific tools like
[Android profiler](https://developer.android.com/studio/profile/android-profiler)
and [Instruments](https://help.apple.com/instruments/mac/current/) provide a
wealth of profiling information that can be used to debug your app. Sometimes
the performance bug may be not in the model but in parts of application code
that interact with the model. Make sure to familiarize yourself with platform
specific profiling tools and best practices for your platform.
## Evaluate whether your model benefits from using hardware accelerators available on the device
@ -102,18 +105,23 @@ interpreter execution. TensorFlow Lite can use delegates by:
efficiency of your model. To enable the Neural Networks API, call
[UseNNAPI](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/interpreter.h#L343)
on the interpreter instance.
* A binary-only GPU delegate has been released for Android and iOS, using
OpenGL and Metal, respectively. To try them out, see the
[GPU delegate tutorial](gpu.md) and [documentation](gpu_advanced.md).
* GPU delegate is available on Android and iOS, using OpenGL/OpenCL and Metal,
respectively. To try them out, see the [GPU delegate tutorial](gpu.md) and
[documentation](gpu_advanced.md).
* Hexagon delegate is available on Android. It leverages the Qualcomm Hexagon
DSP if it is available on the device. See the
[Hexagon delegate tutorial](hexagon_delegate.md) for more information.
* It is possible to create your own delegate if you have access to
non-standard hardware. See [TensorFlow Lite delegates](delegates.md) for
more information.
Be aware that some accelerators work better for different types of models. It is
important to benchmark each delegate to see if it is a good choice for your
application. For example, if you have a very small model, it may not be worth
delegating the model to either the NN API or the GPU. Conversely, accelerators
are a great choice for large models that have high arithmetic intensity.
Be aware that some accelerators work better for different types of models. Some
delegates only support float models or models optimized in a specific way. It is
important to [benchmark](benchmarks.md) each delegate to see if it is a good
choice for your application. For example, if you have a very small model, it may
not be worth delegating the model to either the NN API or the GPU. Conversely,
accelerators are a great choice for large models that have high arithmetic
intensity.
## Need more help

View File

@ -1,7 +1,6 @@
# TensorFlow Lite delegates
_Note: Delegate API is still experimental and is subject to change._
Note: Delegate API is still experimental and is subject to change.
## What is a TensorFlow Lite delegate?

View File

@ -1,39 +1,102 @@
# Model optimization
Edge devices often have limited memory or computational power. Various
optimizations can be applied to models so that they can be run within these
constraints. In addition, some optimizations allow the use of specialized
hardware for accelerated inference.
Tensorflow Lite and the
[Tensorflow Model Optimization Toolkit](https://www.tensorflow.org/model_optimization)
provide tools to minimize the complexity of optimizing inference.
Inference efficiency is particularly important for edge devices, such as mobile
and Internet of Things (IoT). Such devices have many restrictions on processing,
memory, power-consumption, and storage for models. Furthermore, model
optimization unlocks the processing power of fixed-point hardware and next
generation hardware accelerators.
It's recommended that you consider model optimization during your application
development process. This document outlines some best practices for optimizing
TensorFlow models for deployment to edge hardware.
## Model quantization
## Why models should be optimized
Quantizing deep neural networks uses techniques that allow for reduced precision
representations of weights and, optionally, activations for both storage and
computation. Quantization provides several benefits:
There are several main ways model optimization can help with application
development.
* Support on existing CPU platforms.
* Quantization of activations reduces memory access costs for reading and storing intermediate activations.
* Many CPU and hardware accelerator implementations provide SIMD instruction capabilities, which are especially beneficial for quantization.
### Size reduction
TensorFlow Lite provides several levels of support for quantization.
Some forms of optimization can be used to reduce the size of a model. Smaller
models have the following benefits:
* Tensorflow Lite [post-training quantization](post_training_quantization.md)
quantizes weights and activations post training easily.
* [Quantization-aware training](https://github.com/tensorflow/tensorflow/tree/r1.13/tensorflow/contrib/quantize){:.external}
allows for training of networks that can be quantized with minimal accuracy
drop; this is only available for a subset of convolutional neural network
architectures.
- **Smaller storage size:** Smaller models occupy less storage space on your
users' devices. For example, an Android app using a smaller model will take
up less storage space on a user's mobile device.
- **Smaller download size:** Smaller models require less time and bandwidth to
download to users' devices.
- **Less memory usage:** Smaller models use less RAM when they are run, which
frees up memory for other parts of your application to use, and can
translate to better performance and stability.
### Latency and accuracy results
Quantization can reduce the size of a model in all of these cases, potentially
at the expense of some accuracy. Pruning can reduce the size of a model for
download by making it more easily compressible.
### Latency reduction
*Latency* is the amount of time it takes to run a single inference with a given
model. Some forms of optimization can reduce the amount of computation required
to run inference using a model, resulting in lower latency. Latency can also
have an impact on power consumption.
Currently, quantization can be used to reduce latency by simplifying the
calculations that occur during inference, potentially at the expense of some
accuracy.
### Accelerator compatibility
Some hardware accelerators, such as the
[Edge TPU](https://cloud.google.com/edge-tpu/), can run inference extremely fast
with models that have been correctly optimized.
Generally, these types of devices require models to be quantized in a specific
way. See each hardware accelerators documentation to learn more about their
requirements.
## Trade-offs
Optimizations can potentially result in changes in model accuracy, which must be
considered during the application development process.
The accuracy changes depend on the individual model being optimized, and are
difficult to predict ahead of time. Generally, models that are optimized for
size or latency will lose a small amount of accuracy. Depending on your
application, this may or may not impact your users' experience. In rare cases,
certain models may gain some accuracy as a result of the optimization process.
## Types of optimization
TensorFlow Lite currently supports optimization via quantization and pruning.
These are part of the
[TensorFlow Model Optimization Toolkit](https://www.tensorflow.org/model_optimization),
which provides resources for model optimization techniques that are compatible
with TensorFlow Lite.
### Quantization
[Quantization](https://www.tensorflow.org/model_optimization/guide/quantization)
works by reducing the precision of the numbers used to represent a model's
parameters, which by default are 32-bit floating point numbers. This results in
a smaller model size and faster computation.
The following types of quantization are available in TensorFlow Lite:
Technique | Data requirements | Size reduction | Accuracy | Supported hardware
-------------------------------------------------------------------------------------------------------------- | -------------------------------- | -------------- | --------------------------- | ------------------
[Post-training float16 quantization](post_training_float16_quant.ipynb) | No data | Up to 50% | Insignificant accuracy loss | CPU, GPU
[Post-training weight quantization](post_training_quant.ipynb) | No data | Up to 75% | Accuracy loss | CPU
[Post-training integer quantization](post_training_integer_quant.ipynb) | Unlabelled representative sample | Up to 75% | Smaller accuracy loss | CPU, EdgeTPU, Hexagon DSP
[Quantization-aware training](https://github.com/tensorflow/tensorflow/tree/r1.13/tensorflow/contrib/quantize) | Labelled training data | Up to 75% | Smallest accuracy loss | CPU, EdgeTPU, Hexagon DSP
Below are the latency and accuracy results for post-training quantization and
quantization-aware training on a few models. All latency numbers are measured on
Pixel&nbsp;2 devices using a single big core. As the toolkit improves, so will the numbers here:
Pixel 2 devices using a single big core CPU. As the toolkit improves, so will
the numbers here:
<figure>
<table>
@ -61,7 +124,17 @@ Pixel&nbsp;2 devices using a single big core. As the toolkit improves, so will t
</figcaption>
</figure>
## Choice of tool
### Pruning
[Pruning](https://www.tensorflow.org/model_optimization/guide/pruning) works by
removing parameters within a model that have only a minor impact on its
predictions. Pruned models are the same size on disk, and have the same runtime
latency, but can be compressed more effectively. This makes pruning a useful
technique for reducing model download size.
In the future, TensorFlow Lite will provide latency reduction for pruned models.
## Development workflow
As a starting point, check if the models in
[hosted models](../guide/hosted_models.md) can work for your application. If
@ -76,3 +149,6 @@ is the better option. See additional optimization techniques under the
[Tensorflow Model Optimization Toolkit](https://www.tensorflow.org/model_optimization).
Note: Quantization-aware training supports a subset of convolutional neural network architectures.
If you want to further reduce your model size, you can try [pruning](#pruning)
prior to quantizing your models.

View File

@ -1,6 +1,10 @@
# TensorFlow Lite 8-bit quantization specification
### Specification summary
The following document outlines the specification for TensorFlow Lite's 8-bit
quantization scheme. This is intended to assist hardware developers in providing
hardware support for inference with quantized TensorFlow Lite models.
## Specification summary
We are providing a specification, and we can only provide some guarantees on
behaviour if the spec is followed. We also understand different hardware may
@ -27,14 +31,14 @@ Note: In the past our quantized tooling used per-tensor, asymmetric, `uint8`
quantization. New tooling, reference kernels, and optimized kernels for 8-bit
quantization will use this spec.
### Signed integer vs unsigned integer
## Signed integer vs unsigned integer
TensorFlow Lite quantization will primarily prioritize tooling and kernels for
`int8` quantization for 8-bit. This is for the convenience of symmetric
quantization being represented by zero-point equal to 0. Additionally many
backends have additional optimizations for `int8xint8` accumulation.
### Per-axis vs per-tensor
## Per-axis vs per-tensor
Per-tensor quantization means that there will be one scale and/or zero-point per
entire tensor. Per-axis quantization means that there will be one scale and/or
@ -56,7 +60,7 @@ without performance implications. This has large improvements to accuracy.
TFLite has per-axis support for a growing number of operations. At the time of
this document support exists for Conv2d and DepthwiseConv2d.
### Symmetric vs asymmetric
## Symmetric vs asymmetric
Activations are asymmetric: they can have their zero-point anywhere within the
signed `int8` range `[-128, 127]`. Many activations are asymmetric in nature and
@ -95,7 +99,7 @@ The \\(\sum_{i=0}^{n} q_{a}^{(i)} z_b\\) term needs to be computed every inferen
since the activation changes every inference. By enforcing weights to be
symmetric we can remove the cost of this term.
### int8 quantized operator specifications
## int8 quantized operator specifications
Below we describe the quantization requirements for our int8 tflite kernels: