161 lines
5.6 KiB
Markdown
161 lines
5.6 KiB
Markdown
# TFLite on GPU
|
|
|
|
TensorFlow Lite (TFLite) supports several hardware accelerators. This document
|
|
describes how to use the GPU backend using the TFLite delegate APIs on Android
|
|
and iOS.
|
|
|
|
GPUs are designed to have high throughput for massively parallelizable
|
|
workloads. Thus, they are well-suited for deep neural nets which consists of a
|
|
huge number of operators, each working on some input tensor(s) that can be
|
|
easily divided into smaller workloads and carried out in parallel, typically
|
|
resulting in lower latency. In the best scenario, inference on the GPU may now
|
|
run fast enough and now become suitable for real-time applications if it was not
|
|
before.
|
|
|
|
GPUs do their computation with 16-bit or 32-bit floating point numbers and do
|
|
not require quantization for optimal performance unlike the CPUs. If
|
|
quantization of your neural network was not an option due to lower accuracy
|
|
caused by lost precision, such concern can be discarded when running deep neural
|
|
net models on the GPU.
|
|
|
|
Another benefit that comes with GPU inference is its power efficiency. GPUs
|
|
carry out the computations in a very efficient and optimized way, so that they
|
|
consume less power and generate less heat than when the same task is run on the
|
|
CPUs.
|
|
|
|
TFLite on GPU supports the following ops in 16-bit and 32-bit float precision:
|
|
|
|
* `ADD v1`
|
|
* `AVERAGE_POOL_2D v1`
|
|
* `CONCATENATION v1`
|
|
* `CONV_2D v1`
|
|
* `DEPTHWISE_CONV_2D v1-2`
|
|
* `EXP v1`
|
|
* `FULLY_CONNECTED v1`
|
|
* `LOGISTIC v1`
|
|
* `LSTM v2 (Basic LSTM only)`
|
|
* `MAX_POOL_2D v1`
|
|
* `MAXIMUM v1`
|
|
* `MINIMUM v1`
|
|
* `MUL v1`
|
|
* `PAD v1`
|
|
* `PRELU v1`
|
|
* `RELU v1`
|
|
* `RELU6 v1`
|
|
* `RESHAPE v1`
|
|
* `RESIZE_BILINEAR v1-3`
|
|
* `SOFTMAX v1`
|
|
* `STRIDED_SLICE v1`
|
|
* `SUB v1`
|
|
* `TRANSPOSE_CONV v1`
|
|
|
|
## Basic Usage
|
|
|
|
Using TFLite on GPU is as simple as getting the GPU delegate via
|
|
`TfLiteGpuDelegateV2Create()` and then passing it to
|
|
`Interpreter::ModifyGraphWithDelegate()` instead of calling
|
|
`Interpreter::AllocateTensors()`:
|
|
|
|
```c++
|
|
////////
|
|
// Set up interpreter.
|
|
auto model = FlatBufferModel::BuildFromFile(model_path);
|
|
ops::builtin::BuiltinOpResolver op_resolver;
|
|
std::unique_ptr<Interpreter> interpreter;
|
|
InterpreterBuilder(*model, op_resolver)(&interpreter);
|
|
|
|
////////
|
|
// NEW: Prepare GPU delegate.
|
|
auto* delegate = TfLiteGpuDelegateV2Create(/*default options=*/nullptr);
|
|
if (interpreter->ModifyGraphWithDelegate(delegate) != kTfLiteOk) return;
|
|
|
|
////////
|
|
// Run inference.
|
|
WriteToInputTensor(interpreter->typed_input_tensor<float>(0));
|
|
if (interpreter->Invoke() != kTfLiteOk) return;
|
|
ReadFromOutputTensor(interpreter->typed_output_tensor<float>(0));
|
|
|
|
////////
|
|
// Clean up.
|
|
TfLiteGpuDelegateV2Delete(delegate);
|
|
```
|
|
|
|
*IMPORTANT:* When calling `Interpreter::ModifyGraphWithDelegate()` or
|
|
`Interpreter::Invoke()`, the caller must have a `EGLContext` in the current
|
|
thread and `Interpreter::Invoke()` must be called from the same `EGLContext`.
|
|
If such `EGLContext` does not exist, the delegate will internally create one,
|
|
but then the developer must ensure that `Interpreter::Invoke()` is always called
|
|
from the same thread `Interpreter::ModifyGraphWithDelegate()` was called.
|
|
|
|
## Building and Runtime
|
|
|
|
TFLite GPU backend uses OpenGL ES 3.1 compute shaders or OpenCL.
|
|
|
|
```sh
|
|
bazel build --config android_arm64 //path/to/your:project
|
|
```
|
|
|
|
Metal shaders are used for iOS, which were introduced with iOS 8. Thus,
|
|
compilation flags should look like:
|
|
|
|
```sh
|
|
bazel build --config ios_arm64 //path/to/your:project
|
|
```
|
|
|
|
## Advanced Usage: Delegate Options
|
|
|
|
There are GPU options that can be set and passed on to
|
|
`TfLiteGpuDelegateCreate()`. When option is set to `nullptr` as shown in the
|
|
Basic Usage, it translates to:
|
|
|
|
```c++
|
|
const TfLiteGpuDelegateOptionsV2 kDefaultOptions =
|
|
TfLiteGpuDelegateOptionsV2Default();
|
|
```
|
|
|
|
Similar for `NewTfLiteMetalDelegate()`:
|
|
|
|
```c++
|
|
const TfLiteMetalDelegateOptions kDefaultOptions = {
|
|
.precision_loss_allowed = 0, // false
|
|
.wait_type = TFLITE_METAL_WAIT_TYPE_SLEEP,
|
|
};
|
|
```
|
|
|
|
While it is convenient to just supply `nullptr`, it is recommended to explicitly
|
|
set the options to avoid any unexpected artifacts in case default values are
|
|
changed.
|
|
|
|
*IMPORTANT:* Note that the default option does not allow precision loss, and
|
|
thus may not be the fastest. For faster execution, you may want to set
|
|
`precision_loss_allowed` to `1` for FP16 execution.
|
|
|
|
## Tips and Tricks
|
|
|
|
* Some operations that are trivial on CPU side may be high cost in GPU land.
|
|
One class of such operation is various forms of reshape operations (including
|
|
`BATCH_TO_SPACE`, `SPACE_TO_BATCH`, `SPACE_TO_DEPTH`, etc.). If those ops
|
|
are inserted into the network just for the network architect's logical
|
|
thinking, it is worth removing them for performance.
|
|
|
|
* On GPU, tensor data is sliced into 4-channels. Thus, a computation on a
|
|
tensor of shape `[B, H, W, 5]` will perform about the same on a tensor of
|
|
shape `[B, H, W, 8]`, but significantly worse than `[B, H, W, 4]`.
|
|
|
|
* In that sense, if the camera hardware supports image frames in RGBA, feeding
|
|
that 4-channel input is significantly faster as a memory copy (from 3-channel
|
|
RGB to 4-channel RGBX) can be avoided.
|
|
|
|
* For performance [best practices](https://www.tensorflow.org/lite/performance/best_practices), do not hesitate to re-train your classifier with
|
|
mobile-optimized network architecture. That is a significant part of
|
|
optimization for on-device inference.
|
|
|
|
## Publication
|
|
|
|
* [On-Device Neural Net Inference with Mobile GPUs](https://arxiv.org/abs/1907.01989)
|
|
* Juhyun Lee, Nikolay Chirkov, Ekaterina Ignasheva, Yury Pisarchyk, Mogan
|
|
Shieh, Fabio Riccardi, Raman Sarokin, Andrei Kulik, and Matthias
|
|
Grundmann
|
|
* CVPR Workshop
|
|
[Efficient Deep Learning for Computer Vision (ECV2019)](https://sites.google.com/corp/view/ecv2019)
|