Updates GPU delegate documentation with experimental quant support
PiperOrigin-RevId: 312165090 Change-Id: I8fb624f71101fce6a379ed24f6002f8f4b60245d
This commit is contained in:
parent
3c54ef5ab9
commit
4001e3dad3
@ -31,7 +31,7 @@ models.
|
|||||||
For a step-by-step tutorial, watch the
|
For a step-by-step tutorial, watch the
|
||||||
[GPU Delegate for Android](https://youtu.be/Xkhgre8r5G0) video.
|
[GPU Delegate for Android](https://youtu.be/Xkhgre8r5G0) video.
|
||||||
|
|
||||||
Note: This requires OpenGL ES 3.1 or higher.
|
Note: This requires OpenCL or OpenGL ES (3.1 or higher).
|
||||||
|
|
||||||
#### Step 1. Clone the TensorFlow source code and open it in Android Studio
|
#### Step 1. Clone the TensorFlow source code and open it in Android Studio
|
||||||
|
|
||||||
|
@ -1,9 +1,9 @@
|
|||||||
# TensorFlow Lite on GPU
|
# TensorFlow Lite on GPU
|
||||||
|
|
||||||
[TensorFlow Lite](https://www.tensorflow.org/mobile/tflite/) supports several
|
[TensorFlow Lite](https://www.tensorflow.org/mobile/tflite/) supports several
|
||||||
hardware accelerators. This document describes how to use the GPU backend using
|
hardware accelerators. This document describes how to use the GPU backend using
|
||||||
the TensorFlow Lite delegate APIs on Android (requires OpenGL ES 3.1 or higher)
|
the TensorFlow Lite delegate APIs on Android (requires OpenCL or OpenGL ES 3.1
|
||||||
and iOS (requires iOS 8 or later).
|
and higher) and iOS (requires iOS 8 or later).
|
||||||
|
|
||||||
## Benefits of GPU Acceleration
|
## Benefits of GPU Acceleration
|
||||||
|
|
||||||
@ -35,25 +35,33 @@ power and generating less heat than the same task run on a CPU.
|
|||||||
TensorFlow Lite on GPU supports the following ops in 16-bit and 32-bit float
|
TensorFlow Lite on GPU supports the following ops in 16-bit and 32-bit float
|
||||||
precision:
|
precision:
|
||||||
|
|
||||||
* `ADD v1`
|
* `ADD`
|
||||||
* `AVERAGE_POOL_2D v1`
|
* `AVERAGE_POOL_2D`
|
||||||
* `CONCATENATION v1`
|
* `CONCATENATION`
|
||||||
* `CONV_2D v1`
|
* `CONV_2D`
|
||||||
* `DEPTHWISE_CONV_2D v1-2`
|
* `DEPTHWISE_CONV_2D v1-2`
|
||||||
* `FULLY_CONNECTED v1`
|
* `EXP`
|
||||||
* `LOGISTIC v1`
|
* `FULLY_CONNECTED`
|
||||||
* `MAX_POOL_2D v1`
|
* `LOGISTIC`
|
||||||
* `MUL v1`
|
* `LSTM v2 (Basic LSTM only)`
|
||||||
* `PAD v1`
|
* `MAX_POOL_2D`
|
||||||
* `PRELU v1`
|
* `MAXIMUM`
|
||||||
* `RELU v1`
|
* `MINIMUM`
|
||||||
* `RELU6 v1`
|
* `MUL`
|
||||||
* `RESHAPE v1`
|
* `PAD`
|
||||||
* `RESIZE_BILINEAR v1`
|
* `PRELU`
|
||||||
* `SOFTMAX v1`
|
* `RELU`
|
||||||
* `STRIDED_SLICE v1`
|
* `RELU6`
|
||||||
* `SUB v1`
|
* `RESHAPE`
|
||||||
* `TRANSPOSE_CONV v1`
|
* `RESIZE_BILINEAR v1-3`
|
||||||
|
* `SOFTMAX`
|
||||||
|
* `STRIDED_SLICE`
|
||||||
|
* `SUB`
|
||||||
|
* `TRANSPOSE_CONV`
|
||||||
|
|
||||||
|
By default, all ops are only supported at version 1. Enabling the
|
||||||
|
[experimental quantization support](gpu_advanced.md#running-quantized-models-experimental-android-only)
|
||||||
|
allows the appropriate versions; for example, ADD v2.
|
||||||
|
|
||||||
## Basic Usage
|
## Basic Usage
|
||||||
|
|
||||||
@ -82,8 +90,8 @@ delegate.close();
|
|||||||
### Android (C/C++)
|
### Android (C/C++)
|
||||||
|
|
||||||
For C/C++ usage of TensorFlow Lite GPU on Android, the GPU delegate can be
|
For C/C++ usage of TensorFlow Lite GPU on Android, the GPU delegate can be
|
||||||
created with `TfLiteGpuDelegateCreate()` and destroyed with
|
created with `TfLiteGpuDelegateV2Create()` and destroyed with
|
||||||
`TfLiteGpuDelegateDelete()`.
|
`TfLiteGpuDelegateV2Delete()`.
|
||||||
|
|
||||||
```c++
|
```c++
|
||||||
// Set up interpreter.
|
// Set up interpreter.
|
||||||
@ -94,15 +102,7 @@ std::unique_ptr<Interpreter> interpreter;
|
|||||||
InterpreterBuilder(*model, op_resolver)(&interpreter);
|
InterpreterBuilder(*model, op_resolver)(&interpreter);
|
||||||
|
|
||||||
// NEW: Prepare GPU delegate.
|
// NEW: Prepare GPU delegate.
|
||||||
const TfLiteGpuDelegateOptions options = {
|
auto* delegate = TfLiteGpuDelegateV2Create(/*default options=*/nullptr);
|
||||||
.metadata = NULL,
|
|
||||||
.compile_options = {
|
|
||||||
.precision_loss_allowed = 1, // FP16
|
|
||||||
.preferred_gl_object_type = TFLITE_GL_OBJECT_TYPE_FASTEST,
|
|
||||||
.dynamic_batch_enabled = 0, // Not fully functional yet
|
|
||||||
},
|
|
||||||
};
|
|
||||||
auto* delegate = TfLiteGpuDelegateCreate(&options);
|
|
||||||
if (interpreter->ModifyGraphWithDelegate(delegate) != kTfLiteOk) return false;
|
if (interpreter->ModifyGraphWithDelegate(delegate) != kTfLiteOk) return false;
|
||||||
|
|
||||||
// Run inference.
|
// Run inference.
|
||||||
@ -111,9 +111,13 @@ if (interpreter->Invoke() != kTfLiteOk) return false;
|
|||||||
ReadFromOutputTensor(interpreter->typed_output_tensor<float>(0));
|
ReadFromOutputTensor(interpreter->typed_output_tensor<float>(0));
|
||||||
|
|
||||||
// NEW: Clean up.
|
// NEW: Clean up.
|
||||||
TfLiteGpuDelegateDelete(delegate);
|
TfLiteGpuDelegateV2Delete(delegate);
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Take a look at `TfLiteGpuDelegateOptionsV2` to create a delegate instance with
|
||||||
|
custom options. You can initialize the default options with
|
||||||
|
`TfLiteGpuDelegateOptionsV2Default()` and then modify them as necessary.
|
||||||
|
|
||||||
TFLite GPU for Android C/C++ uses the [Bazel](https://bazel.io) build system.
|
TFLite GPU for Android C/C++ uses the [Bazel](https://bazel.io) build system.
|
||||||
The delegate can be built, for example, using the following command:
|
The delegate can be built, for example, using the following command:
|
||||||
|
|
||||||
@ -165,6 +169,43 @@ called.
|
|||||||
|
|
||||||
## Advanced Usage
|
## Advanced Usage
|
||||||
|
|
||||||
|
### Running quantized models (Experimental, Android only)
|
||||||
|
|
||||||
|
The GPU delegate already supports
|
||||||
|
[float16 quantized](https://www.tensorflow.org/lite/performance/post_training_float16_quant)
|
||||||
|
models. There is experimental support on Android to run 8-bit quantized as well.
|
||||||
|
This includes all flavors of quantization, including:
|
||||||
|
|
||||||
|
* Models trained with
|
||||||
|
[Quantization-aware training](https://www.tensorflow.org/lite/convert/quantization)
|
||||||
|
* [Post-training dynamic-range quantization](https://www.tensorflow.org/lite/performance/post_training_quant)
|
||||||
|
* [Post-training full-integer quantization](https://www.tensorflow.org/lite/performance/post_training_integer_quant)
|
||||||
|
|
||||||
|
To optimize performance, use models that have floating-point input & output
|
||||||
|
tensors.
|
||||||
|
|
||||||
|
This feature can be enabled using delegate options as follows:
|
||||||
|
|
||||||
|
**C++ API**
|
||||||
|
|
||||||
|
```c++
|
||||||
|
// NEW: Prepare custom options with feature enabled.
|
||||||
|
TfLiteGpuDelegateOptionsV2 options = TfLiteGpuDelegateOptionsV2Default();
|
||||||
|
options.experimental_flags |= TFLITE_GPU_EXPERIMENTAL_FLAGS_ENABLE_QUANT;
|
||||||
|
|
||||||
|
auto* delegate = TfLiteGpuDelegateV2Create(options);
|
||||||
|
if (interpreter->ModifyGraphWithDelegate(delegate) != kTfLiteOk) return false;
|
||||||
|
```
|
||||||
|
|
||||||
|
**Java API**
|
||||||
|
|
||||||
|
```java
|
||||||
|
// NEW: Prepare GPU delegate with feature turned on.
|
||||||
|
GpuDelegate delegate = new GpuDelegate(new GpuDelegate.Options().setQuantizedModelsAllowed(true));
|
||||||
|
|
||||||
|
Interpreter.Options options = (new Interpreter.Options()).addDelegate(delegate);
|
||||||
|
```
|
||||||
|
|
||||||
### Delegate Options for iOS
|
### Delegate Options for iOS
|
||||||
|
|
||||||
`NewGpuDelegate()` accepts a `struct` of options.
|
`NewGpuDelegate()` accepts a `struct` of options.
|
||||||
@ -210,7 +251,7 @@ While it is convenient to use `nullptr`, we recommend that you explicitly set
|
|||||||
the options, to avoid any unexpected behavior if default values are changed in
|
the options, to avoid any unexpected behavior if default values are changed in
|
||||||
the future.
|
the future.
|
||||||
|
|
||||||
### Input/Output Buffers
|
### Input/Output Buffers (iOS only)
|
||||||
|
|
||||||
To do computation on the GPU, data must be made available to the GPU. This often
|
To do computation on the GPU, data must be made available to the GPU. This often
|
||||||
requires performing a memory copy. It is desirable not to cross the CPU/GPU
|
requires performing a memory copy. It is desirable not to cross the CPU/GPU
|
||||||
@ -229,80 +270,10 @@ To achieve best performance, TensorFlow Lite makes it possible for users to
|
|||||||
directly read from and write to the TensorFlow hardware buffer and bypass
|
directly read from and write to the TensorFlow hardware buffer and bypass
|
||||||
avoidable memory copies.
|
avoidable memory copies.
|
||||||
|
|
||||||
#### Android
|
|
||||||
|
|
||||||
Assuming the image input is in the GPU memory, it must first be converted to an
|
|
||||||
OpenGL Shader Storage Buffer Object (SSBO). You can associate a TfLiteTensor to
|
|
||||||
a user-prepared SSBO with `Interpreter.bindGlBufferToTensor()`. Note that
|
|
||||||
`Interpreter.bindGlBufferToTensor()` must be called before
|
|
||||||
`Interpreter.modifyGraphWithDelegate()`.
|
|
||||||
|
|
||||||
```java
|
|
||||||
// Ensure a valid EGL rendering context.
|
|
||||||
EGLContext eglContext = eglGetCurrentContext();
|
|
||||||
if (eglContext.equals(EGL_NO_CONTEXT)) return false;
|
|
||||||
|
|
||||||
// Create an SSBO.
|
|
||||||
int[] id = new int[1];
|
|
||||||
glGenBuffers(id.length, id, 0);
|
|
||||||
glBindBuffer(GL_SHADER_STORAGE_BUFFER, id[0]);
|
|
||||||
glBufferData(GL_SHADER_STORAGE_BUFFER, inputSize, null, GL_STREAM_COPY);
|
|
||||||
glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0); // unbind
|
|
||||||
int inputSsboId = id[0];
|
|
||||||
|
|
||||||
// Create interpreter.
|
|
||||||
Interpreter interpreter = new Interpreter(tfliteModel);
|
|
||||||
Tensor inputTensor = interpreter.getInputTensor(0);
|
|
||||||
GpuDelegate gpuDelegate = new GpuDelegate();
|
|
||||||
// The buffer must be bound before the delegate is installed.
|
|
||||||
gpuDelegate.bindGlBufferToTensor(inputTensor, inputSsboId);
|
|
||||||
interpreter.modifyGraphWithDelegate(gpuDelegate);
|
|
||||||
|
|
||||||
// Run inference; the null input argument indicates use of the bound buffer for input.
|
|
||||||
fillSsboWithCameraImageTexture(inputSsboId);
|
|
||||||
float[] outputArray = new float[outputSize];
|
|
||||||
interpreter.runInference(null, outputArray);
|
|
||||||
```
|
|
||||||
|
|
||||||
A similar approach can be applied to the output tensor. In that case,
|
|
||||||
`Interpreter.Options.setAllowBufferHandleOutput(true)` should be passed on, to
|
|
||||||
disable the default copying of the network's output from GPU memory to CPU
|
|
||||||
memory.
|
|
||||||
|
|
||||||
```java
|
|
||||||
// Ensure a valid EGL rendering context.
|
|
||||||
EGLContext eglContext = eglGetCurrentContext();
|
|
||||||
if (eglContext.equals(EGL_NO_CONTEXT)) return false;
|
|
||||||
|
|
||||||
// Create a SSBO.
|
|
||||||
int[] id = new int[1];
|
|
||||||
glGenBuffers(id.length, id, 0);
|
|
||||||
glBindBuffer(GL_SHADER_STORAGE_BUFFER, id[0]);
|
|
||||||
glBufferData(GL_SHADER_STORAGE_BUFFER, outputSize, null, GL_STREAM_COPY);
|
|
||||||
glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0); // unbind
|
|
||||||
int outputSsboId = id[0];
|
|
||||||
|
|
||||||
// Create interpreter.
|
|
||||||
Interpreter.Options options = (new Interpreter.Options()).setAllowBufferHandleOutput(true);
|
|
||||||
Interpreter interpreter = new Interpreter(tfliteModel, options);
|
|
||||||
Tensor outputTensor = interpreter.getOutputTensor(0);
|
|
||||||
GpuDelegate gpuDelegate = new GpuDelegate();
|
|
||||||
// The buffer must be bound before the delegate is installed.
|
|
||||||
gpuDelegate.bindGlBufferToTensor(outputTensor, outputSsboId);
|
|
||||||
interpreter.modifyGraphWithDelegate(gpuDelegate);
|
|
||||||
|
|
||||||
// Run inference; the null output argument indicates use of the bound buffer for output.
|
|
||||||
ByteBuffer input = getCameraImageByteBuffer();
|
|
||||||
interpreter.runInference(input, null);
|
|
||||||
renderOutputSsbo(outputSsboId);
|
|
||||||
```
|
|
||||||
|
|
||||||
#### iOS
|
|
||||||
|
|
||||||
Assuming the image input is in GPU memory, it must first be converted to a
|
Assuming the image input is in GPU memory, it must first be converted to a
|
||||||
`MTLBuffer` object for Metal. You can associate a TfLiteTensor to a
|
`MTLBuffer` object for Metal. You can associate a TfLiteTensor to a
|
||||||
user-prepared `MTLBuffer` with `BindMetalBufferToTensor()`. Note that
|
user-prepared `MTLBuffer` with `TFLGpuDelegateBindMetalBufferToTensor()`. Note
|
||||||
`BindMetalBufferToTensor()` must be called before
|
that `TFLGpuDelegateBindMetalBufferToTensor()` must be called before
|
||||||
`Interpreter::ModifyGraphWithDelegate()`. Additionally, the inference output is,
|
`Interpreter::ModifyGraphWithDelegate()`. Additionally, the inference output is,
|
||||||
by default, copied from GPU memory to CPU memory. This behavior can be turned
|
by default, copied from GPU memory to CPU memory. This behavior can be turned
|
||||||
off by calling `Interpreter::SetAllowBufferHandleOutput(true)` during
|
off by calling `Interpreter::SetAllowBufferHandleOutput(true)` during
|
||||||
@ -312,8 +283,8 @@ initialization.
|
|||||||
// Prepare GPU delegate.
|
// Prepare GPU delegate.
|
||||||
auto* delegate = NewGpuDelegate(nullptr);
|
auto* delegate = NewGpuDelegate(nullptr);
|
||||||
interpreter->SetAllowBufferHandleOutput(true); // disable default gpu->cpu copy
|
interpreter->SetAllowBufferHandleOutput(true); // disable default gpu->cpu copy
|
||||||
if (!BindMetalBufferToTensor(delegate, interpreter->inputs()[0], user_provided_input_buffer)) return false;
|
if (!TFLGpuDelegateBindMetalBufferToTensor(delegate, interpreter->inputs()[0], user_provided_input_buffer)) return false;
|
||||||
if (!BindMetalBufferToTensor(delegate, interpreter->outputs()[0], user_provided_output_buffer)) return false;
|
if (!TFLGpuDelegateBindMetalBufferToTensor(delegate, interpreter->outputs()[0], user_provided_output_buffer)) return false;
|
||||||
if (interpreter->ModifyGraphWithDelegate(delegate) != kTfLiteOk) return false;
|
if (interpreter->ModifyGraphWithDelegate(delegate) != kTfLiteOk) return false;
|
||||||
|
|
||||||
// Run inference.
|
// Run inference.
|
||||||
|
@ -89,9 +89,9 @@ The following types of quantization are available in TensorFlow Lite:
|
|||||||
Technique | Data requirements | Size reduction | Accuracy | Supported hardware
|
Technique | Data requirements | Size reduction | Accuracy | Supported hardware
|
||||||
------------------------------------------------------------------------------------------------------- | -------------------------------- | -------------- | --------------------------- | ------------------
|
------------------------------------------------------------------------------------------------------- | -------------------------------- | -------------- | --------------------------- | ------------------
|
||||||
[Post-training float16 quantization](post_training_float16_quant.ipynb) | No data | Up to 50% | Insignificant accuracy loss | CPU, GPU
|
[Post-training float16 quantization](post_training_float16_quant.ipynb) | No data | Up to 50% | Insignificant accuracy loss | CPU, GPU
|
||||||
[Post-training dynamic range quantization](post_training_quant.ipynb) | No data | Up to 75% | Accuracy loss | CPU
|
[Post-training dynamic range quantization](post_training_quant.ipynb) | No data | Up to 75% | Accuracy loss | CPU, GPU (Android)
|
||||||
[Post-training integer quantization](post_training_integer_quant.ipynb) | Unlabelled representative sample | Up to 75% | Smaller accuracy loss | CPU, EdgeTPU, Hexagon DSP
|
[Post-training integer quantization](post_training_integer_quant.ipynb) | Unlabelled representative sample | Up to 75% | Smaller accuracy loss | CPU, GPU (Android), EdgeTPU, Hexagon DSP
|
||||||
[Quantization-aware training](http://www.tensorflow.org/model_optimization/guide/quantization/training) | Labelled training data | Up to 75% | Smallest accuracy loss | CPU, EdgeTPU, Hexagon DSP
|
[Quantization-aware training](http://www.tensorflow.org/model_optimization/guide/quantization/training) | Labelled training data | Up to 75% | Smallest accuracy loss | CPU, GPU (Android), EdgeTPU, Hexagon DSP
|
||||||
|
|
||||||
Below are the latency and accuracy results for post-training quantization and
|
Below are the latency and accuracy results for post-training quantization and
|
||||||
quantization-aware training on a few models. All latency numbers are measured on
|
quantization-aware training on a few models. All latency numbers are measured on
|
||||||
|
Loading…
Reference in New Issue
Block a user