Update detection intro page with better details on support & tooling. Also add an 'implementing delegates' page that will be polished later with simpler instructions on delegate authoring.

PiperOrigin-RevId: 338165590 Change-Id: Ifd6e0fd261e157c41321f7dacc2fb16f40bd7769
2020-10-20 16:59:31 -07:00 · 2020-10-20 16:59:31 -07:00 · 19c8b34112
commit 19c8b34112
parent 2638bb9920
4 changed files with 393 additions and 212 deletions
--- a/tensorflow/lite/g3doc/_book.yaml
+++ b/tensorflow/lite/g3doc/_book.yaml
@ -139,7 +139,6 @@ upper_tabs:
        path: /lite/performance/measurement
      - title: "Delegates"
        path: /lite/performance/delegates
        status: experimental
      - title: "GPU delegate"
        path: /lite/performance/gpu
      - title: "Advanced GPU"
@ -152,6 +151,9 @@ upper_tabs:
      - title: "Core ML delegate"
        path: /lite/performance/coreml_delegate
        status: experimental
      - title: "Implementing a delegate"
        path: /lite/performance/implementing_delegate
        status: experimental
      - heading: "Optimize a model"
      - title: "Overview"
--- a/tensorflow/lite/g3doc/performance/delegates.md
+++ b/tensorflow/lite/g3doc/performance/delegates.md
@ -1,31 +1,61 @@
-# TensorFlow Lite delegates
+# TensorFlow Lite Delegates
-Note: Delegate API is still experimental and is subject to change.
+## Introduction
-## What is a TensorFlow Lite delegate?
+**Delegates** enable hardware acceleration of TensorFlow Lite models by
 leveraging on-device accelerators such as the GPU and
 [Digital Signal Processor (DSP)](https://en.wikipedia.org/wiki/Digital_signal_processor).
-A TensorFlow Lite delegate is a way to delegate part or all of graph execution
+By default, TensorFlow Lite utilizes CPU kernels that are optimized for the
-to another executor.
+[ARM Neon](https://developer.arm.com/documentation/dht0002/a/Introducing-NEON/NEON-architecture-overview/NEON-instructions)
 instruction set. However, the CPU is a multi-purpose processor that isn't
 necessarily optimized for the heavy arithmetic typically found in Machine
 Learning models (for example, the matrix math involved in convolution and dense
 layers).
-## Why should I use delegates?
+On the other hand, most modern mobile phones contain chips that are better at
 handling these heavy operations. Utilizing them for neural network operations
 provides huge benefits in terms of latency and power efficiency. For example,
 GPUs can provide upto a
 [5x speedup](https://blog.tensorflow.org/2020/08/faster-mobile-gpu-inference-with-opencl.html)
 in latency, while the
 [Qualcomm® Hexagon DSP](https://developer.qualcomm.com/software/hexagon-dsp-sdk/dsp-processor)
 has shown to reduce power consumption upto 75% in our experiments.
-Running inference on compute-heavy machine learning models on mobile devices is
+Each of these accelerators have associated APIs that enable custom computations,
-resource demanding due to the devices' limited processing and power.
+such as [OpenCL](https://www.khronos.org/opencl/) or
 [OpenGL ES](https://www.khronos.org/opengles/) for mobile GPU and the
 [Qualcomm® Hexagon SDK](https://developer.qualcomm.com/software/hexagon-dsp-sdk)
 for DSP. Typically, you would have to write a lot of custom code to run a neural
 network though these interfaces. Things get even complicated when you consider
 that each accelerator has its pros & cons and cannot execute every operation in
 a neural network. TensorFlow Lite's Delegate API solves this problem by acting
 as a bridge between the TFLite runtime and these lower-level APIs.
-Instead of relying on the CPU, some devices have hardware accelerators, such as
+![runtime with delegates](images/delegate_runtime.png)
 GPU or DSP, that allows for better performance and higher energy efficiency.
-## Using the built-in delegates
+## Choosing a Delegate
-TensorFlow Lite provides the following delegates for hardware acceleration:
+TensorFlow Lite supports multiple delegates, each of which is optimized for
 certain platform(s) and particular types of models. Usually, there will be
 multiple delegates applicable to your use-case, depending on two major criteria:
 the *Platform* (Android or iOS?) you target, and the *Model-type*
 (floating-point or quantized?) that you are trying to accelerate.
 ### Delegates by Platform
 #### Cross-platform (Android & iOS)
 *   **GPU delegate** - The GPU delegate can be used on both Android and iOS. It
    is optimized to run 32-bit and 16-bit float based models where a GPU is
    available. It also supports 8-bit quantized models and provides GPU
    performance on par with their float versions. For details on the GPU
    delegate, see [TensorFlow Lite on GPU](gpu_advanced.md). For step-by-step
    tutorials on using the GPU delegate with Android and iOS, see
    [TensorFlow Lite GPU Delegate Tutorial](gpu.md).
 #### Android
 *   **GPU delegate for cross platform acceleration** - The GPU delegate can be
    used on both Android and iOS. It is optimized to run 32-bit and 16-bit float
    based models where a GPU is available. It also supports 8-bit quantized
    models and provides GPU performance on par with their float versions. For
    details on the GPU delegate, see [TensorFlow Lite on GPU](gpu_advanced.md).
    For step-by-step tutorials on using the GPU delegate with Android and iOS,
    see [TensorFlow Lite GPU Delegate Tutorial](gpu.md).
 *   **NNAPI delegate for newer Android devices** - The NNAPI delegate can be
    used to accelerate models on Android devices with GPU, DSP and / or NPU
    available. It is available in Android 8.1 (API 27+) or higher. For an
@ -33,210 +63,188 @@ TensorFlow Lite provides the following delegates for hardware acceleration:
    practices, see [TensorFlow Lite NNAPI delegate](nnapi.md).
 *   **Hexagon delegate for older Android devices** - The Hexagon delegate can be
    used to accelerate models on Android devices with Qualcomm Hexagon DSP. It
-    can be used on devices older version of Android OS that does not fully
+    can be used on devices running older versions of Android that do not support
-    support NNAPI. See [TensorFlow Lite Hexagon delegate](hexagon_delegate.md)
+    NNAPI. See [TensorFlow Lite Hexagon delegate](hexagon_delegate.md) for more
-    for more detail.
+    detail.
 #### iOS
 *   **Core ML delegate for newer iPhones and iPads** - For newer iPhones and
    iPads where Neural Engine is available, you can use Core ML delegate to
-    accelerate inference for 32-bit float based models. Neural Engine is
+    accelerate inference for 32-bit or 16-bit floating-point models. Neural
-    available Apple mobile devices with A12 SoC or higher. For an overview of
+    Engine is available Apple mobile devices with A12 SoC or higher. For an
-    the Core ML delegate and step-by-step instructions, see
+    overview of the Core ML delegate and step-by-step instructions, see
    [TensorFlow Lite Core ML delegate](coreml_delegate.md).
-## How do delegates work?
+### Delegates by model type
-Let's say we have a simple model graph such as the following:
+Each accelerator is designed with a certain bit-width of data in mind. If you
 provide a floating-point model to a delegate that only supports 8-bit quantized
 operations (such as the [Hexagon delegate](hexagon_delegate.md)), it will reject
 all its operations and the model will run entirely on the CPU. To avoid such
 surprises, the table below provides an overview of delegate support based on
 model type:
-![Original graph](../images/performance/tflite_delegate_graph_1.png "Original Graph")
+**Model Type**                                                                                          | **GPU** | **NNAPI** | **Hexagon** | **CoreML**
 ------------------------------------------------------------------------------------------------------- | ------- | --------- | ----------- | ----------
 Floating-point (32 bit)                                                                                 | Yes     | Yes       | No          | Yes
 [Post-training float16 quantization](post_training_float16_quant.ipynb)                                 | Yes     | No        | No          | Yes
 [Post-training dynamic range quantization](post_training_quant.ipynb)                                   | Yes     | Yes       | No          | No
 [Post-training integer quantization](post_training_integer_quant.ipynb)                                 | Yes     | Yes       | Yes         | No
 [Quantization-aware training](http://www.tensorflow.org/model_optimization/guide/quantization/training) | Yes     | Yes       | Yes         | No
-If a delegate was provided for specific operations, then TensorFlow Lite will
+### Validating performance
 split the graph into multiple subgraphs where each subgraph will be handled by a
 delegate.
-Let's assume that a delegate, `MyDelegate`, has a faster implementation for
+The information in this section acts as a rough guideline for shortlisting the
-Conv2D and Mean operations. The resulting main graph will be updated to look
+delegates that could improve your application. However, it is important to note
-like below.
+that each delegate has a pre-defined set of operations it supports, and may
 perform differently depending on the model and device; for example, the
 [NNAPI delegate](nnapi.md) may choose to use Google's Edge-TPU on a Pixel phone
 while utilizing a DSP on another device. Therefore, it is usually recommended
 that you perform some benchmarking to gauge how useful a delegate is for your
 needs. This also helps justify the binary size increase associated with
 attaching a delegate to the TensorFlow Lite runtime.
-![Graph with delegate](../images/performance/tflite_delegate_graph_2.png "Graph with delegate")
+TensorFlow Lite has extensive performance and accuracy-evaluation tooling that
 can empower developers to be confident in using delegates in their application.
 These tools are discussed in the next section.
-Each subgraph that is handled by a delegate will be replaced with a node that
+## Tools for Evaluation
 evaluates the subgraph on its invoked call.
-Depending on the model, the final graph can end up with one node, which means
+### Latency & memory footprint
 that all of the graphs were delegated or multiple nodes handled the subgraphs.
 In general, you don't want to have multiple subgraphs handled by the delegate,
 since each time you switch from delegate to the main graph, there is an overhead
 for passing the results from the subgraph to the main graph. It's not always
 safe to share memory.
-## How to add a delegate
+TensorFlow Lite’s
 [benchmark tool](https://www.tensorflow.org/lite/performance/measurement) can be
 used with suitable parameters to estimate model performance, including average
 inference latency, initialization overhead, memory footprint, etc. This tool
 supports multiple flags to figure out the best delegate configuration for your
 model. For instance, `--gpu_backend=gl` can be specified with `--use_gpu` to
 measure GPU execution with OpenGL. The complete list of supported delegate
 parameters is defined in the
 [detailed documentation](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/delegates/README.md#tflite-delegate-registrar).
-_Note that the API used below is experimental and is subject to change._
+Here’s an example run for a quantized model with GPU via `adb`:
 Based on the previous section, to add a delegate, we need to do the following:
 1.  Define a kernel node that is responsible for evaluating the delegate
    subgraph.
 1.  Create an instance of
    [TfLiteDelegate](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/c/common.h#L611),
    which is responsible for registering the kernel node and claiming the nodes
    that the delegate can execute.
 To see it in code, let's define a delegate and call it `MyDelegate`, which can
 execute Conv2D and Mean operations faster.
 ```c++
 #include "tensorflow/lite/util.h"
 #include "tensorflow/lite/builtin_ops.h"
 #include "tensorflow/lite/context_util.h"
 // This is where the execution of the operations or whole graph happens.
 // The class below has an empty implementation just as a guideline
 // on the structure.
 class MyDelegate {
 public:
  // Returns true if my delegate can handle this type of op.
  static bool SupportedOp(const TfLiteRegistration* registration) {
    switch (registration->builtin_code) {
      case kTfLiteBuiltinConv2d:
      case kTfLiteBuiltinMean:
        return true;
      default:
        return false;
    }
  }
  // Any initialization code needed
  bool Init() {}
  // Any preparation work needed (e.g. allocate buffers)
  bool Prepare(TfLiteContext* context, TfLiteNode* node) {}
  // Actual running of the delegate subgraph.
  bool Invoke(TfLiteContext* context, TfLiteNode* node) {}
  // ... Add any other methods needed.
 };
 // Create the TfLiteRegistration for the Kernel node which will replace
 // the subgraph in the main TfLite graph.
 TfLiteRegistration GetMyDelegateNodeRegistration() {
  // This is the registration for the Delegate Node that gets added to
  // the TFLite graph instead of the subgraph it replaces.
  // It is treated as an OP node. But in our case
  // Init will initialize the delegate.
  // Invoke will run the delegate graph.
  // Prepare for preparing the delegate.
  // Free for any cleaning needed by the delegate.
  TfLiteRegistration kernel_registration;
  kernel_registration.builtin_code = kTfLiteBuiltinDelegate;
  kernel_registration.custom_name = "MyDelegate";
  kernel_registration.free = [](TfLiteContext* context, void* buffer) -> void {
    delete reinterpret_cast<MyDelegate*>(buffer);
  };
  kernel_registration.init = [](TfLiteContext* context, const char* buffer,
                                   size_t) -> void* {
    // In the node init phase, initialize MyDelegate instance
    const TfLiteDelegateParams* delegate_params =
        reinterpret_cast<const TfLiteDelegateParams*>(buffer);
    MyDelegate* my_delegate = new MyDelegate;
    if (!my_delegate->Init(context, params)) {
      return nullptr;
    }
    return my_delegate;
  };
  kernel_registration.invoke = [](TfLiteContext* context,
                                   TfLiteNode* node) -> TfLiteStatus {
    MyDelegate* kernel = reinterpret_cast<MyDelegate*>(node->user_data);
    return kernel->Invoke(context, node);
  };
  kernel_registration.prepare = [](TfLiteContext* context,
                                    TfLiteNode* node) -> TfLiteStatus {
    MyDelegate* kernel = reinterpret_cast<MyDelegate*>(node->user_data);
    return kernel->Prepare(context, node);
  };
  return kernel_registration;
 }
 // TfLiteDelegate methods
 TfLiteStatus DelegatePrepare(TfLiteContext* context, TfLiteDelegate* delegate) {
  // Claim all nodes that can be evaluated by the delegate and ask the
  // framework to update the graph with delegate kernel instead.
  std::vector<int> supported_nodes;
  TfLiteIntArray* plan;
  TF_LITE_ENSURE_STATUS(context->GetExecutionPlan(context, &plan));
  TfLiteNode* node;
  TfLiteRegistration* registration;
  for (int node_index : TfLiteIntArrayView(plan)) {
    TF_LITE_ENSURE_STATUS(context->GetNodeAndRegistration(
        context, node_index, &node, &registration));
    if (MyDelegate::SupportedOp(registration)) {
      supported_nodes.push_back(node_index);
    }
  }
  TfLiteRegistration my_delegate_kernel_registration =
      GetMyDelegateNodeRegistration();
  // This call split the graphs into subgraphs, for subgraphs that can be
  // handled by the delegate, it will replace it with a
  // 'my_delegate_kernel_registration'
  TfLiteIntArray* supported_nodes_int_array =
      ::tflite::ConvertVectorToTfLiteIntArray(supported_nodes);
  auto status = context->ReplaceNodeSubsetsWithDelegateKernels(
      context, my_delegate_kernel_registration,
      supported_nodes_int_array, delegate);
  TfLiteIntArrayFree(supported_nodes_int_array);
  return status
 }
 void FreeBufferHandle(TfLiteContext* context, TfLiteDelegate* delegate,
                      TfLiteBufferHandle* handle) {
  // Do any cleanups.
 }
 TfLiteStatus CopyToBufferHandle(TfLiteContext* context,
                                TfLiteDelegate* delegate,
                                TfLiteBufferHandle buffer_handle,
                                TfLiteTensor* tensor) {
  // Copies data from tensor to delegate buffer if needed.
  return kTfLiteOk;
 }
 TfLiteStatus CopyFromBufferHandle(TfLiteContext* context,
                                  TfLiteDelegate* delegate,
                                  TfLiteBufferHandle buffer_handle,
                                  TfLiteTensor* tensor) {
  // Copies the data from delegate buffer into the tensor raw memory.
  return kTfLiteOk;
 }
 // Caller takes ownership of the returned pointer.
 TfLiteDelegate* CreateMyDelegate() {
  TfLiteDelegate* delegate = new TfLiteDelegate;
  delegate->data_ = nullptr;
  delegate->flags = kTfLiteDelegateFlagsNone;
  delegate->Prepare = &DelegatePrepare;
  // This cannot be null.
  delegate->CopyFromBufferHandle = &CopyFromBufferHandle;
  // This can be null.
  delegate->CopyToBufferHandle = &CopyToBufferHandle;
  // This can be null.
  delegate->FreeBufferHandle = &FreeBufferHandle;
  return delegate;
 }
 // To add the delegate you need to call
 auto* my_delegate = CreateMyDelegate();
 if (interpreter->ModifyGraphWithDelegate(my_delegate) !=
        kTfLiteOk) {
  // Handle error
 } else {
  interpreter->Invoke();
 }
 ...
 // Don't forget to delete your delegate
 delete my_delegate;
 ```
 adb shell /data/local/tmp/benchmark_model \
  --graph=/data/local/tmp/mobilenet_v1_224_quant.tflite \
  --use_gpu=true
 ```
 You can download pre-built version of this tool for Android, 64-bit ARM
 architecture
 [here](https://storage.googleapis.com/tensorflow-nightly-public/prod/tensorflow/release/lite/tools/nightly/latest/android_aarch64_benchmark_model.apk)
 ([more details](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark/android)).
 ### Accuracy & correctness
 Delegates usually perform computations at a different precision than their CPU
 counterparts. As a result, there is an (usually minor) accuracy tradeoff
 associated with utilizing a delegate for hardware acceleration. Note that this
 isn't *always* true; for example, since the GPU uses floating-point precision to
 run quantized models, there might be a slight precision improvement (for e.g.,
 <1% Top-5 improvement in ILSVRC image classification).
 TensorFlow Lite has two types of tooling to measure how accurately a delegate
 behaves for a given model: *Task-Based* and *Task-Agnostic*. All the tools
 described in this section support the
 [advanced delegation parameters](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/delegates/README.md#tflite-delegate-registrar)
 used by the benchmarking tool from the previous section. Note that the
 sub-sections below focus on *delegate evaluation* (Does the delegate perform the
 same as the CPU?) rather than model evaluation (Is the model itself good for the
 task?).
 #### Task-Based Evaluation
 TensorFlow Lite has tools to evaluate correctness on two image-based tasks:
 *   [ILSVRC 2012](http://image-net.org/challenges/LSVRC/2012/) (Image
    Classification) with
    [top-K accuracy](https://en.wikipedia.org/wiki/Evaluation_measures_\(information_retrieval\)#Precision_at_K)
 *   [COCO Object Detection (w/ bounding boxes)](https://cocodataset.org/#detection-2020)
    with
    [mean Average Precision (mAP)](https://en.wikipedia.org/wiki/Evaluation_measures_\(information_retrieval\)#Mean_average_precision)
 Prebuilt binaries of these tools (Android, 64-bit ARM architecture), along with
 documentation can be found here:
 *   [ImageNet Image Classification](https://storage.googleapis.com/tensorflow-nightly-public/prod/tensorflow/release/lite/tools/nightly/latest/android_aarch64_eval_imagenet_image_classification)
    ([More details](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/evaluation/tasks/imagenet_image_classification))
 *   [COCO Object Detection](https://storage.googleapis.com/tensorflow-nightly-public/prod/tensorflow/release/lite/tools/nightly/latest/android_aarch64_eval_coco_object_detection)
    ([More details](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/evaluation/tasks/coco_object_detection))
 The example below demonstrates
 [image classification evaluation](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/evaluation/tasks/imagenet_image_classification)
 with NNAPI utilizing Google's Edge-TPU on a Pixel 4:
 ```
 adb shell /data/local/tmp/run_eval \
  --model_file=/data/local/tmp/mobilenet_quant_v1_224.tflite \
  --ground_truth_images_path=/data/local/tmp/ilsvrc_images \
  --ground_truth_labels=/data/local/tmp/ilsvrc_validation_labels.txt \
  --model_output_labels=/data/local/tmp/model_output_labels.txt \
  --output_file_path=/data/local/tmp/accuracy_output.txt \
  --num_images=0 # Run on all images. \
  --use_nnapi=true \
  --nnapi_accelerator_name=google-edgetpu
 ```
 The expected output is a list of Top-K metrics from 1 to 10:
 ```
 Top-1 Accuracy: 0.733333
 Top-2 Accuracy: 0.826667
 Top-3 Accuracy: 0.856667
 Top-4 Accuracy: 0.87
 Top-5 Accuracy: 0.89
 Top-6 Accuracy: 0.903333
 Top-7 Accuracy: 0.906667
 Top-8 Accuracy: 0.913333
 Top-9 Accuracy: 0.92
 Top-10 Accuracy: 0.923333
 ```
 #### Task-Agnostic Evaluation
 For tasks where there isn't an established on-device evaluation tool, or if you
 are experimenting with custom models, TensorFlow Lite has the
 [Inference Diff](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/evaluation/tasks/inference_diff)
 tool. (Android, 64-bit ARM binary architecture binary
 [here](https://storage.googleapis.com/tensorflow-nightly-public/prod/tensorflow/release/lite/tools/nightly/latest/android_aarch64_eval_inference_diff))
 Inference Diff compares TensorFlow Lite execution (in terms of latency &
 output-value deviation) in two settings:
 *   Single-threaded CPU Inference
 *   User-defined Inference - defined by
    [these parameters](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/delegates/README.md#tflite-delegate-registrar)
 To do so, the tool generates random Gaussian data and passes it through two
 TFLite Interpreters - one running single-threaded CPU kernels, and the other
 parametrized by the user's arguments.
 It measures the latency of both, as well as the absolute difference between the
 output tensors from each Interpreter, on a per-element basis.
 For a model with a single output tensor, the output might look like this:
 ```
 Num evaluation runs: 50
 Reference run latency: avg=84364.2(us), std_dev=12525(us)
 Test run latency: avg=7281.64(us), std_dev=2089(us)
 OutputDiff[0]: avg_error=1.96277e-05, std_dev=6.95767e-06
 ```
 What this means is that for the output tensor at index `0`, the elements from
 the CPU output different from the delegate output by an average of `1.96e-05`.
 Note that interpreting these numbers requires deeper knowledge of the model, and
 what each output tensor signifies. If its a simple regression that determines
 some sort of score or embedding, the difference should be low (otherwise it's an
 error with the delegate). However, outputs like the 'detection class' one from
 SSD models is a little harder to interpret. For example, it might show a
 difference using this tool, but that may not mean something really wrong with
 the delegate: consider two (fake) classes: "TV (ID: 10)", "Monitor (ID:20)" - If
 a delegate is slightly off the golden truth and shows monitor instead of TV, the
 output diff for this tensor might be something as high as 20-10 = 10.
--- a/tensorflow/lite/g3doc/performance/images/delegate_runtime.png
+++ b/tensorflow/lite/g3doc/performance/images/delegate_runtime.png
--- a/tensorflow/lite/g3doc/performance/implementing_delegate.md
+++ b/tensorflow/lite/g3doc/performance/implementing_delegate.md
@ -0,0 +1,171 @@
 # Implementing a Delegate
 Note: The API used below is experimental and is subject to change.
 Follow the steps below to add a delegate:
 1.  Define a kernel node that is responsible for evaluating the delegate
    subgraph.
 1.  Create an instance of
    [TfLiteDelegate](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/c/common.h#L611),
    which is responsible for registering the kernel node and claiming the nodes
    that the delegate can execute.
 To see it in code, define a delegate `MyDelegate` to execute Conv2D and Mean ops
 faster.
 ```c++
 #include "tensorflow/lite/util.h"
 #include "tensorflow/lite/builtin_ops.h"
 #include "tensorflow/lite/context_util.h"
 // This is where the execution of the operations or whole graph happens.
 // The class below has an empty implementation just as a guideline
 // on the structure.
 class MyDelegate {
 public:
  // Returns true if MyDelegate can handle this type of op.
  static bool SupportedOp(const TfLiteRegistration* registration) {
    switch (registration->builtin_code) {
      case kTfLiteBuiltinConv2d:
      case kTfLiteBuiltinMean:
        return true;
      default:
        return false;
    }
  }
  // Any initialization code needed
  bool Init() {}
  // Any preparation work needed (e.g. allocate buffers)
  bool Prepare(TfLiteContext* context, TfLiteNode* node) {}
  // Actual running of the delegate subgraph.
  bool Invoke(TfLiteContext* context, TfLiteNode* node) {}
  // ... Add any other methods needed.
 };
 // Create the TfLiteRegistration for the Kernel node which will replace
 // the subgraph in the main TfLite graph.
 TfLiteRegistration GetMyDelegateNodeRegistration() {
  // This is the registration for the Delegate Node that gets added to
  // the TFLite graph instead of the subgraph it replaces.
  // It is treated as an OP node. But in this case
  // Init initializes the delegate.
  // Invoke runs the delegate graph.
  // Prepare prepares the delegate.
  // Free performs any memory cleanup needed by the delegate.
  TfLiteRegistration kernel_registration;
  kernel_registration.builtin_code = kTfLiteBuiltinDelegate;
  kernel_registration.custom_name = "MyDelegate";
  kernel_registration.free = [](TfLiteContext* context, void* buffer) -> void {
    delete reinterpret_cast<MyDelegate*>(buffer);
  };
  kernel_registration.init = [](TfLiteContext* context, const char* buffer,
                                   size_t) -> void* {
    // In the node init phase, initialize MyDelegate instance
    const TfLiteDelegateParams* delegate_params =
        reinterpret_cast<const TfLiteDelegateParams*>(buffer);
    MyDelegate* my_delegate = new MyDelegate;
    if (!my_delegate->Init(context, params)) {
      return nullptr;
    }
    return my_delegate;
  };
  kernel_registration.invoke = [](TfLiteContext* context,
                                   TfLiteNode* node) -> TfLiteStatus {
    MyDelegate* kernel = reinterpret_cast<MyDelegate*>(node->user_data);
    return kernel->Invoke(context, node);
  };
  kernel_registration.prepare = [](TfLiteContext* context,
                                    TfLiteNode* node) -> TfLiteStatus {
    MyDelegate* kernel = reinterpret_cast<MyDelegate*>(node->user_data);
    return kernel->Prepare(context, node);
  };
  return kernel_registration;
 }
 // TfLiteDelegate methods
 TfLiteStatus DelegatePrepare(TfLiteContext* context, TfLiteDelegate* delegate) {
  // Claim all nodes that can be evaluated by the delegate and ask the
  // framework to update the graph with delegate kernel instead.
  std::vector<int> supported_nodes;
  TfLiteIntArray* plan;
  TF_LITE_ENSURE_STATUS(context->GetExecutionPlan(context, &plan));
  TfLiteNode* node;
  TfLiteRegistration* registration;
  for (int node_index : TfLiteIntArrayView(plan)) {
    TF_LITE_ENSURE_STATUS(context->GetNodeAndRegistration(
        context, node_index, &node, &registration));
    if (MyDelegate::SupportedOp(registration)) {
      supported_nodes.push_back(node_index);
    }
  }
  TfLiteRegistration my_delegate_kernel_registration =
      GetMyDelegateNodeRegistration();
  // This call split the graphs into subgraphs, for subgraphs that can be
  // handled by the delegate, it will replace it with a
  // 'my_delegate_kernel_registration'
  TfLiteIntArray* supported_nodes_int_array =
      ::tflite::ConvertVectorToTfLiteIntArray(supported_nodes);
  auto status = context->ReplaceNodeSubsetsWithDelegateKernels(
      context, my_delegate_kernel_registration,
      supported_nodes_int_array, delegate);
  TfLiteIntArrayFree(supported_nodes_int_array);
  return status
 }
 void FreeBufferHandle(TfLiteContext* context, TfLiteDelegate* delegate,
                      TfLiteBufferHandle* handle) {
  // Do any cleanups.
 }
 TfLiteStatus CopyToBufferHandle(TfLiteContext* context,
                                TfLiteDelegate* delegate,
                                TfLiteBufferHandle buffer_handle,
                                TfLiteTensor* tensor) {
  // Copies data from tensor to delegate buffer if needed.
  return kTfLiteOk;
 }
 TfLiteStatus CopyFromBufferHandle(TfLiteContext* context,
                                  TfLiteDelegate* delegate,
                                  TfLiteBufferHandle buffer_handle,
                                  TfLiteTensor* tensor) {
  // Copies the data from delegate buffer into the tensor raw memory.
  return kTfLiteOk;
 }
 // Caller takes ownership of the returned pointer.
 TfLiteDelegate* CreateMyDelegate() {
  TfLiteDelegate* delegate = new TfLiteDelegate;
  delegate->data_ = nullptr;
  delegate->flags = kTfLiteDelegateFlagsNone;
  delegate->Prepare = &DelegatePrepare;
  // This cannot be null.
  delegate->CopyFromBufferHandle = &CopyFromBufferHandle;
  // This can be null.
  delegate->CopyToBufferHandle = &CopyToBufferHandle;
  // This can be null.
  delegate->FreeBufferHandle = &FreeBufferHandle;
  return delegate;
 }
 // To add the delegate you need to call
 auto* my_delegate = CreateMyDelegate();
 if (interpreter->ModifyGraphWithDelegate(my_delegate) !=
        kTfLiteOk) {
  // Handle error
 } else {
  interpreter->Invoke();
 }
 ...
 // Don't forget to delete your delegate
 delete my_delegate;
 ```