Merge pull request #47227 from advaitjain:opt-kernel-doc

PiperOrigin-RevId: 358268296 Change-Id: I6666ca8e2dbe3bccbef6141a7d86ab29cd397563
2021-02-18 14:49:57 -08:00 · 2021-02-18 14:49:57 -08:00 · 498026c3a5
commit 498026c3a5
parent 014f02fea5 fbd8a074ae
2 changed files with 201 additions and 0 deletions
--- a/tensorflow/lite/micro/README.md
+++ b/tensorflow/lite/micro/README.md
@ -65,6 +65,7 @@ project, we have additional documentation in the [docs](docs/) folder.
 *   [Benchmarks](benchmarks/README.md)
 *   [Profiling](docs/profiling.md)
 *   [Memory Management](docs/memory_management.md)
+*   [Optimized Kernel Implementations](docs/optimized_kernel_implementations.md)
 *   [New Platform Support](docs/new_platform_support.md)
 *   [Software Emulation with Renode](docs/renode.md)
 *   [Pre-allocated tensors](docs/preallocated_tensors.md)
--- a/tensorflow/lite/micro/docs/optimized_kernel_implementations.md
+++ b/tensorflow/lite/micro/docs/optimized_kernel_implementations.md
@ -0,0 +1,200 @@
+<!-- mdformateoff(b/169948621#comment2) -->
+
+<!--
+Semi-automated TOC generation with instructions from
+https://github.com/ekalinin/github-markdown-toc#auto-insert-and-update-toc
+-->
+
+<!--ts-->
+
+*   [Summary](#summary)
+*   [High-Level Steps](#high-level-steps)
+    *   [Why not Optimize the Reference Kernels](#why-not-optimize-the-reference-kernels)
+*   [Software Architecture](#software-architecture)
+    *   [Hardware-specific NN library](#hardware-specific-nn-library)
+    *   [Optimized Kernels](#optimized-kernels)
+    *   [Build System Integration](#build-system-integration)
+    *   [Testing and Continuous Integration](#testing-and-continuous-integration)
+
+<!-- Added by: advaitjain, at: Wed 17 Feb 2021 02:14:16 PM PST -->
+
+<!--te-->
+
+# Summary
+
+This guide describes the recommended high-level architecture and steps to add
+hardware-specific optimized kernels to TfLite Micro.
+
+The goal with these optimizations and the process that we recommend to getting
+them merged into the TfLite Micro codebase is to have a measurable and
+documented performance improvement on a benchmark of interest.
+
+Once the optimizations are merged, they will indeed be used for more than the
+benchmark but the context for why the optimizations were added is still very
+important.
+
+# High-Level Steps
+
+1.  Pick a benchmark that you would like to measure the performance for.
+
+    *   Existing benchmarks are in the [benchmarks directory](../benchmarks).
+    *   If none of the existing benchmarks capture your use-case, then please
+        create a github issue or start a thread on micro@tensorflow.org to
+        figure out how to add in a new benchmark.
+    *   If adding a publicly-available benchmark to the TFLM codebase is
+        determined to be infeasible, then a fall-back would be to have an
+        internal benchmark that can be used to document the benefits of adding
+        in the optimizations via PR descriptions.
+    *   Adding optimized code without any associated benchmarks will need very
+        strong justification and will most likely not be permitted.
+
+1.  Do the groundwork and architecture needed to be able to add in optimizations
+    for your target (more details in the
+    [software architecture](#software-architecture) section).
+
+1.  Create one pull request for each optimized kernel with the PR description
+    clearly stating the commands that were used to measure the performance
+    improvement.
+
+    *   This context is important even if the toolchain is proprietary and there
+        are currently a small number of users.
+        *   See [this PR](https://github.com/tensorflow/tensorflow/pull/47098)
+            as an example.
+        *   At minimum the latency with and without the particular optimized
+            kernel should be documented.
+            [Additional context](https://github.com/tensorflow/tensorflow/pull/46746)
+            may also be desirable.
+    *   Here is some
+        [general guidance](https://testing.googleblog.com/2017/09/code-health-providing-context-with.html)
+        on writing
+        [good PR descriptions](https://google.github.io/eng-practices/review/developer/cl-descriptions.html)
+
+## Why Not Optimize the Portable Reference Kernels?
+
+We would like to explicitly point out (as have others) that the reference kernel
+implementations are not performant and there are plenty of opportunities to
+speed them up. This is by design and the reference kernels are meant to be a
+shared starting point to then be optimized in a target specific optimized kernel
+implementation.
+
+Two previous discussions on this topic are on
+[PR #42477](https://github.com/tensorflow/tensorflow/pull/42477) and
+[PR #45227](https://github.com/tensorflow/tensorflow/pull/45227)
+
+Our current point of view on this topic is that while optimizing shared
+reference code in a portable manner is attractive, we are making an explicit
+choice to not go down that path and instead rely on target-specific optimized
+implementations. The TFLM codebase has a growing list of optimized kernel
+implementations, and we are investing in making the process of adding new
+implementations smoother.
+
+# Software Architecture
+
+The optimized kernel architecture is composed of the following three modules:
+
+1.  Hardware-specific NN library
+1.  Optimized Kernels
+1.  Build System Integration
+
+## Hardware-specific NN library
+
+This library uses knowledge of the hardware and compiler to implement the
+underlying operations. Examples of this are
+[CMSIS-NN](https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN) from
+ARM and [NNLib](https://github.com/foss-xtensa/nnlib-hifi4) from Cadence.
+
+The benefits of having this API separation are:
+
+1.  The NN library does not need to follow the style guide of the rest of the
+    TFLM code.
+1.  Releases of the NN library can be made independent of TFLM
+1.  The same NN library can be used and tested independent of TFLM.
+1.  The maintainers of the NN library have full control over the development
+    process that they would like to follow.
+
+## Optimized Kernels
+
+These will be (hopefully thin) wrappers that act as the glue between TFLM and
+the NN library.
+
+The goal here is to delegate as much work as possible to the NN library while
+still allowing the two APIs (TFLM and NN library) to be independent of each
+other. If there is a performance degradation due to this (for example,
+unnecessary memory copies) then we can evaluate those on a case-by-case basis.
+
+This code will be reviewed and merged in the TFLM github repository and must
+follow the development style of the TFLM codebase.
+
+Some amount of refactoring of the existing code may be needed to ensure that
+code is suitably shared between the reference and optimized kernels. There is
+currently no fixed recipe for this refactor and we will evaluate on a
+case-by-case basis during the PR review.
+
+For example, to add an optimized implementation for `fully_conntected` for the
+Xtensa Fusion F1 the steps were: *
+[PR 1](https://github.com/tensorflow/tensorflow/pull/45464): refactor for
+reference fallbacks and a baseline latency. *
+[PR 2](https://github.com/tensorflow/tensorflow/pull/46242): refactor to share
+code between reference and optimized kernels. *
+[PR 3](https://github.com/tensorflow/tensorflow/pull/46411): add the code needed
+to use the optimized NN lib and document the latency improvement.
+
+## Build System Integration
+
+This module is the least defined but we strongly recommend the following: 1. A
+single target makefile.inc for all the architectures that you would like to
+support along with optional target-specific
+[system_setup.cc](../arduino/system_setup.cc). See
+[cortex_m_generic_makefile.inc](../tools/make/targets/cortex_m_generic_makefile.inc)
+and [xtensa_makefile.inc](../tools/make/targets/xtensa_makefile.inc) as
+examples.
+
+1.  A single `ext_libs.inc` (and associated scripts) that downloads any external
+    dependencies (including the NN library). For example:
+
+    *   [cmsis_nn.inc](../tools/make/ext_libs/cmsis_nn.inc) and
+        [cmsis_download.sh](../tools/make/ext_libs/cmsis_download.sh)
+    *   [xtensa.inc](../tools/make/ext_libs/xtensa.inc) and
+        [xtensa_download.sh](../tools/make/ext_libs/xtensa_download.sh)
+
+1.  The optimized kernels will then live in a kernels subdirectory (e.g.
+    [kernels/cmsis_nn](../kernels/cmsis_nn) and
+    [kernels/xtensa](../kernels/xtensa))
+
+Two development workflows that the TFLM team would like to encourage and
+support:
+
+1.  Export static library + headers into target-specific development environment
+
+    *   Build a static libtensorflow-microlite.a using the TFLM makefile with:
+        `make -f tensorflow/lite/micro/tools/make/Makefile TARGET=<target>
+        OPTIMIZED_KERNEL_DIR=<optimize_dir> microlite`
+    *   Use the static library and any TFLM headers as part of the overall
+        application (with its own build system).
+
+1.  Integrate TFLM with IDE:
+
+    *   This has historically been done using the TFLM Makefile’s support for
+        project generation.
+
+    *   However, given the learning curve and high-maintenance overhead, we are
+        moving away from supporting project generation via the Makefile and are
+        encouraging future IDE integrations to be done outside of the TFLM
+        Makefiles.
+
+    *   The TFLM team is currently working through the details on this topic.
+
+## Testing and Continuous Integration
+
+The kernel tests are the primary method of ensuring that the optimized kernel
+implementations are accurate.
+
+Currently, most of the tests require the optimizations to be bit-exact to the
+quantized reference implementation. We can revisit this requirement if it ends
+up having a high associated cost on the latency.
+
+We strongly encourage optimized kernel implementations to have an associated
+continuous build that runs through all the unit tests and publishes a build
+badge to the
+[TFLM community supported builds](../README.md#community-supported-builds)
+table. Running the units tests once a day is often a good place to start.