Merge pull request #47227 from advaitjain:opt-kernel-doc
PiperOrigin-RevId: 358268296 Change-Id: I6666ca8e2dbe3bccbef6141a7d86ab29cd397563
This commit is contained in:
commit
498026c3a5
@ -65,6 +65,7 @@ project, we have additional documentation in the [docs](docs/) folder.
|
||||
* [Benchmarks](benchmarks/README.md)
|
||||
* [Profiling](docs/profiling.md)
|
||||
* [Memory Management](docs/memory_management.md)
|
||||
* [Optimized Kernel Implementations](docs/optimized_kernel_implementations.md)
|
||||
* [New Platform Support](docs/new_platform_support.md)
|
||||
* [Software Emulation with Renode](docs/renode.md)
|
||||
* [Pre-allocated tensors](docs/preallocated_tensors.md)
|
||||
|
200
tensorflow/lite/micro/docs/optimized_kernel_implementations.md
Normal file
200
tensorflow/lite/micro/docs/optimized_kernel_implementations.md
Normal file
@ -0,0 +1,200 @@
|
||||
<!-- mdformateoff(b/169948621#comment2) -->
|
||||
|
||||
<!--
|
||||
Semi-automated TOC generation with instructions from
|
||||
https://github.com/ekalinin/github-markdown-toc#auto-insert-and-update-toc
|
||||
-->
|
||||
|
||||
<!--ts-->
|
||||
|
||||
* [Summary](#summary)
|
||||
* [High-Level Steps](#high-level-steps)
|
||||
* [Why not Optimize the Reference Kernels](#why-not-optimize-the-reference-kernels)
|
||||
* [Software Architecture](#software-architecture)
|
||||
* [Hardware-specific NN library](#hardware-specific-nn-library)
|
||||
* [Optimized Kernels](#optimized-kernels)
|
||||
* [Build System Integration](#build-system-integration)
|
||||
* [Testing and Continuous Integration](#testing-and-continuous-integration)
|
||||
|
||||
<!-- Added by: advaitjain, at: Wed 17 Feb 2021 02:14:16 PM PST -->
|
||||
|
||||
<!--te-->
|
||||
|
||||
# Summary
|
||||
|
||||
This guide describes the recommended high-level architecture and steps to add
|
||||
hardware-specific optimized kernels to TfLite Micro.
|
||||
|
||||
The goal with these optimizations and the process that we recommend to getting
|
||||
them merged into the TfLite Micro codebase is to have a measurable and
|
||||
documented performance improvement on a benchmark of interest.
|
||||
|
||||
Once the optimizations are merged, they will indeed be used for more than the
|
||||
benchmark but the context for why the optimizations were added is still very
|
||||
important.
|
||||
|
||||
# High-Level Steps
|
||||
|
||||
1. Pick a benchmark that you would like to measure the performance for.
|
||||
|
||||
* Existing benchmarks are in the [benchmarks directory](../benchmarks).
|
||||
* If none of the existing benchmarks capture your use-case, then please
|
||||
create a github issue or start a thread on micro@tensorflow.org to
|
||||
figure out how to add in a new benchmark.
|
||||
* If adding a publicly-available benchmark to the TFLM codebase is
|
||||
determined to be infeasible, then a fall-back would be to have an
|
||||
internal benchmark that can be used to document the benefits of adding
|
||||
in the optimizations via PR descriptions.
|
||||
* Adding optimized code without any associated benchmarks will need very
|
||||
strong justification and will most likely not be permitted.
|
||||
|
||||
1. Do the groundwork and architecture needed to be able to add in optimizations
|
||||
for your target (more details in the
|
||||
[software architecture](#software-architecture) section).
|
||||
|
||||
1. Create one pull request for each optimized kernel with the PR description
|
||||
clearly stating the commands that were used to measure the performance
|
||||
improvement.
|
||||
|
||||
* This context is important even if the toolchain is proprietary and there
|
||||
are currently a small number of users.
|
||||
* See [this PR](https://github.com/tensorflow/tensorflow/pull/47098)
|
||||
as an example.
|
||||
* At minimum the latency with and without the particular optimized
|
||||
kernel should be documented.
|
||||
[Additional context](https://github.com/tensorflow/tensorflow/pull/46746)
|
||||
may also be desirable.
|
||||
* Here is some
|
||||
[general guidance](https://testing.googleblog.com/2017/09/code-health-providing-context-with.html)
|
||||
on writing
|
||||
[good PR descriptions](https://google.github.io/eng-practices/review/developer/cl-descriptions.html)
|
||||
|
||||
## Why Not Optimize the Portable Reference Kernels?
|
||||
|
||||
We would like to explicitly point out (as have others) that the reference kernel
|
||||
implementations are not performant and there are plenty of opportunities to
|
||||
speed them up. This is by design and the reference kernels are meant to be a
|
||||
shared starting point to then be optimized in a target specific optimized kernel
|
||||
implementation.
|
||||
|
||||
Two previous discussions on this topic are on
|
||||
[PR #42477](https://github.com/tensorflow/tensorflow/pull/42477) and
|
||||
[PR #45227](https://github.com/tensorflow/tensorflow/pull/45227)
|
||||
|
||||
Our current point of view on this topic is that while optimizing shared
|
||||
reference code in a portable manner is attractive, we are making an explicit
|
||||
choice to not go down that path and instead rely on target-specific optimized
|
||||
implementations. The TFLM codebase has a growing list of optimized kernel
|
||||
implementations, and we are investing in making the process of adding new
|
||||
implementations smoother.
|
||||
|
||||
# Software Architecture
|
||||
|
||||
The optimized kernel architecture is composed of the following three modules:
|
||||
|
||||
1. Hardware-specific NN library
|
||||
1. Optimized Kernels
|
||||
1. Build System Integration
|
||||
|
||||
## Hardware-specific NN library
|
||||
|
||||
This library uses knowledge of the hardware and compiler to implement the
|
||||
underlying operations. Examples of this are
|
||||
[CMSIS-NN](https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN) from
|
||||
ARM and [NNLib](https://github.com/foss-xtensa/nnlib-hifi4) from Cadence.
|
||||
|
||||
The benefits of having this API separation are:
|
||||
|
||||
1. The NN library does not need to follow the style guide of the rest of the
|
||||
TFLM code.
|
||||
1. Releases of the NN library can be made independent of TFLM
|
||||
1. The same NN library can be used and tested independent of TFLM.
|
||||
1. The maintainers of the NN library have full control over the development
|
||||
process that they would like to follow.
|
||||
|
||||
## Optimized Kernels
|
||||
|
||||
These will be (hopefully thin) wrappers that act as the glue between TFLM and
|
||||
the NN library.
|
||||
|
||||
The goal here is to delegate as much work as possible to the NN library while
|
||||
still allowing the two APIs (TFLM and NN library) to be independent of each
|
||||
other. If there is a performance degradation due to this (for example,
|
||||
unnecessary memory copies) then we can evaluate those on a case-by-case basis.
|
||||
|
||||
This code will be reviewed and merged in the TFLM github repository and must
|
||||
follow the development style of the TFLM codebase.
|
||||
|
||||
Some amount of refactoring of the existing code may be needed to ensure that
|
||||
code is suitably shared between the reference and optimized kernels. There is
|
||||
currently no fixed recipe for this refactor and we will evaluate on a
|
||||
case-by-case basis during the PR review.
|
||||
|
||||
For example, to add an optimized implementation for `fully_conntected` for the
|
||||
Xtensa Fusion F1 the steps were: *
|
||||
[PR 1](https://github.com/tensorflow/tensorflow/pull/45464): refactor for
|
||||
reference fallbacks and a baseline latency. *
|
||||
[PR 2](https://github.com/tensorflow/tensorflow/pull/46242): refactor to share
|
||||
code between reference and optimized kernels. *
|
||||
[PR 3](https://github.com/tensorflow/tensorflow/pull/46411): add the code needed
|
||||
to use the optimized NN lib and document the latency improvement.
|
||||
|
||||
## Build System Integration
|
||||
|
||||
This module is the least defined but we strongly recommend the following: 1. A
|
||||
single target makefile.inc for all the architectures that you would like to
|
||||
support along with optional target-specific
|
||||
[system_setup.cc](../arduino/system_setup.cc). See
|
||||
[cortex_m_generic_makefile.inc](../tools/make/targets/cortex_m_generic_makefile.inc)
|
||||
and [xtensa_makefile.inc](../tools/make/targets/xtensa_makefile.inc) as
|
||||
examples.
|
||||
|
||||
1. A single `ext_libs.inc` (and associated scripts) that downloads any external
|
||||
dependencies (including the NN library). For example:
|
||||
|
||||
* [cmsis_nn.inc](../tools/make/ext_libs/cmsis_nn.inc) and
|
||||
[cmsis_download.sh](../tools/make/ext_libs/cmsis_download.sh)
|
||||
* [xtensa.inc](../tools/make/ext_libs/xtensa.inc) and
|
||||
[xtensa_download.sh](../tools/make/ext_libs/xtensa_download.sh)
|
||||
|
||||
1. The optimized kernels will then live in a kernels subdirectory (e.g.
|
||||
[kernels/cmsis_nn](../kernels/cmsis_nn) and
|
||||
[kernels/xtensa](../kernels/xtensa))
|
||||
|
||||
Two development workflows that the TFLM team would like to encourage and
|
||||
support:
|
||||
|
||||
1. Export static library + headers into target-specific development environment
|
||||
|
||||
* Build a static libtensorflow-microlite.a using the TFLM makefile with:
|
||||
`make -f tensorflow/lite/micro/tools/make/Makefile TARGET=<target>
|
||||
OPTIMIZED_KERNEL_DIR=<optimize_dir> microlite`
|
||||
* Use the static library and any TFLM headers as part of the overall
|
||||
application (with its own build system).
|
||||
|
||||
1. Integrate TFLM with IDE:
|
||||
|
||||
* This has historically been done using the TFLM Makefile’s support for
|
||||
project generation.
|
||||
|
||||
* However, given the learning curve and high-maintenance overhead, we are
|
||||
moving away from supporting project generation via the Makefile and are
|
||||
encouraging future IDE integrations to be done outside of the TFLM
|
||||
Makefiles.
|
||||
|
||||
* The TFLM team is currently working through the details on this topic.
|
||||
|
||||
## Testing and Continuous Integration
|
||||
|
||||
The kernel tests are the primary method of ensuring that the optimized kernel
|
||||
implementations are accurate.
|
||||
|
||||
Currently, most of the tests require the optimizations to be bit-exact to the
|
||||
quantized reference implementation. We can revisit this requirement if it ends
|
||||
up having a high associated cost on the latency.
|
||||
|
||||
We strongly encourage optimized kernel implementations to have an associated
|
||||
continuous build that runs through all the unit tests and publishes a build
|
||||
badge to the
|
||||
[TFLM community supported builds](../README.md#community-supported-builds)
|
||||
table. Running the units tests once a day is often a good place to start.
|
Loading…
Reference in New Issue
Block a user