Merge pull request #43801 from advaitjain:doc-reorganize

PiperOrigin-RevId: 335748303
Change-Id: Ibac76ae1c02066aa80e59f9b4a9c85f0aae160ae
This commit is contained in:
TensorFlower Gardener 2020-10-06 16:53:29 -07:00
commit 4119a2b2b1
5 changed files with 516 additions and 406 deletions

View File

@ -5,7 +5,6 @@ https://github.com/ekalinin/github-markdown-toc#auto-insert-and-update-toc
<!--ts-->
* [Resources](#resources)
* [Contributing Guidelines](#contributing-guidelines)
* [General Pull Request Guidelines](#general-pull-request-guidelines)
* [Guidelines for Specific Contribution Categories](#guidelines-for-specific-contribution-categories)
@ -20,25 +19,10 @@ https://github.com/ekalinin/github-markdown-toc#auto-insert-and-update-toc
* [Reviewer notes](#reviewer-notes)
* [Python notes](#python-notes)
<!-- Added by: advaitjain, at: Tue 08 Sep 2020 04:00:31 PM PDT -->
<!-- Added by: advaitjain, at: Mon 05 Oct 2020 02:38:02 PM PDT -->
<!--te-->
# Resources
A
[TF Lite Micro Github issue](https://github.com/tensorflow/tensorflow/issues/new?labels=comp%3Amicro&template=70-tflite-micro-issue.md)
should be the primary method of getting in touch with the TensorFlow Lite Micro
(TFLM) team.
The following resources may also be useful:
1. SIG Micro [email group](https://groups.google.com/a/tensorflow.org/g/micro)
and
[monthly meetings](http://doc/1YHq9rmhrOUdcZnrEnVCWvd87s2wQbq4z17HbeRl-DBc).
1. SIG Micro [gitter chat room](https://gitter.im/tensorflow/sig-micro).
# Contributing Guidelines
We look forward to your contributions to the TensorFlow Lite Micro codebase and

View File

@ -1,112 +0,0 @@
# Memory Management in TensorFlow Lite Micro
This document outlines how memory is managed internally by TensorFlow Lite Micro (TFLM) today. It outlines the "online" allocation strategy used by the default TFLM APIs for loading a model into a shared tensor arena.
## Tensor Arena
The main "working" space for TFLM allocations is inside a single `char` or `int8_t` buffer. This buffer can be managed by passing it directly into a `tflite::MicroInterpreter` constructor or through a `tflite::MicroAllocator` instance that can be passed into a `tflite::MicroInterpreter` constructor. Internally, the `tflite::MicroAllocator` classifies allocations into 3 different sections:
* **Head** - non-persistent allocations.
* **Temporary** - short term "scoped" allocations.
* **Tail** - persistent allocations.
The illustration below represents typical allocations in TFLM:
```
--------------------------------------------------------------------------------
| | | |
| HEAD |<-- TEMPORARY -->| TAIL |
| | | |
--------------------------------------------------------------------------------
* Lowest Address Highest Address *
```
### Head Section
This non-persistent section typically holds shared Tensor buffers. This section does not allocate small iterative chunks, it can only be set by a specific length for the entire section.
This allocation length of this section is managed by the `tflite::GreedyMemoryPlanner`. That memory planner looks at the entire graph of a model and tries to reuse as many buffers as possible to create the smallest length for the head. The Tensor buffers for this section can be accessed via a `TfLiteEvalTensor` or `TfLiteTensor` instance on the `tflite::MicroInterpreter`.
### Temporary Section
This section is used to allocate "scoped" or short-term, non-guaranteed buffers. Allocations from this section start from the current end address of the head section and grow towards the tail section. An allocation chain can be reset (and must be reset before adjusting the head) and moves the current allocation start address back to the end of the head section.
TFLM currently uses these allocations for a scope allocation of large C structs or scratch memory that is expected to be valid for at least the lifetime of a method call. This section.
### Tail Section
This section holds all persistent allocations used by TFLM. This section contains many random sized allocations and grows towards the end of the head section. Allocations in this section come from a variety of areas inside of TFLM. TFLM provides a [recording API](#Recording-Memory-APIs) to assist with auditing the contents of this section.
## Recording Memory APIs
TFLM provides simple APIs for auditing memory usage in the shared tensor arena. These APIs are opt-in and require some additional memory overhead and a working debug logging implementation [(reference implementation)](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/micro/debug_log.cc).
A typical bare-bones TFLM interpreter setup looks as such:
```c++
// Buffer for the tensor arena:
size_t tensor_arena_size = 2048;
uint8_t tensor_arena[tensor_arena_size];
// Interpreter using the shared tensor arena above:
tflite::MicroInterpreter interpreter(
tflite::GetModel(my_model_data), ops_resolver,
tensor_arena, tensor_arena_size, error_reporter);
// Invoke one time which will allocate internals:
if (interpreter.Invoke() != kTfLiteOk) {
TF_LITE_REPORT_ERROR(error_reporter, "Exception during invoke()!");
}
```
Recording API can simply be used by including the `RecordingMicroInterpreter` class (`recording_micro_interpreter.h`) and replace `tflite::MicroInterpreter` with `tflite::RecordingMicroInterpreter`. The same call to `invoke()` is performed, but another call is made to `PrintAllocations()` which will output detailed allocation logging:
```c++
// Add an include to the recording API:
#include "recording_micro_interpreter.h"
// Simply change the class name from 'MicroInterpreter' to 'RecordingMicroInterpreter':
tflite::RecoridngMicroInterpreter interpreter(
tflite::GetModel(my_model_data), ops_resolver,
tensor_arena, tensor_arena_size, error_reporter);
// Invoke one time which will allocate internals:
if (interpreter.Invoke() != kTfLiteOk) {
TF_LITE_REPORT_ERROR(error_reporter, "Exception during invoke()!");
}
// Print out detailed allocation information:
interpreter.PrintAllocations();
```
The output of this call will look something similar to this (output from the [memory_arena_threshold_test](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/micro/memory_arena_threshold_test.cc#L205)):
```sh
[RecordingMicroAllocator] Arena allocation total 9568 bytes
[RecordingMicroAllocator] Arena allocation head 7744 bytes
[RecordingMicroAllocator] Arena allocation tail 1824 bytes
[RecordingMicroAllocator] 'TfLiteEvalTensor data' used 360 bytes with alignment overhead (requested 360 bytes for 15 allocations)
[RecordingMicroAllocator] 'Persistent TfLiteTensor data' used 0 bytes with alignment overhead (requested 0 bytes for 0 tensors)
[RecordingMicroAllocator] 'Persistent TfLiteTensor quantization data' used 0 bytes with alignment overhead (requested 0 bytes for 0 allocations)
[RecordingMicroAllocator] 'TfLiteTensor variable buffer data' used 0 bytes with alignment overhead (requested 0 bytes for 0 allocations)
[RecordingMicroAllocator] 'NodeAndRegistration struct' used 392 bytes with alignment overhead (requested 392 bytes for 7 NodeAndRegistration structs)
[RecordingMicroAllocator] 'Operator runtime data' used 136 bytes with alignment overhead (requested 136 bytes for 5 OpData structs)
```
### Allocation Section Details
More information about each recorded allocation section:
* 'TfLiteEvalTensor data'
* C struct that holds the data type, dimension, and a pointer to the buffer representing the Tensor.
* 'Persistent TfLiteTensor data'
* C struct that holds more information than a `TfLiteEvalTensor` struct in the graph.
* Allocations in this bucket will only show up when accessing tensors from the accessors on `tflite::MicroInterpreter`.
* 'Persistent TfLiteTensor quantization data'
* Length of persistent quantization data assigned to persistent `TfLiteTensor` structs.
* Allocations in this bucket will only show up when accessing tensors from the accessors on `tflite::MicroInterpreter`.
* 'TfLiteTensor variable buffer data'
* Length of buffer data from a variable tensor (retains data throughout calls to `invoke()`).
* 'NodeAndRegistration struct'
* C struct that holds a `TfLiteRegistration` and `TfLiteNode` struct instance.
* Each operator in a model will contain one `NodeAndRegistration` struct.
* 'Operator runtime data'
* Persistent allocations of data cached by TFLM kernels (e.g. quantization params, multipliers, etc).

View File

@ -1,291 +1,50 @@
<!--
Semi-automated TOC generation with instructions from
https://github.com/ekalinin/github-markdown-toc#auto-insert-and-update-toc
-->
<!--ts-->
* [TensorFlow Lite for Microcontrollers](#tensorflow-lite-for-microcontrollers)
* [Getting Help and Involved](#getting-help-and-involved)
* [Additional Documentation for the TFLM Internals](#additional-documentation-for-the-tflm-internals)
<!-- Added by: advaitjain, at: Mon 05 Oct 2020 02:37:34 PM PDT -->
<!--te-->
# TensorFlow Lite for Microcontrollers
TensorFlow Lite for Microcontrollers is a port of TensorFlow Lite designed to run
machine learning models on microcontrollers and other devices with only kilobytes
of memory.
TensorFlow Lite for Microcontrollers is a port of TensorFlow Lite designed to
run machine learning models on microcontrollers and other devices with only
kilobytes of memory.
To learn how to use the framework, visit the developer documentation at
[tensorflow.org/lite/microcontrollers](https://www.tensorflow.org/lite/microcontrollers).
## Porting to a new platform
# Getting Help and Involved
The remainder of this document provides guidance on porting TensorFlow Lite for
Microcontrollers to new platforms. You should read the
[developer documentation](https://www.tensorflow.org/lite/microcontrollers)
first.
A
[TF Lite Micro Github issue](https://github.com/tensorflow/tensorflow/issues/new?labels=comp%3Amicro&template=70-tflite-micro-issue.md)
should be the primary method of getting in touch with the TensorFlow Lite Micro
(TFLM) team.
### Requirements
The following resources may also be useful:
Since the core neural network operations are pure arithmetic, and don't require
any I/O or other system-specific functionality, the code doesn't have to have
many dependencies. We've tried to enforce this, so that it's as easy as possible
to get TensorFlow Lite Micro running even on 'bare metal' systems without an OS.
Here are the core requirements that a platform needs to run the framework:
1. SIG Micro [email group](https://groups.google.com/a/tensorflow.org/g/micro)
and
[monthly meetings](http://doc/1YHq9rmhrOUdcZnrEnVCWvd87s2wQbq4z17HbeRl-DBc).
- C/C++ compiler capable of C++11 compatibility. This is probably the most
restrictive of the requirements, since C++11 is not as widely adopted in the
embedded world as it is elsewhere. We made the decision to require it since
one of the main goals of TFL Micro is to share as much code as possible with
the wider TensorFlow codebase, and since that relies on C++11 features, we
need compatibility to achieve it. We only use a small, sane, subset of C++
though, so don't worry about having to deal with template metaprogramming or
similar challenges!
1. SIG Micro [gitter chat room](https://gitter.im/tensorflow/sig-micro).
- Debug logging. The core network operations don't need any I/O functions, but
to be able to run tests and tell if they've worked as expected, the
framework needs some way to write out a string to some kind of debug
console. This will vary from system to system, for example on Linux it could
just be `fprintf(stderr, debug_string)` whereas an embedded device might
write the string out to a specified UART. As long as there's some mechanism
for outputting debug strings, you should be able to use TFL Micro on that
platform.
If you are interested in contributing code to TensorFlow Lite for
Microcontrollers then please read our [contributions guide](CONTRIBUTING.md).
- Math library. The C standard `libm.a` library is needed to handle some of
the mathematical operations used to calculate neural network results.
# Additional Documentation
- Global variable initialization. We do use a pattern of relying on global
variables being set before `main()` is run in some places, so you'll need to
make sure your compiler toolchain supports this.
For developers that are interested in more details of the internals of the
project, we have additional documentation in the [docs](docs/) folder.
And that's it! You may be wondering about some other common requirements that
are needed by a lot of non-embedded software, so here's a brief list of things
that aren't necessary to get started with TFL Micro on a new platform:
- Operating system. Since the only platform-specific function we need is
`DebugLog()`, there's no requirement for any kind of Posix or similar
functionality around files, processes, or threads.
- C or C++ standard libraries. The framework tries to avoid relying on any
standard library functions that require linker-time support. This includes
things like string functions, but still allows us to use headers like
`stdtypes.h` which typically just define constants and typedefs.
Unfortunately this distinction isn't officially defined by any standard, so
it's possible that different toolchains may decide to require linked code
even for the subset we use, but in practice we've found it's usually a
pretty obvious decision and stable over platforms and toolchains.
- Dynamic memory allocation. All the TFL Micro code avoids dynamic memory
allocation, instead relying on local variables on the stack in most cases,
or global variables for a few situations. These are all fixed-size, which
can mean some compile-time configuration to ensure there's enough space for
particular networks, but does avoid any need for a heap and the
implementation of `malloc\new` on a platform.
- Floating point. Eight-bit integer arithmetic is enough for inference on many
networks, so if a model sticks to these kind of quantized operations, no
floating point instructions should be required or executed by the framework.
### Getting started
We recommend that you start trying to compile and run one of the simplest tests
in the framework as your first step. The full TensorFlow codebase can seem
overwhelming to work with at first, so instead you can begin with a collection
of self-contained project folders that only include the source files needed for
a particular test or executable. You can find a set of pre-generated projects
[here](https://drive.google.com/open?id=1cawEQAkqquK_SO4crReDYqf_v7yAwOY8).
As mentioned above, the one function you will need to implement for a completely
new platform is debug logging. If your device is just a variation on an existing
platform you may be able to reuse code that's already been written. To
understand what's available, begin with the default reference implementation at
[tensorflow/lite/micro/debug_log.cc](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/micro/debug_log.cc),
which uses fprintf and stderr. If your platform has this level of support for
the C standard library in its toolchain, then you can just reuse this.
Otherwise, you'll need to do some research into how your platform and device can
communicate logging statements to the outside world. As another example, take a
look at
[the Mbed version of `DebugLog()`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/micro/mbed/debug_log.cc),
which creates a UART object and uses it to output strings to the host's console
if it's connected.
Begin by navigating to the micro_error_reporter_test folder in the pregenerated
projects you downloaded. Inside here, you'll see a set of folders containing all
the source code you need. If you look through them, you should find a total of
around 60 C or C++ files that compiled together will create the test executable.
There's an example makefile in the directory that lists all of the source files
and include paths for the headers. If you're building on a Linux or MacOS host
system, you may just be able to reuse that same makefile to cross-compile for
your system, as long as you swap out the `CC` and `CXX` variables from their
defaults, to point to your cross compiler instead (for example
`arm-none-eabi-gcc` or `riscv64-unknown-elf-gcc`). Otherwise, set up a project
in the build system you are using. It should hopefully be fairly
straightforward, since all of the source files in the folder need to be
compiled, so on many IDEs you can just drag the whole lot in. Then you need to
make sure that C++11 compatibility is turned on, and that the right include
paths (as mentioned in the makefile) have been added.
You'll see the default `DebugLog()` implementation in
'tensorflow/lite/micro/debug_log.cc' inside the
micro_error_reporter_test folder. Modify that file to add the right
implementation for your platform, and then you should be able to build the set
of files into an executable. Transfer that executable to your target device (for
example by flashing it), and then try running it. You should see output that
looks something like this:
```
Number: 42
Badly-formed format string
Another badly-formed format string
~~ALL TESTS PASSED~~~
```
If not, you'll need to debug what went wrong, but hopefully with this small
starting project it should be manageable.
### Troubleshooting
When we've been porting to new platforms, it's often been hard to figure out
some of the fundamentals like linker settings and other toolchain setup flags.
If you are having trouble, see if you can find a simple example program for your
platform, like one that just blinks an LED. If you're able to build and run that
successfully, then start to swap in parts of the TF Lite Micro codebase to that
working project, taking it a step at a time and ensuring it's still working
after every change. For example, a first step might be to paste in your
`DebugLog()` implementation and call `DebugLog("Hello World!")` from the main
function.
Another common problem on embedded platforms is the stack size being too small.
Mbed defaults to 4KB for the main thread's stack, which is too small for most
models since TensorFlow Lite allocates buffers and other data structures that
require more memory. The exact size will depend on which model you're running,
but try increasing it if you are running into strange corruption issues that
might be related to stack overwriting.
### Optimizing for your platform
The default reference implementations in TensorFlow Lite Micro are written to be
portable and easy to understand, not fast, so you'll want to replace performance
critical parts of the code with versions specifically tailored to your
architecture. The framework has been designed with this in mind, and we hope the
combination of small modules and many tests makes it as straightforward as
possible to swap in your own code a piece at a time, ensuring you have a working
version at every step. To write specialized implementations for a platform, it's
useful to understand how optional components are handled inside the build
system.
### Code module organization
We have adopted a system of small modules with platform-specific implementations
to help with portability. Every module is just a standard `.h` header file
containing the interface (either functions or a class), with an accompanying
reference implementation in a `.cc` with the same name. The source file
implements all of the code that's declared in the header. If you have a
specialized implementation, you can create a folder in the same directory as the
header and reference source, name it after your platform, and put your
implementation in a `.cc` file inside that folder. We've already seen one
example of this, where the Mbed and Bluepill versions of `DebugLog()` are inside
[mbed](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/micro/mbed)
and
[bluepill](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/micro/bluepill)
folders, children of the
[same directory](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite//micro)
where the stdio-based
[`debug_log.cc`](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/micro/debug_log.cc)
reference implementation is found.
The advantage of this approach is that we can automatically pick specialized
implementations based on the current build target, without having to manually
edit build files for every new platform. It allows incremental optimizations
from a always-working foundation, without cluttering the reference
implementations with a lot of variants.
To see why we're doing this, it's worth looking at the alternatives. TensorFlow
Lite has traditionally used preprocessor macros to separate out some
platform-specific code within particular files, for example:
```
#ifndef USE_NEON
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
#define USE_NEON
#include <arm_neon.h>
#endif
```
Theres also a tradition in gemmlowp of using file suffixes to indicate
platform-specific versions of particular headers, with kernel_neon.h being
included by kernel.h if `USE_NEON` is defined. As a third variation, kernels are
separated out using a directory structure, with
tensorflow/lite/kernels/internal/reference containing portable implementations,
and tensorflow/lite/kernels/internal/optimized holding versions optimized for
NEON on Arm platforms.
These approaches are hard to extend to multiple platforms. Using macros means
that platform-specific code is scattered throughout files in a hard-to-find way,
and can make following the control flow difficult since you need to understand
the macro state to trace it. For example, I temporarily introduced a bug that
disabled NEON optimizations for some kernels when I removed
tensorflow/lite/kernels/internal/common.h from their includes, without realizing
it was where USE_NEON was defined!
Its also tough to port to different build systems, since figuring out the right
combination of macros to use can be hard, especially since some of them are
automatically defined by the compiler, and others are only set by build scripts,
often across multiple rules.
The approach we are using extends the file system approach that we use for
kernel implementations, but with some specific conventions:
- For each module in TensorFlow Lite, there will be a parent directory that
contains tests, interface headers used by other modules, and portable
implementations of each part.
- Portable means that the code doesnt include code from any libraries except
flatbuffers, or other TF Lite modules. You can include a limited subset of
standard C or C++ headers, but you cant use any functions that require
linking against those libraries, including fprintf, etc. You can link
against functions in the standard math library, in <math.h>.
- Specialized implementations are held inside subfolders of the parent
directory, named after the platform or library that they depend on. So, for
example if you had my_module/foo.cc, a version that used RISC-V extensions
would live in my_module/riscv/foo.cc. If you had a version that used the
CMSIS library, it should be in my_module/cmsis/foo.cc.
- These specialized implementations should completely replace the top-level
implementations. If this involves too much code duplication, the top-level
implementation should be split into smaller files, so only the
platform-specific code needs to be replaced.
- There is a convention about how build systems pick the right implementation
file. There will be an ordered list of 'tags' defining the preferred
implementations, and to generate the right list of source files, each module
will be examined in turn. If a subfolder with a tags name contains a .cc
file with the same base name as one in the parent folder, then it will
replace the parent folders version in the list of build files. If there are
multiple subfolders with matching tags and file names, then the tag thats
latest in the ordered list will be chosen. This allows us to express “Id
like generically-optimized fixed point if its available, but Id prefer
something using the CMSIS library” using the list 'fixed_point cmsis'. These
tags are passed in as `TAGS="<foo>"` on the command line when you use the
main Makefile to build.
- There is an implicit “reference” tag at the start of every list, so that
its possible to support directory structures like the current
tensorflow/kernels/internal where portable implementations are held in a
“reference” folder thats a sibling to the NEON-optimized folder.
- The headers for each unit in a module should remain platform-agnostic, and
be the same for all implementations. Private headers inside a sub-folder can
be used as needed, but shouldnt be referred to by any portable code at the
top level.
- Tests should be at the parent level, with no platform-specific code.
- No platform-specific macros or #ifdefs should be used in any portable code.
The implementation of these rules is handled inside the Makefile, with a
[`specialize` function](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/micro/tools/make/helper_functions.inc#L42)
that takes a list of reference source file paths as an input, and returns the
equivalent list with specialized versions of those files swapped in if they
exist.
### Implementing more optimizations
Clearly, getting debug logging support is only the beginning of the work you'll
need to do on a particular platform. It's very likely that you'll want to
optimize the core deep learning operations that take up the most time when
running models you care about. The good news is that the process for providing
optimized implementations is the same as the one you just went through to
provide your own logging. You'll need to identify parts of the code that are
bottlenecks, and then add specialized implementations in their own folders.
These don't need to be platform specific, they can also be broken out by which
library they rely on for example. [Here's where we do that for the CMSIS
implementation of integer fast-fourier
transforms](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/micro/examples/micro_speech/simple_features/simple_features_generator.cc).
This more complex case shows that you can also add helper source files alongside
the main implementation, as long as you
[mention them in the platform-specific makefile](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/micro/examples/micro_speech/CMSIS/Makefile.inc).
You can also do things like update the list of libraries that need to be linked
in, or add include paths to required headers.
* [Benchmarks](benchmarks/README.md)
* [Memory Management](docs/memory_management.md)
* [New Platform Support](docs/new_platform_support.md)

View File

@ -0,0 +1,173 @@
<!--
Semi-automated TOC generation with instructions from
https://github.com/ekalinin/github-markdown-toc#auto-insert-and-update-toc
-->
<!--ts-->
* [Memory Management in TensorFlow Lite Micro](#memory-management-in-tensorflow-lite-micro)
* [Tensor Arena](#tensor-arena)
* [Head Section](#head-section)
* [Temporary Section](#temporary-section)
* [Tail Section](#tail-section)
* [Recording Memory APIs](#recording-memory-apis)
* [Allocation Section Details](#allocation-section-details)
<!-- Added by: advaitjain, at: Mon 05 Oct 2020 02:21:02 PM PDT -->
<!--te-->
# Memory Management in TensorFlow Lite Micro
This document outlines how memory is managed internally by TensorFlow Lite Micro
(TFLM) today. It outlines the "online" allocation strategy used by the default
TFLM APIs for loading a model into a shared tensor arena.
## Tensor Arena
The main "working" space for TFLM allocations is inside a single `char` or
`int8_t` buffer. This buffer can be managed by passing it directly into a
`tflite::MicroInterpreter` constructor or through a `tflite::MicroAllocator`
instance that can be passed into a `tflite::MicroInterpreter` constructor.
Internally, the `tflite::MicroAllocator` classifies allocations into 3 different
sections:
* **Head** - non-persistent allocations.
* **Temporary** - short term "scoped" allocations.
* **Tail** - persistent allocations.
The illustration below represents typical allocations in TFLM:
## ```
| | | | | HEAD |<-- TEMPORARY -->| TAIL |
## | | | |
* Lowest Address Highest Address * ```
### Head Section
This non-persistent section typically holds shared Tensor buffers. This section
does not allocate small iterative chunks, it can only be set by a specific
length for the entire section.
This allocation length of this section is managed by the
`tflite::GreedyMemoryPlanner`. That memory planner looks at the entire graph of
a model and tries to reuse as many buffers as possible to create the smallest
length for the head. The Tensor buffers for this section can be accessed via a
`TfLiteEvalTensor` or `TfLiteTensor` instance on the `tflite::MicroInterpreter`.
### Temporary Section
This section is used to allocate "scoped" or short-term, non-guaranteed buffers.
Allocations from this section start from the current end address of the head
section and grow towards the tail section. An allocation chain can be reset (and
must be reset before adjusting the head) and moves the current allocation start
address back to the end of the head section.
TFLM currently uses these allocations for a scope allocation of large C structs
or scratch memory that is expected to be valid for at least the lifetime of a
method call. This section.
### Tail Section
This section holds all persistent allocations used by TFLM. This section
contains many random sized allocations and grows towards the end of the head
section. Allocations in this section come from a variety of areas inside of
TFLM. TFLM provides a [recording API](#Recording-Memory-APIs) to assist with
auditing the contents of this section.
## Recording Memory APIs
TFLM provides simple APIs for auditing memory usage in the shared tensor arena.
These APIs are opt-in and require some additional memory overhead and a working
debug logging implementation
[(reference implementation)](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/micro/debug_log.cc).
A typical bare-bones TFLM interpreter setup looks as such:
```c++
// Buffer for the tensor arena:
size_t tensor_arena_size = 2048;
uint8_t tensor_arena[tensor_arena_size];
// Interpreter using the shared tensor arena above:
tflite::MicroInterpreter interpreter(
tflite::GetModel(my_model_data), ops_resolver,
tensor_arena, tensor_arena_size, error_reporter);
// Invoke one time which will allocate internals:
if (interpreter.Invoke() != kTfLiteOk) {
TF_LITE_REPORT_ERROR(error_reporter, "Exception during invoke()!");
}
```
Recording API can simply be used by including the `RecordingMicroInterpreter`
class (`recording_micro_interpreter.h`) and replace `tflite::MicroInterpreter`
with `tflite::RecordingMicroInterpreter`. The same call to `invoke()` is
performed, but another call is made to `PrintAllocations()` which will output
detailed allocation logging:
```c++
// Add an include to the recording API:
#include "recording_micro_interpreter.h"
// Simply change the class name from 'MicroInterpreter' to 'RecordingMicroInterpreter':
tflite::RecoridngMicroInterpreter interpreter(
tflite::GetModel(my_model_data), ops_resolver,
tensor_arena, tensor_arena_size, error_reporter);
// Invoke one time which will allocate internals:
if (interpreter.Invoke() != kTfLiteOk) {
TF_LITE_REPORT_ERROR(error_reporter, "Exception during invoke()!");
}
// Print out detailed allocation information:
interpreter.PrintAllocations();
```
The output of this call will look something similar to this (output from the
[memory_arena_threshold_test](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/micro/memory_arena_threshold_test.cc#L205)):
`sh [RecordingMicroAllocator] Arena allocation total 9568 bytes
[RecordingMicroAllocator] Arena allocation head 7744 bytes
[RecordingMicroAllocator] Arena allocation tail 1824 bytes
[RecordingMicroAllocator] 'TfLiteEvalTensor data' used 360 bytes with alignment
overhead (requested 360 bytes for 15 allocations) [RecordingMicroAllocator]
'Persistent TfLiteTensor data' used 0 bytes with alignment overhead (requested 0
bytes for 0 tensors) [RecordingMicroAllocator] 'Persistent TfLiteTensor
quantization data' used 0 bytes with alignment overhead (requested 0 bytes for 0
allocations) [RecordingMicroAllocator] 'TfLiteTensor variable buffer data' used
0 bytes with alignment overhead (requested 0 bytes for 0 allocations)
[RecordingMicroAllocator] 'NodeAndRegistration struct' used 392 bytes with
alignment overhead (requested 392 bytes for 7 NodeAndRegistration structs)
[RecordingMicroAllocator] 'Operator runtime data' used 136 bytes with alignment
overhead (requested 136 bytes for 5 OpData structs)`
### Allocation Section Details
More information about each recorded allocation section:
* 'TfLiteEvalTensor data'
* C struct that holds the data type, dimension, and a pointer to the
buffer representing the Tensor.
* 'Persistent TfLiteTensor data'
* C struct that holds more information than a `TfLiteEvalTensor` struct in
the graph.
* Allocations in this bucket will only show up when accessing tensors from
the accessors on `tflite::MicroInterpreter`.
* 'Persistent TfLiteTensor quantization data'
* Length of persistent quantization data assigned to persistent
`TfLiteTensor` structs.
* Allocations in this bucket will only show up when accessing tensors from
the accessors on `tflite::MicroInterpreter`.
* 'TfLiteTensor variable buffer data'
* Length of buffer data from a variable tensor (retains data throughout
calls to `invoke()`).
* 'NodeAndRegistration struct'
* C struct that holds a `TfLiteRegistration` and `TfLiteNode` struct
instance.
* Each operator in a model will contain one `NodeAndRegistration` struct.
* 'Operator runtime data'
* Persistent allocations of data cached by TFLM kernels (e.g. quantization
params, multipliers, etc).

View File

@ -0,0 +1,306 @@
<!--
Semi-automated TOC generation with instructions from
https://github.com/ekalinin/github-markdown-toc#auto-insert-and-update-toc
-->
<!--ts-->
* [Porting to a new platform](#porting-to-a-new-platform)
* [Requirements](#requirements)
* [Getting started](#getting-started)
* [Troubleshooting](#troubleshooting)
* [Optimizing for your platform](#optimizing-for-your-platform)
* [Code module organization](#code-module-organization)
* [Implementing more optimizations](#implementing-more-optimizations)
<!-- Added by: advaitjain, at: Mon 05 Oct 2020 02:36:46 PM PDT -->
<!--te-->
***Please note that we are currently pausing accepting new platforms***. Please
see our [contributions guide](../CONTRIBUTING.md) for more details and context.
Parts of the documentation below will likely change as we start accepting new
platform support again.
# Porting to a new platform
The remainder of this document provides guidance on porting TensorFlow Lite for
Microcontrollers to new platforms. You should read the
[developer documentation](https://www.tensorflow.org/lite/microcontrollers)
first.
## Requirements
Since the core neural network operations are pure arithmetic, and don't require
any I/O or other system-specific functionality, the code doesn't have to have
many dependencies. We've tried to enforce this, so that it's as easy as possible
to get TensorFlow Lite Micro running even on 'bare metal' systems without an OS.
Here are the core requirements that a platform needs to run the framework:
- C/C++ compiler capable of C++11 compatibility. This is probably the most
restrictive of the requirements, since C++11 is not as widely adopted in the
embedded world as it is elsewhere. We made the decision to require it since
one of the main goals of TFL Micro is to share as much code as possible with
the wider TensorFlow codebase, and since that relies on C++11 features, we
need compatibility to achieve it. We only use a small, sane, subset of C++
though, so don't worry about having to deal with template metaprogramming or
similar challenges!
- Debug logging. The core network operations don't need any I/O functions, but
to be able to run tests and tell if they've worked as expected, the
framework needs some way to write out a string to some kind of debug
console. This will vary from system to system, for example on Linux it could
just be `fprintf(stderr, debug_string)` whereas an embedded device might
write the string out to a specified UART. As long as there's some mechanism
for outputting debug strings, you should be able to use TFL Micro on that
platform.
- Math library. The C standard `libm.a` library is needed to handle some of
the mathematical operations used to calculate neural network results.
- Global variable initialization. We do use a pattern of relying on global
variables being set before `main()` is run in some places, so you'll need to
make sure your compiler toolchain supports this.
And that's it! You may be wondering about some other common requirements that
are needed by a lot of non-embedded software, so here's a brief list of things
that aren't necessary to get started with TFL Micro on a new platform:
- Operating system. Since the only platform-specific function we need is
`DebugLog()`, there's no requirement for any kind of Posix or similar
functionality around files, processes, or threads.
- C or C++ standard libraries. The framework tries to avoid relying on any
standard library functions that require linker-time support. This includes
things like string functions, but still allows us to use headers like
`stdtypes.h` which typically just define constants and typedefs.
Unfortunately this distinction isn't officially defined by any standard, so
it's possible that different toolchains may decide to require linked code
even for the subset we use, but in practice we've found it's usually a
pretty obvious decision and stable over platforms and toolchains.
- Dynamic memory allocation. All the TFL Micro code avoids dynamic memory
allocation, instead relying on local variables on the stack in most cases,
or global variables for a few situations. These are all fixed-size, which
can mean some compile-time configuration to ensure there's enough space for
particular networks, but does avoid any need for a heap and the
implementation of `malloc\new` on a platform.
- Floating point. Eight-bit integer arithmetic is enough for inference on many
networks, so if a model sticks to these kind of quantized operations, no
floating point instructions should be required or executed by the framework.
## Getting started
We recommend that you start trying to compile and run one of the simplest tests
in the framework as your first step. The full TensorFlow codebase can seem
overwhelming to work with at first, so instead you can begin with a collection
of self-contained project folders that only include the source files needed for
a particular test or executable. You can find a set of pre-generated projects
[here](https://drive.google.com/open?id=1cawEQAkqquK_SO4crReDYqf_v7yAwOY8).
As mentioned above, the one function you will need to implement for a completely
new platform is debug logging. If your device is just a variation on an existing
platform you may be able to reuse code that's already been written. To
understand what's available, begin with the default reference implementation at
[tensorflow/lite/micro/debug_log.cc](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/micro/debug_log.cc),
which uses fprintf and stderr. If your platform has this level of support for
the C standard library in its toolchain, then you can just reuse this.
Otherwise, you'll need to do some research into how your platform and device can
communicate logging statements to the outside world. As another example, take a
look at
[the Mbed version of `DebugLog()`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/micro/mbed/debug_log.cc),
which creates a UART object and uses it to output strings to the host's console
if it's connected.
Begin by navigating to the micro_error_reporter_test folder in the pregenerated
projects you downloaded. Inside here, you'll see a set of folders containing all
the source code you need. If you look through them, you should find a total of
around 60 C or C++ files that compiled together will create the test executable.
There's an example makefile in the directory that lists all of the source files
and include paths for the headers. If you're building on a Linux or MacOS host
system, you may just be able to reuse that same makefile to cross-compile for
your system, as long as you swap out the `CC` and `CXX` variables from their
defaults, to point to your cross compiler instead (for example
`arm-none-eabi-gcc` or `riscv64-unknown-elf-gcc`). Otherwise, set up a project
in the build system you are using. It should hopefully be fairly
straightforward, since all of the source files in the folder need to be
compiled, so on many IDEs you can just drag the whole lot in. Then you need to
make sure that C++11 compatibility is turned on, and that the right include
paths (as mentioned in the makefile) have been added.
You'll see the default `DebugLog()` implementation in
'tensorflow/lite/micro/debug_log.cc' inside the micro_error_reporter_test
folder. Modify that file to add the right implementation for your platform, and
then you should be able to build the set of files into an executable. Transfer
that executable to your target device (for example by flashing it), and then try
running it. You should see output that looks something like this:
```
Number: 42
Badly-formed format string
Another badly-formed format string
~~ALL TESTS PASSED~~~
```
If not, you'll need to debug what went wrong, but hopefully with this small
starting project it should be manageable.
## Troubleshooting
When we've been porting to new platforms, it's often been hard to figure out
some of the fundamentals like linker settings and other toolchain setup flags.
If you are having trouble, see if you can find a simple example program for your
platform, like one that just blinks an LED. If you're able to build and run that
successfully, then start to swap in parts of the TF Lite Micro codebase to that
working project, taking it a step at a time and ensuring it's still working
after every change. For example, a first step might be to paste in your
`DebugLog()` implementation and call `DebugLog("Hello World!")` from the main
function.
Another common problem on embedded platforms is the stack size being too small.
Mbed defaults to 4KB for the main thread's stack, which is too small for most
models since TensorFlow Lite allocates buffers and other data structures that
require more memory. The exact size will depend on which model you're running,
but try increasing it if you are running into strange corruption issues that
might be related to stack overwriting.
## Optimizing for your platform
The default reference implementations in TensorFlow Lite Micro are written to be
portable and easy to understand, not fast, so you'll want to replace performance
critical parts of the code with versions specifically tailored to your
architecture. The framework has been designed with this in mind, and we hope the
combination of small modules and many tests makes it as straightforward as
possible to swap in your own code a piece at a time, ensuring you have a working
version at every step. To write specialized implementations for a platform, it's
useful to understand how optional components are handled inside the build
system.
## Code module organization
We have adopted a system of small modules with platform-specific implementations
to help with portability. Every module is just a standard `.h` header file
containing the interface (either functions or a class), with an accompanying
reference implementation in a `.cc` with the same name. The source file
implements all of the code that's declared in the header. If you have a
specialized implementation, you can create a folder in the same directory as the
header and reference source, name it after your platform, and put your
implementation in a `.cc` file inside that folder. We've already seen one
example of this, where the Mbed and Bluepill versions of `DebugLog()` are inside
[mbed](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/micro/mbed)
and
[bluepill](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/micro/bluepill)
folders, children of the
[same directory](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite//micro)
where the stdio-based
[`debug_log.cc`](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/micro/debug_log.cc)
reference implementation is found.
The advantage of this approach is that we can automatically pick specialized
implementations based on the current build target, without having to manually
edit build files for every new platform. It allows incremental optimizations
from a always-working foundation, without cluttering the reference
implementations with a lot of variants.
To see why we're doing this, it's worth looking at the alternatives. TensorFlow
Lite has traditionally used preprocessor macros to separate out some
platform-specific code within particular files, for example:
```
#ifndef USE_NEON
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
#define USE_NEON
#include <arm_neon.h>
#endif
```
Theres also a tradition in gemmlowp of using file suffixes to indicate
platform-specific versions of particular headers, with kernel_neon.h being
included by kernel.h if `USE_NEON` is defined. As a third variation, kernels are
separated out using a directory structure, with
tensorflow/lite/kernels/internal/reference containing portable implementations,
and tensorflow/lite/kernels/internal/optimized holding versions optimized for
NEON on Arm platforms.
These approaches are hard to extend to multiple platforms. Using macros means
that platform-specific code is scattered throughout files in a hard-to-find way,
and can make following the control flow difficult since you need to understand
the macro state to trace it. For example, I temporarily introduced a bug that
disabled NEON optimizations for some kernels when I removed
tensorflow/lite/kernels/internal/common.h from their includes, without realizing
it was where USE_NEON was defined!
Its also tough to port to different build systems, since figuring out the right
combination of macros to use can be hard, especially since some of them are
automatically defined by the compiler, and others are only set by build scripts,
often across multiple rules.
The approach we are using extends the file system approach that we use for
kernel implementations, but with some specific conventions:
- For each module in TensorFlow Lite, there will be a parent directory that
contains tests, interface headers used by other modules, and portable
implementations of each part.
- Portable means that the code doesnt include code from any libraries except
flatbuffers, or other TF Lite modules. You can include a limited subset of
standard C or C++ headers, but you cant use any functions that require
linking against those libraries, including fprintf, etc. You can link
against functions in the standard math library, in <math.h>.
- Specialized implementations are held inside subfolders of the parent
directory, named after the platform or library that they depend on. So, for
example if you had my_module/foo.cc, a version that used RISC-V extensions
would live in my_module/riscv/foo.cc. If you had a version that used the
CMSIS library, it should be in my_module/cmsis/foo.cc.
- These specialized implementations should completely replace the top-level
implementations. If this involves too much code duplication, the top-level
implementation should be split into smaller files, so only the
platform-specific code needs to be replaced.
- There is a convention about how build systems pick the right implementation
file. There will be an ordered list of 'tags' defining the preferred
implementations, and to generate the right list of source files, each module
will be examined in turn. If a subfolder with a tags name contains a .cc
file with the same base name as one in the parent folder, then it will
replace the parent folders version in the list of build files. If there are
multiple subfolders with matching tags and file names, then the tag thats
latest in the ordered list will be chosen. This allows us to express “Id
like generically-optimized fixed point if its available, but Id prefer
something using the CMSIS library” using the list 'fixed_point cmsis'. These
tags are passed in as `TAGS="<foo>"` on the command line when you use the
main Makefile to build.
- There is an implicit “reference” tag at the start of every list, so that
its possible to support directory structures like the current
tensorflow/kernels/internal where portable implementations are held in a
“reference” folder thats a sibling to the NEON-optimized folder.
- The headers for each unit in a module should remain platform-agnostic, and
be the same for all implementations. Private headers inside a sub-folder can
be used as needed, but shouldnt be referred to by any portable code at the
top level.
- Tests should be at the parent level, with no platform-specific code.
- No platform-specific macros or #ifdefs should be used in any portable code.
The implementation of these rules is handled inside the Makefile, with a
[`specialize` function](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/micro/tools/make/helper_functions.inc#L42)
that takes a list of reference source file paths as an input, and returns the
equivalent list with specialized versions of those files swapped in if they
exist.
## Implementing more optimizations
Clearly, getting debug logging support is only the beginning of the work you'll
need to do on a particular platform. It's very likely that you'll want to
optimize the core deep learning operations that take up the most time when
running models you care about. The good news is that the process for providing
optimized implementations is the same as the one you just went through to
provide your own logging. You'll need to identify parts of the code that are
bottlenecks, and then add specialized implementations in their own folders.
These don't need to be platform specific, they can also be broken out by which
library they rely on for example. [Here's where we do that for the CMSIS
implementation of integer fast-fourier
transforms](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/micro/examples/micro_speech/simple_features/simple_features_generator.cc).
This more complex case shows that you can also add helper source files alongside
the main implementation, as long as you
[mention them in the platform-specific makefile](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/micro/examples/micro_speech/CMSIS/Makefile.inc).
You can also do things like update the list of libraries that need to be linked
in, or add include paths to required headers.