STT-tensorflow/tensorflow/lite/micro
Advait Jain 34b4fab30a Remove deprecated AddBuiltin API from MicroMutableOpResolver.
* Added new API hooks for all the OPs currently supported in TFLM.
* These new APIs still need to be implemented with operator specific parse
  functions but this change allows us to remove the old API and incrementally
  update the implementations.

PiperOrigin-RevId: 317770205
Change-Id: Idaaa687401f2bac5fbf9925e27c04bf536b154ea
2020-06-22 18:10:20 -07:00
..
apollo3evb Internal micro demo. 2020-05-11 15:03:17 -07:00
arc_emsdp Merge pull request #39077 from foss-for-synopsys-dwc-arc-processors:arcmli_upstream 2020-05-22 09:10:59 -07:00
arduino
benchmarks Remove TF Micro tests that use the "name" field in TfLiteTensor. 2020-06-11 16:42:03 -07:00
bluepill
chre
ecm3531
examples Remove deprecated AddBuiltin API from MicroMutableOpResolver. 2020-06-22 18:10:20 -07:00
himax_we1_evb Merge pull request #40299 from HimaxWiseEyePlus:master 2020-06-17 16:45:55 -07:00
kernels Rename kTfLiteActRelu1 to kTfLiteActReluN1To1 2020-06-21 21:32:44 -07:00
mbed
memory_planner Merge pull request #38121 from jenselofsson:offline_memory_planner 2020-06-15 13:37:09 -07:00
openmvcam Merge pull request #33292 from openmv:openmv 2020-03-17 18:28:22 -07:00
posix Add micro timer interface and a default implementation. 2020-03-04 16:45:46 -08:00
riscv32_mcu minor spelling tweaks 2020-03-08 15:41:21 +09:00
sparkfun_edge Add micro timer interface and a default implementation. 2020-03-04 16:45:46 -08:00
stm32f4 TFLu: Add stm32f4 and build target 2020-01-14 09:13:58 +01:00
stm32f4HAL Removed macros 2020-03-04 14:40:50 +01:00
testing Remove TF Micro tests that use the "name" field in TfLiteTensor. 2020-06-11 16:42:03 -07:00
tools Reduce the size of TfLiteTensor for the TF Micro runtime. 2020-06-19 10:45:51 -07:00
xcore Merge pull request #39318 from andrewxcav:master 2020-05-14 16:31:42 -07:00
xtensa-xpg Create keyword spotting benchmark. Keyword model has scrambled weights, so it will only function as a benchmark for performance optimizations. 2020-04-14 14:00:37 -07:00
all_ops_resolver.cc Remove deprecated AddBuiltin API from MicroMutableOpResolver. 2020-06-22 18:10:20 -07:00
all_ops_resolver.h Split OpResolvers out of micro_framework build target. 2020-06-03 13:54:55 -07:00
BUILD Add default values in MicroInterpreter constructors. 2020-06-17 06:46:18 -07:00
build_def.bzl Enable building TFLite Micro runtime for embedded systems under Bazel. 2020-04-08 13:03:33 -07:00
compatibility.h Moved TensorFlow Lite Micro out of experimental folder 2019-12-09 11:43:28 -08:00
debug_log.cc Moved TensorFlow Lite Micro out of experimental folder 2019-12-09 11:43:28 -08:00
debug_log.h Moved TensorFlow Lite Micro out of experimental folder 2019-12-09 11:43:28 -08:00
memory_arena_threshold_test.cc Reduce the size of TfLiteTensor for the TF Micro runtime. 2020-06-19 10:45:51 -07:00
memory_helpers_test.cc Moved TensorFlow Lite Micro out of experimental folder 2019-12-09 11:43:28 -08:00
memory_helpers.cc Add MicroOpResolver interface class. 2020-05-28 17:38:37 -07:00
memory_helpers.h Include what you use for the micro_framework bazel target. 2020-05-15 15:09:06 -07:00
micro_allocator_test.cc Adapt to changes in micro_allocator. 2020-06-12 10:49:57 +02:00
micro_allocator.cc Add endian-aware flatbuffer conversions to MicroAllocator. 2020-06-22 14:08:31 -07:00
micro_allocator.h TFLM: Rename FinishTensorAllocation to FinishModelAllocation in the comments. 2020-06-16 17:37:23 -07:00
micro_error_reporter_test.cc Optionally strip error message strings to reduce binary size 2020-02-12 11:31:53 -08:00
micro_error_reporter.cc Include what you use for the micro_framework bazel target. 2020-05-15 15:09:06 -07:00
micro_error_reporter.h Include what you use for the micro_framework bazel target. 2020-05-15 15:09:06 -07:00
micro_interpreter_test.cc Reduce the size of TfLiteTensor for the TF Micro runtime. 2020-06-19 10:45:51 -07:00
micro_interpreter.cc TFLM: Make MicroInterpreter multi-tenant constructor taking MicroOpResolver reference instead of pointer to match the behavior of the original constructor. 2020-06-17 22:58:36 -07:00
micro_interpreter.h TFLM: Make MicroInterpreter multi-tenant constructor taking MicroOpResolver reference instead of pointer to match the behavior of the original constructor. 2020-06-17 22:58:36 -07:00
micro_mutable_op_resolver_test.cc Remove deprecated AddBuiltin API from MicroMutableOpResolver. 2020-06-22 18:10:20 -07:00
micro_mutable_op_resolver.h Remove deprecated AddBuiltin API from MicroMutableOpResolver. 2020-06-22 18:10:20 -07:00
micro_op_resolver.h Move AddBuiltin and AddCustom out of the interface. 2020-06-08 16:05:36 -07:00
micro_optional_debug_tools.cc Merge pull request #38121 from jenselofsson:offline_memory_planner 2020-06-15 13:37:09 -07:00
micro_optional_debug_tools.h Fix review comments, 24/4 2020-04-27 14:00:15 +02:00
micro_profiler.cc Create a micro profiler class and use it to implement per op profiling in the micro interpreter. 2020-06-12 17:01:37 -07:00
micro_profiler.h Create a micro profiler class and use it to implement per op profiling in the micro interpreter. 2020-06-12 17:01:37 -07:00
micro_string_test.cc Extract DebugLog calls to a single location in MicroErrorReporter. Expose MicroSnprintf method for use elsewhere in the micro codebase. 2020-02-27 19:00:06 -08:00
micro_string.cc Include the necessary headers only. 2020-04-16 11:03:56 -07:00
micro_string.h Include the necessary headers only. 2020-04-16 11:03:56 -07:00
micro_time_test.cc Add micro timer interface and a default implementation. 2020-03-04 16:45:46 -08:00
micro_time.cc Add micro timer interface and a default implementation. 2020-03-04 16:45:46 -08:00
micro_time.h Add micro timer interface and a default implementation. 2020-03-04 16:45:46 -08:00
micro_utils_test.cc fixed static sized arrays with variable length 2020-01-09 20:20:12 +01:00
micro_utils.cc Make few implicit casts explicit to resolve clang warnings 2020-04-27 14:09:28 -07:00
micro_utils.h Add int16->float dequantization support to TFLM. 2020-03-19 14:46:45 -07:00
README.md Update README.md 2020-01-29 21:37:52 +08:00
recording_micro_allocator_test.cc Reduce excessive RAM in TFLM by using the existing flatbuffer quantization data for scales. 2020-06-15 15:27:20 -07:00
recording_micro_allocator.cc Track variable tensor buffer allocation in the "recording" MicroAllocator. 2020-06-12 13:39:35 -07:00
recording_micro_allocator.h Track variable tensor buffer allocation in the "recording" MicroAllocator. 2020-06-12 13:39:35 -07:00
recording_micro_interpreter.h TFLM: Make MicroInterpreter multi-tenant constructor taking MicroOpResolver reference instead of pointer to match the behavior of the original constructor. 2020-06-17 22:58:36 -07:00
recording_simple_memory_allocator_test.cc Add special "recording" SimpleMemoryAllocator class to help with logging tail allocations. 2020-05-29 14:22:59 -07:00
recording_simple_memory_allocator.cc Add a memory threshold allocation test for the Keyword model. 2020-06-11 12:54:09 -07:00
recording_simple_memory_allocator.h Add a memory threshold allocation test for the Keyword model. 2020-06-11 12:54:09 -07:00
simple_memory_allocator_test.cc Report allocation failures in SimpleMemoryAllocator. 2020-04-16 12:38:24 -07:00
simple_memory_allocator.cc Add error_reporter DCHECK back into SimpleMemoryAllocator. 2020-06-09 14:07:19 -07:00
simple_memory_allocator.h Add special "recording" SimpleMemoryAllocator class to help with logging tail allocations. 2020-05-29 14:22:59 -07:00
test_helpers.cc Merge pull request #38121 from jenselofsson:offline_memory_planner 2020-06-15 13:37:09 -07:00
test_helpers.h Merge remote-tracking branch 'upstream/master' into offline_memory_planner 2020-06-15 10:06:36 +02:00
testing_helpers_test.cc Remove TF Micro tests that use the "name" field in TfLiteTensor. 2020-06-11 16:42:03 -07:00

TensorFlow Lite for Microcontrollers

TensorFlow Lite for Microcontrollers is a port of TensorFlow Lite designed to run machine learning models on microcontrollers and other devices with only kilobytes of memory.

To learn how to use the framework, visit the developer documentation at tensorflow.org/lite/microcontrollers.

Porting to a new platform

The remainder of this document provides guidance on porting TensorFlow Lite for Microcontrollers to new platforms. You should read the developer documentation first.

Requirements

Since the core neural network operations are pure arithmetic, and don't require any I/O or other system-specific functionality, the code doesn't have to have many dependencies. We've tried to enforce this, so that it's as easy as possible to get TensorFlow Lite Micro running even on 'bare metal' systems without an OS. Here are the core requirements that a platform needs to run the framework:

  • C/C++ compiler capable of C++11 compatibility. This is probably the most restrictive of the requirements, since C++11 is not as widely adopted in the embedded world as it is elsewhere. We made the decision to require it since one of the main goals of TFL Micro is to share as much code as possible with the wider TensorFlow codebase, and since that relies on C++11 features, we need compatibility to achieve it. We only use a small, sane, subset of C++ though, so don't worry about having to deal with template metaprogramming or similar challenges!

  • Debug logging. The core network operations don't need any I/O functions, but to be able to run tests and tell if they've worked as expected, the framework needs some way to write out a string to some kind of debug console. This will vary from system to system, for example on Linux it could just be fprintf(stderr, debug_string) whereas an embedded device might write the string out to a specified UART. As long as there's some mechanism for outputting debug strings, you should be able to use TFL Micro on that platform.

  • Math library. The C standard libm.a library is needed to handle some of the mathematical operations used to calculate neural network results.

  • Global variable initialization. We do use a pattern of relying on global variables being set before main() is run in some places, so you'll need to make sure your compiler toolchain supports this.

And that's it! You may be wondering about some other common requirements that are needed by a lot of non-embedded software, so here's a brief list of things that aren't necessary to get started with TFL Micro on a new platform:

  • Operating system. Since the only platform-specific function we need is DebugLog(), there's no requirement for any kind of Posix or similar functionality around files, processes, or threads.

  • C or C++ standard libraries. The framework tries to avoid relying on any standard library functions that require linker-time support. This includes things like string functions, but still allows us to use headers like stdtypes.h which typically just define constants and typedefs. Unfortunately this distinction isn't officially defined by any standard, so it's possible that different toolchains may decide to require linked code even for the subset we use, but in practice we've found it's usually a pretty obvious decision and stable over platforms and toolchains.

  • Dynamic memory allocation. All the TFL Micro code avoids dynamic memory allocation, instead relying on local variables on the stack in most cases, or global variables for a few situations. These are all fixed-size, which can mean some compile-time configuration to ensure there's enough space for particular networks, but does avoid any need for a heap and the implementation of malloc\new on a platform.

  • Floating point. Eight-bit integer arithmetic is enough for inference on many networks, so if a model sticks to these kind of quantized operations, no floating point instructions should be required or executed by the framework.

Getting started

We recommend that you start trying to compile and run one of the simplest tests in the framework as your first step. The full TensorFlow codebase can seem overwhelming to work with at first, so instead you can begin with a collection of self-contained project folders that only include the source files needed for a particular test or executable. You can find a set of pre-generated projects here.

As mentioned above, the one function you will need to implement for a completely new platform is debug logging. If your device is just a variation on an existing platform you may be able to reuse code that's already been written. To understand what's available, begin with the default reference implementation at tensorflow/lite/micro/debug_log.cc, which uses fprintf and stderr. If your platform has this level of support for the C standard library in its toolchain, then you can just reuse this. Otherwise, you'll need to do some research into how your platform and device can communicate logging statements to the outside world. As another example, take a look at the Mbed version of DebugLog(), which creates a UART object and uses it to output strings to the host's console if it's connected.

Begin by navigating to the micro_error_reporter_test folder in the pregenerated projects you downloaded. Inside here, you'll see a set of folders containing all the source code you need. If you look through them, you should find a total of around 60 C or C++ files that compiled together will create the test executable. There's an example makefile in the directory that lists all of the source files and include paths for the headers. If you're building on a Linux or MacOS host system, you may just be able to reuse that same makefile to cross-compile for your system, as long as you swap out the CC and CXX variables from their defaults, to point to your cross compiler instead (for example arm-none-eabi-gcc or riscv64-unknown-elf-gcc). Otherwise, set up a project in the build system you are using. It should hopefully be fairly straightforward, since all of the source files in the folder need to be compiled, so on many IDEs you can just drag the whole lot in. Then you need to make sure that C++11 compatibility is turned on, and that the right include paths (as mentioned in the makefile) have been added.

You'll see the default DebugLog() implementation in 'tensorflow/lite/micro/debug_log.cc' inside the micro_error_reporter_test folder. Modify that file to add the right implementation for your platform, and then you should be able to build the set of files into an executable. Transfer that executable to your target device (for example by flashing it), and then try running it. You should see output that looks something like this:

Number: 42
Badly-formed format string
Another  badly-formed  format string
~~ALL TESTS PASSED~~~

If not, you'll need to debug what went wrong, but hopefully with this small starting project it should be manageable.

Troubleshooting

When we've been porting to new platforms, it's often been hard to figure out some of the fundamentals like linker settings and other toolchain setup flags. If you are having trouble, see if you can find a simple example program for your platform, like one that just blinks an LED. If you're able to build and run that successfully, then start to swap in parts of the TF Lite Micro codebase to that working project, taking it a step at a time and ensuring it's still working after every change. For example, a first step might be to paste in your DebugLog() implementation and call DebugLog("Hello World!") from the main function.

Another common problem on embedded platforms is the stack size being too small. Mbed defaults to 4KB for the main thread's stack, which is too small for most models since TensorFlow Lite allocates buffers and other data structures that require more memory. The exact size will depend on which model you're running, but try increasing it if you are running into strange corruption issues that might be related to stack overwriting.

Optimizing for your platform

The default reference implementations in TensorFlow Lite Micro are written to be portable and easy to understand, not fast, so you'll want to replace performance critical parts of the code with versions specifically tailored to your architecture. The framework has been designed with this in mind, and we hope the combination of small modules and many tests makes it as straightforward as possible to swap in your own code a piece at a time, ensuring you have a working version at every step. To write specialized implementations for a platform, it's useful to understand how optional components are handled inside the build system.

Code module organization

We have adopted a system of small modules with platform-specific implementations to help with portability. Every module is just a standard .h header file containing the interface (either functions or a class), with an accompanying reference implementation in a .cc with the same name. The source file implements all of the code that's declared in the header. If you have a specialized implementation, you can create a folder in the same directory as the header and reference source, name it after your platform, and put your implementation in a .cc file inside that folder. We've already seen one example of this, where the Mbed and Bluepill versions of DebugLog() are inside mbed and bluepill folders, children of the same directory where the stdio-based debug_log.cc reference implementation is found.

The advantage of this approach is that we can automatically pick specialized implementations based on the current build target, without having to manually edit build files for every new platform. It allows incremental optimizations from a always-working foundation, without cluttering the reference implementations with a lot of variants.

To see why we're doing this, it's worth looking at the alternatives. TensorFlow Lite has traditionally used preprocessor macros to separate out some platform-specific code within particular files, for example:

#ifndef USE_NEON
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
#define USE_NEON
#include <arm_neon.h>
#endif

Theres also a tradition in gemmlowp of using file suffixes to indicate platform-specific versions of particular headers, with kernel_neon.h being included by kernel.h if USE_NEON is defined. As a third variation, kernels are separated out using a directory structure, with tensorflow/lite/kernels/internal/reference containing portable implementations, and tensorflow/lite/kernels/internal/optimized holding versions optimized for NEON on Arm platforms.

These approaches are hard to extend to multiple platforms. Using macros means that platform-specific code is scattered throughout files in a hard-to-find way, and can make following the control flow difficult since you need to understand the macro state to trace it. For example, I temporarily introduced a bug that disabled NEON optimizations for some kernels when I removed tensorflow/lite/kernels/internal/common.h from their includes, without realizing it was where USE_NEON was defined!

Its also tough to port to different build systems, since figuring out the right combination of macros to use can be hard, especially since some of them are automatically defined by the compiler, and others are only set by build scripts, often across multiple rules.

The approach we are using extends the file system approach that we use for kernel implementations, but with some specific conventions:

  • For each module in TensorFlow Lite, there will be a parent directory that contains tests, interface headers used by other modules, and portable implementations of each part.
  • Portable means that the code doesnt include code from any libraries except flatbuffers, or other TF Lite modules. You can include a limited subset of standard C or C++ headers, but you cant use any functions that require linking against those libraries, including fprintf, etc. You can link against functions in the standard math library, in <math.h>.
  • Specialized implementations are held inside subfolders of the parent directory, named after the platform or library that they depend on. So, for example if you had my_module/foo.cc, a version that used RISC-V extensions would live in my_module/riscv/foo.cc. If you had a version that used the CMSIS library, it should be in my_module/cmsis/foo.cc.
  • These specialized implementations should completely replace the top-level implementations. If this involves too much code duplication, the top-level implementation should be split into smaller files, so only the platform-specific code needs to be replaced.
  • There is a convention about how build systems pick the right implementation file. There will be an ordered list of 'tags' defining the preferred implementations, and to generate the right list of source files, each module will be examined in turn. If a subfolder with a tags name contains a .cc file with the same base name as one in the parent folder, then it will replace the parent folders version in the list of build files. If there are multiple subfolders with matching tags and file names, then the tag thats latest in the ordered list will be chosen. This allows us to express “Id like generically-optimized fixed point if its available, but Id prefer something using the CMSIS library” using the list 'fixed_point cmsis'. These tags are passed in as TAGS="<foo>" on the command line when you use the main Makefile to build.
  • There is an implicit “reference” tag at the start of every list, so that its possible to support directory structures like the current tensorflow/kernels/internal where portable implementations are held in a “reference” folder thats a sibling to the NEON-optimized folder.
  • The headers for each unit in a module should remain platform-agnostic, and be the same for all implementations. Private headers inside a sub-folder can be used as needed, but shouldnt be referred to by any portable code at the top level.
  • Tests should be at the parent level, with no platform-specific code.
  • No platform-specific macros or #ifdefs should be used in any portable code.

The implementation of these rules is handled inside the Makefile, with a specialize function that takes a list of reference source file paths as an input, and returns the equivalent list with specialized versions of those files swapped in if they exist.

Implementing more optimizations

Clearly, getting debug logging support is only the beginning of the work you'll need to do on a particular platform. It's very likely that you'll want to optimize the core deep learning operations that take up the most time when running models you care about. The good news is that the process for providing optimized implementations is the same as the one you just went through to provide your own logging. You'll need to identify parts of the code that are bottlenecks, and then add specialized implementations in their own folders. These don't need to be platform specific, they can also be broken out by which library they rely on for example. Here's where we do that for the CMSIS implementation of integer fast-fourier transforms. This more complex case shows that you can also add helper source files alongside the main implementation, as long as you mention them in the platform-specific makefile. You can also do things like update the list of libraries that need to be linked in, or add include paths to required headers.