STT-tensorflow/tensorflow/lite/experimental/microfrontend
tg-at-google 847bd7c896
in resolution of [Wsign-compare] warning id 13
index `i` changed to type `size_t`
2020-06-02 11:32:16 -04:00
..
lib in resolution of [Wsign-compare] warning id 13 2020-06-02 11:32:16 -04:00
ops
python Remove run_v1_only from experimental lite Python tests. 2019-08-01 09:05:04 -07:00
audio_microfrontend_test.cc Change comment of the external repo #include file origin. 2020-03-23 10:42:53 -07:00
audio_microfrontend.cc Change comment of the external repo #include file origin. 2020-03-23 10:42:53 -07:00
audio_microfrontend.h Use a header to declare Register_AUDIO_MICROFRONTEND, instead of having to forward-declare. 2019-10-17 18:37:08 -07:00
BUILD Added BUILD rules for the micro-frontend TF op so that we can use it as a C++ library. 2020-05-14 15:54:09 -07:00
README.md Adds missing docs. 2020-03-11 14:24:51 -07:00

Audio "frontend" TensorFlow operations for feature generation

The most common module used by most audio processing modules is the feature generation (also called frontend). It receives raw audio input, and produces filter banks (a vector of values).

More specifically the audio signal goes through a pre-emphasis filter (optionally); then gets sliced into (overlapping) frames and a window function is applied to each frame; afterwards, we do a Fourier transform on each frame (or more specifically a Short-Time Fourier Transform) and calculate the power spectrum; and subsequently compute the filter banks.

Operations

Here we provide implementations for both a TensorFlow and TensorFlow Lite operations that encapsulate the functionality of the audio frontend.

Both frontend Ops receives audio data and produces as many unstacked frames (filterbanks) as audio is passed in, according to the configuration.

The processing uses a lightweight library to perform:

  1. A slicing window function
  2. Short-time FFTs
  3. Filterbank calculations
  4. Noise reduction
  5. Auto Gain Control
  6. Logarithmic scaling

Please refer to the Op's documentation for details on the different configuration parameters.

However, it is important to clarify the contract of the Ops:

A frontend OP will produce as many unstacked frames as possible with the given audio input.

This means:

  1. The output is a rank-2 Tensor, where each row corresponds to the sequence/time dimension, and each column is the feature dimension).
  2. It is expected that the Op will receive the right input (in terms of positioning in the audio stream, and the amount), as needed to produce the expected output.
  3. Thus, any logic to slice, cache, or otherwise rearrange the input and/or output of the operation must be handled externally in the graph.

For example, a 200ms audio input will produce an output tensor of shape [18, num_channels], when configured with a window_size=25ms, and window_step=10ms. The reason being that when reaching the point in the audio at 180ms theres not enough audio to construct a complete window.

Due to both functional and efficiency reasons, we provide the following functionality related to input processing:

Padding. A boolean flag zero_padding that indicates whether to pad the audio with zeros such that we generate output frames based on the window_step. This means that in the example above, we would generate a tensor of shape [20, num_channels] by adding enough zeros such that we step over all the available audio and still be able to create complete windows of audio (some of the window will just have zeros; in the example above, frame 19 and 20 will have the equivalent of 5 and 15ms full of zeros respectively).

Striding. An integer frame_stride that indicates the striding step used to generate the output tensor, thus determining the second dimension. In the example above, with a frame_stride=3, the output tensor would have a shape of [6, 120] when zero_padding is set to false, and [7, 120] when zero_padding is set to true.