|
||
---|---|---|
.. | ||
lib | ||
ops | ||
python | ||
audio_microfrontend_test.cc | ||
audio_microfrontend.cc | ||
audio_microfrontend.h | ||
BUILD | ||
README.md |
Audio "frontend" TensorFlow operations for feature generation
The most common module used by most audio processing modules is the feature generation (also called frontend). It receives raw audio input, and produces filter banks (a vector of values).
More specifically the audio signal goes through a pre-emphasis filter (optionally); then gets sliced into (overlapping) frames and a window function is applied to each frame; afterwards, we do a Fourier transform on each frame (or more specifically a Short-Time Fourier Transform) and calculate the power spectrum; and subsequently compute the filter banks.
Operations
Here we provide implementations for both a TensorFlow and TensorFlow Lite operations that encapsulate the functionality of the audio frontend.
Both frontend Ops receives audio data and produces as many unstacked frames (filterbanks) as audio is passed in, according to the configuration.
The processing uses a lightweight library to perform:
- A slicing window function
- Short-time FFTs
- Filterbank calculations
- Noise reduction
- Auto Gain Control
- Logarithmic scaling
Please refer to the Op's documentation for details on the different configuration parameters.
However, it is important to clarify the contract of the Ops:
A frontend OP will produce as many unstacked frames as possible with the given audio input.
This means:
- The output is a rank-2 Tensor, where each row corresponds to the sequence/time dimension, and each column is the feature dimension).
- It is expected that the Op will receive the right input (in terms of positioning in the audio stream, and the amount), as needed to produce the expected output.
- Thus, any logic to slice, cache, or otherwise rearrange the input and/or output of the operation must be handled externally in the graph.
For example, a 200ms audio input will produce an output tensor of shape
[18, num_channels]
, when configured with a window_size=25ms
, and
window_step=10ms
. The reason being that when reaching the point in the
audio at 180ms there’s not enough audio to construct a complete window.
Due to both functional and efficiency reasons, we provide the following functionality related to input processing:
Padding. A boolean flag zero_padding
that indicates whether to pad the
audio with zeros such that we generate output frames based on the window_step
.
This means that in the example above, we would generate a tensor of shape
[20, num_channels]
by adding enough zeros such that we step over all the
available audio and still be able to create complete windows of audio (some of
the window will just have zeros; in the example above, frame 19 and 20 will have
the equivalent of 5 and 15ms full of zeros respectively).
Striding. An integer frame_stride
that indicates the striding step used to
generate the output tensor, thus determining the second dimension. In the
example above, with a frame_stride=3
, the output tensor would have a shape of
[6, 120]
when zero_padding
is set to false, and [7, 120]
when
zero_padding
is set to true.