Adds missing docs.

PiperOrigin-RevId: 300405855
Change-Id: Iaece469885db17d9a59627f790299358b57bc7d1
This commit is contained in:
Raziel Alvarez 2020-03-11 14:21:57 -07:00 committed by TensorFlower Gardener
parent 604ebadb89
commit 819f1bd4e1
4 changed files with 141 additions and 9 deletions

View File

@ -0,0 +1,75 @@
# Audio "frontend" TensorFlow operations for feature generation
The most common module used by most audio processing modules is the feature
generation (also called frontend). It receives raw audio input, and produces
filter banks (a vector of values).
More specifically the audio signal goes through a pre-emphasis filter
(optionally); then gets sliced into (overlapping) frames and a window function
is applied to each frame; afterwards, we do a Fourier transform on each frame
(or more specifically a Short-Time Fourier Transform) and calculate the power
spectrum; and subsequently compute the filter banks.
## Operations
Here we provide implementations for both a TensorFlow and TensorFlow Lite
operations that encapsulate the functionality of the audio frontend.
Both frontend Ops receives audio data and produces as many unstacked frames
(filterbanks) as audio is passed in, according to the configuration.
The processing uses a lightweight library to perform:
1. A slicing window function
2. Short-time FFTs
3. Filterbank calculations
4. Noise reduction
5. Auto Gain Control
6. Logarithmic scaling
Please refer to the Op's documentation for details on the different
configuration parameters.
However, it is important to clarify the contract of the Ops:
> *A frontend OP will produce as many unstacked frames as possible with the
> given audio input.*
This means:
1. The output is a rank-2 Tensor, where each row corresponds to the
sequence/time dimension, and each column is the feature dimension).
2. It is expected that the Op will receive the right input (in terms of
positioning in the audio stream, and the amount), as needed to produce the
expected output.
3. Thus, any logic to slice, cache, or otherwise rearrange the input and/or
output of the operation must be handled externally in the graph.
For example, a 200ms audio input will produce an output tensor of shape
`[18, num_channels]`, when configured with a `window_size=25ms`, and
`window_step=10ms`. The reason being that when reaching the point in the
audio at 180ms theres not enough audio to construct a complete window.
Due to both functional and efficiency reasons, we provide the following
functionality related to input processing:
**Padding.** A boolean flag `zero_padding` that indicates whether to pad the
audio with zeros such that we generate output frames based on the `window_step`.
This means that in the example above, we would generate a tensor of shape
`[20, num_channels]` by adding enough zeros such that we step over all the
available audio and still be able to create complete windows of audio (some of
the window will just have zeros; in the example above, frame 19 and 20 will have
the equivalent of 5 and 15ms full of zeros respectively).
<!-- TODO
Stacking. An integer that indicates how many contiguous frames to stack in the output tensors first dimension, such that the tensor is shaped [-1, stack_size * num_channels]. For example, if the stack_size is 3, the example above would produce an output tensor shaped [18, 120] is padding is false, and [20, 120] is padding is set to true.
-->
**Striding.** An integer `frame_stride` that indicates the striding step used to
generate the output tensor, thus determining the second dimension. In the
example above, with a `frame_stride=3`, the output tensor would have a shape of
`[6, 120]` when `zero_padding` is set to false, and `[7, 120]` when
`zero_padding` is set to true.
<!-- TODO
Note we would not expect the striding step to be larger than the stack_size
(should we enforce that?).
-->

View File

@ -1,9 +0,0 @@
The binary frontend_main shows sample usage of the frontend, printing out
coefficients when it has processed enough data.
The binary frontend_memmap_main shows a sample usage of how to avoid all the
init code in your runtime, by first running "frontend_generate_memmap" to
create a header/source file that uses a baked in frontend state. This command
could be automated as part of your build process, or you can just use the output
directly.

View File

@ -0,0 +1,65 @@
# Audio "frontend" library for feature generation
A feature generation library (also called frontend) that receives raw audio
input, and produces filter banks (a vector of values).
The raw audio input is expected to be 16-bit PCM features, with a configurable
sample rate. More specifically the audio signal goes through a pre-emphasis
filter (optionally); then gets sliced into (potentially overlapping) frames and
a window function is applied to each frame; afterwards, we do a Fourier
transform on each frame (or more specifically a Short-Time Fourier Transform)
and calculate the power spectrum; and subsequently compute the filter banks.
By default the library is configured with a set of defaults to perform the
different processing tasks. This takes place with the frontend_util.c function:
```c++
void FrontendFillConfigWithDefaults(struct FrontendConfig* config)
```
A single invocation looks like:
```c++
struct FrontendConfig frontend_config;
FrontendFillConfigWithDefaults(&frontend_config);
int sample_rate = 16000;
FrontendPopulateState(&frontend_config, &frontend_state, sample_rate);
int16_t* audio_data = ; // PCM audio samples at 16KHz.
size_t audio_size = ; // Number of audio samples.
size_t num_samples_read; // How many samples were processed.
struct FrontendOutput output =
FrontendProcessSamples(
&frontend_state, audio_data, audio_size, &num_samples_read);
for (i = 0; i < output.size; ++i) {
printf("%d ", output.values[i]); // Print the feature vector.
}
```
Something to note in the above example is that the frontend consumes as many
samples needed from the audio data to produce a single feature vector (according
to the frontend configuration). If not enough samples were available to generate
a feature vector, the returned size will be 0 and the values pointer will be
`NULL`.
An example of how to use the frontend is provided in frontend_main.cc and its
binary frontend_main. This example, expects a path to a file containing `int16`
PCM features at a sample rate of 16KHz, and upon execution will printing out
the coefficients according to the frontend default configuration.
## Extra features
Extra features of this frontend library include a noise reduction module, as
well as a gain control module.
**Noise cancellation**. Removes stationary noise from each channel of the signal
using a low pass filter.
**Gain control**. A novel automatic gain control based dynamic compression to
replace the widely used static (such as log or root) compression. Disabled
by default.
## Memory map
The binary frontend_memmap_main shows a sample usage of how to avoid all the
initialization code in your application, by first running
"frontend_generate_memmap" to create a header/source file that uses a baked in
frontend state. This command could be automated as part of your build process,
or you can just use the output directly.

View File

@ -25,6 +25,7 @@ limitations under the License.
extern "C" {
#endif
// Details at https://research.google/pubs/pub45911.pdf
struct PcanGainControlState {
int enable_pcan;
uint32_t* noise_estimate;