Merge pull request #2558 from mozilla/readthedocs
Updated old docs in prep for 0.6.0
This commit is contained in:
commit
0ab348878c
@ -1,9 +1,16 @@
|
|||||||
Introduction
|
Introduction
|
||||||
============
|
============
|
||||||
|
|
||||||
In this project we will reproduce the results of
|
The aim of this project is to create a simple, open, and ubiquitous speech
|
||||||
|
recognition engine. Simple, in that the engine should not require server-class
|
||||||
|
hardware to execute. Open, in that the code and models are released under the
|
||||||
|
Mozilla Public License. Ubiquitous, in that the engine should run on many
|
||||||
|
platforms and have bindings to many different languages.
|
||||||
|
|
||||||
|
The architecture of the engine was originally motivated by that presented in
|
||||||
`Deep Speech: Scaling up end-to-end speech recognition <http://arxiv.org/abs/1412.5567>`_.
|
`Deep Speech: Scaling up end-to-end speech recognition <http://arxiv.org/abs/1412.5567>`_.
|
||||||
The core of the system is a bidirectional recurrent neural network (BRNN)
|
However, the engine currently differs in many respects from the engine it was
|
||||||
|
originally motivated by. The core of the engine is a recurrent neural network (RNN)
|
||||||
trained to ingest speech spectrograms and generate English text transcriptions.
|
trained to ingest speech spectrograms and generate English text transcriptions.
|
||||||
|
|
||||||
Let a single utterance :math:`x` and label :math:`y` be sampled from a training set
|
Let a single utterance :math:`x` and label :math:`y` be sampled from a training set
|
||||||
@ -14,19 +21,19 @@ Let a single utterance :math:`x` and label :math:`y` be sampled from a training
|
|||||||
Each utterance, :math:`x^{(i)}` is a time-series of length :math:`T^{(i)}`
|
Each utterance, :math:`x^{(i)}` is a time-series of length :math:`T^{(i)}`
|
||||||
where every time-slice is a vector of audio features,
|
where every time-slice is a vector of audio features,
|
||||||
:math:`x^{(i)}_t` where :math:`t=1,\ldots,T^{(i)}`.
|
:math:`x^{(i)}_t` where :math:`t=1,\ldots,T^{(i)}`.
|
||||||
We use MFCC as our features; so :math:`x^{(i)}_{t,p}` denotes the :math:`p`-th MFCC feature
|
We use MFCC's as our features; so :math:`x^{(i)}_{t,p}` denotes the :math:`p`-th MFCC feature
|
||||||
in the audio frame at time :math:`t`. The goal of our BRNN is to convert an input
|
in the audio frame at time :math:`t`. The goal of our RNN is to convert an input
|
||||||
sequence :math:`x` into a sequence of character probabilities for the transcription
|
sequence :math:`x` into a sequence of character probabilities for the transcription
|
||||||
:math:`y`, with :math:`\hat{y}_t =\mathbb{P}(c_t \mid x)`,
|
:math:`y`, with :math:`\hat{y}_t =\mathbb{P}(c_t \mid x)`,
|
||||||
where :math:`c_t \in \{a,b,c, . . . , z, space, apostrophe, blank\}`.
|
where for English :math:`c_t \in \{a,b,c, . . . , z, space, apostrophe, blank\}`.
|
||||||
(The significance of :math:`blank` will be explained below.)
|
(The significance of :math:`blank` will be explained below.)
|
||||||
|
|
||||||
Our BRNN model is composed of :math:`5` layers of hidden units.
|
Our RNN model is composed of :math:`5` layers of hidden units.
|
||||||
For an input :math:`x`, the hidden units at layer :math:`l` are denoted :math:`h^{(l)}` with the
|
For an input :math:`x`, the hidden units at layer :math:`l` are denoted :math:`h^{(l)}` with the
|
||||||
convention that :math:`h^{(0)}` is the input. The first three layers are not recurrent.
|
convention that :math:`h^{(0)}` is the input. The first three layers are not recurrent.
|
||||||
For the first layer, at each time :math:`t`, the output depends on the MFCC frame
|
For the first layer, at each time :math:`t`, the output depends on the MFCC frame
|
||||||
:math:`x_t` along with a context of :math:`C` frames on each side.
|
:math:`x_t` along with a context of :math:`C` frames on each side.
|
||||||
(We typically use :math:`C \in \{5, 7, 9\}` for our experiments.)
|
(We use :math:`C = 9` for our experiments.)
|
||||||
The remaining non-recurrent layers operate on independent data for each time step.
|
The remaining non-recurrent layers operate on independent data for each time step.
|
||||||
Thus, for each time :math:`t`, the first :math:`3` layers are computed by:
|
Thus, for each time :math:`t`, the first :math:`3` layers are computed by:
|
||||||
|
|
||||||
@ -35,28 +42,24 @@ Thus, for each time :math:`t`, the first :math:`3` layers are computed by:
|
|||||||
|
|
||||||
where :math:`g(z) = \min\{\max\{0, z\}, 20\}` is a clipped rectified-linear (ReLu)
|
where :math:`g(z) = \min\{\max\{0, z\}, 20\}` is a clipped rectified-linear (ReLu)
|
||||||
activation function and :math:`W^{(l)}`, :math:`b^{(l)}` are the weight matrix and bias
|
activation function and :math:`W^{(l)}`, :math:`b^{(l)}` are the weight matrix and bias
|
||||||
parameters for layer :math:`l`. The fourth layer is a bidirectional recurrent
|
parameters for layer :math:`l`. The fourth layer is a recurrent
|
||||||
layer `[1] <http://www.di.ufpe.br/~fnj/RNA/bibliografia/BRNN.pdf>`_.
|
layer `[1] <https://en.wikipedia.org/wiki/Recurrent_neural_network>`_.
|
||||||
This layer includes two sets of hidden units: a set with forward recurrence,
|
This layer includes a set of hidden units with forward recurrence,
|
||||||
:math:`h^{(f)}`, and a set with backward recurrence :math:`h^{(b)}`:
|
:math:`h^{(f)}`:
|
||||||
|
|
||||||
.. math::
|
.. math::
|
||||||
h^{(f)}_t = g(W^{(4)} h^{(3)}_t + W^{(f)}_r h^{(f)}_{t-1} + b^{(4)})
|
h^{(f)}_t = g(W^{(4)} h^{(3)}_t + W^{(f)}_r h^{(f)}_{t-1} + b^{(4)})
|
||||||
|
|
||||||
h^{(b)}_t = g(W^{(4)} h^{(3)}_t + W^{(b)}_r h^{(b)}_{t+1} + b^{(4)})
|
|
||||||
|
|
||||||
Note that :math:`h^{(f)}` must be computed sequentially from :math:`t = 1` to :math:`t = T^{(i)}`
|
Note that :math:`h^{(f)}` must be computed sequentially from :math:`t = 1` to :math:`t = T^{(i)}`
|
||||||
for the :math:`i`-th utterance, while the units :math:`h^{(b)}` must be computed
|
for the :math:`i`-th utterance.
|
||||||
sequentially in reverse from :math:`t = T^{(i)}` to :math:`t = 1`.
|
|
||||||
|
|
||||||
The fifth (non-recurrent) layer takes both the forward and backward units as inputs
|
The fifth (non-recurrent) layer takes the forward units as inputs
|
||||||
|
|
||||||
.. math::
|
.. math::
|
||||||
h^{(5)} = g(W^{(5)} h^{(4)} + b^{(5)})
|
h^{(5)} = g(W^{(5)} h^{(f)} + b^{(5)}).
|
||||||
|
|
||||||
where :math:`h^{(4)} = h^{(f)} + h^{(b)}`. The output layer are standard logits that
|
The output layer is standard logits that correspond to the predicted character probabilities
|
||||||
correspond to the predicted character probabilities for each time slice :math:`t` and
|
for each time slice :math:`t` and character :math:`k` in the alphabet:
|
||||||
character :math:`k` in the alphabet:
|
|
||||||
|
|
||||||
.. math::
|
.. math::
|
||||||
h^{(6)}_{t,k} = \hat{y}_{t,k} = (W^{(6)} h^{(5)}_t)_k + b^{(6)}_k
|
h^{(6)}_{t,k} = \hat{y}_{t,k} = (W^{(6)} h^{(5)}_t)_k + b^{(6)}_k
|
||||||
@ -66,14 +69,15 @@ element of the matrix product.
|
|||||||
|
|
||||||
Once we have computed a prediction for :math:`\hat{y}_{t,k}`, we compute the CTC loss
|
Once we have computed a prediction for :math:`\hat{y}_{t,k}`, we compute the CTC loss
|
||||||
`[2] <http://www.cs.toronto.edu/~graves/preprint.pdf>`_ :math:`\cal{L}(\hat{y}, y)`
|
`[2] <http://www.cs.toronto.edu/~graves/preprint.pdf>`_ :math:`\cal{L}(\hat{y}, y)`
|
||||||
to measure the error in prediction. During training, we can evaluate the gradient
|
to measure the error in prediction. (The CTC loss requires the :math:`blank` above
|
||||||
|
to indicate transitions between characters.) During training, we can evaluate the gradient
|
||||||
:math:`\nabla \cal{L}(\hat{y}, y)` with respect to the network outputs given the
|
:math:`\nabla \cal{L}(\hat{y}, y)` with respect to the network outputs given the
|
||||||
ground-truth character sequence :math:`y`. From this point, computing the gradient
|
ground-truth character sequence :math:`y`. From this point, computing the gradient
|
||||||
with respect to all of the model parameters may be done via back-propagation
|
with respect to all of the model parameters may be done via back-propagation
|
||||||
through the rest of the network. We use the Adam method for training
|
through the rest of the network. We use the Adam method for training
|
||||||
`[3] <http://arxiv.org/abs/1412.6980>`_.
|
`[3] <http://arxiv.org/abs/1412.6980>`_.
|
||||||
|
|
||||||
The complete BRNN model is illustrated in the figure below.
|
The complete RNN model is illustrated in the figure below.
|
||||||
|
|
||||||
.. image:: ../images/rnn_fig-624x548.png
|
.. image:: ../images/rnn_fig-624x598.png
|
||||||
:alt: DeepSpeech BRNN
|
:alt: DeepSpeech BRNN
|
||||||
|
|||||||
@ -3,13 +3,6 @@ Geometric Constants
|
|||||||
|
|
||||||
This is about several constants related to the geometry of the network.
|
This is about several constants related to the geometry of the network.
|
||||||
|
|
||||||
n_steps
|
|
||||||
-------
|
|
||||||
The network views each speech sample as a sequence of time-slices :math:`x^{(i)}_t` of
|
|
||||||
length :math:`T^{(i)}`. As the speech samples vary in length, we know that :math:`T^{(i)}`
|
|
||||||
need not equal :math:`T^{(j)}` for :math:`i \ne j`. For each batch, BRNN in TensorFlow needs
|
|
||||||
to know ``n_steps`` which is the maximum :math:`T^{(i)}` for the batch.
|
|
||||||
|
|
||||||
n_input
|
n_input
|
||||||
-------
|
-------
|
||||||
Each of the at maximum ``n_steps`` vectors is a vector of MFCC features of a
|
Each of the at maximum ``n_steps`` vectors is a vector of MFCC features of a
|
||||||
@ -17,14 +10,14 @@ time-slice of the speech sample. We will make the number of MFCC features
|
|||||||
dependent upon the sample rate of the data set. Generically, if the sample rate
|
dependent upon the sample rate of the data set. Generically, if the sample rate
|
||||||
is 8kHz we use 13 features. If the sample rate is 16kHz we use 26 features...
|
is 8kHz we use 13 features. If the sample rate is 16kHz we use 26 features...
|
||||||
We capture the dimension of these vectors, equivalently the number of MFCC
|
We capture the dimension of these vectors, equivalently the number of MFCC
|
||||||
features, in the variable ``n_input``.
|
features, in the variable ``n_input``. By default ``n_input`` is 26.
|
||||||
|
|
||||||
n_context
|
n_context
|
||||||
---------
|
---------
|
||||||
As previously mentioned, the BRNN is not simply fed the MFCC features of a given
|
As previously mentioned, the RNN is not simply fed the MFCC features of a given
|
||||||
time-slice. It is fed, in addition, a context of :math:`C \in \{5, 7, 9\}` frames on
|
time-slice. It is fed, in addition, a context of :math:`C` frames on
|
||||||
either side of the frame in question. The number of frames in this context is
|
either side of the frame in question. The number of frames in this context is
|
||||||
captured in the variable ``n_context``.
|
captured in the variable ``n_context``. By default ``n_context`` is 9.
|
||||||
|
|
||||||
Next we will introduce constants that specify the geometry of some of the
|
Next we will introduce constants that specify the geometry of some of the
|
||||||
non-recurrent layers of the network. We do this by simply specifying the number
|
non-recurrent layers of the network. We do this by simply specifying the number
|
||||||
@ -36,20 +29,13 @@ n_hidden_1, n_hidden_2, n_hidden_5
|
|||||||
of units in the second, and ``n_hidden_5`` the number in the fifth. We haven't
|
of units in the second, and ``n_hidden_5`` the number in the fifth. We haven't
|
||||||
forgotten about the third or sixth layer. We will define their unit count below.
|
forgotten about the third or sixth layer. We will define their unit count below.
|
||||||
|
|
||||||
A LSTM BRNN consists of a pair of LSTM RNN's.
|
The RNN consists of an LSTM RNN that works "forward in time":
|
||||||
One LSTM RNN that works "forward in time":
|
|
||||||
|
|
||||||
.. image:: ../images/LSTM3-chain.png
|
.. image:: ../images/LSTM3-chain.png
|
||||||
:alt: Image shows a diagram of a recurrent neural network with LSTM cells, with arrows depicting the flow of data from earlier time steps to later timesteps within the RNN.
|
:alt: Image shows a diagram of a recurrent neural network with LSTM cells, with arrows depicting the flow of data from earlier time steps to later timesteps within the RNN.
|
||||||
|
|
||||||
and a second LSTM RNN that works "backwards in time":
|
|
||||||
|
|
||||||
.. image:: ../images/LSTM3-chain-backwards.png
|
|
||||||
:alt: Image shows a diagram of a recurrent neural network with LSTM cells, this time with data flowing from later time steps to earlier timesteps within the RNN.
|
|
||||||
|
|
||||||
The dimension of the cell state, the upper line connecting subsequent LSTM units,
|
The dimension of the cell state, the upper line connecting subsequent LSTM units,
|
||||||
is independent of the input dimension and the same for both the forward and
|
is independent of the input dimension.
|
||||||
backward LSTM RNN.
|
|
||||||
|
|
||||||
n_cell_dim
|
n_cell_dim
|
||||||
----------
|
----------
|
||||||
@ -63,11 +49,11 @@ determined by ``n_cell_dim`` as follows
|
|||||||
|
|
||||||
.. code:: python
|
.. code:: python
|
||||||
|
|
||||||
n_hidden_3 = 2 * n_cell_dim
|
n_hidden_3 = n_cell_dim
|
||||||
|
|
||||||
n_character
|
n_hidden_6
|
||||||
-----------
|
-----------
|
||||||
The variable ``n_character`` will hold the number of characters in the target
|
The variable ``n_hidden_6`` will hold the number of characters in the target
|
||||||
language plus one, for the :math:`blank`.
|
language plus one, for the :math:`blank`.
|
||||||
For English it is the cardinality of the set
|
For English it is the cardinality of the set
|
||||||
|
|
||||||
@ -75,12 +61,3 @@ For English it is the cardinality of the set
|
|||||||
\{a,b,c, . . . , z, space, apostrophe, blank\}
|
\{a,b,c, . . . , z, space, apostrophe, blank\}
|
||||||
|
|
||||||
we referred to earlier.
|
we referred to earlier.
|
||||||
|
|
||||||
n_hidden_6
|
|
||||||
----------
|
|
||||||
The number of units in the sixth layer is determined by ``n_character`` as follows:
|
|
||||||
|
|
||||||
.. code:: python
|
|
||||||
|
|
||||||
n_hidden_6 = n_character
|
|
||||||
|
|
||||||
|
|||||||
Binary file not shown.
|
Before Width: | Height: | Size: 221 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 182 KiB |
BIN
images/rnn_fig-624x598.png
Normal file
BIN
images/rnn_fig-624x598.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 117 KiB |
Loading…
x
Reference in New Issue
Block a user