Merge pull request #2558 from mozilla/readthedocs

Updated old docs in prep for 0.6.0
This commit is contained in:
Kelly Davis 2019-12-02 15:02:55 +01:00 committed by GitHub
commit 0ab348878c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
5 changed files with 36 additions and 55 deletions

View File

@ -1,9 +1,16 @@
Introduction Introduction
============ ============
In this project we will reproduce the results of The aim of this project is to create a simple, open, and ubiquitous speech
recognition engine. Simple, in that the engine should not require server-class
hardware to execute. Open, in that the code and models are released under the
Mozilla Public License. Ubiquitous, in that the engine should run on many
platforms and have bindings to many different languages.
The architecture of the engine was originally motivated by that presented in
`Deep Speech: Scaling up end-to-end speech recognition <http://arxiv.org/abs/1412.5567>`_. `Deep Speech: Scaling up end-to-end speech recognition <http://arxiv.org/abs/1412.5567>`_.
The core of the system is a bidirectional recurrent neural network (BRNN) However, the engine currently differs in many respects from the engine it was
originally motivated by. The core of the engine is a recurrent neural network (RNN)
trained to ingest speech spectrograms and generate English text transcriptions. trained to ingest speech spectrograms and generate English text transcriptions.
Let a single utterance :math:`x` and label :math:`y` be sampled from a training set Let a single utterance :math:`x` and label :math:`y` be sampled from a training set
@ -14,19 +21,19 @@ Let a single utterance :math:`x` and label :math:`y` be sampled from a training
Each utterance, :math:`x^{(i)}` is a time-series of length :math:`T^{(i)}` Each utterance, :math:`x^{(i)}` is a time-series of length :math:`T^{(i)}`
where every time-slice is a vector of audio features, where every time-slice is a vector of audio features,
:math:`x^{(i)}_t` where :math:`t=1,\ldots,T^{(i)}`. :math:`x^{(i)}_t` where :math:`t=1,\ldots,T^{(i)}`.
We use MFCC as our features; so :math:`x^{(i)}_{t,p}` denotes the :math:`p`-th MFCC feature We use MFCC's as our features; so :math:`x^{(i)}_{t,p}` denotes the :math:`p`-th MFCC feature
in the audio frame at time :math:`t`. The goal of our BRNN is to convert an input in the audio frame at time :math:`t`. The goal of our RNN is to convert an input
sequence :math:`x` into a sequence of character probabilities for the transcription sequence :math:`x` into a sequence of character probabilities for the transcription
:math:`y`, with :math:`\hat{y}_t =\mathbb{P}(c_t \mid x)`, :math:`y`, with :math:`\hat{y}_t =\mathbb{P}(c_t \mid x)`,
where :math:`c_t \in \{a,b,c, . . . , z, space, apostrophe, blank\}`. where for English :math:`c_t \in \{a,b,c, . . . , z, space, apostrophe, blank\}`.
(The significance of :math:`blank` will be explained below.) (The significance of :math:`blank` will be explained below.)
Our BRNN model is composed of :math:`5` layers of hidden units. Our RNN model is composed of :math:`5` layers of hidden units.
For an input :math:`x`, the hidden units at layer :math:`l` are denoted :math:`h^{(l)}` with the For an input :math:`x`, the hidden units at layer :math:`l` are denoted :math:`h^{(l)}` with the
convention that :math:`h^{(0)}` is the input. The first three layers are not recurrent. convention that :math:`h^{(0)}` is the input. The first three layers are not recurrent.
For the first layer, at each time :math:`t`, the output depends on the MFCC frame For the first layer, at each time :math:`t`, the output depends on the MFCC frame
:math:`x_t` along with a context of :math:`C` frames on each side. :math:`x_t` along with a context of :math:`C` frames on each side.
(We typically use :math:`C \in \{5, 7, 9\}` for our experiments.) (We use :math:`C = 9` for our experiments.)
The remaining non-recurrent layers operate on independent data for each time step. The remaining non-recurrent layers operate on independent data for each time step.
Thus, for each time :math:`t`, the first :math:`3` layers are computed by: Thus, for each time :math:`t`, the first :math:`3` layers are computed by:
@ -35,28 +42,24 @@ Thus, for each time :math:`t`, the first :math:`3` layers are computed by:
where :math:`g(z) = \min\{\max\{0, z\}, 20\}` is a clipped rectified-linear (ReLu) where :math:`g(z) = \min\{\max\{0, z\}, 20\}` is a clipped rectified-linear (ReLu)
activation function and :math:`W^{(l)}`, :math:`b^{(l)}` are the weight matrix and bias activation function and :math:`W^{(l)}`, :math:`b^{(l)}` are the weight matrix and bias
parameters for layer :math:`l`. The fourth layer is a bidirectional recurrent parameters for layer :math:`l`. The fourth layer is a recurrent
layer `[1] <http://www.di.ufpe.br/~fnj/RNA/bibliografia/BRNN.pdf>`_. layer `[1] <https://en.wikipedia.org/wiki/Recurrent_neural_network>`_.
This layer includes two sets of hidden units: a set with forward recurrence, This layer includes a set of hidden units with forward recurrence,
:math:`h^{(f)}`, and a set with backward recurrence :math:`h^{(b)}`: :math:`h^{(f)}`:
.. math:: .. math::
h^{(f)}_t = g(W^{(4)} h^{(3)}_t + W^{(f)}_r h^{(f)}_{t-1} + b^{(4)}) h^{(f)}_t = g(W^{(4)} h^{(3)}_t + W^{(f)}_r h^{(f)}_{t-1} + b^{(4)})
h^{(b)}_t = g(W^{(4)} h^{(3)}_t + W^{(b)}_r h^{(b)}_{t+1} + b^{(4)})
Note that :math:`h^{(f)}` must be computed sequentially from :math:`t = 1` to :math:`t = T^{(i)}` Note that :math:`h^{(f)}` must be computed sequentially from :math:`t = 1` to :math:`t = T^{(i)}`
for the :math:`i`-th utterance, while the units :math:`h^{(b)}` must be computed for the :math:`i`-th utterance.
sequentially in reverse from :math:`t = T^{(i)}` to :math:`t = 1`.
The fifth (non-recurrent) layer takes both the forward and backward units as inputs The fifth (non-recurrent) layer takes the forward units as inputs
.. math:: .. math::
h^{(5)} = g(W^{(5)} h^{(4)} + b^{(5)}) h^{(5)} = g(W^{(5)} h^{(f)} + b^{(5)}).
where :math:`h^{(4)} = h^{(f)} + h^{(b)}`. The output layer are standard logits that The output layer is standard logits that correspond to the predicted character probabilities
correspond to the predicted character probabilities for each time slice :math:`t` and for each time slice :math:`t` and character :math:`k` in the alphabet:
character :math:`k` in the alphabet:
.. math:: .. math::
h^{(6)}_{t,k} = \hat{y}_{t,k} = (W^{(6)} h^{(5)}_t)_k + b^{(6)}_k h^{(6)}_{t,k} = \hat{y}_{t,k} = (W^{(6)} h^{(5)}_t)_k + b^{(6)}_k
@ -66,14 +69,15 @@ element of the matrix product.
Once we have computed a prediction for :math:`\hat{y}_{t,k}`, we compute the CTC loss Once we have computed a prediction for :math:`\hat{y}_{t,k}`, we compute the CTC loss
`[2] <http://www.cs.toronto.edu/~graves/preprint.pdf>`_ :math:`\cal{L}(\hat{y}, y)` `[2] <http://www.cs.toronto.edu/~graves/preprint.pdf>`_ :math:`\cal{L}(\hat{y}, y)`
to measure the error in prediction. During training, we can evaluate the gradient to measure the error in prediction. (The CTC loss requires the :math:`blank` above
to indicate transitions between characters.) During training, we can evaluate the gradient
:math:`\nabla \cal{L}(\hat{y}, y)` with respect to the network outputs given the :math:`\nabla \cal{L}(\hat{y}, y)` with respect to the network outputs given the
ground-truth character sequence :math:`y`. From this point, computing the gradient ground-truth character sequence :math:`y`. From this point, computing the gradient
with respect to all of the model parameters may be done via back-propagation with respect to all of the model parameters may be done via back-propagation
through the rest of the network. We use the Adam method for training through the rest of the network. We use the Adam method for training
`[3] <http://arxiv.org/abs/1412.6980>`_. `[3] <http://arxiv.org/abs/1412.6980>`_.
The complete BRNN model is illustrated in the figure below. The complete RNN model is illustrated in the figure below.
.. image:: ../images/rnn_fig-624x548.png .. image:: ../images/rnn_fig-624x598.png
:alt: DeepSpeech BRNN :alt: DeepSpeech BRNN

View File

@ -3,13 +3,6 @@ Geometric Constants
This is about several constants related to the geometry of the network. This is about several constants related to the geometry of the network.
n_steps
-------
The network views each speech sample as a sequence of time-slices :math:`x^{(i)}_t` of
length :math:`T^{(i)}`. As the speech samples vary in length, we know that :math:`T^{(i)}`
need not equal :math:`T^{(j)}` for :math:`i \ne j`. For each batch, BRNN in TensorFlow needs
to know ``n_steps`` which is the maximum :math:`T^{(i)}` for the batch.
n_input n_input
------- -------
Each of the at maximum ``n_steps`` vectors is a vector of MFCC features of a Each of the at maximum ``n_steps`` vectors is a vector of MFCC features of a
@ -17,14 +10,14 @@ time-slice of the speech sample. We will make the number of MFCC features
dependent upon the sample rate of the data set. Generically, if the sample rate dependent upon the sample rate of the data set. Generically, if the sample rate
is 8kHz we use 13 features. If the sample rate is 16kHz we use 26 features... is 8kHz we use 13 features. If the sample rate is 16kHz we use 26 features...
We capture the dimension of these vectors, equivalently the number of MFCC We capture the dimension of these vectors, equivalently the number of MFCC
features, in the variable ``n_input``. features, in the variable ``n_input``. By default ``n_input`` is 26.
n_context n_context
--------- ---------
As previously mentioned, the BRNN is not simply fed the MFCC features of a given As previously mentioned, the RNN is not simply fed the MFCC features of a given
time-slice. It is fed, in addition, a context of :math:`C \in \{5, 7, 9\}` frames on time-slice. It is fed, in addition, a context of :math:`C` frames on
either side of the frame in question. The number of frames in this context is either side of the frame in question. The number of frames in this context is
captured in the variable ``n_context``. captured in the variable ``n_context``. By default ``n_context`` is 9.
Next we will introduce constants that specify the geometry of some of the Next we will introduce constants that specify the geometry of some of the
non-recurrent layers of the network. We do this by simply specifying the number non-recurrent layers of the network. We do this by simply specifying the number
@ -36,20 +29,13 @@ n_hidden_1, n_hidden_2, n_hidden_5
of units in the second, and ``n_hidden_5`` the number in the fifth. We haven't of units in the second, and ``n_hidden_5`` the number in the fifth. We haven't
forgotten about the third or sixth layer. We will define their unit count below. forgotten about the third or sixth layer. We will define their unit count below.
A LSTM BRNN consists of a pair of LSTM RNN's. The RNN consists of an LSTM RNN that works "forward in time":
One LSTM RNN that works "forward in time":
.. image:: ../images/LSTM3-chain.png .. image:: ../images/LSTM3-chain.png
:alt: Image shows a diagram of a recurrent neural network with LSTM cells, with arrows depicting the flow of data from earlier time steps to later timesteps within the RNN. :alt: Image shows a diagram of a recurrent neural network with LSTM cells, with arrows depicting the flow of data from earlier time steps to later timesteps within the RNN.
and a second LSTM RNN that works "backwards in time":
.. image:: ../images/LSTM3-chain-backwards.png
:alt: Image shows a diagram of a recurrent neural network with LSTM cells, this time with data flowing from later time steps to earlier timesteps within the RNN.
The dimension of the cell state, the upper line connecting subsequent LSTM units, The dimension of the cell state, the upper line connecting subsequent LSTM units,
is independent of the input dimension and the same for both the forward and is independent of the input dimension.
backward LSTM RNN.
n_cell_dim n_cell_dim
---------- ----------
@ -63,11 +49,11 @@ determined by ``n_cell_dim`` as follows
.. code:: python .. code:: python
n_hidden_3 = 2 * n_cell_dim n_hidden_3 = n_cell_dim
n_character n_hidden_6
----------- -----------
The variable ``n_character`` will hold the number of characters in the target The variable ``n_hidden_6`` will hold the number of characters in the target
language plus one, for the :math:`blank`. language plus one, for the :math:`blank`.
For English it is the cardinality of the set For English it is the cardinality of the set
@ -75,12 +61,3 @@ For English it is the cardinality of the set
\{a,b,c, . . . , z, space, apostrophe, blank\} \{a,b,c, . . . , z, space, apostrophe, blank\}
we referred to earlier. we referred to earlier.
n_hidden_6
----------
The number of units in the sixth layer is determined by ``n_character`` as follows:
.. code:: python
n_hidden_6 = n_character

Binary file not shown.

Before

Width:  |  Height:  |  Size: 221 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 182 KiB

BIN
images/rnn_fig-624x598.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 117 KiB