Updating Introduction
This commit is contained in:
parent
b6bc46f3fb
commit
7d96540d66
@ -1,9 +1,16 @@
|
|||||||
Introduction
|
Introduction
|
||||||
============
|
============
|
||||||
|
|
||||||
In this project we will reproduce the results of
|
The aim of this project is to create a simple, open, and ubiquitous speech
|
||||||
|
recognition engine. Simple, in that the engine should not require server-class
|
||||||
|
hardware to execute. Open, in that the code and models are released under the
|
||||||
|
Mozilla Public License. Ubiquitous, in that the engine should run on many
|
||||||
|
platforms and have bindings to many different languages.
|
||||||
|
|
||||||
|
The architecture of the engine was originally motivated by that presented in
|
||||||
`Deep Speech: Scaling up end-to-end speech recognition <http://arxiv.org/abs/1412.5567>`_.
|
`Deep Speech: Scaling up end-to-end speech recognition <http://arxiv.org/abs/1412.5567>`_.
|
||||||
The core of the system is a bidirectional recurrent neural network (BRNN)
|
However, the engine currently differs in many respects from the engine it was
|
||||||
|
originally motivated by. The core of the engine is a recurrent neural network (RNN)
|
||||||
trained to ingest speech spectrograms and generate English text transcriptions.
|
trained to ingest speech spectrograms and generate English text transcriptions.
|
||||||
|
|
||||||
Let a single utterance :math:`x` and label :math:`y` be sampled from a training set
|
Let a single utterance :math:`x` and label :math:`y` be sampled from a training set
|
||||||
@ -14,19 +21,19 @@ Let a single utterance :math:`x` and label :math:`y` be sampled from a training
|
|||||||
Each utterance, :math:`x^{(i)}` is a time-series of length :math:`T^{(i)}`
|
Each utterance, :math:`x^{(i)}` is a time-series of length :math:`T^{(i)}`
|
||||||
where every time-slice is a vector of audio features,
|
where every time-slice is a vector of audio features,
|
||||||
:math:`x^{(i)}_t` where :math:`t=1,\ldots,T^{(i)}`.
|
:math:`x^{(i)}_t` where :math:`t=1,\ldots,T^{(i)}`.
|
||||||
We use MFCC as our features; so :math:`x^{(i)}_{t,p}` denotes the :math:`p`-th MFCC feature
|
We use MFCC coefficients as our features; so :math:`x^{(i)}_{t,p}` denotes the :math:`p`-th MFCC feature
|
||||||
in the audio frame at time :math:`t`. The goal of our BRNN is to convert an input
|
in the audio frame at time :math:`t`. The goal of our RNN is to convert an input
|
||||||
sequence :math:`x` into a sequence of character probabilities for the transcription
|
sequence :math:`x` into a sequence of character probabilities for the transcription
|
||||||
:math:`y`, with :math:`\hat{y}_t =\mathbb{P}(c_t \mid x)`,
|
:math:`y`, with :math:`\hat{y}_t =\mathbb{P}(c_t \mid x)`,
|
||||||
where :math:`c_t \in \{a,b,c, . . . , z, space, apostrophe, blank\}`.
|
where for English :math:`c_t \in \{a,b,c, . . . , z, space, apostrophe, blank\}`.
|
||||||
(The significance of :math:`blank` will be explained below.)
|
(The significance of :math:`blank` will be explained below.)
|
||||||
|
|
||||||
Our BRNN model is composed of :math:`5` layers of hidden units.
|
Our RNN model is composed of :math:`5` layers of hidden units.
|
||||||
For an input :math:`x`, the hidden units at layer :math:`l` are denoted :math:`h^{(l)}` with the
|
For an input :math:`x`, the hidden units at layer :math:`l` are denoted :math:`h^{(l)}` with the
|
||||||
convention that :math:`h^{(0)}` is the input. The first three layers are not recurrent.
|
convention that :math:`h^{(0)}` is the input. The first three layers are not recurrent.
|
||||||
For the first layer, at each time :math:`t`, the output depends on the MFCC frame
|
For the first layer, at each time :math:`t`, the output depends on the MFCC frame
|
||||||
:math:`x_t` along with a context of :math:`C` frames on each side.
|
:math:`x_t` along with a context of :math:`C` frames on each side.
|
||||||
(We typically use :math:`C \in \{5, 7, 9\}` for our experiments.)
|
(We use :math:`C = 9` for our experiments.)
|
||||||
The remaining non-recurrent layers operate on independent data for each time step.
|
The remaining non-recurrent layers operate on independent data for each time step.
|
||||||
Thus, for each time :math:`t`, the first :math:`3` layers are computed by:
|
Thus, for each time :math:`t`, the first :math:`3` layers are computed by:
|
||||||
|
|
||||||
@ -35,28 +42,24 @@ Thus, for each time :math:`t`, the first :math:`3` layers are computed by:
|
|||||||
|
|
||||||
where :math:`g(z) = \min\{\max\{0, z\}, 20\}` is a clipped rectified-linear (ReLu)
|
where :math:`g(z) = \min\{\max\{0, z\}, 20\}` is a clipped rectified-linear (ReLu)
|
||||||
activation function and :math:`W^{(l)}`, :math:`b^{(l)}` are the weight matrix and bias
|
activation function and :math:`W^{(l)}`, :math:`b^{(l)}` are the weight matrix and bias
|
||||||
parameters for layer :math:`l`. The fourth layer is a bidirectional recurrent
|
parameters for layer :math:`l`. The fourth layer is a recurrent
|
||||||
layer `[1] <http://www.di.ufpe.br/~fnj/RNA/bibliografia/BRNN.pdf>`_.
|
layer `[1] <https://en.wikipedia.org/wiki/Recurrent_neural_network>`_.
|
||||||
This layer includes two sets of hidden units: a set with forward recurrence,
|
This layer includes a set of hidden units with forward recurrence,
|
||||||
:math:`h^{(f)}`, and a set with backward recurrence :math:`h^{(b)}`:
|
:math:`h^{(f)}`:
|
||||||
|
|
||||||
.. math::
|
.. math::
|
||||||
h^{(f)}_t = g(W^{(4)} h^{(3)}_t + W^{(f)}_r h^{(f)}_{t-1} + b^{(4)})
|
h^{(f)}_t = g(W^{(4)} h^{(3)}_t + W^{(f)}_r h^{(f)}_{t-1} + b^{(4)})
|
||||||
|
|
||||||
h^{(b)}_t = g(W^{(4)} h^{(3)}_t + W^{(b)}_r h^{(b)}_{t+1} + b^{(4)})
|
|
||||||
|
|
||||||
Note that :math:`h^{(f)}` must be computed sequentially from :math:`t = 1` to :math:`t = T^{(i)}`
|
Note that :math:`h^{(f)}` must be computed sequentially from :math:`t = 1` to :math:`t = T^{(i)}`
|
||||||
for the :math:`i`-th utterance, while the units :math:`h^{(b)}` must be computed
|
for the :math:`i`-th utterance.
|
||||||
sequentially in reverse from :math:`t = T^{(i)}` to :math:`t = 1`.
|
|
||||||
|
|
||||||
The fifth (non-recurrent) layer takes both the forward and backward units as inputs
|
The fifth (non-recurrent) layer takes the forward units as inputs
|
||||||
|
|
||||||
.. math::
|
.. math::
|
||||||
h^{(5)} = g(W^{(5)} h^{(4)} + b^{(5)})
|
h^{(5)} = g(W^{(5)} h^{(f)} + b^{(5)}).
|
||||||
|
|
||||||
where :math:`h^{(4)} = h^{(f)} + h^{(b)}`. The output layer are standard logits that
|
The output layer is standard logits that correspond to the predicted character probabilities
|
||||||
correspond to the predicted character probabilities for each time slice :math:`t` and
|
for each time slice :math:`t` and character :math:`k` in the alphabet:
|
||||||
character :math:`k` in the alphabet:
|
|
||||||
|
|
||||||
.. math::
|
.. math::
|
||||||
h^{(6)}_{t,k} = \hat{y}_{t,k} = (W^{(6)} h^{(5)}_t)_k + b^{(6)}_k
|
h^{(6)}_{t,k} = \hat{y}_{t,k} = (W^{(6)} h^{(5)}_t)_k + b^{(6)}_k
|
||||||
@ -66,14 +69,15 @@ element of the matrix product.
|
|||||||
|
|
||||||
Once we have computed a prediction for :math:`\hat{y}_{t,k}`, we compute the CTC loss
|
Once we have computed a prediction for :math:`\hat{y}_{t,k}`, we compute the CTC loss
|
||||||
`[2] <http://www.cs.toronto.edu/~graves/preprint.pdf>`_ :math:`\cal{L}(\hat{y}, y)`
|
`[2] <http://www.cs.toronto.edu/~graves/preprint.pdf>`_ :math:`\cal{L}(\hat{y}, y)`
|
||||||
to measure the error in prediction. During training, we can evaluate the gradient
|
to measure the error in prediction. (The CTC loss requires the :math:`blank` above
|
||||||
|
to indicate transitions between characters.) During training, we can evaluate the gradient
|
||||||
:math:`\nabla \cal{L}(\hat{y}, y)` with respect to the network outputs given the
|
:math:`\nabla \cal{L}(\hat{y}, y)` with respect to the network outputs given the
|
||||||
ground-truth character sequence :math:`y`. From this point, computing the gradient
|
ground-truth character sequence :math:`y`. From this point, computing the gradient
|
||||||
with respect to all of the model parameters may be done via back-propagation
|
with respect to all of the model parameters may be done via back-propagation
|
||||||
through the rest of the network. We use the Adam method for training
|
through the rest of the network. We use the Adam method for training
|
||||||
`[3] <http://arxiv.org/abs/1412.6980>`_.
|
`[3] <http://arxiv.org/abs/1412.6980>`_.
|
||||||
|
|
||||||
The complete BRNN model is illustrated in the figure below.
|
The complete RNN model is illustrated in the figure below.
|
||||||
|
|
||||||
.. image:: ../images/rnn_fig-624x548.png
|
.. image:: ../images/rnn_fig-624x598.png
|
||||||
:alt: DeepSpeech BRNN
|
:alt: DeepSpeech BRNN
|
||||||
|
Binary file not shown.
Before Width: | Height: | Size: 182 KiB |
BIN
images/rnn_fig-624x598.png
Normal file
BIN
images/rnn_fig-624x598.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 117 KiB |
Loading…
Reference in New Issue
Block a user