Convert documentation to Sphinx RST

This commit is contained in:
Reuben Morais 2017-02-02 22:57:58 -02:00
parent 0a1d3d49ca
commit 56eee9adaf
4 changed files with 141 additions and 113 deletions

View File

@ -1,70 +1,76 @@
Introduction
# Introduction ============
In this project we will reproduce the results of In this project we will reproduce the results of
[Deep Speech: Scaling up end-to-end speech recognition](http://arxiv.org/abs/1412.5567). `Deep Speech: Scaling up end-to-end speech recognition <http://arxiv.org/abs/1412.5567>`_.
The core of the system is a bidirectional recurrent neural network (BRNN) The core of the system is a bidirectional recurrent neural network (BRNN)
trained to ingest speech spectrograms and generate English text transcriptions. trained to ingest speech spectrograms and generate English text transcriptions.
Let a single utterance $x$ and label $y$ be sampled from a training set Let a single utterance :math:`x` and label :math:`y` be sampled from a training set
$S = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), . . .\}$. `S = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), . . .\}`.
Each utterance, $x^{(i)}$ is a time-series of length $T^{(i)}$ Each utterance, :math:`x^{(i)}` is a time-series of length :math:`T^{(i)}`
where every time-slice is a vector of audio features, where every time-slice is a vector of audio features,
$x^{(i)}_t$ where $t=1,\ldots,T^{(i)}$. `x^{(i)}_t` where :math:`t=1,\ldots,T^{(i)}`.
We use MFCC as our features; so $x^{(i)}_{t,p}$ denotes the $p$-th MFCC feature We use MFCC as our features; so :math:`x^{(i)}_{t,p}` denotes the :math:`p`-th MFCC feature
in the audio frame at time $t$. The goal of our BRNN is to convert an input in the audio frame at time :math:`t`. The goal of our BRNN is to convert an input
sequence $x$ into a sequence of character probabilities for the transcription sequence :math:`x` into a sequence of character probabilities for the transcription
$y$, with $\hat{y}_t =\mathbb{P}(c_t \mid x)$, `y`, with :math:`\hat{y}_t =\mathbb{P}(c_t \mid x)`,
where $c_t \in \{a,b,c, . . . , z, space, apostrophe, blank\}$. where :math:`c_t \in \{a,b,c, . . . , z, space, apostrophe, blank\}`.
(The significance of $blank$ will be explained below.) (The significance of :math:`blank` will be explained below.)
Our BRNN model is composed of $5$ layers of hidden units. Our BRNN model is composed of :math:`5` layers of hidden units.
For an input $x$, the hidden units at layer $l$ are denoted $h^{(l)}$ with the For an input :math:`x`, the hidden units at layer :math:`l` are denoted :math:`h^{(l)}` with the
convention that $h^{(0)}$ is the input. The first three layers are not recurrent. convention that :math:`h^{(0)}` is the input. The first three layers are not recurrent.
For the first layer, at each time $t$, the output depends on the MFCC frame For the first layer, at each time :math:`t`, the output depends on the MFCC frame
$x_t$ along with a context of $C$ frames on each side. `x_t` along with a context of :math:`C` frames on each side.
(We typically use $C \in \{5, 7, 9\}$ for our experiments.) (We typically use :math:`C \in \{5, 7, 9\}` for our experiments.)
The remaining non-recurrent layers operate on independent data for each time step. The remaining non-recurrent layers operate on independent data for each time step.
Thus, for each time $t$, the first $3$ layers are computed by: Thus, for each time :math:`t`, the first :math:`3` layers are computed by:
$$h^{(l)}_t = g(W^{(l)} h^{(l-1)}_t + b^{(l)})$$ .. math::
h^{(l)}_t = g(W^{(l)} h^{(l-1)}_t + b^{(l)})
where $g(z) = \min\{\max\{0, z\}, 20\}$ is a clipped rectified-linear (ReLu) where :math:`g(z) = \min\{\max\{0, z\}, 20\}` is a clipped rectified-linear (ReLu)
activation function and $W^{(l)}$, $b^{(l)}$ are the weight matrix and bias activation function and :math:`W^{(l)}`, :math:`b^{(l)}` are the weight matrix and bias
parameters for layer $l$. The fourth layer is a bidirectional recurrent parameters for layer :math:`l`. The fourth layer is a bidirectional recurrent
layer[[1](http://www.di.ufpe.br/~fnj/RNA/bibliografia/BRNN.pdf)]. layer `[1] <http://www.di.ufpe.br/~fnj/RNA/bibliografia/BRNN.pdf>`_.
This layer includes two sets of hidden units: a set with forward recurrence, This layer includes two sets of hidden units: a set with forward recurrence,
$h^{(f)}$, and a set with backward recurrence $h^{(b)}$: `h^{(f)}`, and a set with backward recurrence :math:`h^{(b)}`:
$$h^{(f)}_t = g(W^{(4)} h^{(3)}_t + W^{(f)}_r h^{(f)}_{t-1} + b^{(4)})$$ .. math::
$$h^{(b)}_t = g(W^{(4)} h^{(3)}_t + W^{(b)}_r h^{(b)}_{t+1} + b^{(4)})$$ h^{(f)}_t = g(W^{(4)} h^{(3)}_t + W^{(f)}_r h^{(f)}_{t-1} + b^{(4)})
Note that $h^{(f)}$ must be computed sequentially from $t = 1$ to $t = T^{(i)}$ h^{(b)}_t = g(W^{(4)} h^{(3)}_t + W^{(b)}_r h^{(b)}_{t+1} + b^{(4)})
for the $i$-th utterance, while the units $h^{(b)}$ must be computed
sequentially in reverse from $t = T^{(i)}$ to $t = 1$. Note that :math:`h^{(f)}` must be computed sequentially from :math:`t = 1` to :math:`t = T^{(i)}`
for the :math:`i`-th utterance, while the units :math:`h^{(b)}` must be computed
sequentially in reverse from :math:`t = T^{(i)}` to :math:`t = 1`.
The fifth (non-recurrent) layer takes both the forward and backward units as inputs The fifth (non-recurrent) layer takes both the forward and backward units as inputs
$$h^{(5)} = g(W^{(5)} h^{(4)} + b^{(5)})$$ .. math::
h^{(5)} = g(W^{(5)} h^{(4)} + b^{(5)})
where $h^{(4)} = h^{(f)} + h^{(b)}$. The output layer are standard logits that where :math:`h^{(4)} = h^{(f)} + h^{(b)}`. The output layer are standard logits that
correspond to the predicted character probabilities for each time slice $t$ and correspond to the predicted character probabilities for each time slice :math:`t` and
character $k$ in the alphabet: character :math:`k` in the alphabet:
$$h^{(6)}_{t,k} = \hat{y}_{t,k} = (W^{(6)} h^{(5)}_t)_k + b^{(6)}_k$$ .. math::
h^{(6)}_{t,k} = \hat{y}_{t,k} = (W^{(6)} h^{(5)}_t)_k + b^{(6)}_k
Here $b^{(6)}_k$ denotes the $k$-th bias and $(W^{(6)} h^{(5)}_t)_k$ the $k$-th Here :math:`b^{(6)}_k` denotes the :math:`k`-th bias and :math:`(W^{(6)} h^{(5)}_t)_k` the :math:`k`-th
element of the matrix product. element of the matrix product.
Once we have computed a prediction for $\hat{y}_{t,k}$, we compute the CTC loss Once we have computed a prediction for :math:`\hat{y}_{t,k}`, we compute the CTC loss
[[2]](http://www.cs.toronto.edu/~graves/preprint.pdf) $\cal{L}(\hat{y}, y)$ `[2] <http://www.cs.toronto.edu/~graves/preprint.pdf>`_ :math:`\cal{L}(\hat{y}, y)`
to measure the error in prediction. During training, we can evaluate the gradient to measure the error in prediction. During training, we can evaluate the gradient
$\nabla \cal{L}(\hat{y}, y)$ with respect to the network outputs given the :math:`\nabla \cal{L}(\hat{y}, y)` with respect to the network outputs given the
ground-truth character sequence $y$. From this point, computing the gradient ground-truth character sequence :math:`y`. From this point, computing the gradient
with respect to all of the model parameters may be done via back-propagation with respect to all of the model parameters may be done via back-propagation
through the rest of the network. We use the Adam method for training through the rest of the network. We use the Adam method for training
[[3](http://arxiv.org/abs/1412.6980)]. `[3] <http://arxiv.org/abs/1412.6980>`_.
The complete BRNN model is illustrated in the figure below. The complete BRNN model is illustrated in the figure below.
![DeepSpeech BRNN](../images/rnn_fig-624x548.png) .. image:: ../images/rnn_fig-624x548.png
:alt: DeepSpeech BRNN

View File

@ -1,69 +1,86 @@
# Geometric Constants Geometric Constants
===================
This is about several constants related to the geometry of the network. This is about several constants related to the geometry of the network.
## n_steps n_steps
The network views each speech sample as a sequence of time-slices $x^{(i)}_t$ of -------
length $T^{(i)}$. As the speech samples vary in length, we know that $T^{(i)}$ The network views each speech sample as a sequence of time-slices :math:`x^{(i)}_t` of
need not equal $T^{(j)}$ for $i \ne j$. For each batch, BRNN in TensorFlow needs length :math:`T^{(i)}`. As the speech samples vary in length, we know that :math:`T^{(i)}`
to know `n_steps` which is the maximum $T^{(i)}$ for the batch. need not equal :math:`T^{(j)}` for :math:`i \ne j`. For each batch, BRNN in TensorFlow needs
to know ``n_steps`` which is the maximum :math:`T^{(i)}` for the batch.
## n_input n_input
Each of the at maximum `n_steps` vectors is a vector of MFCC features of a -------
Each of the at maximum ``n_steps`` vectors is a vector of MFCC features of a
time-slice of the speech sample. We will make the number of MFCC features time-slice of the speech sample. We will make the number of MFCC features
dependent upon the sample rate of the data set. Generically, if the sample rate dependent upon the sample rate of the data set. Generically, if the sample rate
is 8kHz we use 13 features. If the sample rate is 16kHz we use 26 features... is 8kHz we use 13 features. If the sample rate is 16kHz we use 26 features...
We capture the dimension of these vectors, equivalently the number of MFCC We capture the dimension of these vectors, equivalently the number of MFCC
features, in the variable `n_input` features, in the variable ``n_input``.
## n_context n_context
---------
As previously mentioned, the BRNN is not simply fed the MFCC features of a given As previously mentioned, the BRNN is not simply fed the MFCC features of a given
time-slice. It is fed, in addition, a context of $C \in \{5, 7, 9\}$ frames on time-slice. It is fed, in addition, a context of :math:`C \in \{5, 7, 9\}` frames on
either side of the frame in question. The number of frames in this context is either side of the frame in question. The number of frames in this context is
captured in the variable `n_context` captured in the variable ``n_context``.
Next we will introduce constants that specify the geometry of some of the Next we will introduce constants that specify the geometry of some of the
non-recurrent layers of the network. We do this by simply specifying the number non-recurrent layers of the network. We do this by simply specifying the number
of units in each of the layers of units in each of the layers.
## n_hidden_1, n_hidden_2, n_hidden_5 n_hidden_1, n_hidden_2, n_hidden_5
`n_hidden_1` is the number of units in the first layer, `n_hidden_2` the number ----------------------------------
of units in the second, and `n_hidden_5` the number in the fifth. We haven't ``n_hidden_1`` is the number of units in the first layer, ``n_hidden_2`` the number
of units in the second, and ``n_hidden_5`` the number in the fifth. We haven't
forgotten about the third or sixth layer. We will define their unit count below. forgotten about the third or sixth layer. We will define their unit count below.
A LSTM BRNN consists of a pair of LSTM RNN's. A LSTM BRNN consists of a pair of LSTM RNN's.
One LSTM RNN that works "forward in time" One LSTM RNN that works "forward in time":
<img src="../images/LSTM3-chain.png" alt="LSTM" width="800"> .. image:: ../images/LSTM3-chain.png
:alt: Image shows a diagram of a recurrent neural network with LSTM cells, with arrows depicting the flow of data from earlier time steps to later timesteps within the RNN.
and a second LSTM RNN that works "backwards in time" and a second LSTM RNN that works "backwards in time":
<img src="../images/LSTM3-chain.png" alt="LSTM" width="800"> .. image:: ../images/LSTM3-chain-backwards.png
:alt: Image shows a diagram of a recurrent neural network with LSTM cells, this time with data flowing from later time steps to earlier timesteps within the RNN.
The dimension of the cell state, the upper line connecting subsequent LSTM units, The dimension of the cell state, the upper line connecting subsequent LSTM units,
is independent of the input dimension and the same for both the forward and is independent of the input dimension and the same for both the forward and
backward LSTM RNN. backward LSTM RNN.
## n_cell_dim n_cell_dim
----------
Hence, we are free to choose the dimension of this cell state independent of the Hence, we are free to choose the dimension of this cell state independent of the
input dimension. We capture the cell state dimension in the variable `n_cell_dim`. input dimension. We capture the cell state dimension in the variable ``n_cell_dim``.
## n_hidden_3 n_hidden_3
----------
The number of units in the third layer, which feeds in to the LSTM, is The number of units in the third layer, which feeds in to the LSTM, is
determined by `n_cell_dim` as follows determined by ``n_cell_dim`` as follows
```python
n_hidden_3 = 2 * n_cell_dim
```
## n_character .. code:: python
The variable `n_character` will hold the number of characters in the target
language plus one, for the $blamk$. n_hidden_3 = 2 * n_cell_dim
n_character
-----------
The variable ``n_character`` will hold the number of characters in the target
language plus one, for the :math:`blank`.
For English it is the cardinality of the set For English it is the cardinality of the set
$\{a,b,c, . . . , z, space, apostrophe, blank\}$
.. math::
\{a,b,c, . . . , z, space, apostrophe, blank\}
we referred to earlier. we referred to earlier.
## n_hidden_6 n_hidden_6
The number of units in the sixth layer is determined by `n_character` as follows ----------
```python The number of units in the sixth layer is determined by ``n_character`` as follows:
n_hidden_6 = n_character
``` .. code:: python
n_hidden_6 = n_character

View File

@ -1,14 +1,16 @@
# Parallel Optimization Parallel Optimization
=====================
This is how we implement optimization of the DeepSpeech model across GPU's on a This is how we implement optimization of the DeepSpeech model across GPUs on a
single host. Parallel optimization can take on various forms. For example single host. Parallel optimization can take on various forms. For example
one can use asynchronous updates of the model, synchronous updates of the model, one can use asynchronous updates of the model, synchronous updates of the model,
or some combination of the two. or some combination of the two.
## Asynchronous Parallel Optimization Asynchronous Parallel Optimization
----------------------------------
In asynchronous parallel optimization, for example, one places the model In asynchronous parallel optimization, for example, one places the model
initially in CPU memory. Then each of the $G$ GPU's obtains a mini-batch of data initially in CPU memory. Then each of the :math:`G` GPUs obtains a mini-batch of data
along with the current model parameters. Using this mini-batch each GPU then along with the current model parameters. Using this mini-batch each GPU then
computes the gradients for all model parameters and sends these gradients back computes the gradients for all model parameters and sends these gradients back
to the CPU when the GPU is done with its mini-batch. The CPU then asynchronously to the CPU when the GPU is done with its mini-batch. The CPU then asynchronously
@ -17,33 +19,36 @@ updates the model parameters whenever it recieves a set of gradients from a GPU.
Asynchronous parallel optimization has several advantages and several Asynchronous parallel optimization has several advantages and several
disadvantages. One large advantage is throughput. No GPU will every be waiting disadvantages. One large advantage is throughput. No GPU will every be waiting
idle. When a GPU is done processing a mini-batch, it can immediately obtain the idle. When a GPU is done processing a mini-batch, it can immediately obtain the
next mini-batch to process. It never has to wait on other GPU's to finish their next mini-batch to process. It never has to wait on other GPUs to finish their
mini-batch. However, this means that the model updates will also be asynchronous mini-batch. However, this means that the model updates will also be asynchronous
which can have problems. which can have problems.
For example, one may have model parameters $W$ on the CPU and send mini-batch For example, one may have model parameters :math:`W` on the CPU and send mini-batch
$n$ to GPU 1 and send mini-batch $n+1$ to GPU 2. As processing is asynchronous, :math:`n` to GPU 1 and send mini-batch :math:`n+1` to GPU 2. As processing is asynchronous,
GPU 2 may finish before GPU 1 and thus update the CPU's model parameters $W$ GPU 2 may finish before GPU 1 and thus update the CPU's model parameters :math:`W`
with its gradients $\Delta W_{n+1}(W)$, where the subscript $n+1$ identifies the with its gradients :math:`\Delta W_{n+1}(W)`, where the subscript :math:`n+1` identifies the
mini-batch and the argument $W$ the location at which the gradient was evaluated. mini-batch and the argument :math:`W` the location at which the gradient was evaluated.
This results in the new model parameters This results in the new model parameters
$$W + \Delta W_{n+1}(W).$$ .. math::
W + \Delta W_{n+1}(W).
Next GPU 1 could finish with its mini-batch and update the parameters to Next GPU 1 could finish with its mini-batch and update the parameters to
$$W + \Delta W_{n+1}(W) + \Delta W_{n}(W).$$ .. math::
W + \Delta W_{n+1}(W) + \Delta W_{n}(W).
The problem with this is that $\Delta W_{n}(W)$ is evaluated at $W$ and not The problem with this is that :math:`\Delta W_{n}(W)` is evaluated at :math:`W` and not
$W + \Delta W_{n+1}(W)$. Hence, the direction of the gradient $\Delta W_{n}(W)$ :math:`W + \Delta W_{n+1}(W)`. Hence, the direction of the gradient :math:`\Delta W_{n}(W)`
is slightly incorrect as it is evaluated at the wrong location. This can be is slightly incorrect as it is evaluated at the wrong location. This can be
counteracted through synchronous updates of model, but this is also problematic. counteracted through synchronous updates of model, but this is also problematic.
## Synchronous Optimization Synchronous Optimization
------------------------
Synchronous optimization solves the problem we saw above. In synchronous Synchronous optimization solves the problem we saw above. In synchronous
optimization, one places the model initially in CPU memory. Then one of the $G$ optimization, one places the model initially in CPU memory. Then one of the `G`
GPU's is given a mini-batch of data along with the current model parameters. GPUs is given a mini-batch of data along with the current model parameters.
Using the mini-batch the GPU computes the gradients for all model parameters and Using the mini-batch the GPU computes the gradients for all model parameters and
sends the gradients back to the CPU. The CPU then updates the model parameters sends the gradients back to the CPU. The CPU then updates the model parameters
and starts the process of sending out the next mini-batch. and starts the process of sending out the next mini-batch.
@ -51,50 +56,50 @@ and starts the process of sending out the next mini-batch.
As on can readily see, synchronous optimization does not have the problem we As on can readily see, synchronous optimization does not have the problem we
found in the last section, that of incorrect gradients. However, synchronous found in the last section, that of incorrect gradients. However, synchronous
optimization can only make use of a single GPU at a time. So, when we have a optimization can only make use of a single GPU at a time. So, when we have a
multi-GPU setup, $G > 1$, all but one of the GPU's will remain idle, which is multi-GPU setup, :math:`G > 1`, all but one of the GPUs will remain idle, which is
unacceptable. However, there is a third alternative which is combines the unacceptable. However, there is a third alternative which is combines the
advantages of asynchronous and synchronous optimization. advantages of asynchronous and synchronous optimization.
## Hybrid Parallel Optimization Hybrid Parallel Optimization
----------------------------
Hybrid parallel optimization combines most of the benifits of asynchronous and Hybrid parallel optimization combines most of the benifits of asynchronous and
synchronous optimization. It allows for multiple GPU's to be used, but does not synchronous optimization. It allows for multiple GPUs to be used, but does not
suffer from the incorrect gradient problem exhibited by asynchronous suffer from the incorrect gradient problem exhibited by asynchronous
optimization. optimization.
In hybrid parallel optimization one places the model initially in CPU memory. In hybrid parallel optimization one places the model initially in CPU memory.
Then, as in asynchronous optimization, each of the $G$ GPU'S obtains a Then, as in asynchronous optimization, each of the :math:`G` GPUs obtains a
mini-batch of data along with the current model parameters. Using the mini-batch mini-batch of data along with the current model parameters. Using the mini-batch
each of the GPU's then computes the gradients for all model parameters and sends each of the GPUs then computes the gradients for all model parameters and sends
these gradients back to the CPU. Now, in contrast to asynchronous optimization, these gradients back to the CPU. Now, in contrast to asynchronous optimization,
the CPU waits until each GPU is finished with its mini-batch then takes the mean the CPU waits until each GPU is finished with its mini-batch then takes the mean
of all the gradients from the $G$ GPU's and updates the model with this mean of all the gradients from the :math:`G` GPUs and updates the model with this mean
gradient. gradient.
<img src="images/Parallelism.png" alt="LSTM" width="600"> .. image:: ../images/Parallelism.png
:alt: Image shows a diagram with arrows displaying the flow of information between devices during training. A CPU device sends weights and gradients to one or more GPU devices, which run an optimization step and then return the new parameters to the CPU, which averages them and starts a new training iteration.
Hybrid parallel optimization has several advantages and few disadvantages. As in Hybrid parallel optimization has several advantages and few disadvantages. As in
asynchronous parallel optimization, hybrid parallel optimization allows for one asynchronous parallel optimization, hybrid parallel optimization allows for one
to use multiple GPU's in parallel. Furthermore, unlike asynchronous parallel to use multiple GPUs in parallel. Furthermore, unlike asynchronous parallel
optimization, the incorrect gradient problem is not present here. In fact, optimization, the incorrect gradient problem is not present here. In fact,
hybrid parallel optimization performs as if one is working with a single hybrid parallel optimization performs as if one is working with a single
mini-batch which is $G$ times the size of a mini-batch handled by a single GPU. mini-batch which is :math:`G` times the size of a mini-batch handled by a single GPU.
Hoewever, hybrid parallel optimization is not perfect. If one GPU is slower than Hoewever, hybrid parallel optimization is not perfect. If one GPU is slower than
all the others in completing its mini-batch, all other GPU's will have to sit all the others in completing its mini-batch, all other GPUs will have to sit
idle until this straggler finishes with its mini-batch. This hurts throughput. idle until this straggler finishes with its mini-batch. This hurts throughput.
But, if all GPU'S are of the same make and model, this problem should be But, if all GPUs are of the same make and model, this problem should be
minimized. minimized.
So, relatively speaking, hybrid parallel optimization seems the have more So, relatively speaking, hybrid parallel optimization seems the have more
advantages and fewer disadvantages as compared to both asynchronous and advantages and fewer disadvantages as compared to both asynchronous and
synchronous optimization. So, we will, for our work, use this hybrid model. synchronous optimization. So, we will, for our work, use this hybrid model.
## Adam Optimization Adam Optimization
-----------------
In constrast to In constrast to
[Deep Speech: Scaling up end-to-end speech recognition](http://arxiv.org/abs/1412.5567), `Deep Speech: Scaling up end-to-end speech recognition <http://arxiv.org/abs/1412.5567>`_,
in which in which `Nesterovs Accelerated Gradient Descent <www.cs.toronto.edu/~fritz/absps/momentum.pdf>`_ was used, we will use the Adam method for optimization `[3] <http://arxiv.org/abs/1412.6980>`_,
[Nesterovs Accelerated Gradient Descent](www.cs.toronto.edu/~fritz/absps/momentum.pdf)
was used, we will use the Adam method for optimization
[[3](http://arxiv.org/abs/1412.6980)],
because, generally, it requires less fine-tuning. because, generally, it requires less fine-tuning.

Binary file not shown.

After

Width:  |  Height:  |  Size: 221 KiB