diff --git a/doc/DeepSpeech.rst b/doc/DeepSpeech.rst index c907d950..b4fa7ebd 100644 --- a/doc/DeepSpeech.rst +++ b/doc/DeepSpeech.rst @@ -1,9 +1,16 @@ Introduction ============ -In this project we will reproduce the results of +The aim of this project is to create a simple, open, and ubiquitous speech +recognition engine. Simple, in that the engine should not require server-class +hardware to execute. Open, in that the code and models are released under the +Mozilla Public License. Ubiquitous, in that the engine should run on many +platforms and have bindings to many different languages. + +The architecture of the engine was originally motivated by that presented in `Deep Speech: Scaling up end-to-end speech recognition `_. -The core of the system is a bidirectional recurrent neural network (BRNN) +However, the engine currently differs in many respects from the engine it was +originally motivated by. The core of the engine is a recurrent neural network (RNN) trained to ingest speech spectrograms and generate English text transcriptions. Let a single utterance :math:`x` and label :math:`y` be sampled from a training set @@ -14,19 +21,19 @@ Let a single utterance :math:`x` and label :math:`y` be sampled from a training Each utterance, :math:`x^{(i)}` is a time-series of length :math:`T^{(i)}` where every time-slice is a vector of audio features, :math:`x^{(i)}_t` where :math:`t=1,\ldots,T^{(i)}`. -We use MFCC as our features; so :math:`x^{(i)}_{t,p}` denotes the :math:`p`-th MFCC feature -in the audio frame at time :math:`t`. The goal of our BRNN is to convert an input +We use MFCC coefficients as our features; so :math:`x^{(i)}_{t,p}` denotes the :math:`p`-th MFCC feature +in the audio frame at time :math:`t`. The goal of our RNN is to convert an input sequence :math:`x` into a sequence of character probabilities for the transcription :math:`y`, with :math:`\hat{y}_t =\mathbb{P}(c_t \mid x)`, -where :math:`c_t \in \{a,b,c, . . . , z, space, apostrophe, blank\}`. +where for English :math:`c_t \in \{a,b,c, . . . , z, space, apostrophe, blank\}`. (The significance of :math:`blank` will be explained below.) -Our BRNN model is composed of :math:`5` layers of hidden units. +Our RNN model is composed of :math:`5` layers of hidden units. For an input :math:`x`, the hidden units at layer :math:`l` are denoted :math:`h^{(l)}` with the convention that :math:`h^{(0)}` is the input. The first three layers are not recurrent. For the first layer, at each time :math:`t`, the output depends on the MFCC frame :math:`x_t` along with a context of :math:`C` frames on each side. -(We typically use :math:`C \in \{5, 7, 9\}` for our experiments.) +(We use :math:`C = 9` for our experiments.) The remaining non-recurrent layers operate on independent data for each time step. Thus, for each time :math:`t`, the first :math:`3` layers are computed by: @@ -35,28 +42,24 @@ Thus, for each time :math:`t`, the first :math:`3` layers are computed by: where :math:`g(z) = \min\{\max\{0, z\}, 20\}` is a clipped rectified-linear (ReLu) activation function and :math:`W^{(l)}`, :math:`b^{(l)}` are the weight matrix and bias -parameters for layer :math:`l`. The fourth layer is a bidirectional recurrent -layer `[1] `_. -This layer includes two sets of hidden units: a set with forward recurrence, -:math:`h^{(f)}`, and a set with backward recurrence :math:`h^{(b)}`: +parameters for layer :math:`l`. The fourth layer is a recurrent +layer `[1] `_. +This layer includes a set of hidden units with forward recurrence, +:math:`h^{(f)}`: .. math:: h^{(f)}_t = g(W^{(4)} h^{(3)}_t + W^{(f)}_r h^{(f)}_{t-1} + b^{(4)}) - h^{(b)}_t = g(W^{(4)} h^{(3)}_t + W^{(b)}_r h^{(b)}_{t+1} + b^{(4)}) - Note that :math:`h^{(f)}` must be computed sequentially from :math:`t = 1` to :math:`t = T^{(i)}` -for the :math:`i`-th utterance, while the units :math:`h^{(b)}` must be computed -sequentially in reverse from :math:`t = T^{(i)}` to :math:`t = 1`. +for the :math:`i`-th utterance. -The fifth (non-recurrent) layer takes both the forward and backward units as inputs +The fifth (non-recurrent) layer takes the forward units as inputs .. math:: - h^{(5)} = g(W^{(5)} h^{(4)} + b^{(5)}) + h^{(5)} = g(W^{(5)} h^{(f)} + b^{(5)}). -where :math:`h^{(4)} = h^{(f)} + h^{(b)}`. The output layer are standard logits that -correspond to the predicted character probabilities for each time slice :math:`t` and -character :math:`k` in the alphabet: +The output layer is standard logits that correspond to the predicted character probabilities +for each time slice :math:`t` and character :math:`k` in the alphabet: .. math:: h^{(6)}_{t,k} = \hat{y}_{t,k} = (W^{(6)} h^{(5)}_t)_k + b^{(6)}_k @@ -66,14 +69,15 @@ element of the matrix product. Once we have computed a prediction for :math:`\hat{y}_{t,k}`, we compute the CTC loss `[2] `_ :math:`\cal{L}(\hat{y}, y)` -to measure the error in prediction. During training, we can evaluate the gradient +to measure the error in prediction. (The CTC loss requires the :math:`blank` above +to indicate transitions between characters.) During training, we can evaluate the gradient :math:`\nabla \cal{L}(\hat{y}, y)` with respect to the network outputs given the ground-truth character sequence :math:`y`. From this point, computing the gradient with respect to all of the model parameters may be done via back-propagation through the rest of the network. We use the Adam method for training `[3] `_. -The complete BRNN model is illustrated in the figure below. +The complete RNN model is illustrated in the figure below. -.. image:: ../images/rnn_fig-624x548.png +.. image:: ../images/rnn_fig-624x598.png :alt: DeepSpeech BRNN diff --git a/images/rnn_fig-624x548.png b/images/rnn_fig-624x548.png deleted file mode 100644 index 0f288bad..00000000 Binary files a/images/rnn_fig-624x548.png and /dev/null differ diff --git a/images/rnn_fig-624x598.png b/images/rnn_fig-624x598.png new file mode 100644 index 00000000..ecc79322 Binary files /dev/null and b/images/rnn_fig-624x598.png differ