diff --git a/DeepSpeech.ipynb b/DeepSpeech.ipynb index cc814355..a6fde0c6 100644 --- a/DeepSpeech.ipynb +++ b/DeepSpeech.ipynb @@ -13,13 +13,13 @@ "source": [ "In this notebook we will reproduce the results of [Deep Speech: Scaling up end-to-end speech recognition](http://arxiv.org/abs/1412.5567). The core of the system is a bidirectional recurrent neural network (BRNN) trained to ingest speech spectrograms and generate English text transcriptions.\n", "\n", - " Let a single utterance $x$ and label $y$ be sampled from a training set $S = \\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), . . .\\}$. Each utterance, $x^{(i)}$ is a time-series of length $T^{(i)}$ where every time-slice is a vector of audio features, $x^{(i)}_t$ where $t=1,\\ldots,T^{(i)}$. We use spectrograms as our features; so $x^{(i)}_{t,p}$ denotes the power of the $p$-th frequency bin in the audio frame at time $t$. The goal of our BRNN is to convert an input sequence $x$ into a sequence of character probabilities for the transcription $y$, with $\\hat{y}_t =\\mathbb{P}(c_t \\mid x)$, where $c_t \\in \\{a,b,c, . . . , z, space, apostrophe, blank\\}$. (The significance of $blank$ will be explained below.)\n", + " Let a single utterance $x$ and label $y$ be sampled from a training set $S = \\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), . . .\\}$. Each utterance, $x^{(i)}$ is a time-series of length $T^{(i)}$ where every time-slice is a vector of audio features, $x^{(i)}_t$ where $t=1,\\ldots,T^{(i)}$. We use MFCC as our features; so $x^{(i)}_{t,p}$ denotes the $p$-th MFCC feature in the audio frame at time $t$. The goal of our BRNN is to convert an input sequence $x$ into a sequence of character probabilities for the transcription $y$, with $\\hat{y}_t =\\mathbb{P}(c_t \\mid x)$, where $c_t \\in \\{a,b,c, . . . , z, space, apostrophe, blank\\}$. (The significance of $blank$ will be explained below.)\n", "\n", - "Our BRNN model is composed of $5$ layers of hidden units. For an input $x$, the hidden units at layer $l$ are denoted $h^{(l)}$ with the convention that $h^{(0)}$ is the input. The first three layers are not recurrent. For the first layer, at each time $t$, the output depends on the spectrogram frame $x_t$ along with a context of $C$ frames on each side. (We typically use $C \\in \\{5, 7, 9\\}$ for our experiments.) The remaining non-recurrent layers operate on independent data for each time step. Thus, for each time $t$, the first $3$ layers are computed by:\n", + "Our BRNN model is composed of $5$ layers of hidden units. For an input $x$, the hidden units at layer $l$ are denoted $h^{(l)}$ with the convention that $h^{(0)}$ is the input. The first three layers are not recurrent. For the first layer, at each time $t$, the output depends on the MFCC frame $x_t$ along with a context of $C$ frames on each side. (We typically use $C \\in \\{5, 7, 9\\}$ for our experiments.) The remaining non-recurrent layers operate on independent data for each time step. Thus, for each time $t$, the first $3$ layers are computed by:\n", "\n", "$$h^{(l)}_t = g(W^{(l)} h^{(l-1)}_t + b^{(l)})$$\n", "\n", - "where $g(z) = \\min\\{\\max\\{0, z\\}, 20\\}$ is the clipped rectified-linear (ReLu) activation function and $W^{(l)}$, $b^{(l)}$ are the weight matrix and bias parameters for layer $l$. The fourth layer is a bidirectional recurrent layer[[1](http://www.di.ufpe.br/~fnj/RNA/bibliografia/BRNN.pdf)]. This layer includes two sets of hidden units: a set with forward recurrence, $h^{(f)}$, and a set with backward recurrence $h^{(b)}$:\n", + "where $g(z) = \\min\\{\\max\\{0, z\\}, 20\\}$ is a clipped rectified-linear (ReLu) activation function and $W^{(l)}$, $b^{(l)}$ are the weight matrix and bias parameters for layer $l$. The fourth layer is a bidirectional recurrent layer[[1](http://www.di.ufpe.br/~fnj/RNA/bibliografia/BRNN.pdf)]. This layer includes two sets of hidden units: a set with forward recurrence, $h^{(f)}$, and a set with backward recurrence $h^{(b)}$:\n", "\n", "$$h^{(f)}_t = g(W^{(4)} h^{(3)}_t + W^{(f)}_r h^{(f)}_{t-1} + b^{(4)})$$\n", "$$h^{(b)}_t = g(W^{(4)} h^{(3)}_t + W^{(b)}_r h^{(b)}_{t+1} + b^{(4)})$$\n", @@ -37,7 +37,7 @@ "\n", "Here $b^{(6)}_k$ denotes the $k$-th bias and $(W^{(6)} h^{(5)}_t)_k$ the $k$-th element of the matrix product.\n", "\n", - "Once we have computed a prediction for $\\mathbb{P}(c_t = k \\mid x)$, we compute the CTC loss[[2]](http://www.cs.toronto.edu/~graves/preprint.pdf) $\\cal{L}(\\hat{y}, y)$ to measure the error in prediction. During training, we can evaluate the gradient $\\nabla \\cal{L}(\\hat{y}, y)$ with respect to the network outputs given the ground-truth character sequence $y$. From this point, computing the gradient with respect to all of the model parameters may be done via back-propagation through the rest of the network. We use the Adam method for training[[3](http://arxiv.org/abs/1412.6980)].\n", + "Once we have computed a prediction for $\\hat{y}_{t,k}$, we compute the CTC loss[[2]](http://www.cs.toronto.edu/~graves/preprint.pdf) $\\cal{L}(\\hat{y}, y)$ to measure the error in prediction. During training, we can evaluate the gradient $\\nabla \\cal{L}(\\hat{y}, y)$ with respect to the network outputs given the ground-truth character sequence $y$. From this point, computing the gradient with respect to all of the model parameters may be done via back-propagation through the rest of the network. We use the Adam method for training[[3](http://arxiv.org/abs/1412.6980)].\n", "\n", "The complete BRNN model is illustrated in the figure below.\n", "\n", @@ -100,9 +100,8 @@ }, "outputs": [], "source": [ - "import tensorflow as tf\n", - "from tensorflow.python.framework.constant_op import constant\n", - "import numpy as np" + "import numpy as np\n", + "import tensorflow as tf" ] }, { @@ -132,6 +131,9 @@ "outputs": [], "source": [ "learning_rate = 0.001 # TODO: Determine a reasonable value for this\n", + "beta1 = 0.9 # TODO: Determine a reasonable value for this\n", + "beta2 = 0.999 # TODO: Determine a reasonable value for this\n", + "epsilon = 1e-8 # TODO: Determine a reasonable value for this\n", "training_iters = 100000 # TODO: Determine a reasonable value for this\n", "batch_size = 128 # TODO: Determine a reasonable value for this\n", "display_step = 10 # TODO: Determine a reasonable value for this" @@ -154,7 +156,7 @@ }, "outputs": [], "source": [ - "dropout_rate = 0.05" + "dropout_rate = 0.05 # TODO: Validate this is a reasonable value" ] }, { @@ -206,7 +208,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Each of the `n_steps` vectors is the Fourier transform of a time-slice of the speech sample. The number of \"bins\" of this Fourier transform is dependent upon the sample rate of the data set. Generically, if the sample rate is 8kHz we use 80bins. If the sample rate is 16kHz we use 160bins... We capture the dimension of these vectors, equivalently the number of bins in the Fourier transform, in the variable `n_input`" + "Each of the `n_steps` vectors is MFCC features of a time-slice of the speech sample. We will make the number of MFCC features dependent upon the sample rate of the data set. Generically, if the sample rate is 8kHz we use 13 features. If the sample rate is 16kHz we use 26 features... We capture the dimension of these vectors, equivalently the number of MFCC features, in the variable `n_input`" ] }, { @@ -217,14 +219,14 @@ }, "outputs": [], "source": [ - "n_input = 160 # TODO: Determine this programatically from the sample rate" + "n_input = 26 # TODO: Determine this programatically from the sample rate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "As previously mentioned, the BRNN is not simply fed the Fourier transform of a given time-slice. It is fed, in addition, a context of $C \\in \\{5, 7, 9\\}$ frames on either side of the frame in question. The number of frames in this context is captured in the variable `n_context`" + "As previously mentioned, the BRNN is not simply fed the MFCC features of a given time-slice. It is fed, in addition, a context of $C \\in \\{5, 7, 9\\}$ frames on either side of the frame in question. The number of frames in this context is captured in the variable `n_context`" ] }, { @@ -392,36 +394,22 @@ "outputs": [], "source": [ "x = tf.placeholder(\"float\", [None, n_steps, n_input + 2*n_input*n_context])\n", - "y = tf.placeholder(\"string\", [None, 1])" + "y = tf.sparse_placeholder(tf.int32)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "As `y` represents the text transcript of each element in a batch, it is of type \"string\" and has shape `[None, 1]` where the `None` dimension corresponds to the number of elements in the batch.\n", + "The placeholder `y` represents the text transcript of each element in a batch. `y` is of type \"SparseTensor\" required by the CTC algorithm. The details of how the text transcripts are encoded in to a \"SparseTensor\" will be presented below.\n", "\n", - "The placeholder `x` is a place holder for the the speech spectrograms along with their prefix and postfix contexts for each element in a batch. As it represents a spectrogram, its type is \"float\". The `None` dimension of its shape\n", + "The placeholder `x` is a place holder for the the speech features along with their prefix and postfix contexts for each element in a batch. As it represents MFCC features, its type is \"float\". The `None` dimension of its shape\n", "\n", "```python\n", "[None, n_steps, n_input + 2*n_input*n_context]\n", "```\n", "\n", - "has the same meaning as the `None` dimension in the shape of `y`. The `n_steps` dimension of its shape indicates the number of time-slices in the sequence. Finally, the `n_input + 2*n_input*n_context` dimension of its shape indicates the number of bins in Fourier transform `n_input` along with the number of bins in the prefix-context `n_input*n_context` and postfix-contex `n_input*n_context`.\n", - "\n", - "The next placeholders we introduce `istate_fw` and `istate_bw` correspond to the initial states and cells of the forward and backward LSTM networks. As both of these are floats of dimension `n_cell_dim`, we define `istate_fw` and `istate_bw` as follows" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": { - "collapsed": true - }, - "outputs": [], - "source": [ - "istate_fw = (tf.placeholder(\"float\", [None, n_cell_dim]), tf.placeholder(\"float\", [None, n_cell_dim]))\n", - "istate_bw = (tf.placeholder(\"float\", [None, n_cell_dim]), tf.placeholder(\"float\", [None, n_cell_dim]))" + "is a 'placeholder' for the batch size. The `n_steps` dimension of its shape indicates the number of time-slices in the sequence. Finally, the `n_input + 2*n_input*n_context` dimension of its shape indicates the number of MFCC features `n_input` along with the number of MFCC features in the prefix-context `n_input*n_context` and postfix-contex `n_input*n_context`." ] }, { @@ -433,7 +421,7 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 15, "metadata": { "collapsed": true }, @@ -455,7 +443,7 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 16, "metadata": { "collapsed": false }, @@ -483,18 +471,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Next we introduce a utility function `BiRNN` that can take our placeholders `x`, `istate_fw`, and `istate_bw` along with the dictionaries `weights` and `biases` and add all the apropos operators to our default graph." + "Next we introduce a utility function `BiRNN` that can take the placeholder `x` along with the dictionaries `weights` and `biases` and add all the apropos operators to our default graph." ] }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [], "source": [ - "def BiRNN(_X, _istate_fw, _istate_bw, _weights, _biases):\n", + "def BiRNN(_X, _weights, _biases):\n", " # Input shape: [batch_size, n_steps, n_input + 2*n_input*n_context]\n", " _X = tf.transpose(_X, [1, 0, 2]) # Permute n_steps and batch_size\n", " # Reshape to prepare input for first layer\n", @@ -523,8 +511,7 @@ " outputs, output_state_fw, output_state_bw = tf.nn.bidirectional_rnn(cell_fw=lstm_fw_cell,\n", " cell_bw=lstm_bw_cell,\n", " inputs=layer_3,\n", - " initial_state_fw=_istate_fw,\n", - " initial_state_bw=_istate_bw)\n", + " dtype=tf.float32)\n", " \n", " # Reshape outputs from a list of n_steps tensors each of shape [batch_size, 2*n_cell_dim]\n", " # to a single tensor of shape [n_steps*batch_size, 2*n_cell_dim]\n", @@ -552,7 +539,7 @@ "source": [ "The first few lines of the function `BiRNN`\n", "```python\n", - "def BiRNN(_X, _istate_fw, _istate_bw, _weights, _biases):\n", + "def BiRNN(_X, _weights, _biases):\n", " # Input shape: [batch_size, n_steps, n_input + 2*n_input*n_context]\n", " _X = tf.transpose(_X, [1, 0, 2]) # Permute n_steps and batch_size\n", " # Reshape to prepare input for first layer\n", @@ -603,12 +590,11 @@ " outputs, output_state_fw, output_state_bw = tf.nn.bidirectional_rnn(cell_fw=lstm_fw_cell,\n", " cell_bw=lstm_bw_cell,\n", " inputs=layer_3,\n", - " initial_state_fw=_istate_fw,\n", - " initial_state_bw=_istate_bw)\n", + " dtype=tf.float32)\n", "```\n", "feeds `layer_3` to the LSTM BRNN cell and obtains the LSTM BRNN output.\n", "\n", - "The next lines convert `outputs` from a list of rank two tensors into a rank two tensor in preparation for passing it to the next neural network layer \n", + "The next lines convert `outputs` from a list of rank two tensors into a single rank two tensor in preparation for passing it to the next neural network layer \n", "```python\n", " # Reshape outputs from a list of n_steps tensors each of shape [batch_size, 2*n_cell_dim]\n", " # to a single tensor of shape [n_steps*batch_size, 2*n_cell_dim]\n", @@ -650,13 +636,13 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [], "source": [ - "layer_6 = BiRNN(x, istate_fw, istate_bw, weights, biases)" + "layer_6 = BiRNN(x, weights, biases)" ] }, { @@ -1223,7 +1209,7 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 19, "metadata": { "collapsed": true },