Merge pull request #1821 from JRMeyer/docs

2021-03-29 21:34:29 +02:00 · 2021-03-29 21:34:29 +02:00 · 91c5f90f3c
commit 91c5f90f3c
parent 3409fde4a0
27 changed files with 747 additions and 676 deletions
--- a/doc/AUGMENTATION.rst
+++ b/doc/AUGMENTATION.rst
@ -0,0 +1,237 @@
+.. _training-data-augmentation:
+
+Training Data Augmentation
+==========================
+
+This document is an overview of the augmentation techniques available for training with STT.
+
+Training data augmentations can help STT models better transcribe new speech at deployment time. The basic intuition behind data augmentation is the following: by distorting, modifying, or adding to your existing audio data, you can create a training set many times larger than what you started with. If you use a larger training data set to train as STT model, you force the model to learn more generalizable characteristics of speech, making `overfitting <https://en.wikipedia.org/wiki/Overfitting>`_ more difficult. If you can't find a larger data set of speech, you can create one with data augmentation. 
+
+We have implemented a pre-processing pipeline with various augmentation techniques on audio data (i.e. raw ``PCM`` and spectrograms).
+
+Each audio file in your training data will be potentially affected by the sequence of augmentations you specify. Whether or not an augmentation will *actually* get applied to a given audio file is determined by the augmentation's probability value. For example, a probability value of ``p=0.1`` means the according augmentation has a 10% chance of being applied to a given audio file. This also means that augmentations are not mutually exclusive on a per-audio-file basis.
+
+The ``--augment`` flag uses a common syntax for all augmentation types:
+
+.. code-block::
+
+  --augment augmentation_type1[param1=value1,param2=value2,...] --augment augmentation_type2[param1=value1,param2=value2,...] ...
+
+For example, for the ``overlay`` augmentation:
+
+.. code-block::
+
+  python3 train.py --augment overlay[p=0.1,source=/path/to/audio.sdb,snr=20.0] ...
+
+In the documentation below, whenever a value is specified as ``<float-range>`` or ``<int-range>``, it supports one of the follow formats:
+
+  * ``<value>``: A constant (int or float) value.
+
+  * ``<value>~<r>``: A center value with a randomization radius around it. E.g. ``1.2~0.4`` will result in picking of a uniformly random value between 0.8 and 1.6 on each sample augmentation.
+
+  * ``<start>:<end>``: The value will range from `<start>` at the beginning of the training to `<end>` at the end of the training. E.g. ``-0.2:1.2`` (float) or ``2000:4000`` (int)
+
+  * ``<start>:<end>~<r>``: Combination of the two previous cases with a ranging center value. E.g. ``4-6~2`` would at the beginning of the training pick values between 2 and 6 and at the end of the training between 4 and 8.
+
+Ranges specified with integer limits will only assume integer (rounded) values.
+
+.. warning::
+    When feature caching is enabled, by default the cache has no expiration limit and will be used for the entire training run. This will cause these augmentations to only be performed once during the first epoch and the result will be reused for subsequent epochs. This would not only hinder value ranges from reaching their intended final values, but could also lead to unintended over-fitting. In this case flag ``--cache_for_epochs N`` (with N > 1) should be used to periodically invalidate the cache after every N epochs and thus allow samples to be re-augmented in new ways and with current range-values.
+
+Every augmentation targets a certain representation of the sample - in this documentation these representations are referred to as *domains*.
+Augmentations are applied in the following order:
+
+1. **sample** domain: The sample just got loaded and its waveform is represented as a NumPy array. For implementation reasons these augmentations are the only ones that can be "simulated" through ``bin/play.py``.
+
+2. **signal** domain: The sample waveform is represented as a tensor.
+
+3. **spectrogram** domain: The sample spectrogram is represented as a tensor.
+
+4. **features** domain: The sample's mel spectrogram features are represented as a tensor.
+
+Within a single domain, augmentations are applied in the same order as they appear in the command-line.
+
+
+Sample domain augmentations
+---------------------------
+
+**Overlay augmentation** ``--augment overlay[p=<float>,source=<str>,snr=<float-range>,layers=<int-range>]``
+  Layers another audio source (multiple times) onto augmented samples.
+
+  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
+
+  * **source**: path to the sample collection to use for augmenting (\*.sdb or \*.csv file). It will be repeated if there are not enough samples left.
+
+  * **snr**: signal to noise ratio in dB - positive values for lowering volume of the overlay in relation to the sample
+
+  * **layers**: number of layers added onto the sample (e.g. 10 layers of speech to get "cocktail-party effect"). A layer is just a sample of the same duration as the sample to augment. It gets stitched together from as many source samples as required.
+
+
+**Reverb augmentation** ``--augment reverb[p=<float>,delay=<float-range>,decay=<float-range>]``
+  Adds simplified (no all-pass filters) `Schroeder reverberation <https://ccrma.stanford.edu/~jos/pasp/Schroeder_Reverberators.html>`_ to the augmented samples.
+
+  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
+
+  * **delay**: time delay in ms for the first signal reflection - higher values are widening the perceived "room"
+
+  * **decay**: sound decay in dB per reflection - higher values will result in a less reflective perceived "room"
+
+
+**Resample augmentation** ``--augment resample[p=<float>,rate=<int-range>]``
+  Resamples augmented samples to another sample rate and then resamples back to the original sample rate.
+
+  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
+
+  * **rate**: sample-rate to re-sample to
+
+
+**Codec augmentation** ``--augment codec[p=<float>,bitrate=<int-range>]``
+  Compresses and then decompresses augmented samples using the lossy Opus audio codec.
+
+  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
+
+  * **bitrate**: bitrate used during compression
+
+
+**Volume augmentation** ``--augment volume[p=<float>,dbfs=<float-range>]``
+  Measures and levels augmented samples to a target dBFS value.
+
+  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
+
+  * **dbfs** : target volume in dBFS (default value of 3.0103 will normalize min and max amplitudes to -1.0/1.0)
+
+Spectrogram domain augmentations
+--------------------------------
+
+**Pitch augmentation** ``--augment pitch[p=<float>,pitch=<float-range>]``
+  Scales spectrogram on frequency axis and thus changes pitch.
+
+  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
+
+  * **pitch**: pitch factor by with the frequency axis is scaled (e.g. a value of 2.0 will raise audio frequency by one octave)
+
+
+**Tempo augmentation** ``--augment tempo[p=<float>,factor=<float-range>]``
+  Scales spectrogram on time axis and thus changes playback tempo.
+
+  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
+
+  * **factor**: speed factor by which the time axis is stretched or shrunken (e.g. a value of 2.0 will double playback tempo)
+
+
+**Warp augmentation** ``--augment warp[p=<float>,nt=<int-range>,nf=<int-range>,wt=<float-range>,wf=<float-range>]``
+  Applies a non-linear image warp to the spectrogram. This is achieved by randomly shifting a grid of equally distributed warp points along time and frequency axis.
+
+  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
+
+  * **nt**: number of equally distributed warp grid lines along time axis of the spectrogram (excluding the edges)
+
+  * **nf**: number of equally distributed warp grid lines along frequency axis of the spectrogram (excluding the edges)
+
+  * **wt**: standard deviation of the random shift applied to warp points along time axis (0.0 = no warp, 1.0 = half the distance to the neighbour point)
+
+  * **wf**: standard deviation of the random shift applied to warp points along frequency axis (0.0 = no warp, 1.0 = half the distance to the neighbour point)
+
+
+**Frequency mask augmentation** ``--augment frequency_mask[p=<float>,n=<int-range>,size=<int-range>]``
+  Sets frequency-intervals within the augmented samples to zero (silence) at random frequencies. See the SpecAugment paper for more details - https://arxiv.org/abs/1904.08779
+
+  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
+
+  * **n**: number of intervals to mask
+
+  * **size**: number of frequency bands to mask per interval
+
+Multi domain augmentations
+--------------------------
+
+**Time mask augmentation** ``--augment time_mask[p=<float>,n=<int-range>,size=<float-range>,domain=<domain>]``
+  Sets time-intervals within the augmented samples to zero (silence) at random positions.
+
+  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
+
+  * **n**: number of intervals to set to zero
+
+  * **size**: duration of intervals in ms
+
+  * **domain**: data representation to apply augmentation to - "signal", "features" or "spectrogram" (default)
+
+
+**Dropout augmentation** ``--augment dropout[p=<float>,rate=<float-range>,domain=<domain>]``
+  Zeros random data points of the targeted data representation.
+
+  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
+
+  * **rate**: dropout rate ranging from 0.0 for no dropout to 1.0 for 100% dropout
+
+  * **domain**: data representation to apply augmentation to - "signal", "features" or "spectrogram" (default)
+
+
+**Add augmentation** ``--augment add[p=<float>,stddev=<float-range>,domain=<domain>]``
+  Adds random values picked from a normal distribution (with a mean of 0.0) to all data points of the targeted data representation.
+
+  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
+
+  * **stddev**: standard deviation of the normal distribution to pick values from
+
+  * **domain**: data representation to apply augmentation to - "signal", "features" (default) or "spectrogram"
+
+
+**Multiply augmentation** ``--augment multiply[p=<float>,stddev=<float-range>,domain=<domain>]``
+  Multiplies all data points of the targeted data representation with random values picked from a normal distribution (with a mean of 1.0).
+
+  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
+
+  * **stddev**: standard deviation of the normal distribution to pick values from
+
+  * **domain**: data representation to apply augmentation to - "signal", "features" (default) or "spectrogram"
+
+
+Example training with all augmentations:
+
+.. code-block:: bash
+
+        python -u train.py \
+          --train_files "train.sdb" \
+          --feature_cache ./feature.cache \
+          --cache_for_epochs 10 \
+          --epochs 100 \
+          --augment overlay[p=0.5,source=noise.sdb,layers=1,snr=50:20~10] \
+          --augment reverb[p=0.1,delay=50.0~30.0,decay=10.0:2.0~1.0] \
+          --augment resample[p=0.1,rate=12000:8000~4000] \
+          --augment codec[p=0.1,bitrate=48000:16000] \
+          --augment volume[p=0.1,dbfs=-10:-40] \
+          --augment pitch[p=0.1,pitch=1~0.2] \
+          --augment tempo[p=0.1,factor=1~0.5] \
+          --augment warp[p=0.1,nt=4,nf=1,wt=0.5:1.0,wf=0.1:0.2] \
+          --augment frequency_mask[p=0.1,n=1:3,size=1:5] \
+          --augment time_mask[p=0.1,domain=signal,n=3:10~2,size=50:100~40] \
+          --augment dropout[p=0.1,rate=0.05] \
+          --augment add[p=0.1,domain=signal,stddev=0~0.5] \
+          --augment multiply[p=0.1,domain=features,stddev=0~0.5] \
+          [...]
+
+
+The ``bin/play.py`` and ``bin/data_set_tool.py`` tools also support ``--augment`` parameters (for sample domain augmentations) and can be used for experimenting with different configurations or creating augmented data sets.
+
+Example of playing all samples with reverberation and maximized volume:
+
+.. code-block:: bash
+
+        bin/play.py --augment reverb[p=0.1,delay=50.0,decay=2.0] --augment volume --random test.sdb
+
+Example simulation of the codec augmentation of a wav-file first at the beginning and then at the end of an epoch:
+
+.. code-block:: bash
+
+        bin/play.py --augment codec[p=0.1,bitrate=48000:16000] --clock 0.0 test.wav
+        bin/play.py --augment codec[p=0.1,bitrate=48000:16000] --clock 1.0 test.wav
+
+Example of creating a pre-augmented test set:
+
+.. code-block:: bash
+
+        bin/data_set_tool.py \
+          --augment overlay[source=noise.sdb,layers=1,snr=20~10] \
+          --augment resample[rate=12000:8000~4000] \
+          test.sdb test-augmented.sdb
--- a/doc/BYTE_OUTPUT_MODE.rst
+++ b/doc/BYTE_OUTPUT_MODE.rst
@ -0,0 +1,8 @@
+.. _byte-output-mode:
+
+Training in byte output mode
+=============================
+
+🐸STT includes a ``byte output mode`` which can be useful when working with languages with very large alphabets, such as Mandarin Chinese.
+
+This training mode is experimental, and has only been used for Mandarin Chinese.
--- a/doc/CHECKPOINTING.rst
+++ b/doc/CHECKPOINTING.rst
@ -0,0 +1,10 @@
+.. _checkpointing:
+
+Checkpointing
+=============
+
+Checkpoints are representations of the parameters of a neural network. During training, model parameters are continually updated, and checkpoints allow graceful interruption of a training run without data loss. If you interrupt a training run for any reason, you can pick up where you left off by using the checkpoints as a starting place. This is the exact same logic behind :ref:`model fine-tuning <transfer-learning>`. 
+
+Checkpointing occurs at a configurable time interval. Resuming from checkpoints happens automatically by re-starting training with the same ``--checkpoint_dir`` of the former run. Alternatively, you can specify more fine grained options with ``--load_checkpoint_dir`` and ``--save_checkpoint_dir``, which specify separate locations to use for loading and saving checkpoints respectively.
+
+Be aware that checkpoints are only valid for the same model geometry from which they were generated. If you experience error messages that certain ``Tensors`` have incompatible dimensions, you might be trying to use checkpoints with an incompatible architecture.
--- a/doc/COMMON_VOICE_DATA.rst
+++ b/doc/COMMON_VOICE_DATA.rst
@ -0,0 +1,44 @@
+.. _common-voice-data:
+
+Common Voice training data
+==========================
+
+This document gives some information about using Common Voice data with STT. If you're in need of training data, the Common Voice corpus is a good place to start.
+
+Common Voice consists of voice data that was donated through Mozilla's `Common Voice <https://commonvoice.mozilla.org/>`_ initiative. You can download the data sets for various languages `here <https://commonvoice.mozilla.org/data>`_.
+
+After you download and extract a data set for one language, you'll find the following contents:
+
+* ``.tsv`` files, containing metadata such as text transcripts
+* ``.mp3`` audio files, located in the ``clips`` directory
+
+🐸STT cannot directly work with Common Voice data, so you should run our importer script ``bin/import_cv2.py`` to format the data correctly:
+
+.. code-block:: bash
+
+   bin/import_cv2.py --filter_alphabet path/to/some/alphabet.txt /path/to/extracted/common-voice/archive
+
+Providing a filter alphabet is optional. This alphabet is used to exclude all audio files whose transcripts contain characters not in the specified alphabet. Running the importer with ``-h`` will show you additional options.
+
+The importer will create a new ``WAV`` file for every ``MP3`` file in the ``clips`` directory. The importer will also create the following ``CSV`` files:
+
+* ``clips/train.csv``
+* ``clips/dev.csv``
+* ``clips/test.csv``
+
+The CSV files contain the following fields:
+
+* ``wav_filename`` - path to the audio file, may be absolute or relative. Our importer produces relative paths
+* ``wav_filesize`` - samples size given in bytes, used for sorting the data before training. Expects integer
+* ``transcript`` - transcription target for the sample
+
+To use Common Voice data for training, validation and testing, you should pass the ``CSV`` filenames to ``train.py`` via ``--train_files``, ``--dev_files``, ``--test_files``.
+
+For example, if you download, extracted, and imported the French language data from Common Voice, you will have a new local directory named ``fr``. You can train STT with this new French data as such:
+
+.. code-block:: bash
+
+   $ python3 train.py \
+	--train_files fr/clips/train.csv \
+	--dev_files fr/clips/dev.csv \
+	--test_files fr/clips/test.csv
--- a/doc/DATASET_IMPORTERS.rst
+++ b/doc/DATASET_IMPORTERS.rst
@ -0,0 +1,6 @@
+.. _data-importers:
+
+Data Importers
+==============
+
+We supply importer scripts for various publically available speech data sets. You can find these scripts in the STT repo under the ``bin/`` directory, but not all of the data sets are free. See ``bin/import_librivox.py`` for an example of how to import and preprocess a large, free dataset for training with 🐸STT.
--- a/doc/DECODER.rst
+++ b/doc/DECODER.rst
@ -4,31 +4,31 @@ CTC beam search decoder
 =======================

 Introduction
-^^^^^^^^^^^^
+------------

 🐸STT uses the `Connectionist Temporal Classification <http://www.cs.toronto.edu/~graves/icml_2006.pdf>`_ loss function. For an excellent explanation of CTC and its usage, see this Distill article: `Sequence Modeling with CTC <https://distill.pub/2017/ctc/>`_. This document assumes the reader is familiar with the concepts described in that article, and describes 🐸STT specific behaviors that developers building systems with 🐸STT should know to avoid problems.

-Note: Documentation for the tooling for creating custom scorer packages is available in :ref:`scorer-scripts`.
+Note: Documentation for the tooling for creating custom scorer packages is available in :ref:`language-model`.

 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this document are to be interpreted as described in `BCP 14 <https://tools.ietf.org/html/bcp14>`_ when, and only when, they appear in all capitals, as shown here.


 External scorer
-^^^^^^^^^^^^^^^
+---------------

 🐸STT clients support OPTIONAL use of an external language model to improve the accuracy of the predicted transcripts. In the code, command line parameters, and documentation, this is referred to as a "scorer". The scorer is used to compute the likelihood (also called a score, hence the name "scorer") of sequences of words or characters in the output, to guide the decoder towards more likely results. This improves accuracy significantly.

 The use of an external scorer is fully optional. When an external scorer is not specified, 🐸STT still uses a beam search decoding algorithm, but without any outside scoring.

-Currently, the 🐸STT external scorer is implemented with `KenLM <https://kheafield.com/code/kenlm/>`_, plus some tooling to package the necessary files and metadata into a single ``.scorer`` package. The tooling lives in ``data/lm/``. The scripts included in ``data/lm/`` can be used and modified to build your own language model based on your particular use case or language. See :ref:`scorer-scripts` for more details on how to reproduce our scorer file as well as create your own.
+Currently, the 🐸STT external scorer is implemented with `KenLM <https://kheafield.com/code/kenlm/>`_, plus some tooling to package the necessary files and metadata into a single ``.scorer`` package. The tooling lives in ``data/lm/``. The scripts included in ``data/lm/`` can be used and modified to build your own language model based on your particular use case or language. See :ref:`language-model` for more details on how to reproduce our scorer file as well as create your own.

 The scripts are geared towards replicating the language model files we release as part of `STT model releases <https://github.com/coqui-ai/STT/releases/latest>`_, but modifying them to use different datasets or language model construction parameters should be simple.


 Decoding modes
-^^^^^^^^^^^^^^
+--------------

-🐸STT currently supports two modes of operation with significant differences at both training and decoding time. Note that Bytes output mode is experimental and has not been tested for languages other than Chinese Mandarin.
+🐸STT currently supports two modes of operation with significant differences at both training and decoding time. Note that ``bytes output`` mode is experimental and has not been tested for languages other than Chinese Mandarin.


 Default mode (alphabet based)
@ -44,9 +44,9 @@ Bytes output mode

 In bytes output mode the model predicts UTF-8 bytes directly instead of letters from an alphabet file. This idea was proposed in the paper `Bytes Are All You Need <https://arxiv.org/abs/1811.09021>`_. This mode is enabled with the ``--bytes_output_mode`` flag at training and export time. At training time, the alphabet file is not used. Instead, the model is forced to have 256 labels, with labels 0-254 corresponding to UTF-8 byte values 1-255, and label 255 is used for the CTC blank symbol. If using an external scorer at decoding time, it MUST be built according to the instructions that follow.

-Bytes output mode can be useful for languages with very large alphabets, such as Mandarin written with Simplified Chinese characters. It may also be useful for building multi-language models, or as a base for transfer learning. Currently these cases are untested and unsupported. Note that bytes output mode makes assumptions that hold for Mandarin written with Simplified Chinese characters and may not hold for other languages.
+Byte output mode can be useful for languages with very large alphabets, such as Mandarin written with Simplified Chinese characters. It may also be useful for building multi-language models, or as a base for transfer learning. Currently these cases are untested and unsupported. Note that bytes output mode makes assumptions that hold for Mandarin written with Simplified Chinese characters and may not hold for other languages.

-UTF-8 scorers are character based (more specifically, Unicode codepoint based), but the way they are used is similar to a word based scorer where each "word" is a sequence of UTF-8 bytes representing a single Unicode codepoint. This means that the input text used to create UTF-8 scorers should contain space separated Unicode codepoints. For example, the following input text:
+UTF-8 byte mode language models are character based (more specifically, Unicode codepoint based), but the way they are used is similar to a word based scorer where each "word" is a sequence of UTF-8 bytes representing a single Unicode codepoint. This means that the input text used to create ``byte output mode`` scorers should contain space separated Unicode codepoints. For example, the following input text:

 ``早 上 好``

@ -58,13 +58,13 @@ corresponds to the following three "words", or UTF-8 byte sequences:

 At decoding time, the scorer is queried every time a Unicode codepoint is predicted, instead of when a space character is predicted. From the language modeling perspective, this is a character based model. From the implementation perspective, this is a word based model, because each character is composed of multiple labels.

-**Acoustic models trained with ``--bytes_output_mode`` MUST NOT be used with an alphabet based scorer. Conversely, acoustic models trained with an alphabet file MUST NOT be used with a UTF-8 scorer.**
+**Acoustic models trained with ``--bytes_output_mode`` MUST NOT be used with an alphabet based scorer. Conversely, acoustic models trained with an alphabet file MUST NOT be used with a UTF-8 byte output mode scorer.**

-UTF-8 scorers can be built by using an input corpus with space separated codepoints. If your corpus only contains single codepoints separated by spaces, ``generate_scorer_package`` should automatically enable bytes output mode, and it should print the message "Looks like a character based model."
+``byte output mode`` language models can be built by using an input corpus with space separated codepoints. If your corpus only contains single codepoints separated by spaces, ``generate_scorer_package`` should automatically enable bytes output mode, and it should print the message "Looks like a character based model."

 If the message "Doesn't look like a character based model." is printed, you should double check your inputs to make sure it only contains single codepoints separated by spaces. Bytes output mode can be forced by specifying the ``--force_bytes_output_mode`` flag when running ``generate_scorer_package``, but it is NOT RECOMMENDED.

-See :ref:`scorer-scripts` for more details on using ``generate_scorer_package``.
+See :ref:`language-model` for more details on using ``generate_scorer_package``.

 Because KenLM uses spaces as a word separator, the resulting language model will not include space characters in it. If you wish to use bytes output mode but still model spaces, you need to replace spaces in the input corpus with a different character **before** converting it to space separated codepoints. For example:

--- a/doc/DEPLOYMENT.rst
+++ b/doc/DEPLOYMENT.rst
@ -34,7 +34,9 @@ You can find pre-trained models ready for deployment on the 🐸STT `releases pa

 In every 🐸STT official release, there are several kinds of model files provided. For the acoustic model there are two file extensions: ``.pbmm`` and ``.tflite``. Files ending in ``.pbmm`` are compatible with clients and language bindings built against the standard TensorFlow runtime. ``.pbmm`` files are also compatible with CUDA enabled clients and language bindings. Files ending in ``.tflite``, on the other hand, are only compatible with clients and language bindings built against the `TensorFlow Lite runtime <https://www.tensorflow.org/lite/>`_. TFLite models are optimized for size and performance on low-power devices. You can find a full list of supported platforms and TensorFlow runtimes at :ref:`supported-platforms-deployment`.

-For language models, there is only only file extension: ``.scorer``. Language models can run on any supported device, regardless of Tensorflow runtime. You can read more about language models with regard to :ref:`the decoding process <decoder-docs>` and :ref:`how scorers are generated <scorer-scripts>`.
+For language models, there is only only file extension: ``.scorer``. Language models can run on any supported device, regardless of Tensorflow runtime. You can read more about language models with regard to :ref:`the decoding process <decoder-docs>` and :ref:`how scorers are generated <language-model>`.
+
+.. _model-data-match:

 How will a model perform on my data?
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -51,9 +53,9 @@ If you take a 🐸STT model trained on English, and pass Spanish into it, you sh

 An acoustic model (i.e. ``.pbmm`` or ``.tflite``) has "learned" how to transcribe a certain language, and the model probably understands some accents better than others. In addition to languages and accents, acoustic models are sensitive to the style of speech, the topic of speech, and the demographics of the person speaking. The language model (``.scorer``) has been trained on text alone. As such, the language model is sensitive to how well the topic and style of speech matches that of the text used in training. The 🐸STT `release notes <https://github.com/coqui-ai/STT/releases/tag/v0.9.3>`_ include detailed information on the data used to train the models. If the data used for training the off-the-shelf models does not align with your intended use case, it may be necessary to adapt or train new models in order to improve transcription on your data.

-Training your own language model is often a good way to improve transcription on your audio. The process and tools used to generate a language model are described in :ref:`scorer-scripts` and general information can be found in :ref:`decoder-docs`. Generating a scorer from a constrained topic dataset is a quick process and can bring significant accuracy improvements if your audio is from a specific topic.
+Training your own language model is often a good way to improve transcription on your audio. The process and tools used to generate a language model are described in :ref:`language-model` and general information can be found in :ref:`decoder-docs`. Generating a scorer from a constrained topic dataset is a quick process and can bring significant accuracy improvements if your audio is from a specific topic.

-Acoustic model training is described in :ref:`training-docs`. Fine tuning an off-the-shelf acoustic model to your own data can be a good way to improve performance. See the :ref:`fine tuning and transfer learning sections <training-fine-tuning>` for more information.
+Acoustic model training is described in :ref:`intro-training-docs`. Fine tuning an off-the-shelf acoustic model to your own data can be a good way to improve performance. See the :ref:`fine tuning and transfer learning sections <training-fine-tuning>` for more information.

 Model compatibility
 ^^^^^^^^^^^^^^^^^^^
@ -175,7 +177,7 @@ See the :ref:`TypeScript client <js-api-example>` for an example of how to use t
 Installing bindings from source
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-If pre-built binaries aren't available for your system, you'll need to install them from scratch. Follow the :ref:`native client build and installation instructions <native-build-client>`.
+If pre-built binaries aren't available for your system, you'll need to install them from scratch. Follow the :ref:`native client build and installation instructions <build-native-client>`.

 Dockerfile for building from source
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
--- a/doc/Dockerfile.docs
+++ b/doc/Dockerfile.docs
@ -0,0 +1,35 @@
+# Build with docker build . -f doc/Dockerfile.docs --output doc-html
+# Built documentation will be placed in the doc-html folder.
+
+# Ubuntu Focal for Python >= 3.7
+FROM ubuntu:focal-20210325 as build
+ENV DEBIAN_FRONTEND="noninteractive"
+RUN apt-get update -y && apt-get install -y --no-install-recommends \
+    doxygen \
+    git \
+    npm \
+    make \
+    python3-pip \
+    python3-venv \
+    && rm -rf /var/lib/apt/lists/*
+
+# Setup virtualenv
+ENV VIRTUAL_ENV=/tmp/venv
+RUN python3 -m venv $VIRTUAL_ENV
+ENV PATH="$VIRTUAL_ENV/bin:$PATH"
+RUN python3 -m pip install -U pip setuptools wheel
+
+# First just requirements so we don't re-run pip install every time
+COPY doc/requirements.txt /code/doc/requirements.txt
+COPY doc/Makefile /code/doc/Makefile
+WORKDIR /code/doc
+RUN make python-reqs
+
+# Then rest of code for docs build
+COPY . /code
+WORKDIR /code/doc
+RUN make html
+
+# Output target only containing built files
+FROM scratch as docs
+COPY --from=build /code/doc/.build/html /
--- a/doc/DotNet-Examples.rst
+++ b/doc/DotNet-Examples.rst
@ -14,7 +14,7 @@ Creating a model instance and loading model
   :end-before: sphinx-doc: csharp_ref_model_stop

 Deploying trained model
--------------------
+-----------------------

 .. literalinclude:: ../native_client/dotnet/STTConsole/Program.cs
   :language: csharp
--- a/doc/EXPORTING_MODELS.rst
+++ b/doc/EXPORTING_MODELS.rst
@ -0,0 +1,56 @@
+.. _exporting-checkpoints:
+
+Exporting a model for deployment
+================================
+
+After you train a STT model, your model will be stored on disk as a :ref:`checkpoint file <checkpointing>`. Model checkpoints are useful for resuming training at a later date, but they are not the correct format for deploying a model into production. The best model format for deployment is a protobuf file.
+
+This document explains how to export model checkpoints as a protobuf file.
+
+How to export a model
+---------------------
+
+The simplest way to export STT model checkpoints for deployment is via ``train.py`` and the ``--export_dir`` flag.
+
+.. code-block:: bash
+
+   $ python3 train.py \
+	--checkpoint_dir path/to/existing/model/checkpoints \
+	--export_dir where/to/export/new/protobuf
+
+However, you may want to export a model for small devices or for more efficient memory usage. In this case, follow the steps below.
+
+Exporting as memory-mapped
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+By default, the protobuf exported by ``train.py`` will be loaded in memory every time the model is deployed. This results in extra loading time and memory consumption. Creating a memory-mapped protobuf file will avoid these issues.
+
+First, export your checkpoints to a protobuf with ``train.py``:
+
+.. code-block:: bash
+
+   $ python3 train.py \
+	--checkpoint_dir path/to/existing/model/checkpoints \
+	--export_dir where/to/export/new/protobuf
+
+Second, convert the protobuf to a memory-mapped protobuf with ``convert_graphdef_memmapped_format``:
+
+.. code-block::
+
+   $ convert_graphdef_memmapped_format \
+       --in_graph=output_graph.pb \
+       --out_graph=output_graph.pbmm
+
+``convert_graphdef_memmapped_format`` is a dedicated tool to convert regular protobuf files to memory-mapped protobufs. You can find this tool pre-compiled on the STT `release page <https://github.com/coqui-ai/STT/releases>`_. You should download and decompress ``convert_graphdef_memmapped_format`` before use. Upon a sucessful conversion ``convert_graphdef_memmapped_format`` will report conversion of a non-zero number of nodes.
+
+Exporting for small devices
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If you want to deploy a STT model on a small device, you might consider exporting the model with `Tensorflow Lite <https://www.tensorflow.org/lite>`_ support. Export STT model checkpoints for Tensorflow Lite via ``train.py`` and the ``--export_tflite`` flag.
+
+.. code-block:: bash
+
+   $ python3 train.py \
+	--checkpoint_dir path/to/existing/model/checkpoints \
+	--export_dir where/to/export/new/protobuf \
+	--export_tflite
--- a/doc/Geometry.rst
+++ b/doc/Geometry.rst
@ -1,7 +1,9 @@
-Geometric Constants
-===================
+.. _model-geometry:

-This is about several constants related to the geometry of the network.
+Model Geometry
+==============
+
+This document explains several constants related to the geometry of an STT model.

 n_input
 -------
--- a/doc/Java-Examples.rst
+++ b/doc/Java-Examples.rst
@ -14,7 +14,7 @@ Creating a model instance and loading model
   :end-before: sphinx-doc: java_ref_model_stop

 Deploying trained model
--------------------
+-----------------------

 .. literalinclude:: ../native_client/java/app/src/main/java/ai/coqui/sttexampleapp/STTActivity.java
   :language: java
--- a/doc/LANGUAGE_MODEL.rst
+++ b/doc/LANGUAGE_MODEL.rst
@ -1,4 +1,4 @@
-.. _scorer-scripts:
+.. _language-model:

 How to Train a Language Model
 =============================
@ -45,7 +45,7 @@ Train the Language Model

 Assuming you found and formatted a text corpus, the next step is to use that text to train a KenLM language model with ``data/lm/generate_lm.py``.

-Before training the language model, you should first familiarize yourself with the `KenLM toolkit <https://kheafield.com/code/kenlm/>`_. Most of the options exposed by the ``generate_lm.py`` script are simply forwarded to KenLM options of the same name, so you should read the KenLM documentation in order to fully understand their behavior.
+For more custom use cases, you might familiarize yourself with the `KenLM toolkit <https://kheafield.com/code/kenlm/>`_. Most of the options exposed by the ``generate_lm.py`` script are simply forwarded to KenLM options of the same name, so you should read the `KenLM documentation <https://kheafield.com/code/kenlm/estimation/>`_ in order to fully understand their behavior.

 .. code-block:: bash

--- a/doc/MIXED_PRECISION.rst
+++ b/doc/MIXED_PRECISION.rst
@ -0,0 +1,18 @@
+.. _automatic-mixed-precision:
+
+Automatic Mixed Precision
+=========================
+
+Training with `automatic mixed precision <https://medium.com/tensorflow/automatic-mixed-precision-in-tensorflow-for-faster-ai-training-on-nvidia-gpus-6033234b2540>`_ is available when training STT on an GPU.
+
+Mixed precision training makes use of both ``FP32`` and ``FP16`` precisions where appropriate. ``FP16`` operations can leverage the Tensor cores on NVIDIA GPUs (Volta, Turing or newer architectures) for improved throughput. Mixed precision training often allows larger batch sizes. Automatic mixed precision training can be enabled by including the flag `--automatic_mixed_precision` at training time:
+
+.. code-block:: bash
+
+    $ python3 train.py \
+        --train_files train.csv \
+	--dev_files dev.csv \
+	--test_files test.csv \
+	--automatic_mixed_precision
+
+On a Volta generation V100 GPU, automatic mixed precision can speed up 🐸STT training and evaluation by approximately 30% to 40%.
--- a/doc/Makefile
+++ b/doc/Makefile
@ -8,29 +8,24 @@ SPHINXPROJ    = "Coqui STT"
 SOURCEDIR     = .
 BUILDDIR      = .build

-PIP_INSTALL   ?= pip3 install --user
+PIP_INSTALL   ?= python3 -m pip install

 # Put it first so that "make" without argument is like "make help".
 help:
 	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

-.PHONY: help pip3 Makefile
+.PHONY: help python-reqs

-pip3:
-	$(PIP_INSTALL) -r ../taskcluster/docs-requirements.txt
+python-reqs: requirements.txt
+	$(PIP_INSTALL) -r requirements.txt

 submodule:
-	git submodule update --init --remote -- ../doc/examples
+	git submodule update --init --remote -- examples || true

 # Add submodule update dependency to Sphinx's "html" target
-html: Makefile submodule pip3
+html: Makefile submodule python-reqs
 	@PATH=$$HOME/.local/bin:`pwd`/../node_modules/.bin/:$$PATH \
 	     $(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

 dist: html
 	cd $(BUILDDIR)/html/ && zip -r9 ../../html.zip *
-
-# Catch-all target: route all unknown targets to Sphinx using the new
-# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
-%: Makefile pip3
-	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/doc/NodeJS-Examples.rst
+++ b/doc/NodeJS-Examples.rst
@ -16,7 +16,7 @@ Creating a model instance and loading model
   :end-before: sphinx-doc: js_ref_model_stop

 Deploying trained model
--------------------
+-----------------------

 .. literalinclude:: ../native_client/javascript/client.ts
   :language: javascript
--- a/doc/PARALLEL_OPTIMIZATION.rst
+++ b/doc/PARALLEL_OPTIMIZATION.rst
@ -0,0 +1,53 @@
+.. _parallel-training-optimization:
+
+Parallel Training Optimization
+==============================
+
+We use hybrid optimization to train the 🐸STT model across multiple GPUs on a single host. Parallel optimization can take on various forms. For example one can use asynchronous updates of the model, synchronous updates of the model, or some combination of the two.
+
+Asynchronous Parallel Optimization
+----------------------------------
+
+In asynchronous parallel optimization, for example, one places the model initially in CPU memory. Then each of the :math:`G` GPUs obtains a mini-batch of data along with the current model parameters. Using this mini-batch each GPU then computes the gradients for all model parameters and sends these gradients back to the CPU when the GPU is done with its mini-batch. The CPU then asynchronously updates the model parameters whenever it receives a set of gradients from a GPU.
+
+Asynchronous parallel optimization has several advantages and several disadvantages. One large advantage is throughput. No GPU will ever be waiting idle. When a GPU is done processing a mini-batch, it can immediately obtain the next mini-batch to process. It never has to wait on other GPUs to finish their mini-batch. However, this means that the model updates will also be asynchronous which can have problems.
+
+For example, one may have model parameters :math:`W` on the CPU and send mini-batch :math:`n` to GPU 1 and send mini-batch :math:`n+1` to GPU 2. As processing is asynchronous, GPU 2 may finish before GPU 1 and thus update the CPU's model parameters :math:`W` with its gradients :math:`\Delta W_{n+1}(W)`, where the subscript :math:`n+1` identifies the mini-batch and the argument :math:`W` the location at which the gradient was evaluated.
+
+This results in the new model parameters:
+
+.. math::
+    W + \Delta W_{n+1}(W).
+
+Next GPU 1 could finish with its mini-batch and update the parameters to
+
+.. math::
+    W + \Delta W_{n+1}(W) + \Delta W_{n}(W).
+
+The problem with this is that :math:`\Delta W_{n}(W)` is evaluated at :math:`W` and not :math:`W + \Delta W_{n+1}(W)`. Hence, the direction of the gradient :math:`\Delta W_{n}(W)` is slightly incorrect as it is evaluated at the wrong location. This can be counteracted through synchronous updates of model, but this is also problematic.
+
+Synchronous Optimization
+------------------------
+
+Synchronous optimization solves the problem we saw above. In synchronous optimization, one places the model initially in CPU memory. Then one of the `G` GPUs is given a mini-batch of data along with the current model parameters. Using the mini-batch the GPU computes the gradients for all model parameters and sends the gradients back to the CPU. The CPU then updates the model parameters and starts the process of sending out the next mini-batch.
+
+As on can readily see, synchronous optimization does not have the problem we found in the last section, that of incorrect gradients. However, synchronous optimization can only make use of a single GPU at a time. So, when we have a multi-GPU setup, :math:`G > 1`, all but one of the GPUs will remain idle, which is unacceptable. However, there is a third alternative which is combines the advantages of asynchronous and synchronous optimization.
+
+Hybrid Parallel Optimization
+----------------------------
+
+Hybrid parallel optimization combines most of the benefits of asynchronous and synchronous optimization. It allows for multiple GPUs to be used, but does not suffer from the incorrect gradient problem exhibited by asynchronous optimization.
+
+In hybrid parallel optimization one places the model initially in CPU memory. Then, as in asynchronous optimization, each of the :math:`G` GPUs obtains a mini-batch of data along with the current model parameters. Using the mini-batch each of the GPUs then computes the gradients for all model parameters and sends these gradients back to the CPU. Now, in contrast to asynchronous optimization, the CPU waits until each GPU is finished with its mini-batch then takes the mean of all the gradients from the :math:`G` GPUs and updates the model with this mean gradient.
+
+.. image:: ../images/Parallelism.png
+    :alt: Image shows a diagram with arrows displaying the flow of information between devices during training. A CPU device sends weights and gradients to one or more GPU devices, which run an optimization step and then return the new parameters to the CPU, which averages them and starts a new training iteration.
+
+Hybrid parallel optimization has several advantages and few disadvantages. As in asynchronous parallel optimization, hybrid parallel optimization allows for one to use multiple GPUs in parallel. Furthermore, unlike asynchronous parallel optimization, the incorrect gradient problem is not present here. In fact, hybrid parallel optimization performs as if one is working with a single mini-batch which is :math:`G` times the size of a mini-batch handled by a single GPU. However, hybrid parallel optimization is not perfect. If one GPU is slower than all the others in completing its mini-batch, all other GPUs will have to sit idle until this straggler finishes with its mini-batch. This hurts throughput. But, if all GPUs are of the same make and model, this problem should be minimized.
+
+So, relatively speaking, hybrid parallel optimization seems the have more advantages and fewer disadvantages as compared to both asynchronous and synchronous optimization. So, we will, for our work, use this hybrid model.
+
+Adam Optimization
+-----------------
+
+In contrast to `Deep Speech: Scaling up end-to-end speech recognition <http://arxiv.org/abs/1412.5567>`_, in which `Nesterov’s Accelerated Gradient Descent <www.cs.toronto.edu/~fritz/absps/momentum.pdf>`_ was used, we use the `Adam <http://arxiv.org/abs/1412.6980>`_ method for optimization, because in our experience the latter requires less fine-tuning.
--- a/doc/ParallelOptimization.rst
+++ b/doc/ParallelOptimization.rst
@ -1,105 +0,0 @@
-Parallel Optimization
-=====================
-
-This is how we implement train the 🐸STT model across GPUs on a single host.
-Parallel optimization can take on various forms. For example one can use
-asynchronous updates of the model, synchronous updates of the model, or some
-combination of the two.
-
-Asynchronous Parallel Optimization
----------------------------------
-
-In asynchronous parallel optimization, for example, one places the model
-initially in CPU memory. Then each of the :math:`G` GPUs obtains a mini-batch of data
-along with the current model parameters. Using this mini-batch each GPU then
-computes the gradients for all model parameters and sends these gradients back
-to the CPU when the GPU is done with its mini-batch. The CPU then asynchronously
-updates the model parameters whenever it receives a set of gradients from a GPU.
-
-Asynchronous parallel optimization has several advantages and several
-disadvantages. One large advantage is throughput. No GPU will ever be waiting
-idle. When a GPU is done processing a mini-batch, it can immediately obtain the
-next mini-batch to process. It never has to wait on other GPUs to finish their
-mini-batch. However, this means that the model updates will also be asynchronous
-which can have problems.
-
-For example, one may have model parameters :math:`W` on the CPU and send mini-batch
-:math:`n` to GPU 1 and send mini-batch :math:`n+1` to GPU 2. As processing is asynchronous,
-GPU 2 may finish before GPU 1 and thus update the CPU's model parameters :math:`W`
-with its gradients :math:`\Delta W_{n+1}(W)`, where the subscript :math:`n+1` identifies the
-mini-batch and the argument :math:`W` the location at which the gradient was evaluated.
-This results in the new model parameters
-
-.. math::
-    W + \Delta W_{n+1}(W).
-
-Next GPU 1 could finish with its mini-batch and update the parameters to
-
-.. math::
-    W + \Delta W_{n+1}(W) + \Delta W_{n}(W).
-
-The problem with this is that :math:`\Delta W_{n}(W)` is evaluated at :math:`W` and not
-:math:`W + \Delta W_{n+1}(W)`. Hence, the direction of the gradient :math:`\Delta W_{n}(W)`
-is slightly incorrect as it is evaluated at the wrong location. This can be
-counteracted through synchronous updates of model, but this is also problematic.
-
-Synchronous Optimization
------------------------
-
-Synchronous optimization solves the problem we saw above. In synchronous
-optimization, one places the model initially in CPU memory. Then one of the `G`
-GPUs is given a mini-batch of data along with the current model parameters.
-Using the mini-batch the GPU computes the gradients for all model parameters and
-sends the gradients back to the CPU. The CPU then updates the model parameters
-and starts the process of sending out the next mini-batch.
-
-As on can readily see, synchronous optimization does not have the problem we
-found in the last section, that of incorrect gradients. However, synchronous
-optimization can only make use of a single GPU at a time. So, when we have a
-multi-GPU setup, :math:`G > 1`, all but one of the GPUs will remain idle, which is
-unacceptable. However, there is a third alternative which is combines the
-advantages of asynchronous and synchronous optimization.
-
-Hybrid Parallel Optimization
----------------------------
-
-Hybrid parallel optimization combines most of the benefits of asynchronous and
-synchronous optimization. It allows for multiple GPUs to be used, but does not
-suffer from the incorrect gradient problem exhibited by asynchronous
-optimization.
-
-In hybrid parallel optimization one places the model initially in CPU memory.
-Then, as in asynchronous optimization, each of the :math:`G` GPUs obtains a
-mini-batch of data along with the current model parameters. Using the mini-batch
-each of the GPUs then computes the gradients for all model parameters and sends
-these gradients back to the CPU. Now, in contrast to asynchronous optimization,
-the CPU waits until each GPU is finished with its mini-batch then takes the mean
-of all the gradients from the :math:`G` GPUs and updates the model with this mean
-gradient.
-
-.. image:: ../images/Parallelism.png
-    :alt: Image shows a diagram with arrows displaying the flow of information between devices during training. A CPU device sends weights and gradients to one or more GPU devices, which run an optimization step and then return the new parameters to the CPU, which averages them and starts a new training iteration.
-
-Hybrid parallel optimization has several advantages and few disadvantages. As in
-asynchronous parallel optimization, hybrid parallel optimization allows for one
-to use multiple GPUs in parallel. Furthermore, unlike asynchronous parallel
-optimization, the incorrect gradient problem is not present here. In fact,
-hybrid parallel optimization performs as if one is working with a single
-mini-batch which is :math:`G` times the size of a mini-batch handled by a single GPU.
-However, hybrid parallel optimization is not perfect. If one GPU is slower than
-all the others in completing its mini-batch, all other GPUs will have to sit
-idle until this straggler finishes with its mini-batch. This hurts throughput.
-But, if all GPUs are of the same make and model, this problem should be
-minimized.
-
-So, relatively speaking, hybrid parallel optimization seems the have more
-advantages and fewer disadvantages as compared to both asynchronous and
-synchronous optimization. So, we will, for our work, use this hybrid model.
-
-Adam Optimization
-----------------
-
-In contrast to
-`Deep Speech: Scaling up end-to-end speech recognition <http://arxiv.org/abs/1412.5567>`_,
-in which `Nesterov’s Accelerated Gradient Descent <www.cs.toronto.edu/~fritz/absps/momentum.pdf>`_ was used, we will use the Adam method for optimization `[3] <http://arxiv.org/abs/1412.6980>`_,
-because, generally, it requires less fine-tuning.
--- a/doc/Python-Examples.rst
+++ b/doc/Python-Examples.rst
@ -16,7 +16,7 @@ Creating a model instance and loading model
   :end-before: sphinx-doc: python_ref_model_stop

 Deploying trained model
--------------------
+-----------------------

 .. literalinclude:: ../native_client/python/client.py
   :language: python
--- a/doc/SUPPORT.rst
+++ b/doc/SUPPORT.rst
--- a/doc/TRAINING.rst
+++ b/doc/TRAINING.rst
@ -1,534 +0,0 @@
-.. _training-docs:
-
-Training
-========
-
-.. _cuda-training-deps:
-
-Prerequisites for training a model
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-* `Python 3.6 <https://www.python.org/>`_
-* Mac or Linux environment
-* CUDA 10.0 / CuDNN v7.6 per `Dockerfile <https://hub.docker.com/layers/tensorflow/tensorflow/1.15.4-gpu-py3/images/sha256-a5255ae38bcce7c7610816c778244309f8b8d1576e2c0023c685c011392958d7?context=explore>`_.
-
-Getting the training code
-^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Clone the latest released stable branch from Github (e.g. 0.9.3, check `here <https://github.com/coqui-ai/STT/releases>`_):
-
-.. code-block:: bash
-
-   git clone --branch v0.9.3 https://github.com/coqui-ai/STT
-
-If you plan on committing code or you want to report bugs, please use the master branch.
-
-Creating a virtual environment
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Throughout the documentation we assume you are using **virtualenv** to manage your Python environments. This setup is the one used and recommended by the project authors and is the easiest way to make sure you won't run into environment issues. If you're using **Anaconda, Miniconda or Mamba**, first read the instructions at :ref:`training-with-conda` and then continue from the installation step below.
-
-In creating a virtual environment you will create a directory containing a ``python3`` binary and everything needed to run 🐸STT. You can use whatever directory you want. For the purpose of the documentation, we will rely on ``$HOME/tmp/coqui-stt-train-venv``. You can create it using this command:
-
-.. code-block::
-
-   $ python3 -m venv $HOME/tmp/coqui-stt-train-venv/
-
-Once this command completes successfully, the environment will be ready to be activated.
-
-Activating the environment
-^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Each time you need to work with 🐸STT, you have to *activate* this virtual environment. This is done with this simple command:
-
-.. code-block::
-
-   $ source $HOME/tmp/coqui-stt-train-venv/bin/activate
-
-Installing Coqui STT Training Code and its dependencies
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Install the required dependencies using ``pip3``\ :
-
-.. code-block:: bash
-
-   cd STT
-   pip3 install --upgrade pip==20.2.2 wheel==0.34.2 setuptools==49.6.0
-   pip3 install --upgrade -e .
-
-Remember to re-run the last ``pip3 install`` command above when you update the training code (for example by pulling new changes), in order to update any dependencies.
-
-The ``webrtcvad`` Python package might require you to ensure you have proper tooling to build Python modules:
-
-.. code-block:: bash
-
-   sudo apt-get install python3-dev
-
-Recommendations
-^^^^^^^^^^^^^^^
-
-If you have a capable (NVIDIA, at least 8GB of VRAM) GPU, it is highly recommended to install TensorFlow with GPU support. Training will be significantly faster than using the CPU. To enable GPU support, you can do:
-
-.. code-block:: bash
-
-   pip3 uninstall tensorflow
-   pip3 install 'tensorflow-gpu==1.15.4'
-
-Please ensure you have the required `CUDA dependency <https://www.tensorflow.org/install/source#gpu>`_ and/or :ref:`Prerequisites <cuda-training-deps>`.
-
-It has been reported for some people failure at training:
-
-.. code-block::
-
-   tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
-        [[{{node tower_0/conv1d/Conv2D}}]]
-
-Setting the ``TF_FORCE_GPU_ALLOW_GROWTH`` environment variable to ``true`` seems to help in such cases. This could also be due to an incorrect version of libcudnn. Double check your versions with the :ref:`TensorFlow 1.15 documentation <cuda-training-deps>`.
-
-Basic Dockerfile for training
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-We provide ``Dockerfile.train`` to automatically set up a basic training environment in Docker. This should ensure that you'll re-use the upstream Python 3 TensorFlow GPU-enabled Docker image. The image can be used with ``FROM ghcr.io/coqui-ai/stt-train``.
-
-Common Voice training data
-^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The Common Voice corpus consists of voice samples that were donated through Mozilla's `Common Voice <https://voice.mozilla.org/>`_ Initiative.
-You can download individual CommonVoice v2.0 language data sets from `here <https://voice.mozilla.org/data>`_.
-After extraction of such a data set, you'll find the following contents:
-
-
-* the ``*.tsv`` files output by CorporaCreator for the downloaded language
-* the mp3 audio files they reference in a ``clips`` sub-directory.
-
-For bringing this data into a form that 🐸STT understands, you have to run the CommonVoice v2.0 importer (\ ``bin/import_cv2.py``\ ):
-
-.. code-block:: bash
-
-   bin/import_cv2.py --filter_alphabet path/to/some/alphabet.txt /path/to/extracted/language/archive
-
-Providing a filter alphabet is optional. It will exclude all samples whose transcripts contain characters not in the specified alphabet.
-Running the importer with ``-h`` will show you some additional options.
-
-Once the import is done, the ``clips`` sub-directory will contain for each required ``.mp3`` an additional ``.wav`` file.
-It will also add the following ``.csv`` files:
-
-* ``clips/train.csv``
-* ``clips/dev.csv``
-* ``clips/test.csv``
-
-The CSV files comprise of the following fields:
-
-* ``wav_filename`` - path of the sample, either absolute or relative. Here, the importer produces relative paths.
-* ``wav_filesize`` - samples size given in bytes, used for sorting the data before training. Expects integer.
-* ``transcript`` - transcription target for the sample.
-
-To use Common Voice data during training, validation and testing, you pass (comma separated combinations of) their filenames into ``--train_files``\ , ``--dev_files``\ , ``--test_files`` parameters of ``train.py``.
-
-If, for example, Common Voice language ``en`` was extracted to ``../data/CV/en/``\ , ``train.py`` could be called like this:
-
-.. code-block:: bash
-
-   python3 train.py --train_files ../data/CV/en/clips/train.csv --dev_files ../data/CV/en/clips/dev.csv --test_files ../data/CV/en/clips/test.csv
-
-Training a model
-^^^^^^^^^^^^^^^^
-
-The central (Python) script is ``train.py`` in the project's root directory. For its list of command line options, you can call:
-
-.. code-block:: bash
-
-   python3 train.py --helpfull
-
-To get the output of this in a slightly better-formatted way, you can also look at the flag definitions in :ref:`training-flags`.
-
-For executing pre-configured training scenarios, there is a collection of convenience scripts in the ``bin`` folder. Most of them are named after the corpora they are configured for. Keep in mind that most speech corpora are *very large*, on the order of tens of gigabytes, and some aren't free. Downloading and preprocessing them can take a very long time, and training on them without a fast GPU (GTX 10 series or newer recommended) takes even longer.
-
-**If you experience GPU OOM errors while training, try reducing the batch size with the ``--train_batch_size``\ , ``--dev_batch_size`` and ``--test_batch_size`` parameters.**
-
-As a simple first example you can open a terminal, change to the directory of the 🐸STT checkout, activate the virtualenv created above, and run:
-
-.. code-block:: bash
-
-   ./bin/run-ldc93s1.sh
-
-This script will train on a small sample dataset composed of just a single audio file, the sample file for the `TIMIT Acoustic-Phonetic Continuous Speech Corpus <https://catalog.ldc.upenn.edu/LDC93S1>`_, which can be overfitted on a GPU in a few minutes for demonstration purposes. From here, you can alter any variables with regards to what dataset is used, how many training iterations are run and the default values of the network parameters.
-
-Feel also free to pass additional (or overriding) ``train.py`` parameters to these scripts. Then, just run the script to train the modified network.
-
-Each dataset has a corresponding importer script in ``bin/`` that can be used to download (if it's freely available) and preprocess the dataset. See ``bin/import_librivox.py`` for an example of how to import and preprocess a large dataset for training with 🐸STT.
-
-Some importers might require additional code to properly handled your locale-specific requirements. Such handling is dealt with ``--validate_label_locale`` flag that allows you to source out-of-tree Python script that defines a ``validate_label`` function. Please refer to ``util/importers.py`` for implementation example of that function.
-If you don't provide this argument, the default ``validate_label`` function will be used. This one is only intended for English language, so you might have consistency issues in your data for other languages.
-
-For example, in order to use a custom validation function that disallows any sample with "a" in its transcript, and lower cases everything else, you could put the following code in a file called ``my_validation.py`` and then use ``--validate_label_locale my_validation.py``:
-
-.. code-block:: python
-
-  def validate_label(label):
-      if 'a' in label: # disallow labels with 'a'
-          return None
-      return label.lower() # lower case valid labels
-
-If you've run the old importers (in ``util/importers/``\ ), they could have removed source files that are needed for the new importers to run. In that case, simply remove the extracted folders and let the importer extract and process the dataset from scratch, and things should work.
-
-Training with automatic mixed precision
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Automatic Mixed Precision (AMP) training on GPU for TensorFlow has been recently [introduced](https://medium.com/tensorflow/automatic-mixed-precision-in-tensorflow-for-faster-ai-training-on-nvidia-gpus-6033234b2540).
-
-Mixed precision training makes use of both FP32 and FP16 precisions where appropriate. FP16 operations can leverage the Tensor cores on NVIDIA GPUs (Volta, Turing or newer architectures) for improved throughput. Mixed precision training also often allows larger batch sizes. Automatic mixed precision training can be enabled by including the flag `--automatic_mixed_precision` at training time:
-
-```
-python3 train.py --train_files ./train.csv --dev_files ./dev.csv --test_files ./test.csv --automatic_mixed_precision
-```
-
-On a Volta generation V100 GPU, automatic mixed precision speeds up 🐸STT training and evaluation by ~30%-40%.
-
-Checkpointing
-^^^^^^^^^^^^^
-
-During training of a model so-called checkpoints will get stored on disk. This takes place at a configurable time interval. The purpose of checkpoints is to allow interruption (also in the case of some unexpected failure) and later continuation of training without losing hours of training time. Resuming from checkpoints happens automatically by just (re)starting training with the same ``--checkpoint_dir`` of the former run. Alternatively, you can specify more fine grained options with ``--load_checkpoint_dir`` and ``--save_checkpoint_dir``, which specify separate locations to use for loading and saving checkpoints respectively. If not specified these flags use the same value as ``--checkpoint_dir``, ie. load from and save to the same directory.
-
-Be aware however that checkpoints are only valid for the same model geometry they had been generated from. In other words: If there are error messages of certain ``Tensors`` having incompatible dimensions, this is most likely due to an incompatible model change. One usual way out would be to wipe all checkpoint files in the checkpoint directory or changing it before starting the training.
-
-Exporting a model for deployment
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-If the ``--export_dir`` parameter is provided, a model will have been exported to this directory during training.
-Refer to the :ref:`usage instructions <usage-docs>` for information on running a client that can use the exported model.
-
-Exporting a model for TFLite
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-If you want to experiment with the TF Lite engine, you need to export a model that is compatible with it, then use the ``--export_tflite`` flags. If you already have a trained model, you can re-export it for TFLite by running ``train.py`` again and specifying the same ``checkpoint_dir`` that you used for training, as well as passing ``--export_tflite --export_dir /model/export/destination``. If you changed the alphabet you also need to add the ``--alphabet_config_path my-new-language-alphabet.txt`` flag.
-
-Making a mmap-able model for deployment
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The ``output_graph.pb`` model file generated in the above step will be loaded in memory when deployed.
-This will result in extra loading time and memory consumption. One way to avoid this is to directly read data from the disk.
-
-TensorFlow has tooling to achieve this: it requires building the target ``//tensorflow/contrib/util:convert_graphdef_memmapped_format`` (binaries are produced by our TaskCluster for some systems including Linux/amd64 and macOS/amd64), use ``util/taskcluster.py`` tool to download:
-
-.. code-block::
-
-   $ python3 util/taskcluster.py --source tensorflow --artifact convert_graphdef_memmapped_format --branch r1.15 --target .
-
-Producing a mmap-able model is as simple as:
-
-.. code-block::
-
-   $ convert_graphdef_memmapped_format --in_graph=output_graph.pb --out_graph=output_graph.pbmm
-
-Upon sucessfull run, it should report about conversion of a non-zero number of nodes. If it reports converting ``0`` nodes, something is wrong: make sure your model is a frozen one, and that you have not applied any incompatible changes (this includes ``quantize_weights``\ ).
-
-Continuing training from a release model
----------------------------------------
-There are currently two supported approaches to make use of a pre-trained 🐸STT model: fine-tuning or transfer-learning. Choosing which one to use is a simple decision, and it depends on your target dataset. Does your data use the same alphabet as the release model? If "Yes": fine-tune. If "No" use transfer-learning.
-
-If your own data uses the *extact* same alphabet as the English release model (i.e. `a-z` plus `'`) then the release model's output layer will match your data, and you can just fine-tune the existing parameters. However, if you want to use a new alphabet (e.g. Cyrillic `а`, `б`, `д`), the output layer of a release 🐸STT model will *not* match your data. In this case, you should use transfer-learning (i.e. remove the trained model's output layer, and reinitialize a new output layer that matches your target character set.
-
-N.B. - If you have access to a pre-trained model which uses UTF-8 bytes at the output layer you can always fine-tune, because any alphabet should be encodable as UTF-8.
-
-.. _training-fine-tuning:
-
-Fine-Tuning (same alphabet)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-If you'd like to use one of the pre-trained models to bootstrap your training process (fine tuning), you can do so by using the ``--checkpoint_dir`` flag in ``train.py``. Specify the path where you downloaded the checkpoint from the release, and training will resume from the pre-trained model.
-
-For example, if you want to fine tune the entire graph using your own data in ``my-train.csv``\ , ``my-dev.csv`` and ``my-test.csv``\ , for three epochs, you can something like the following, tuning the hyperparameters as needed:
-
-.. code-block:: bash
-
-   mkdir fine_tuning_checkpoints
-   python3 train.py --n_hidden 2048 --checkpoint_dir path/to/checkpoint/folder --epochs 3 --train_files my-train.csv --dev_files my-dev.csv --test_files my_dev.csv --learning_rate 0.0001
-
-Notes about the release checkpoints: the released models were trained with ``--n_hidden 2048``\ , so you need to use that same value when initializing from the release models. Since v0.6.0, the release models are also trained with ``--train_cudnn``\ , so you'll need to specify that as well. If you don't have a CUDA compatible GPU, then you can workaround it by using the ``--load_cudnn`` flag. Use ``--helpfull`` to get more information on how the flags work.
-
-You also cannot use ```--automatic_mixed_precision``` when loading release checkpoints, as they do not use automatic mixed precision training.
-
-If you try to load a release model without following these steps, you'll get an error similar to this:
-
-.. code-block::
-
-   E Tried to load a CuDNN RNN checkpoint but there were more missing variables than just the Adam moment tensors.
-
-
-Transfer-Learning (new alphabet)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-If you want to continue training an alphabet-based 🐸STT model (i.e. not a UTF-8 model) on a new language, or if you just want to add new characters to your custom alphabet, you will probably want to use transfer-learning instead of fine-tuning. If you're starting with a pre-trained UTF-8 model -- even if your data comes from a different language or uses a different alphabet -- the model will be able to predict your new transcripts, and you should use fine-tuning instead.
-
-In a nutshell, 🐸STT's transfer-learning allows you to remove certain layers from a pre-trained model, initialize new layers for your target data, stitch together the old and new layers, and update all layers via gradient descent. You will remove the pre-trained output layer (and optionally more layers) and reinitialize parameters to fit your target alphabet. The simplest case of transfer-learning is when you remove just the output layer.
-
-In 🐸STT's implementation of transfer-learning, all removed layers will be contiguous, starting from the output layer. The key flag you will want to experiment with is ``--drop_source_layers``. This flag accepts an integer from ``1`` to ``5`` and allows you to specify how many layers you want to remove from the pre-trained model. For example, if you supplied ``--drop_source_layers 3``, you will drop the last three layers of the pre-trained model: the output layer, penultimate layer, and LSTM layer. All dropped layers will be reinintialized, and (crucially) the output layer will be defined to match your supplied target alphabet.
-
-You need to specify the location of the pre-trained model with ``--load_checkpoint_dir`` and define where your new model checkpoints will be saved with ``--save_checkpoint_dir``. You need to specify how many layers to remove (aka "drop") from the pre-trained model: ``--drop_source_layers``. You also need to supply your new alphabet file using the standard ``--alphabet_config_path`` (remember, using a new alphabet is the whole reason you want to use transfer-learning).
-
-.. code-block:: bash
-
-       python3 train.py \
-           --drop_source_layers 1 \
-           --alphabet_config_path my-new-language-alphabet.txt \
-           --save_checkpoint_dir path/to/output-checkpoint/folder \
-           --load_checkpoint_dir path/to/release-checkpoint/folder \
-           --train_files   my-new-language-train.csv \
-           --dev_files   my-new-language-dev.csv \
-           --test_files  my-new-language-test.csv
-
-UTF-8 mode
-^^^^^^^^^^
-
-🐸STT includes a UTF-8 operating mode which can be useful to model languages with very large alphabets, such as Chinese Mandarin. For details on how it works and how to use it, see :ref:`decoder-docs`.
-
-
-.. _training-data-augmentation:
-
-Augmentation
-^^^^^^^^^^^^
-
-Augmentation is a useful technique for better generalization of machine learning models. Thus, a pre-processing pipeline with various augmentation techniques on raw pcm and spectrogram has been implemented and can be used while training the model. Following are the available augmentation techniques that can be enabled at training time by using the corresponding flags in the command line.
-
-Each sample of the training data will get treated by every specified augmentation in their given order. However: whether an augmentation will actually get applied to a sample is decided by chance on base of the augmentation's probability value. For example a value of ``p=0.1`` would apply the according augmentation to just 10% of all samples. This also means that augmentations are not mutually exclusive on a per-sample basis.
-
-The ``--augment`` flag uses a common syntax for all augmentation types:
-
-.. code-block::
-
-  --augment augmentation_type1[param1=value1,param2=value2,...] --augment augmentation_type2[param1=value1,param2=value2,...] ...
-
-For example, for the ``overlay`` augmentation:
-
-.. code-block::
-
-  python3 train.py --augment overlay[p=0.1,source=/path/to/audio.sdb,snr=20.0] ...
-
-
-In the documentation below, whenever a value is specified as ``<float-range>`` or ``<int-range>``, it supports one of the follow formats:
-
-  * ``<value>``: A constant (int or float) value.
-
-  * ``<value>~<r>``: A center value with a randomization radius around it. E.g. ``1.2~0.4`` will result in picking of a uniformly random value between 0.8 and 1.6 on each sample augmentation.
-
-  * ``<start>:<end>``: The value will range from `<start>` at the beginning of the training to `<end>` at the end of the training. E.g. ``-0.2:1.2`` (float) or ``2000:4000`` (int)
-
-  * ``<start>:<end>~<r>``: Combination of the two previous cases with a ranging center value. E.g. ``4-6~2`` would at the beginning of the training pick values between 2 and 6 and at the end of the training between 4 and 8.
-
-Ranges specified with integer limits will only assume integer (rounded) values.
-
-.. warning::
-    When feature caching is enabled, by default the cache has no expiration limit and will be used for the entire training run. This will cause these augmentations to only be performed once during the first epoch and the result will be reused for subsequent epochs. This would not only hinder value ranges from reaching their intended final values, but could also lead to unintended over-fitting. In this case flag ``--cache_for_epochs N`` (with N > 1) should be used to periodically invalidate the cache after every N epochs and thus allow samples to be re-augmented in new ways and with current range-values.
-
-Every augmentation targets a certain representation of the sample - in this documentation these representations are referred to as *domains*.
-Augmentations are applied in the following order:
-
-1. **sample** domain: The sample just got loaded and its waveform is represented as a NumPy array. For implementation reasons these augmentations are the only ones that can be "simulated" through ``bin/play.py``.
-
-2. **signal** domain: The sample waveform is represented as a tensor.
-
-3. **spectrogram** domain: The sample spectrogram is represented as a tensor.
-
-4. **features** domain: The sample's mel spectrogram features are represented as a tensor.
-
-Within a single domain, augmentations are applied in the same order as they appear in the command-line.
-
-
-Sample domain augmentations
---------------------------
-
-**Overlay augmentation** ``--augment overlay[p=<float>,source=<str>,snr=<float-range>,layers=<int-range>]``
-  Layers another audio source (multiple times) onto augmented samples.
-
-  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
-
-  * **source**: path to the sample collection to use for augmenting (\*.sdb or \*.csv file). It will be repeated if there are not enough samples left.
-
-  * **snr**: signal to noise ratio in dB - positive values for lowering volume of the overlay in relation to the sample
-
-  * **layers**: number of layers added onto the sample (e.g. 10 layers of speech to get "cocktail-party effect"). A layer is just a sample of the same duration as the sample to augment. It gets stitched together from as many source samples as required.
-
-
-**Reverb augmentation** ``--augment reverb[p=<float>,delay=<float-range>,decay=<float-range>]``
-  Adds simplified (no all-pass filters) `Schroeder reverberation <https://ccrma.stanford.edu/~jos/pasp/Schroeder_Reverberators.html>`_ to the augmented samples.
-
-  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
-
-  * **delay**: time delay in ms for the first signal reflection - higher values are widening the perceived "room"
-
-  * **decay**: sound decay in dB per reflection - higher values will result in a less reflective perceived "room"
-
-
-**Resample augmentation** ``--augment resample[p=<float>,rate=<int-range>]``
-  Resamples augmented samples to another sample rate and then resamples back to the original sample rate.
-
-  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
-
-  * **rate**: sample-rate to re-sample to
-
-
-**Codec augmentation** ``--augment codec[p=<float>,bitrate=<int-range>]``
-  Compresses and then decompresses augmented samples using the lossy Opus audio codec.
-
-  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
-
-  * **bitrate**: bitrate used during compression
-
-
-**Volume augmentation** ``--augment volume[p=<float>,dbfs=<float-range>]``
-  Measures and levels augmented samples to a target dBFS value.
-
-  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
-
-  * **dbfs** : target volume in dBFS (default value of 3.0103 will normalize min and max amplitudes to -1.0/1.0)
-
-Spectrogram domain augmentations
--------------------------------
-
-**Pitch augmentation** ``--augment pitch[p=<float>,pitch=<float-range>]``
-  Scales spectrogram on frequency axis and thus changes pitch.
-
-  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
-
-  * **pitch**: pitch factor by with the frequency axis is scaled (e.g. a value of 2.0 will raise audio frequency by one octave)
-
-
-**Tempo augmentation** ``--augment tempo[p=<float>,factor=<float-range>]``
-  Scales spectrogram on time axis and thus changes playback tempo.
-
-  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
-
-  * **factor**: speed factor by which the time axis is stretched or shrunken (e.g. a value of 2.0 will double playback tempo)
-
-
-**Warp augmentation** ``--augment warp[p=<float>,nt=<int-range>,nf=<int-range>,wt=<float-range>,wf=<float-range>]``
-  Applies a non-linear image warp to the spectrogram. This is achieved by randomly shifting a grid of equally distributed warp points along time and frequency axis.
-
-  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
-
-  * **nt**: number of equally distributed warp grid lines along time axis of the spectrogram (excluding the edges)
-
-  * **nf**: number of equally distributed warp grid lines along frequency axis of the spectrogram (excluding the edges)
-
-  * **wt**: standard deviation of the random shift applied to warp points along time axis (0.0 = no warp, 1.0 = half the distance to the neighbour point)
-
-  * **wf**: standard deviation of the random shift applied to warp points along frequency axis (0.0 = no warp, 1.0 = half the distance to the neighbour point)
-
-
-**Frequency mask augmentation** ``--augment frequency_mask[p=<float>,n=<int-range>,size=<int-range>]``
-  Sets frequency-intervals within the augmented samples to zero (silence) at random frequencies. See the SpecAugment paper for more details - https://arxiv.org/abs/1904.08779
-
-  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
-
-  * **n**: number of intervals to mask
-
-  * **size**: number of frequency bands to mask per interval
-
-Multi domain augmentations
--------------------------
-
-**Time mask augmentation** ``--augment time_mask[p=<float>,n=<int-range>,size=<float-range>,domain=<domain>]``
-  Sets time-intervals within the augmented samples to zero (silence) at random positions.
-
-  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
-
-  * **n**: number of intervals to set to zero
-
-  * **size**: duration of intervals in ms
-
-  * **domain**: data representation to apply augmentation to - "signal", "features" or "spectrogram" (default)
-
-
-**Dropout augmentation** ``--augment dropout[p=<float>,rate=<float-range>,domain=<domain>]``
-  Zeros random data points of the targeted data representation.
-
-  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
-
-  * **rate**: dropout rate ranging from 0.0 for no dropout to 1.0 for 100% dropout
-
-  * **domain**: data representation to apply augmentation to - "signal", "features" or "spectrogram" (default)
-
-
-**Add augmentation** ``--augment add[p=<float>,stddev=<float-range>,domain=<domain>]``
-  Adds random values picked from a normal distribution (with a mean of 0.0) to all data points of the targeted data representation.
-
-  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
-
-  * **stddev**: standard deviation of the normal distribution to pick values from
-
-  * **domain**: data representation to apply augmentation to - "signal", "features" (default) or "spectrogram"
-
-
-**Multiply augmentation** ``--augment multiply[p=<float>,stddev=<float-range>,domain=<domain>]``
-  Multiplies all data points of the targeted data representation with random values picked from a normal distribution (with a mean of 1.0).
-
-  * **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
-
-  * **stddev**: standard deviation of the normal distribution to pick values from
-
-  * **domain**: data representation to apply augmentation to - "signal", "features" (default) or "spectrogram"
-
-
-Example training with all augmentations:
-
-.. code-block:: bash
-
-        python -u train.py \
-          --train_files "train.sdb" \
-          --feature_cache ./feature.cache \
-          --cache_for_epochs 10 \
-          --epochs 100 \
-          --augment overlay[p=0.5,source=noise.sdb,layers=1,snr=50:20~10] \
-          --augment reverb[p=0.1,delay=50.0~30.0,decay=10.0:2.0~1.0] \
-          --augment resample[p=0.1,rate=12000:8000~4000] \
-          --augment codec[p=0.1,bitrate=48000:16000] \
-          --augment volume[p=0.1,dbfs=-10:-40] \
-          --augment pitch[p=0.1,pitch=1~0.2] \
-          --augment tempo[p=0.1,factor=1~0.5] \
-          --augment warp[p=0.1,nt=4,nf=1,wt=0.5:1.0,wf=0.1:0.2] \
-          --augment frequency_mask[p=0.1,n=1:3,size=1:5] \
-          --augment time_mask[p=0.1,domain=signal,n=3:10~2,size=50:100~40] \
-          --augment dropout[p=0.1,rate=0.05] \
-          --augment add[p=0.1,domain=signal,stddev=0~0.5] \
-          --augment multiply[p=0.1,domain=features,stddev=0~0.5] \
-          [...]
-
-
-The ``bin/play.py`` and ``bin/data_set_tool.py`` tools also support ``--augment`` parameters (for sample domain augmentations) and can be used for experimenting with different configurations or creating augmented data sets.
-
-Example of playing all samples with reverberation and maximized volume:
-
-.. code-block:: bash
-
-        bin/play.py --augment reverb[p=0.1,delay=50.0,decay=2.0] --augment volume --random test.sdb
-
-Example simulation of the codec augmentation of a wav-file first at the beginning and then at the end of an epoch:
-
-.. code-block:: bash
-
-        bin/play.py --augment codec[p=0.1,bitrate=48000:16000] --clock 0.0 test.wav
-        bin/play.py --augment codec[p=0.1,bitrate=48000:16000] --clock 1.0 test.wav
-
-Example of creating a pre-augmented test set:
-
-.. code-block:: bash
-
-        bin/data_set_tool.py \
-          --augment overlay[source=noise.sdb,layers=1,snr=20~10] \
-          --augment resample[rate=12000:8000~4000] \
-          test.sdb test-augmented.sdb
-
-.. _training-with-conda:
-
-Training from an Anaconda or miniconda environment
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Keep in mind that none of the core authors use Anaconda or miniconda, so this setup is not guaranteed to work. If you experience problems, try using a non-conda setup first. We're happy to accept pull requests fixing any incompatibilities with conda setups, but we will not offer any support ourselves beyond reviewing pull requests.
-
-To prevent common problems, make sure you **always use a separate environment when setting things up for training**:
-
-.. code-block:: bash
-
-   (base) $ conda create -n coqui-stt python=3.7
-   (base) $ conda activate coqui-stt
--- a/doc/TRAINING_ADVANCED.rst
+++ b/doc/TRAINING_ADVANCED.rst
@ -0,0 +1,19 @@
+.. _advanced-training-docs:
+
+Training: Advanced Topics
+=========================
+
+This document contains more advanced topics with regard to training models with STT. If you'd prefer a lighter introduction, please refer to :ref:`Training: Quickstart<intro-training-docs>`.
+
+
+1. :ref:`training-flags`
+2. :ref:`transfer-learning`
+3. :ref:`automatic-mixed-precision`
+4. :ref:`checkpointing`
+5. :ref:`common-voice-data`
+6. :ref:`training-data-augmentation`
+7. :ref:`exporting-checkpoints`
+8. :ref:`model-geometry`
+9. :ref:`parallel-training-optimization`
+10. :ref:`data-importers`
+11. :ref:`byte-output-mode`
--- a/doc/TRAINING_FLAGS.rst
+++ b/doc/TRAINING_FLAGS.rst
--- a/doc/TRAINING_INTRO.rst
+++ b/doc/TRAINING_INTRO.rst
@ -0,0 +1,167 @@
+.. _intro-training-docs:
+
+Training: Quickstart
+=====================
+
+Introduction
+------------
+
+This document is a quickstart guide to training an 🐸STT model using your own speech data. For more in-depth training documentation, you should refer to :ref:`Advanced Training Topics <advanced-training-docs>`.
+
+Training a model using your own audio can lead to better transcriptions compared to an off-the-shelf 🐸STT model. If your speech data differs significantly from the data we used in training, training your own model (or fine-tuning one of ours) may lead to large improvements in transcription quality. You can read about how speech characteristics interact with transcription accuracy :ref:`here <model-data-match>`.
+
+Dockerfile Setup
+----------------
+
+We suggest you use our Docker image as a base for training. You can download and run the image in a container:
+
+.. code-block:: bash
+
+   $ docker pull ghcr.io/coqui-ai/stt-train
+   $ docker run -it stt-train:latest
+
+Alternatively you can build it from source using ``Dockerfile.train``, and run the locally built version in a container:
+
+.. code-block:: bash
+
+   $ git clone --recurse-submodules https://github.com/coqui-ai/STT
+   $ cd STT
+   $ docker build -f Dockerfile.train . -t stt-train:latest
+   $ docker run -it stt-train:latest
+
+You can read more about working with Dockerfiles in the `official documentation <https://docs.docker.com/engine/reference/builder/>`_.
+
+Manual Setup
+------------
+
+If you don't want to use our Dockerfile template, you will need to manually install STT in order to train a model.
+
+.. _training-deps:
+
+Prerequisites
+^^^^^^^^^^^^^
+
+* `Python 3.6 <https://www.python.org/>`_
+* Mac or Linux environment (training on Windows is *not* currently supported)
+* CUDA 10.0 and CuDNN v7.6
+
+Download
+^^^^^^^^
+
+We recommened that you clone the STT repo from the latest stable release branch on Github (e.g. ``v0.9.3``). You can find all 🐸STT releases `here <https://github.com/coqui-ai/STT/releases>`_).
+
+.. code-block:: bash
+
+   $ git clone --branch v0.9.3 --depth 1 https://github.com/coqui-ai/STT
+
+Installation
+^^^^^^^^^^^^
+
+Installing STT and its dependencies is much easier with a virtual environment.
+
+Set up Virtural Environment
+"""""""""""""""""""""""""""
+
+We recommend Python's built-in `venv <https://docs.python.org/3/library/venv.html>`_ module to manage your Python environment.
+
+Setup your Python virtual environment, and name it ``coqui-stt-train-venv``:
+
+.. code-block::
+
+   $ python3 -m venv coqui-stt-train-venv
+
+Activate the virtual environment:
+
+.. code-block::
+
+   $ source coqui-stt-train-venv/bin/activate
+
+Setup with a ``conda`` virtual environment (Anaconda, Miniconda, or Mamba) is not guaranteed to work. Nevertheless, we're happy to review pull requests which fix any incompatibilities you encounter.
+
+Install Dependencies and STT
+""""""""""""""""""""""""""""
+
+Now that we have cloned the STT repo from Github and setup a virtual environment with ``venv``, we can install STT and its dependencies. We recommend Python's built-in `pip <https://pip.pypa.io/en/stable/quickstart/>`_ module for installation:
+
+.. code-block:: bash
+
+   $ cd STT
+   $ python3 -m pip install --upgrade pip wheel setuptools
+   $ python3 -m pip install --upgrade -e .
+
+The ``webrtcvad`` package may additionally require ``python3-dev``:
+
+.. code-block:: bash
+
+   $ sudo apt-get install python3-dev
+
+If you have an NVIDIA GPU, it is highly recommended to install TensorFlow with GPU support. Training will be significantly faster than using the CPU.
+
+.. code-block:: bash
+
+   $ python3 -m pip uninstall tensorflow
+   $ python3 -m pip install 'tensorflow-gpu==1.15.4'
+
+Please ensure you have the required `CUDA dependency <https://www.tensorflow.org/install/source#gpu>`_ and :ref:`prerequisites <training-deps>`.
+
+Verify Install
+""""""""""""""
+
+To verify that your installation was successful, run:
+
+.. code-block:: bash
+
+   $ ./bin/run-ldc93s1.sh
+
+This script will train a model on a single audio file. If the script exits successfully, your STT training setup is ready. Congratulations!
+
+Training on your own Data
+-------------------------
+
+Whether you used our Dockerfile template or you set up your own environment, the central STT training script is ``train.py``. For a list of command line options, use the ``--helpfull`` flag:
+
+.. code-block:: bash
+
+   $ cd STT
+   $ python3 train.py --helpfull
+
+Training Data
+^^^^^^^^^^^^^
+
+There's two kinds of data needed to train an STT model:
+
+1. audio clips
+2. text transcripts
+
+Data Format
+"""""""""""
+
+Audio data is expected to be stored as WAV, sampled at 16kHz, and mono-channel. There's no hard expectations for the length of individual audio files, but in our experience, training is most successful when WAV files range from 5 to 20 seconds in length. Your training data should match as closely as possible the kind of speech you expect at deployment. You can read more about the significant characteristics of speech with regard to STT :ref:`here <model-data-match>`.
+
+Text transcripts should be formatted exactly as the transcripts you expect your model to produce at deployment. If you want your model to produce capital letters, your transcripts should include capital letters. If you want your model to produce punctuation, your transcripts should include punctuation. Keep in mind that the more characters you include in your transcripts, the more difficult the task becomes for your model. STT models learn from experience, and if there's very few examples in the training data, the model will have a hard time learning rare characters (e.g. the "ï" in "naïve"). 
+
+CSV file format
+"""""""""""""""
+
+The audio and transcripts used in training are passed to ``train.py`` via CSV files. You should supply CSV files for training (``train.csv``), development (``dev.csv``), and testing (``test.csv``). The CSV files should contain three columns:
+
+1. ``wav_filename`` - the path to a WAV file on your machine
+2. ``wav_filesize`` - the number of bytes in the WAV file
+3. ``transcript`` - the text transcript of the WAV file
+
+Start Training
+^^^^^^^^^^^^^^
+
+After you've successfully installed STT and have access to data, you can start a training run:
+
+.. code-block:: bash
+
+   $ cd STT
+   $ python3 train.py --train_files train.csv --dev_files dev.csv --test_files test.csv
+
+Next Steps
+----------
+
+You will want to customize the settings of ``train.py`` to work better with your data and your hardware. You should review the :ref:`command-line training flags <training-flags>`, and experiment with different settings.
+
+For more in-depth training documentation, you should refer to the :ref:`Advanced Training Topics <advanced-training-docs>` section.
--- a/doc/TRANSFER_LEARNING.rst
+++ b/doc/TRANSFER_LEARNING.rst
@ -0,0 +1,58 @@
+.. _transfer-learning:
+
+Bootstrap from a pre-trained model
+==================================
+
+If you don't have thousands of hours of training data, you will probably find that bootstrapping from a pre-trained model is a critical step in training a production-ready STT model. Even in the case you do have thousands of hours of data, you will find that bootstrapping from a pre-trained model can significantly decrease training time. Unless you want to experiment with new neural architectures, you probably want to bootstrap from a pre-trained model.
+
+There are currently two supported approaches to bootstrapping from a pre-trained 🐸STT model: fine-tuning or transfer-learning. Choosing which one to use depends on your target dataset. Does your data use the same alphabet as the release model? If "Yes", then you fine-tune. If "No", then you use transfer-learning.
+
+If your own data uses the *extact* same alphabet as the English release model (i.e. ``a-z`` plus ``'``) then the release model's output layer will match your data, and you can just fine-tune the existing parameters. However, if you want to use a new alphabet (e.g. Cyrillic ``а``, ``б``, ``д``), the output layer of an English model will *not* match your data. In this case, you should use transfer-learning (i.e. reinitialize a new output layer that matches your target character set.
+
+.. _training-fine-tuning:
+
+Fine-Tuning (same alphabet)
+---------------------------
+
+You can fine-tune pre-trained model checkpoints by using the ``--checkpoint_dir`` flag in ``train.py``. Specify the path to the checkpoints, and training will resume from the pre-trained model.
+
+For example, if you want to fine tune existing checkpoints to your own data in ``my-train.csv``, ``my-dev.csv``, and ``my-test.csv``, you can do the following:
+
+.. code-block:: bash
+
+   $ python3 train.py \
+	--checkpoint_dir path/to/checkpoint/folder \
+	--train_files my-train.csv \
+	--dev_files my-dev.csv \
+	--test_files my_test.csv
+
+Transfer-Learning (new alphabet)
+--------------------------------
+
+If you want to bootstrap from an alphabet-based 🐸STT model but your text transcripts contain characters not found in the source model, then you will want to use transfer-learning instead of fine-tuning. If you want to bootstrap from a pre-trained UTF-8 ``byte output mode`` model - even if your data comes from a different language or uses a different alphabet - the model will be able to predict your new transcripts, and you should use fine-tuning instead.
+
+🐸STT's transfer-learning allows you to remove certain layers from a pre-trained model, initialize new layers for your target data, stitch together the old and new layers, and update all layers via gradient descent. Transfer-learning always removes the output layer of the pre-trained model in order to fit your new target alphabet. The simplest case of transfer-learning is when you only remove the output layer.
+
+In 🐸STT's implementation of transfer-learning, all removed layers must be contiguous and include the output layer. The flag to control the number of layers you remove from the source model is ``--drop_source_layers``. This flag accepts an integer from ``1`` to ``5``, specifying how many layers to remove from the pre-trained model. For example, with ``--drop_source_layers 3`` and a 🐸STT off-the-shelf model, you will drop the last three layers of the model: the output layer, penultimate layer, and LSTM layer. All dropped layers will be reinintialized, and (crucially) the output layer will be defined to match your supplied target alphabet.
+
+You need to specify the location of the pre-trained model with ``--load_checkpoint_dir`` and define where your new model checkpoints will be saved with ``--save_checkpoint_dir``. You need to specify how many layers to remove (aka "drop") from the pre-trained model: ``--drop_source_layers``. You also need to supply your new alphabet file using the standard ``--alphabet_config_path`` (remember, using a new alphabet is the whole reason you want to use transfer-learning).
+
+.. code-block:: bash
+
+       python3 train.py \
+           --drop_source_layers 1 \
+           --alphabet_config_path my-alphabet.txt \
+           --save_checkpoint_dir path/to/output-checkpoint/folder \
+           --load_checkpoint_dir path/to/input-checkpoint/folder \
+	   --train_files my-new-language-train.csv \
+           --dev_files   my-new-language-dev.csv \
+           --test_files  my-new-language-test.csv
+
+Bootstrapping from Coqui STT release checkpoints
+------------------------------------------------
+
+Currently, 🐸STT release models are trained with ``--n_hidden 2048``, so you need to use that same value when initializing from the release models. Release models are also trained with ``--train_cudnn``, so you'll need to specify that as well. If you don't have a CUDA compatible GPU, then you can workaround it by using the ``--load_cudnn`` flag.
+
+You cannot use ``--automatic_mixed_precision`` when loading release checkpoints, as they do not use automatic mixed precision training.
+
+If you try to load a release model without following these steps, you may experience run-time errors.
--- a/doc/index.rst
+++ b/doc/index.rst
@ -17,7 +17,7 @@ Coqui STT

   DEPLOYMENT

-   TRAINING
+   TRAINING_INTRO

   BUILDING

@ -87,7 +87,7 @@ The fastest way to deploy a pre-trained 🐸STT model is with `pip` with Python

   Scorer

-.. include:: ../SUPPORT.rst
+.. include:: SUPPORT.rst

 Indices and tables
 ==================
--- a/taskcluster/docs-requirements.txt
+++ b/taskcluster/docs-requirements.txt