Merge branch 'playbook-into-docs'

This commit is contained in:
Reuben Morais 2021-03-30 19:39:27 +02:00
commit c78af058a5
18 changed files with 1707 additions and 5 deletions

View File

@ -76,7 +76,8 @@ extensions = [
'sphinx.ext.viewcode', 'sphinx.ext.viewcode',
'sphinx_js', 'sphinx_js',
'sphinx_csharp', 'sphinx_csharp',
'breathe' 'breathe',
'recommonmark',
] ]
@ -112,7 +113,7 @@ language = None
# List of patterns, relative to source directory, that match files and # List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files. # directories to ignore when looking for source files.
# This patterns also effect to html_static_path and html_extra_path # This patterns also effect to html_static_path and html_extra_path
exclude_patterns = ['.build', 'Thumbs.db', '.DS_Store', 'node_modules'] exclude_patterns = ['.build', 'Thumbs.db', '.DS_Store', 'node_modules', 'examples']
# The name of the Pygments (syntax highlighting) style to use. # The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx' pygments_style = 'sphinx'

View File

@ -76,7 +76,7 @@ The fastest way to deploy a pre-trained 🐸STT model is with `pip` with Python
DotNet-Examples DotNet-Examples
Java-Examples Java-Examples
HotWordBoosting-Examples HotWordBoosting-Examples
Contributed-Examples Contributed-Examples
@ -85,10 +85,16 @@ The fastest way to deploy a pre-trained 🐸STT model is with `pip` with Python
:maxdepth: 1 :maxdepth: 1
:caption: Language Model :caption: Language Model
Scorer LANGUAGE_MODEL
.. include:: SUPPORT.rst .. include:: SUPPORT.rst
.. toctree::
:maxdepth: 1
:caption: STT Playbook
playbook/README
Indices and tables Indices and tables
================== ==================

37
doc/playbook/ABOUT.md Executable file
View File

@ -0,0 +1,37 @@
[Home](README.md) | [Previous - Introduction](INTRO.md) | [Next - Formatting your training data](DATA_FORMATTING.md)
# About Coqui STT
## Contents
- [About Coqui STT](#about-coqui-stt)
* [Contents](#contents)
* [What does Coqui STT do?](#what-does-coqui-stt-do-)
* [How does Coqui STT work?](#how-does-coqui-stt-work-)
* [How is Coqui STT implemented?](#how-is-coqui-stt-implemented-)
## What does Coqui STT do?
🐸STT is a tool for automatically transcribing spoken audio. 🐸STT takes digital audio as input and returns a "most likely" text transcript of that audio.
🐸STT is an implementation of the 🐸STT algorithm developed by Baidu and presented in this research paper:
> Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger R, Satheesh S, Sengupta S, Coates A., & Ng, A. Y. (2014). Deep speech: Scaling up end-to-end speech recognition. [arXiv preprint arXiv:1412.5567](https://arxiv.org/pdf/1412.5567).
🐸STT can be used for two key activities related to speech recognition - _training_ and _inference_. Speech recognition _inference_ - the process of converting spoken audio to written text - relies on a _trained model_. 🐸STT can be used, with appropriate hardware (GPU) to train a model using a set of voice data, known as a _corpus_. Then, _inference_ or _recognition_ can be performed using the trained model. 🐸STT includes several pre-trained models.
**This Playbook is focused on helping you train your own model.**
## How does Coqui STT work?
🐸STT takes a stream of audio as input, and converts that stream of audio into a sequence of characters in the designated alphabet. This conversion is made possible by two basic steps: First, the audio is converted into a sequence of probabilities over characters in the alphabet. Secondly, this sequence of probabilities is converted into a sequence of characters.
The first step is made possible by a [Deep Neural Network](https://en.wikipedia.org/wiki/Deep_learning#Deep_neural_networks), and the second step is made possible by an [N-gram](https://en.wikipedia.org/wiki/N-gram)language model. The neural network is trained on audio and corresponding text transcripts, and the N-gram language model is trained on a text corpus (which is often different from the text transcripts of the audio). The neural model is trained to predict the text from speech, and the language model is trained to predict text from preceding text. At a very high level, you can think of the first part (the acoustic model) as a phonetic transcriber, and the second part (the language model) as a spelling and grammar checker.
## How is Coqui STT implemented?
The core of 🐸STT is written in C++, but it has bindings to Python, .NET, Java, JavaScript, and community-based bindings for Golang, Rust, Vlang, and NIM-lang.
---
[Home](README.md) | [Previous - Introduction](INTRO.md) | [Next - Formatting your training data](DATA_FORMATTING.md)

42
doc/playbook/ALPHABET.md Normal file
View File

@ -0,0 +1,42 @@
[Home](README.md) | [Previous - Scorer - language model for determining which words occur together ](SCORER.md) | [Next - Acoustic Model and Language Model](AM_vs_LM.md)
# The alphabet.txt file
## Contents
- [The alphabet.txt file](#the-alphabettxt-file)
* [Contents](#contents)
* [What is alphabet.txt ?](#what-is-alphabettxt--)
* [How does the Glue work?](#how-does-the-glue-work-)
+ [How to diagnose mis-matched alphabets?](#how-to-diagnose-mis-matched-alphabets-)
* [Common alphabet.txt related errors](#common-alphabettxt-related-errors)
This tiny text file is easy to overlook, but it is very important. The *exact same* alphabet must be used to train the both acoustic model and the language model. This alphabet.txt is the glue that holds the language model and the acoustic model together.
## What is alphabet.txt ?
Let's take a look at the English [alphabet.txt](https://github.com/coqui-ai/STT/blob/master/data/alphabet.txt) which was used to train the release 🐸STT models. If you were to ask a native English speaker to write down the alphabet, this `alphabet.txt` isn't what they would write. *The `alphabet.txt` file contains all characters used in a language which are necessary for writing*. Looking at the English alphabet file, the first character is the space `" "`. We need spaces to separate words when writing. Following the space, we find all the familiar letters of the alphabet which children learn in school. Finally, we find the apostrophe "'". The apostrophe is needed for writing contractions, which are very common in English. The apostrophe can distinguish words like "we're" and "were", which have different prounuciations. Not all languages need spaces, and not all languages need apostrophes. Creating the alphabet for a new language takes some research. Two people creating the same alphabet file may disagree, and no one is objectively right. The best alphabet will depend on the target application and the available training data. You may notice that the `alphabet.txt` file released with 🐸STT for English does not contain any characters with accents, even though they do occur sometimes in English. The off-the-shelf 🐸STT model cannot produce words like "naïvely" or "résumé", and this was a design decision. We could make an alphabet that contains every possible character for every possible loan-word into English, but then we would need training data for all those new characters.
## How does the Glue work?
Quite simply, `alphabet.txt` helps 🐸STT make a lookup table, and at run-time that lookup table is used instead of characters themselves. For the English example, the 🐸STT acoustic model doesn't have any idea what the letter 'a' is, but it does know what index '1' is. The `alphabet.txt` file tells us that the index '1' for the acoustic model corresponds to the letter 'a', so we can make sense of the output. If the indeces for the acoustic model and language model don't match, then the acoustic model might hear an 'a', but the language model interprets it instead as 'b'. This mis-match is sneaky, and if the alphabets used for the acoustic model and language are similar, but slightly off, this is a hard problem to diagnose. If you used different `alphabet.txt` files, you may not get any run-time error messages, but the output transcriptions will make no sense.
### How to diagnose mis-matched alphabets?
If you think you used different alphabets to create a [language model and an acoustic model](AM_vs_LM.md), try decoding _without_ the scorer. If you can decode the audio without a scorer and the output is reasonable, but when you decode the same audio with a scorer, and the output is _not_ reasonable, then you could have mis-matched alphabets. Usually the easiest way to fix this is to re-compile the scorer with the correct alphabet.
[Read more information on building a language model (scorer)](SCORER.md).
## Common alphabet.txt related errors
One of the most common errors occurs when there is a character in the corpus that is not in the `alphabet.txt` file. You need to include the missing character in the `alphabet.txt` file.
```
File "/STT/training/coqui_stt_training/util/text.py", line 18, in text_to_char_array
.format(transcript, context, list(ch for ch in transcript if not alphabet.CanEncodeSingle(ch))))
ValueError: Alphabet cannot encode transcript "panggil ambulan" while processing sample "persistent-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_19338419.wav", check that your alphabet contains all characters in the training corpus. Missing characters are: [''].
```
---
[Home](README.md) | [Previous - Scorer - language model for determining which words occur together ](SCORER.md) | [Next - Acoustic Model and Language Model](AM_vs_LM.md)

21
doc/playbook/AM_vs_LM.md Normal file
View File

@ -0,0 +1,21 @@
[Home](README.md) | [Previous - Scorer - language model for determining which words occur together](SCORER.md) | [Next - Setting up your Coqui STT training environment](ENVIRONMENT.md)
# Acoustic model vs. Language model
## Contents
- [Acoustic model vs. Language model](#acoustic-model-vs-language-model)
* [Contents](#contents)
* [Training](#training)
At runtime, 🐸STT is made up of two main parts: (1) the acoustic model and (2) the language model. The acoustic model takes audio as input and converts it to a probability over characters in the alphabet. The language model helps to turn these probabilities into words of coherent language. The language model (aka. the scorer), assigns probabilities to words and phrases based on statistics from training data. The language model knows that "I read a book" is much more probable then "I red a book", even though they may sound identical to the acoustic model.
## Training
The acoustic model is a neural network trained with Tensorflow, and the training data is a corpus of speech and transcripts.
The language model is a n-gram model trained with kenlm, and the training data is a corpus of text.
---
[Home](README.md) | [Previous - Scorer - language model for determining which words occur together](SCORER.md) | [Next - Setting up your Coqui STT training environment](ENVIRONMENT.md)

View File

@ -0,0 +1,115 @@
[Home](README.md) | [Previous - About Coqui STT](ABOUT.md) | [Next - The alphabet.txt file](ALPHABET.md)
# Formatting your training data for Coqui STT
## Contents
- [Formatting your training data for Coqui STT](#formatting-your-training-data-for-coqui-stt)
* [Contents](#contents)
* [Collecting data](#collecting-data)
* [Preparing your data for training](#preparing-your-data-for-training)
+ [Data from Common Voice](#data-from-common-voice)
* [Importers](#importers)
🐸STT expects audio files to be WAV format, mono-channel, and with a 16kHz sampling rate.
For training, testing, and development, you need to feed 🐸STT CSV files which contain three columns: `wav_filename,wav_filesize,transcript`. The `wav_filesize` (i.e. number of bytes) is used to group together audio of similar lengths for efficient batching.
## Collecting data
This PlayBook is focused on _training_ a speech recognition model, rather than on _collecting_ the data that is required for an accurate model. However, a good model starts with data.
* Ensure that your voice clips are 10-20 seconds in length. If they are longer or shorter than this, your model will be less accurate.
* Ensure that every character in your transcription of a voice clip is in your [alphabet.txt](ALPHABET.md) file
* Ensure that your voice clips exhibit the same sort of diversity you expect to encounter in your runtime audio. This means a diversity of accents, genders, background noise and so on.
* Ensure that your voice clips are created using similar microphones to that which you expect in your runtime audio. For example, if you expect to deploy your model on Android mobile phones, ensure that your training data is generated from Android mobile phones.
* Ensure that the phrasing on which your voice clips are generated covers the phrases you expect to encounter in your runtime audio.
### Punctuation and numbers
If you are collecting data that will be used to train a speech model, then you should remove punctuation marks such as dashes, tick marks, quote marks and so on. These will often be confused, and can hinder training an accurate model.
Numbers should be written in full (ie as a [cardinal](https://en.wikipedia.org/wiki/Cardinal_numeral)) - that is, as `eight` rather than `8`.
## Preparing your data for training
### Data from Common Voice
If you are using data from Common Voice for training a model, you will need to prepare it as [outlined in the 🐸STT documentation](https://stt.readthedocs.io/en/latest/TRAINING.html#common-voice-training-data).
In this example we will prepare the Indonesian dataset for training, but you can use any language from Common Voice that you prefer. We've chosen Indonesian as it has the same [orthographic alphabet](ALPHABET.md) as English, which means we don't have to use a different `alphabet.txt` file for training; we can use the default.
---
This example assumes you have already [set up a Docker [environment](ENVIRONMENT.md) for [training](TRAINING.md). If you have not yet set up your Docker environment, we suggest you pause here and do this first.
---
First, [download the dataset from Common Voice](https://commonvoice.mozilla.org/en/datasets), and extract the archive into your `stt-data` directory. This makes it available to your Docker container through a _bind mount_. Start your 🐸STT Docker container with the `stt-data` directory as a _bind mount_ (this is covered in the [environment](ENVIRONMENT.md) section).
Your CV corpus data should be available from within the Docker container.
```
root@3de3afbe5d6f:/STT# ls stt-data/cv-corpus-6.1-2020-12-11/id/
clips invalidated.tsv reported.tsv train.tsv
dev.tsv other.tsv test.tsv validated.tsv
```
The `ghcr.io/coqui-ai/stt-train` Docker image _does not_ come with `sox`, which is a package used for processing Common Voice data. We need to install `sox` first.
```
root@4b39be3b0ffc:/STT# apt-get -y update && apt-get install -y sox
```
Next, we will run the Common Voice importer that ships with 🐸STT.
```
root@3de3afbe5d6f:/STT# bin/import_cv2.py stt-data/cv-corpus-6.1-2020-12-11/id
```
This will process all the CV data into the `clips` directory, and it can now be used [for training](TRAINING.md).
## Importers
🐸STT ships with several scripts which act as _importers_ - preparing a corpus of data for training by 🐸STT.
If you want to create importers for a new language, or a new corpus, you will need to fork the 🐸STT repository, then add support for the new language and/or corpus by creating an _importer_ for that language/corpus.
The existing importer scripts are a good starting point for creating your own importers.
They are located in the `bin` directory of the 🐸STT repo:
```
root@3de3afbe5d6f:/STT# ls | grep import
import_aidatatang.py
import_aishell.py
import_ccpmf.py
import_cv.py
import_cv2.py
import_fisher.py
import_freestmandarin.py
import_gram_vaani.py
import_ldc93s1.py
import_librivox.py
import_lingua_libre.py
import_m-ailabs.py
import_magicdata.py
import_primewords.py
import_slr57.py
import_swb.py
import_swc.py
import_ted.py
import_timit.py
import_ts.py
import_tuda.py
import_vctk.py
import_voxforge.py
```
The importer scripts ensure that the `.wav` files and corresponding transcriptions are in the `.csv` format expected by 🐸STT.
---
[Home](README.md) | [Previous - About Coqui STT](ABOUT.md) | [Next - The alphabet.txt file](ALPHABET.md)

View File

@ -0,0 +1,37 @@
[Home](README.md) | [Previous - Introduction](INTRO.md) | [Next - Formatting your training data](DATA_FORMATTING.md)
# About DeepSpeech
## Contents
- [About DeepSpeech](#about-deepspeech)
* [Contents](#contents)
* [What does DeepSpeech do?](#what-does-deepspeech-do-)
* [How does DeepSpeech work?](#how-does-deepspeech-work-)
* [How is DeepSpeech implemented?](#how-is-deepspeech-implemented-)
## What does DeepSpeech do?
DeepSpeech is a tool for automatically transcribing spoken audio. DeepSpeech takes digital audio as input and returns a "most likely" text transcript of that audio.
DeepSpeech is an implementation of the DeepSpeech algorithm developed by Baidu and presented in this research paper:
> Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger R, Satheesh S, Sengupta S, Coates A., & Ng, A. Y. (2014). Deep speech: Scaling up end-to-end speech recognition. [arXiv preprint arXiv:1412.5567](https://arxiv.org/pdf/1412.5567).
DeepSpeech can be used for two key activities related to speech recognition - _training_ and _inference_. Speech recognition _inference_ - the process of converting spoken audio to written text - relies on a _trained model_. DeepSpeech can be used, with appropriate hardware (GPU) to train a model using a set of voice data, known as a _corpus_. Then, _inference_ or _recognition_ can be performed using the trained model. DeepSpeech includes several pre-trained models.
**This Playbook is focused on helping you train your own model.**
## How does DeepSpeech work?
DeepSpeech takes a stream of audio as input, and converts that stream of audio into a sequence of characters in the designated alphabet. This conversion is made possible by two basic steps: First, the audio is converted into a sequence of probabilities over characters in the alphabet. Secondly, this sequence of probabilities is converted into a sequence of characters.
The first step is made possible by a [Deep Neural Network](https://en.wikipedia.org/wiki/Deep_learning#Deep_neural_networks), and the second step is made possible by an [N-gram](https://en.wikipedia.org/wiki/N-gram)language model. The neural network is trained on audio and corresponding text transcripts, and the N-gram language model is trained on a text corpus (which is often different from the text transcripts of the audio). The neural model is trained to predict the text from speech, and the language model is trained to predict text from preceding text. At a very high level, you can think of the first part (the acoustic model) as a phonetic transcriber, and the second part (the language model) as a spelling and grammar checker.
## How is DeepSpeech implemented?
The core of DeepSpeech is written in C++, but it has bindings to Python, .NET, Java, JavaScript, and community-based bindings for Golang, Rust, Vlang, and NIM-lang.
---
[Home](README.md) | [Previous - Introduction](INTRO.md) | [Next - Formatting your training data](DATA_FORMATTING.md)

114
doc/playbook/DEPLOYMENT.md Normal file
View File

@ -0,0 +1,114 @@
[Home](README.md) | [Previous - Testing and evaluating your trained model](TESTING.md) | [Next - Real life examples of using Coqui STT](EXAMPLES.md)
# Deployment
## Contents
- [Deployment](#deployment)
* [Contents](#contents)
* [Protocol buffer and memory mappable file formats](#protocol-buffer-and-memory-mappable-file-formats)
* [Exporting a memory mappable protocol buffer file with `graphdef`](#exporting-a-memory-mappable-protocol-buffer-file-with--graphdef-)
* [Exporting a tflite model](#exporting-a-tflite-model)
Now that you have [trained](TRAINING.md) and [evaluated](TESTING.md) your model, you are ready to use it for _inference_ - where spoken phrases - _utterances_ - are assessed by your trained model and a text _transcription_ provided.
There are some things to be aware of during this stage of the process.
## Protocol buffer and memory mappable file formats
By default, 🐸STT will export the trained model as a `.pb` file, such as:
```
$ sudo ls -las volumes/stt-data/_data/exported-model
4 drwxr-xr-x 2 root root 4096 Feb 1 22:13 .
4 drwxr-xr-x 6 root root 4096 Feb 1 22:23 ..
4 -rwxr-xr-x 1 root root 1586 Feb 1 22:13 author_model_0.0.1.md
184488 -rwxr-xr-x 1 root root 188915369 Feb 1 22:13 output_graph.pb
```
A `.pb` file is a [protocol buffer](https://en.wikipedia.org/wiki/Protocol_Buffers) file. Protocol buffer is a widely used file format for trained models, but it has a significant downsides. It is not _memory mappable_. [Memory mappable](https://en.wikipedia.org/wiki/Memory-mapped_file) files can be referenced by the operating system using a _file descriptor_, and they consume far less memory than non-memory-mappable files. Protocol buffer files also tend to be much larger than memory-mappable files.
Most inference libraries, such as TensorFlow, require a memory-mappable format.
There are two formats in particular that you should be familiar with.
## Exporting a memory mappable protocol buffer file with `graphdef`
Using the `graphdef` tool which is built in to TensorFlow (but deprecated in TensorFlow 2.3), you can export a memory-mappable protocol buffer file using the following commands:
```
convert_graphdef_memmapped_format --in_graph=output_graph.pb --out_graph=output_graph.pbmm
```
where `--in_graph` is a path to your `.pb` file and `--out_graph` is a path to the exported memory-mappable protocol buffer file.
```
root@12a4ee8ce1ed:/STT# ./convert_graphdef_memmapped_format \
--in_graph="persistent-data/exported-model/output_graph.pb" \
--out_graph="persistent-data/exported-model/output_graph.pbmm"
2021-02-03 21:13:09.516709: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 134217728 exceeds 10% of system memory.
2021-02-03 21:13:09.647395: I tensorflow/contrib/util/convert_graphdef_memmapped_format_lib.cc:171] Converted 7 nodes
```
For [more information on creating a memory-mappable protocol buffer file, consult the documentation](https://stt.readthedocs.io/en/latest/TRAINING.html#exporting-a-model-for-inference).
***Be aware that this file format is likely to be deprecated in the future. We strongly recommend the use of `tflite`.***
## Exporting a tflite model
The `tflite` engine ([more information on tflite](https://www.tensorflow.org/lite/)) is designed to allow inference on mobile, IoT and embedded devices. If you have _not_ yet trained a model, and you want to export a model compatible with `tflite`, you will need to use the `--export_tflite` flags with the `train.py` script. For example:
```
python3 train.py \
--train_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
--dev_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
--test_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--checkpoint_dir stt-data/checkpoints \
--export_dir stt-data/exported-model \
--export_tflite
```
If you have _already_ trained a model, and wish to export to `tflite` format, you can re-export it by specifying the same `checkpoint_dir` that you used for training, and by passing the `--export_tflite` parameter.
Here is an example:
```
python3 train.py \
--checkpoint_dir persistent-data/checkpoints \
--export_dir persistent-data/exported-model \
--export_tflite
I Loading best validating checkpoint from persistent-data/checkpoints-1feb2021-id/best_dev-34064
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/weights
I Models exported at persistent-data/exported-model
I Model metadata file saved to persistent-data/exported-model/author_model_0.0.1.md. Before submitting the exported model for publishing make sure all information in the metadata file is correct, and complete the URL fields.
root@0913858a2868:/STT/persistent-data/exported-model# ls -las
total 415220
4 drwxr-xr-x 2 root root 4096 Feb 3 22:42 .
4 drwxr-xr-x 7 root root 4096 Feb 3 21:54 ..
4 -rwxr-xr-x 1 root root 1582 Feb 3 22:42 author_model_0.0.1.md
184488 -rwxr-xr-x 1 root root 188915369 Feb 1 11:13 output_graph.pb
184496 -rw-r--r-- 1 root root 188916323 Feb 3 21:13 output_graph.pbmm
46224 -rw-r--r-- 1 root root 47332112 Feb 3 22:42 output_graph.tflite
```
For more information on exporting a `tflite` model, [please consult the documentation](https://stt.readthedocs.io/en/latest/TRAINING.html#exporting-a-model-for-inference).
---
[Home](README.md) | [Previous - Testing and evaluating your trained model](TESTING.md) | [Next - Real life examples of using Coqui STT](EXAMPLES.md)

299
doc/playbook/ENVIRONMENT.md Normal file
View File

@ -0,0 +1,299 @@
[Home](README.md) | [Previous - Acoustic Model and Language Model](AM_vs_LM.md) | [Next - Training your model](TRAINING.md)
# Setting up your environment for training using Coqui STT
## Contents
- [Setting up your environment for training using Coqui STT](#setting-up-your-environment-for-training-using-coqui-stt)
* [Contents](#contents)
* [Installing dependencies for working with GPUs under Docker](#installing-dependencies-for-working-with-gpus-under-docker)
+ [GPU drivers](#gpu-drivers)
* [What is Docker and why is it recommended for training a model with Coqui STT?](#what-is-docker-and-why-is-it-recommended-for-training-a-model-with-coqui-stt-)
* [Install Docker](#install-docker)
+ [Ensure that you create a `docker` group and that you add yourself to this group](#ensure-that-you-create-a--docker--group-and-that-you-add-yourself-to-this-group)
+ [Install the `nvidia-container-toolkit`](#install-the--nvidia-container-toolkit-)
* [Pulling down a pre-built Coqui STT Docker image](#pulling-down-a-pre-built-coqui-stt-docker-image)
+ [Testing the image by creating a container and running a script](#testing-the-image-by-creating-a-container-and-running-a-script)
* [Setting up a bind mount to store persistent data](#setting-up-a-bind-mount-to-store-persistent-data)
* [Extending the base `stt-train` Docker image for your needs](#extending-the-base--stt-train--docker-image-for-your-needs)
This section of the Playbook assumes you are comfortable installing 🐸STT and using it with a pre-trained model, and that you are comfortable setting up a Python _virtual environment_.
Here, we provide information on setting up a Docker environment for training your own speech recognition model using 🐸STT. We also cover dependencies Docker has for NVIDIA GPUs, so that you can use your GPU(s) for training a model.
---
*** Do not train using only CPU(s) ***
This Playbook assumes that you will be using NVIDIA GPU(s). Training a 🐸STT speech recognition model on CPU(s) only will take a _very, very, very_ long time. Do not train on your CPU(s).
---
## Installing dependencies for working with GPUs under Docker
Before we install Docker, we are going to make sure that we have all the Ubuntu Linux dependencies required for working with NVIDIA GPUs and Docker.
---
*** Non-NVIDIA GPUS ***
Although non-NVIDIA GPUs exist, they are currently rare, and we do not aim to support them in this Playbook.
---
### GPU drivers
By default, your machine should already have GPU drivers installed. A good way to check is with the `nvidia-smi` tool. If your drivers are installed correctly, `nvidia-smi` will report the driver version and CUDA version.
```
$ nvidia-smi
Sat Jan 9 11:48:50 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1060 Off | 00000000:01:00.0 On | N/A |
| N/A 70C P0 27W / N/A | 766MiB / 6069MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================| | |
+-----------------------------------------------------------------------------+
```
If your drivers are _not_ installed correctly, you will likely see this warning:
```
$ nvidia-smi
Command 'nvidia-smi' not found, but can be installed with:
sudo apt install nvidia-utils-440 # version 440.100-0ubuntu0.20.04.1, or
sudo apt install nvidia-340 # version 340.108-0ubuntu2
sudo apt install nvidia-utils-435 # version 435.21-0ubuntu7
sudo apt install nvidia-utils-390 # version 390.141-0ubuntu0.20.04.1
sudo apt install nvidia-utils-450 # version 450.102.04-0ubuntu0.20.04.1
sudo apt install nvidia-utils-450-server # version 450.80.02-0ubuntu0.20.04.3
sudo apt install nvidia-utils-460 # version 460.32.03-0ubuntu0.20.04.1
sudo apt install nvidia-utils-418-server # version 418.152.00-0ubuntu0.20.04.1
sudo apt install nvidia-utils-440-server # version 440.95.01-0ubuntu0.20.04.1
```
[Follow this guide](https://linuxconfig.org/how-to-install-the-nvidia-drivers-on-ubuntu-18-04-bionic-beaver-linux) to install your GPU drivers.
Once you've installed your drivers, use `nvidia-smi` to prove that they are installed correctly.
_Note that you may need to restart your host after installing the GPU drivers._
Ideally, you should not be running any other processes on your GPU(s) before you start training.
Next, we will install the utility `nvtop` so that you can monitor the performance of your GPU(s). We will also use `nvtop` to prove that Docker is able to use your GPU(s) later in this document.
```
$ sudo apt install nvtop
```
_Note that you may need to restart your host after installing `nvtop`._
If you run `nvtop` you will see a graph similar to this:
![Screenshot of nvtop](images/nvtop.png "Screenshot of nvtop")
You are now ready to install Docker.
## What is Docker and why is it recommended for training a model with Coqui STT?
[Docker](https://www.docker.com/why-docker) is virtualization software that allows a consistent collection of software, dependencies and environments to be packaged into a _container_ which is then run on a host, or many hosts. It is one way to manage the many software dependencies which are required for training a model with 🐸STT, particularly if using an NVIDIA GPU.
## Install Docker
First, you must install Docker on your host. Follow the [instructions on the Docker website](https://docs.docker.com/engine/install/ubuntu/).
### Ensure that you create a `docker` group and that you add yourself to this group
Once you have installed Docker, be sure to follow the [post-installation](https://docs.docker.com/engine/install/linux-postinstall/) steps. These include setting up a `docker` group and adding your user account to this group. If you do not follow this step, you will need to use `sudo` with every Docker command, and this can have unexpected results.
---
If you try to use `docker` commands and constantly receive permission warnings, it's likely that you have forgotten this step.
---
### Install the `nvidia-container-toolkit`
Next, we need to install `nvidia-container-toolkit`. This is necessary to allow Docker to be able to access the GPU(s) on your machine for training.
First, add the repository for your distribution, following the instructions on the [NVIDIA Docker GitHub page](https://nvidia.github.io/nvidia-docker/). For example:
```
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
```
Next, install `nvidia-container-toolkit`:
```
$ sudo apt-get install -y nvidia-container-toolkit
```
## Pulling down a pre-built Coqui STT Docker image
Once you have installed Docker and the `nvidia-container-toolkit`, you are ready to build a Docker _image_. Although it's [possible to build your own Docker image from scratch](), we're going to use a pre-built 🐸STT training image which is hosted on Docker Hub. Once the image is pulled down, you can then create a Docker _container_ from the image to perform training.
As you become more proficient with using 🐸STT, you can use the pre-built Docker image as the basis for your own images.
**Running this command will download several gigabytes of data. Do not perform this command if you are on a limited or metered internet connection**
```
$ docker pull ghcr.io/coqui-ai/stt-train:v0.10.0-alpha.4
v0.10.0-alpha.4: Pulling from coqui-ai/stt-train
Digest: sha256:0f8ee9208874a925618e527f1d06ea9065dd09c700972cba740884e7e7e4cd17
Status: Image is up to date for ghcr.io/coqui-ai/stt-train:v0.10.0-alpha.4
ghcr.io/coqui-ai/stt-train:v0.10.0-alpha.4
```
<!-- FIXME uncomment once we have CI publishing of these images:
If you do not which to use the `v0.9.3` 🐸STT image, [a list of previous images is available](https://github.com/orgs/coqui-ai/packages/container/stt-train/versions).
You will now see the `ghcr.io/coqui-ai/stt-train` image when you run the command `docker image ls`:
```
$ docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
ghcr.io/coqui-ai/stt-train v0.10.0-alpha.4 d145cb0930ea 37 minutes ago 5.12GB
``` -->
### Testing the image by creating a container and running a script
Now that you have your Docker image pulled down, you can create a _container_ from the image. Here, we're going to create a container and run a simple test to make sure that the image is working correctly.
_Note that you can refer to Docker images by `id` - such as `7cdc0bb1fe2a` in the example above, or by the image's name and `tag`. Here, we will be using the image name and `tag` - ie `ghcr.io/coqui-ai/stt-train:v0.10.0-alpha.4`._
```
$ docker run -it --name stt-test --entrypoint /bin/bash ghcr.io/coqui-ai/stt-train:v0.10.0-alpha.4
```
The `entrypoint` instruction following `docker run` tells Docker to run the `/bin/bash` (ie shell) after creating the container.
This command assumes that `/bin/bash` will be invoked as the `root` user. This is necessary, as the Docker container needs to make changes to the filesystem. If you use the `-u $(id -u):$(id -g)` switches, you will tell Docker to invoke `/bin/bash` as the current user of the host that is running the Docker container. You will likely encounter `permission denied` errors while running training.
When you run the above command, you should see the following prompt:
```
________ _______________
___ __/__________________________________ ____/__ /________ __
__ / _ _ \_ __ \_ ___/ __ \_ ___/_ /_ __ /_ __ \_ | /| / /
_ / / __/ / / /(__ )/ /_/ / / _ __/ _ / / /_/ /_ |/ |/ /
/_/ \___//_/ /_//____/ \____//_/ /_/ /_/ \____/____/|__/
WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.
To avoid this, run the container by specifying your user's userid:
$ docker run -u $(id -u):$(id -g) args...
root@d14b2d062526:/STT#
```
In a separate terminal, you can see that you now have a Docker image running:
```
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d14b2d062526 7cdc0bb1fe2a "/bin/bash" About a minute ago Up About a minute compassionate_rhodes
```
🐸STT includes a number of convenience scripts in the `bin` directory. They are named for the corpus they are configured for. To ensure that your Docker environment is functioning correctly, run one of these scripts (in the terminal session where your container is running).
```
root@d14b2d062526:/STT/bin# ./bin/run-ldc93s1.sh
```
This will train on a single audio file for 200 epochs.
We've now proved that the image is working correctly.
## Setting up a bind mount to store persistent data
Now that we have a Docker image pulled down, we can create a _container_ from the image, and do training from within the image.
However, Docker containers are not _persistent_. This means that if the host on which the container is running reboots, or there is a fatal error within the container, all the results stored _within_ the container will be lost. We need to set up _persistent storage_ so that the checkpoints and exported model are stored _outside_ the container.
To do this we create a [bind mount](https://docs.docker.com/storage/bind-mounts/) for Docker. A _bind mount_ allows Docker to store files externally to the container, on your local filesystem.
First, stop and remove the container we created above.
```
$ docker rm -f stt-test
```
Next, we will create a new container, except this time we will also create a _bind mount_ so that it can store persistent data.
First, we create a directory on our local system for the bind mount.
```
$ mkdir stt-data
```
Next, we create a container and instruct it to use a bind mount to the directory.
```
$ docker run -it \
--entrypoint /bin/bash \
--name stt-train \
--gpus all \
--mount type=bind,source="$(pwd)"/stt-data,target=/STT/stt-data \
7cdc0bb1fe2a
```
We all pass the `--gpus all` parameter here to instruct Docker to use all available GPUs. If you need to restrict the use of GPUs, then please consult the [Docker documentation](https://docs.docker.com/config/containers/resource_constraints/). You can also restrict the amount of memory or CPU(s) that the Docker container consumes. This might be useful if you need to use the host that you're training on _at the same time_ as the training is occurring, or if you're on a shared host or cluster (for example at a university).
From within the container, the `stt-data` directory will now be available:
```
root@e964b1e5a60c:/STT# ls | grep stt-data
stt-data
```
You are now ready to begin [training](TRAINING.md) your model.
## Extending the base `stt-train` Docker image for your needs
As you become more comfortable training speech recognition models with 🐸STT, you may wish to extend the base Docker image. You can do this using the `FROM` instruction in a `Dockerfile`, for example:
```
# Custom Dockerfile for training models using 🐸STT
# Get the latest 🐸STT image
FROM ghcr.io/coqui-ai/stt-train:v0.10.0-alpha.4
# Install nano editor
RUN apt-get -y update && apt-get install -y nano
# Install sox for inference and for processing Common Voice data
RUN apt-get -y update && apt-get install -y sox
```
You can then use `docker build` with this `Dockerfile` to build your own custom Docker image.
---
[Home](README.md) | [Previous - Acoustic Model and Language Model](AM_vs_LM.md) | [Next - Training your model](TRAINING.md)

47
doc/playbook/EXAMPLES.md Normal file
View File

@ -0,0 +1,47 @@
[Home](README.md) | [Previous - Deploying your model](DEPLOYMENT.md)
# Example applications of Coqui STT
## Contents
- [Example applications of Coqui STT](#example-applications-of-coqui-stt)
* [Contents](#contents)
* [Coqui STT worked examples repository](#coqui-stt-worked-examples-repository)
* [Other suggestions for integrating Coqui STT](#other-suggestions-for-integrating-coqui-stt)
+ [A stand-alone transcription tool](#a-stand-alone-transcription-tool)
+ [Key Word Search in spoken audio](#key-word-search-in-spoken-audio)
+ [An interface to a voice-controlled application](#an-interface-to-a-voice-controlled-application)
## Coqui STT worked examples repository
There is a repository of examples of using 🐸STT for several use cases, including sample code, in the [🐸STT examples](https://github.com/coqui-ai/STT-examples/) repository.
The examples here include:
* [Android microphone streaming and transcription](https://github.com/coqui-ai/STT-examples/tree/r0.9/android_mic_streaming)
* [🐸STT running in an Electron app using ReactJS](https://github.com/coqui-ai/STT-examples/tree/r0.9/electron)
## Other suggestions for integrating Coqui STT
There are many other possibilities for incorporating speech recognition into your projects using 🐸STT.
### A stand-alone transcription tool
Accurate human-created transcriptions require someone who has been professionally trained, and their time is expensive. High quality transcription of audio may take up to 10 hours of transcription time per one hour of audio. With 🐸STT, you could increase transcriber productivity with a human-in-the-loop approach, in which 🐸STT generates a first-pass transcription, and the transcriber fixes any errors.
### Key Word Search in spoken audio
Key Word Search in audio is a simple task, but it takes considerable time. Given a collection of 100 hours of audio, if you only want to find all the instances of the word "coronavirus", instead of paying a professional transcriber, you might just pay someone to listen to all the audio and note when the word "coronavirus" was spoken. Nevertheless, you will still be paying for human time relative to the amount of audio you wish to search.
A better approach would be to modify 🐸STT to listen better for words of interest (e.g. "coronavirus") and run 🐸STT over all audio in parallel. Afterwards, a human may verify that the identified segments of audio contain the words of interest. This is another human-in-the-loop example, which makes the humans considerably more time-efficient.
### An interface to a voice-controlled application
Another example application of 🐸STT is as an interface to a voice-controlled application. This is an instance where a successful application (e.g. a digital assistant or smart speaker) cannot contain a human-in-the-loop. As such, 🐸STT is not making human-time more efficient, but rather, 🐸STT is enabling technologies which were previously not possible.
One example of a voice-controlled application using 🐸STT is the [voice add-on for WebThings.IO](https://github.com/WebThingsIO/voice-addon).
---
[Home](README.md) | [Previous - Deploying your model](DEPLOYMENT.md)

54
doc/playbook/INTRO.md Normal file
View File

@ -0,0 +1,54 @@
[Home](README.md) | [Next - About Coqui STT](ABOUT.md)
# Introduction
## Contents
- [Introduction](#introduction)
* [Contents](#contents)
* [Is this guide for you?](#is-this-guide-for-you-)
* [Setting expectations](#setting-expectations)
* [Setting up for success](#setting-up-for-success)
* [Checklist for success](#checklist-for-success)
## Is this guide for you?
You're probably here because you're interested in automatic speech recognition (ASR) - the process of converting phrases spoken by humans into written form. There have been significant advances in speech recognition in recent years, driven both by new deep learning algorithms, and by advances in hardware that are capable of the large volume of computations required by those algorithms. Several new tools are available to assist developers with both training speech recognition models and using those models for inference - 🐸STT being one of them.
If you're trying to get 🐸STT working for your application, your data, or a new language, you've come to the right place! You can easily download a pre-trained 🐸STT model for English, but it might not work for you out of the box. No worries, with a little tweaking you can get 🐸STT working for most anything!
Specifically, this guide will help you create a working 🐸STT model for a new language. Along the way, you will learn some best practices for speech recognition and data wrangling.
## Setting expectations
You might think that speech recognition is solved for English, and as such, with a little work you can solve speech recognition for a new language. This is false for two reasons. Firstly, speech recognition is far from solved for English, and secondly it is unlikely you will be able to create something that works as well as a pre-trained English 🐸STT model unless you have a few thousand hours of data.
Nevertheless, with some tips and tricks, you can still make create useful voice technology for a non-English language!
## Setting up for success
Speech recognition is a _statistical_ process. Speech recognition models are _trained_ on large amounts of voice data, using statistical techniques to "learn" associations between sounds, in the form of `.wav` files, and characters, that are found in an alphabet. Because speech recognition is statistical, it does not have "bugs" in the sense that computer code has bugs; instead, anomalies or biases in the data used for a speech recognition model mean that the resulting model will likely exhibit those biases.
Speech recognition still requires trial and error - with the data that is used to train a model, the language model or scorer that is used to form words from characters, and with specific training settings. "Debugging" speech recognition models means findings ways to make the data, the alphabet and and scorer more _accurate_. That is, making them mirror as closely as possible the real-world conditions in which the speech recognition model will be used. If your speech recognition model will be used to help transcribe geographic place names, then your voice data and your scorer need to cover those place names.
The success of any voice technology depends on a constellation of factors, and the accuracy of your speech recognizer is just one factor. To the extent that an existing voice technology works, it works because the creators have eliminated sources of failure. Think about one of the oldest working voice technologies: spoken digit recognition. When you call a bank you might hear a recording like this: "Say ONE to learn about credit cards, say TWO to learn about debit cards, or say ZERO to speak to a representative". These systems usually work well, but you might not know that if you answer with anything other than a single digit, the system will completely fail to understand you. Spoken digit recognition systems are setup for success because they've re-formulated an open-ended transcription problem as a simple classification problem. In this case, as long as the system is able to distinguish spoken digits from one another, it will succeed.
We will talk about ways in which you can constrain the search space of a problem and bias a model towards a set of words that you actually care about. If you want to make a useful digit recognizer, it doesn't matter if your model has an 85% Word Error Rate (WER) when transcribing the nightly news. All that matters is your model can correctly identify spoken digits. It is key to align what you care about with what you are measuring.
If you have ever used a speech technology and it worked flawlessly, the creators of the product set themselves up for success. This is what you must also do in your application.
## Checklist for success
To help set you up for success, we've included a checklist below.
- [ ] Have a clear understanding of the intended _use case_. What phrases will be used in the use case that you want to recognise?
- [ ] Ensure as many audio samples as possible, and ensure that they cover all the phrases expected in the use case. Remember, you will need hundreds of hours of audio data for large vocabulary speech recognition.
- [ ] The language model (scorer) needs to include every word that will be expected to be spoken in your intended use case.
- [ ] The language model (scorer) should _exclude_ any words that are _not_ expected to be spoken in your intended use case, to constrain the model.
- [ ] If your intended use case will have background noise, then your voice data should have background noise.
- [ ] If your intended use case will need to recognise particular accents, then your voice data should contain those accents.
- [ ] You will need access to a Linux host with an NVIDIA GPU, and you should be comfortable operating in a `bash` environment.
---
[Home](README.md) | [Next - About Coqui STT](ABOUT.md)

3
doc/playbook/LICENSE.md Normal file
View File

@ -0,0 +1,3 @@
Unless otherwise indicated, the text of documents in this collection is available under the Creative Commons Attribution Share-Alike 3.0 Unported license, or any later version.
[https://creativecommons.org/licenses/by-sa/3.0/](https://creativecommons.org/licenses/by-sa/3.0/)

84
doc/playbook/README.md Normal file
View File

@ -0,0 +1,84 @@
# Coqui STT Playbook
A crash course on training speech recognition models using 🐸STT.
## Quick links
* [STT on GitHub](https://github.com/coqui-ai/STT)
* [STT documentation on ReadTheDocs](https://stt.readthedocs.io/en/latest/)
* [STT discussions on GitHub](https://github.com/coqui-ai/STT/discussions)
* [Common Voice Datasets](https://commonvoice.mozilla.org/en/datasets)
* [How to install Docker](https://docs.docker.com/engine/install/)
## [Introduction](INTRO.md)
Start here. This section will set your expectations for what you can achieve with the STT Playbook, and the prerequisites you'll need to start to train your own speech recognition models.
## [About Coqui STT](ABOUT.md)
Once you know what you can achieve with the STT Playbook, this section provides an overview of STT itself, its component parts, and how it differs from other speech recognition engines you may have used in the past.
## [Formatting your training data](DATA_FORMATTING.md)
Before you can train a model, you will need to collect and format your _corpus_ of data. This section provides an overview of the data format required for STT, and walks through an example in prepping a dataset from Common Voice.
## [The alphabet.txt file](ALPHABET.md)
If you are training a model that uses a different alphabet to English, for example a language with diacritical marks, then you will need to modify the `alphabet.txt` file.
## [Building your own scorer](SCORER.md)
Learn what the scorer does, and how you can go about building your own.
## [Acoustic model and language model](AM_vs_LM.md)
Learn about the differences between STT's _acoustic_ model and _language_ model and how they combine to provide end to end speech recognition.
## [Setting up your training environment](ENVIRONMENT.md)
This section walks you through building a Docker image, and spawning STT in a Docker container with persistent storage. This approach avoids the complexities of dependencies such as `tensorflow`.
## [Training a model](TRAINING.md)
Once you have your training data formatted, and your training environment established, this section will show you how to train a model, and provide guidance for overcoming common pitfalls.
## [Testing a model](TESTING.md)
Once you've trained a model, you will need to validate that it works for the context it's been designed for. This section walks you through this process.
## [Deploying your model](DEPLOYMENT.md)
Once trained and tested, your model is deployed. This section provides an overview of how you can deploy your model.
## [Applying STT to real world problems](EXAMPLES.md)
This section covers specific use cases where STT can be applied to real world problems, such as transcription, keyword searching and voice controlled applications.
---
## Introductory courses on machine learning
Providing an introduction to machine learning is beyond the scope of this PlayBook, howevever having an understanding of machine learning and deep learning concepts will aid your efforts in training speech recognition models with STT.
Here, we've linked to several resources that you may find helpful; they're listed in the order we recommend reading them in.
* [Digital Ocean's introductory machine learning tutorial](https://www.digitalocean.com/community/tutorials/an-introduction-to-machine-learning) provides an overview of different types of machine learning. The diagrams in this tutorial are a great way of explaining key concepts.
* [Google's machine learning crash course](https://developers.google.com/machine-learning/crash-course/ml-intro) provides a gentle introduction to the main concepts of machine learning, including _gradient descent_, _learning rate_, _training, test and validation sets_ and _overfitting_.
* If machine learning is something that sparks your interest, then you may enjoy [the MIT Open Learning Library's Introduction to Machine Learning course](https://openlearninglibrary.mit.edu/courses/course-v1:MITx+6.036+1T2019/course/), a 13-week college-level course covering perceptrons, neural networks, support vector machines and convolutional neural networks.
---
## How you can help provide feedback on the STT PlayBook
You can help to make the STT PlayBook even better by providing [via a GitHub Issue](https://github.com/coqui-ai/STT-playbook/issues)
* Please _try these instructions_, particularly for building a Docker image and running a Docker container, on multiple distributions of Linux so that we can identify corner cases.
* Please _contribute your tacit knowledge_ - such as:
- common errors encountered in data formatting, environment setup, training and validation
- techniques or approaches for improving the scorer, alphabet file or the accuracy of Word Error Rate (WER) and Character Error Rate (CER).
- case studies of the work you or your organisation have been doing, showing your approaches to data validation, training or evaluation.
* Please identify errors in text - with many eyes, bugs are shallow :-)

310
doc/playbook/SCORER.md Normal file
View File

@ -0,0 +1,310 @@
[Home](README.md) | [Previous - The alphabet.txt file](ALPHABET.md) | [Next - Acoustic Model and Language Model](AM_vs_LM.md)
# Scorer - language model for determining which words occur together
## Contents
- [Scorer - language model for determining which words occur together](#scorer---language-model-for-determining-which-words-occur-together)
* [Contents](#contents)
+ [What is a scorer?](#what-is-a-scorer-)
+ [Building your own scorer](#building-your-own-scorer)
- [Preparing the text file](#preparing-the-text-file)
- [Using `lm_optimizer.py` to generate values for the parameters `--default_alpha` and `--default_beta` that are used by the `generate_scorer_package` script](#using--lm-optimizerpy--to-generate-values-for-the-parameters----default-alpha--and----default-beta--that-are-used-by-the--generate-scorer-package--script)
* [Additional parameters for `lm_optimizer.py`](#additional-parameters-for--lm-optimizerpy-)
- [Using `generate_lm.py` to create `lm.binary` and `vocab-500000.txt` files](#using--generate-lmpy--to-create--lmbinary--and--vocab-500000txt--files)
- [Generating a `kenlm.scorer` file from `generate_scorer_package`](#generating-a--kenlmscorer--file-from--generate-scorer-package-)
- [Using the scorer file in model training](#using-the-scorer-file-in-model-training)
### What is a scorer?
A scorer is a _language model_ and it is used by 🐸STT to improve the accuracy of transcription. A _language model_ predicts which words are more likely to follow each other. For example, the word `chicken` might be frequently followed by the words `nuggets`, `soup` or `rissoles`, but is unlikely to be followed by the word `purple`. The scorer identifies probabilities of words occurring together.
The default scorer used by 🐸STT is trained on the LibriSpeech dataset. The LibriSpeech dataset is based on [LibriVox](https://librivox.org/) - an open collection of out-of-copyright and public domain works.
You may need to build your own scorer - your own _language model_ if:
* You are training 🐸STT in another language
* You are training a speech recognition model for a particular domain - such as technical words, medical transcription, agricultural terms and so on
* If you want to improve the accuracy of transcription
**🐸STT supports the _optional_ use of an external scorer - if you're not sure if you need to build your own scorer, stick with the built-in one to begin with**.
### Building your own scorer
_This section assumes that you are using a Docker image and container for training, as outlined in the [environment](ENVIRONMENT.md) section. If you are not using the Docker image, then some of the scripts such as `generate_lm.py` will not be available in your environment._
_This section assumes that you have already trained a model and have a set of **checkpoints** for that model. See the section on [training](TRAINING.md) for more information on **checkpoints**._
🐸STT uses an algorithm called [_connectionist temporal classification_](https://distill.pub/2017/ctc/) or CTC for short, to map between _input_ sequences of audio and _output_ sequences of characters. The mapping between _inputs_ and _outputs_ is called an _alignment_. The alignment between _inputs_ and _outputs_ is not one-to-one; many _inputs_ may make up an _output_. CTC is therefore a _probabilistic_ algorithm. This means that for each _input_ there are many possible _outputs_ that can be selected. A process call _beam search_ is used to identify the possible _outputs_ and select the one with the highest probability. A [language model](AM_vs_LM.md) or _scorer_ helps the _beam search_ algorithm select the most optimal _output_ value. This is why building your own _scorer_ is necessary for training a model on a narrow domain - otherwise the _beam search_ algorithm would probably select the wrong _output_.
The default _scorer_ used with 🐸STT is trained on Librivox. It's a general model. But let's say that you want to train a speech recognition model for agriculture. If you have the phrase `tomatoes are ...`, a general scorer might identify `red` as the most likely next word - but an agricultural model might identify `ready` as the most likely next word.
The _scorer_ is only used during the _test_ stage of [training](TRAINING.md) (rather than at the _train_ or _validate_ stages) because this is where the _beam search_ decdoder determines which words are formed from the identified characters.
The process for building your own _scorer_ has the following steps:
1. Having, or preparing, a text file (in `.txt` or `.txt.gz` format), with one phrase or word on each line. If you are training a speech recognition model for a particular _domain_ - such as technical words, medical transcription, agricultural terms etc, then they should appear in the text file. The text file is used by the `generate_lm.py` script.
2. Using the `lm_optimizer.py` with your dataset (your `.csv` files) and a set of _checkpoints_ to find optimal values of `--default_alpha` and `--default_beta`. The `--default_alpha` and `--default_beta` parameters are used by the `generate_scorer_package` script to assign initial _weights_ to sequences of words.
3. Using the `generate_lm.py` script which is distributed with 🐸STT, along with the text file, to create two files, called `lm.binary` and `vocab-500000.txt`.
4. Downloading the prebuilt `native_client` from the 🐸STT repository on GitHub, and using the `generate_scorer_package` to create a `kenlm.scorer` file.
5. Using the `kenlm.scorer` file as the _external_scorer_ passed to `train.py`, and used for the _test_ phase. The `scorer` does not impact training; it is used for calculating `word error rate` (covered more in [testing](TESTING.md)).
In the following example we will create a custom external scorer file for Bahasa Indonesia (BCP47: `id-ID`).
#### Preparing the text file
This is straightforward. In this example, we will use a file called `indonesian-sentences.txt`. This file should contain phrases that you wish to prioritize recognising. For example, you may want to recognise place names, digits or medical phrases - and you will include these phrases in the `.txt` file.
_These phrases should not be copied from `test.tsv`, `train.tsv` or `validated.tsv` as you will bias the resultant model._
```
~/stt-data$ ls cv-corpus-6.1-2020-12-11/id
total 6288
4 drwxr-xr-x 3 root root 4096 Feb 24 19:01 ./
4 drwxr-xr-x 4 root root 4096 Feb 11 07:09 ../
1600 drwxr-xr-x 2 root root 1638400 Feb 9 10:43 clips/
396 -rwxr-xr-x 1 root root 401601 Feb 9 10:43 dev.tsv
104 -rwxr-xr-x 1 root root 103332 Feb 9 10:43 invalidated.tsv
1448 -rwxr-xr-x 1 root root 1481571 Feb 9 10:43 other.tsv
28 -rwxr-xr-x 1 root root 26394 Feb 9 10:43 reported.tsv
392 -rwxr-xr-x 1 root root 399790 Feb 9 10:43 test.tsv
456 -rwxr-xr-x 1 root root 465258 Feb 9 10:43 train.tsv
1848 -rwxr-xr-x 1 root root 1889606 Feb 9 10:43 validated.tsv
```
The `indonesian-sentences.txt` file is stored on the local filesystem in the `stt-data` directory so that the Docker container can access it.
```
~/stt-data$ ls | grep indonesian-sentences
476 -rw-rw-r-- 1 root root 483481 Feb 24 19:02 indonesian-sentences.txt
```
The `indonesian-sentences.txt` file is formatted with one phrase per line, eg:
```
Kamar adik laki-laki saya lebih sempit daripada kamar saya.
Ayah akan membunuhku.
Ini pulpen.
Akira pandai bermain tenis.
Dia keluar dari ruangan tanpa mengatakan sepatah kata pun.
Besok Anda akan bertemu dengan siapa.
Aku mengerti maksudmu.
Tolong lepas jasmu.
```
#### Using `lm_optimizer.py` to generate values for the parameters `--default_alpha` and `--default_beta` that are used by the `generate_scorer_package` script
The `lm_optimizer.py` script is located in the `STT` directory if you have set up your [environment][ENVIRONMENT.md] as outlined in the PlayBook.
```
root@57e6bf4eeb1c:/STT# ls | grep lm_optimizer.py
lm_optimizer.py
```
This script takes a set of test data (`--test_files`), and a `--checkpoint_dir` parameter and determines the optimal `--default_alpha` and `--default_beta` values.
Call `lm_optimizer.py` and pass it the `--test_files` and a `--checkpoint_dir` directory.
```
root@57e6bf4eeb1c:/STT# python3 lm_optimizer.py \
--test_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--checkpoint_dir stt-data/checkpoints
```
In general, any change to _geometry_ - the shape of the neural network - needs to be reflected here, otherwise the _checkpoint_ will fail to load. It's always a good idea to record the parameters you used to train a model. For example, if you trained your model with a `--n_hidden` value that is different to the default (`1024`), you should pass the same `--n_hidden` value to `lm_optimizer.py`, i.e:
```
root@57e6bf4eeb1c:/STT# python3 lm_optimizer.py \
--test_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--checkpoint_dir stt-data/checkpoints \
--n_hidden 4
```
`lm_optimizer.py` will create a new _study_.
```
[I 2021-03-05 02:04:23,041] A new study created in memory with name: no-name-38c8e8cb-0cc2-4f53-af0e-7a7bd3bc5159
```
It will then run _testing_ and output a trial score.
```
[I 2021-03-02 12:48:15,336] Trial 0 finished with value: 1.0 and parameters: {'lm_alpha': 1.0381777700987271, 'lm_beta': 0.02094605391055826}. Best is trial 0 with value: 1.0.
```
By default, `lm_optimizer.py` will run `6` trials, and identify the trial with the most optimal parameters.
```
[I 2021-03-02 17:50:00,662] Trial 6 finished with value: 1.0 and parameters: {'lm_alpha': 3.1660260368070423, 'lm_beta': 4.7438794403688735}. Best is trial 0 with value: 1.0.
```
The optimal parameters `--default_alpha` and `--default_beta` are now known, and can be used with `generate_scorer_package`. In this case, the optimal settings are:
```
--default_alpha 1.0381777700987271
--default_beta 0.02094605391055826
```
because `Trial 0` was the best trial.
##### Additional parameters for `lm_optimizer.py`
There are additional parameters that may be useful.
**Please be aware that these parameters may increase processing time significantly - even to a few days - depending on your hardware.**
* `--n_trials` specifies how many trials `lm_optimizer.py` should run to find the optimal values of `--default_alpha` and `--default_beta`. The default is `6`. You may wish to reduce `--n_trials`.
* `--lm_alpha_max` specifies a maximum bound for `--default_alpha`. The default is `0.931289039105002`. You may wish to reduce `--lm_alpha_max`.
* `--lm_beta_max` specifies a maximum bound for `--default_beta`. The default is `1.1834137581510284`. You may wish to reduce `--lm_beta_max`.
For example:
```
root@57e6bf4eeb1c:/STT# python3 lm_optimizer.py \
--test_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--checkpoint_dir stt-data/checkpoints \
--n_hidden 4 \
--n_trials 3 \
--lm_alpha_max 0.92 \
--lm_beta_max 1.05
```
#### Using `generate_lm.py` to create `lm.binary` and `vocab-500000.txt` files
We then use `generate_lm.py` script that comes with 🐸STT to create a [_trie file_](https://en.wikipedia.org/wiki/Trie). The _trie file_ represents associations between words, so that during training, words that are more closely associated together are more likely to be transcribed by 🐸STT.
The _trie file_ is produced using a software package called [KenLM](https://kheafield.com/code/kenlm/). KenLM is designed to create large language models that are able to be filtered and queried easily.
First, create a directory in `stt-data` directory to store your `lm.binary` and `vocab-500000.txt` files:
```
stt-data$ mkdir indonesian-scorer
```
Then, use the `generate_lm.py` script as follows:
```
cd data/lm
python3 generate_lm.py \
--input_txt /STT/stt-data/indonesian-sentences.txt \
--output_dir /STT/stt-data/indonesian-scorer \
--top_k 500000 --kenlm_bins /STT/native_client/kenlm/build/bin/ \
--arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|1" \
--binary_a_bits 255 --binary_q_bits 8 --binary_type trie
```
_Note: the `/STT/native_client/kenlm/build/bin/` is the path to the binary files for `kenlm`. If you are using the Docker image and container (explained on the [environment page of the PlayBook](ENVIRONMENT.md)), then `/STT/native_client/kenlm/build/bin/` is the correct path to use. If you are not using the Docker environment, your path may vary._
You should now have a `lm.binary` and `vocab-500000.txt` file in your `indonesian-scorer` directory:
```
stt-data$ ls indonesian-scorer/
total 1184
4 drwxrwxr-x 2 root root 4096 Feb 25 23:13 ./
4 drwxrwxr-x 5 root root 4096 Feb 26 09:24 ../
488 -rw-r--r-- 1 root root 499594 Feb 24 19:05 lm.binary
52 -rw-r--r-- 1 root root 51178 Feb 24 19:05 vocab-500000.txt
```
#### Generating a `kenlm.scorer` file from `generate_scorer_package`
Next, we need to install the `native_client` package, which contains the `generate_scorer_package`. This is _not_ pre-built into the 🐸STT Docker image.
The `generate_scorer_package`, once installed via the `native client` package, is usable on _all platforms_ supported by 🐸STT. This is so that developers can generate scorers _on-device_, such as on an Android device, or Raspberry Pi 3.
To install `generate_scorer_package`, first download the relevant `native client` package from the [🐸STT GitHub releases page](https://github.com/coqui-ai/STT/releases/tag/v0.9.3) into the `data/lm` directory. The Docker image uses Ubuntu Linux, so you should use either the `native_client.amd64.cuda.linux.tar.xz` package if you are using `cuda` or the `native_client.amd64.cpu.linux.tar.xz` package if not.
The easiest way to download the package and extract it is using `curl [URL] | tar -Jxvf [FILENAME]`:
```
root@dcb62aada58b:/STT/data/lm# curl https://github.com/coqui-ai/STT/releases/download/v0.9.3/native_client.amd64.cuda.linux.tar.xz | tar -Jxvf native_client.amd64.cuda.linux.tar.xz
libstt.so
generate_scorer_package
LICENSE
stt
coqui-stt.h
README.coqui
```
You can now generate a `ken.lm` scorer file.
```
root@dcb62aada58b:/STT/data/lm# ./generate_scorer_package \
--alphabet ../alphabet.txt \
--lm ../../stt-data/indonesian-scorer/lm.binary
--vocab ../../stt-data/indonesian-scorer/vocab-500000.txt \
--package kenlm-indonesian.scorer \
--default_alpha 0.931289039105002 \
--default_beta 1.1834137581510284
6021 unique words read from vocabulary file.
Doesn't look like a character based (Bytes Are All You Need) model.
--force_bytes_output_mode was not specified, using value infered from vocabulary contents: false
Package created in kenlm-indonesian.scorer.
```
The message `Doesn't look like a character based (Bytes Are All You Need) model.` is _not_ an error.
If you receive the error message:
```
--force_bytes_output_mode was not specified, using value infered from vocabulary contents: false
Error: Cant parse scorer file, invalid header. Try updating your scorer file.
Error loading language model file: Invalid magic in trie header.
```
then you should add the parameter `--force_bytes_output_mode` when calling `generate_scorer_package`. This error most usually occurs when training languages that use [alphabets](ALPHABET.md) that contain a large number of characters, such as Mandarin. `--force_bytes_output_mode` forces the _decoder_ to predict `UTF-8` bytes instead of characters. For more information, [please see the 🐸STT documentation](https://stt.readthedocs.io/en/master/Decoder.html#bytes-output-mode). For example:
```
root@dcb62aada58b:/STT/data/lm# ./generate_scorer_package \
--alphabet ../alphabet.txt \
--lm ../../stt-data/indonesian-scorer/lm.binary
--vocab ../../stt-data/indonesian-scorer/vocab-500000.txt \
--package kenlm-indonesian.scorer \
--default_alpha 0.931289039105002 \
--default_beta 1.1834137581510284 \
--force_bytes_output_mode True
```
The `kenlm-indonesian.scorer` file is stored in the `/STT/data/lm` directory within the Docker container. Copy it to the `stt-data` directory.
```
root@dcb62aada58b:/STT/data/lm# cp kenlm-indonesian.scorer ../../stt-data/indonesian-scorer/
```
```
root@dcb62aada58b:/STT/stt-data/indonesian-scorer# ls -las
total 1820
4 drwxrwxr-x 2 1000 1000 4096 Feb 26 21:56 .
4 drwxrwxr-x 5 1000 1000 4096 Feb 25 22:24 ..
636 -rw-r--r-- 1 root root 648000 Feb 26 21:56 kenlm-indonesian.scorer
488 -rw-r--r-- 1 root root 499594 Feb 24 08:05 lm.binary
52 -rw-r--r-- 1 root root 51178 Feb 24 08:05 vocab-500000.txt
```
#### Using the scorer file during the test phase of training
You now have your own scorer file that can be used during the test phase of model training process using the `--scorer` parameter.
For example:
```
python3 train.py \
--test_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--checkpoint_dir stt-data/checkpoints-newscorer-id \
--export_dir stt-data/exported-model-newscorer-id \
--n_hidden 2048 \
--scorer stt-data/indonesian-scorer/kenlm.scorer
```
For more information on scorer files, refer to the [🐸STT documentation](https://stt.readthedocs.io/en/latest/Scorer.html).
---
[Home](README.md) | [Previous - The alphabet.txt file](ALPHABET.md) | [Next - Acoustic Model and Language Model](AM_vs_LM.md)

163
doc/playbook/TESTING.md Normal file
View File

@ -0,0 +1,163 @@
[Home](README.md) | [Previous - Training your model](TRAINING.md) | [Next - Deploying your model](DEPLOYMENT.md)
# Testing and evaluating your trained model
## Contents
- [Testing and evaluating your trained model](#testing-and-evaluating-your-trained-model)
* [Contents](#contents)
* [Word Error Rate, Character Error Rate, loss and model performance](#word-error-rate--character-error-rate--loss-and-model-performance)
* [Acoustic model and language model working together](#acoustic-model-and-language-model-working-together)
* [Heuristics](#heuristics)
* [Fine tuning and transfer learning](#fine-tuning-and-transfer-learning)
_This section of the PlayBook covers testing your trained model and setup before [deployment](DEPLOYMENT.md). If you need to test the 🐸STT source code itself, please consult the source code tests._
Let's say that you've already trained an acoustic model and a language model (a [scorer](SCORER.md)). Congratulations! But before you [deploy](DEPLOYMENT.md) your setup, you will need to evaluate how well it will work in practice - on your intended use case.
We're talking here about a _setup_ rather than a trained _model_ on purpose - as there are multiple factors that influence how well a _setup_ performs in real life. There are multiple factors that influence the success of an application, and you need to keep all these factors in mind. The acoustic model and language model work with each other to turn speech into text, and there are lots of ways (i.e. decoding hyperparameter settings) with which you can combine those two models.
## Gathering training information
When you invoked `train.py` in the [training](TRAINING.md) section, and trained a model, the training would have finished by printing out a set of WER and CER metrics. It would have looked like this:
```
Testing model on stt-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv
Test epoch | Steps: 1844 | Elapsed Time: 0:51:11
Test on stt-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv - WER: 1.000000, CER: 0.824103, loss: 104.989326
--------------------------------------------------------------------------------
Best WER:
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.873786, loss: 317.729767
- wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_23819387.wav
- src: "kami percaya bahwa perdamaian dari koeksistensi dua sistem sosial yang berbeda sepenuhnya bisa terwujud"
- res: "aaaaaaaaaaaaa"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.851485, loss: 295.564240
- wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_19748999.wav
- src: "jika anda mencari informasi tentang pergerakan esperanto di indonesia silakan kunjungi halaman webnya"
- res: "aaaaaaaaaaaaaaa"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.875000, loss: 283.844696
- wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_23819383.wav
- src: "indah memiliki standar hidup yang tinggi tidak heran dia dikenal sebagai orang yang perfeksionis"
- res: "aaaaaaaaaaaaaa"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.818182, loss: 276.511597
- wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_24015532.wav
- src: "selain itu bahasa gaul juga menciptakan kosakata baru yang terbentuk melalui kaidah kaidah tertentu"
- res: "aaaaaaaaaaaaaaaaaa"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.820000, loss: 269.262909
- wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_24015257.wav
- src: "berbagai bahasa daerah dan bahasa asing menjadi bahasa serapan dan kemudian menjadi bahasa indonesia"
- res: "aaaaaaaaaaaaaaaaaa"
--------------------------------------------------------------------------------
Median WER:
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.800000, loss: 97.870811
- wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_20954705.wav
- src: "pemandangan dari hotel sangat indah"
- res: "aaaaaaa"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.941176, loss: 97.848030
- wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_20387916.wav
- src: "hari ini hujan turun rintik rintik"
- res: "aaaaaaaa"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.800000, loss: 97.800034
- wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_20879262.wav
- src: "berapa biaya sewa untuk ruangan ini"
- res: "aaaaaaaaa"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.705882, loss: 97.773476
- wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_19611909.wav
- src: "saya bukan gay tapi pacar saya gay"
- res: "aaaaaaaaaaa"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.806452, loss: 97.725914
- wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_24018261.wav
- src: "selamat datang di san fransisco"
- res: "aaaaaaaaaaa"
--------------------------------------------------------------------------------
Worst WER:
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.800000, loss: 25.830986
- wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_22546523.wav
- src: "tidak"
- res: "aaaa"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.333333, loss: 25.499653
- wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_22185104.wav
- src: "nol"
- res: "aaaa"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.800000, loss: 23.874924
- wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_22546522.wav
- src: "empat"
- res: "aaaa"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.750000, loss: 22.441967
- wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_22528020.wav
- src: "tiga"
- res: "aaaa"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.750000, loss: 21.356133
- wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_22412536.wav
- src: "lima"
- res: "aaaa"
--------------------------------------------------------------------------------
```
_Note: the WER and CER on this output example are both poor because a custom scorer for the language hasn't been built yet._
If you didn't keep the training information, then as long as you stored _checkpoints_ while training, then you will be able to re-run just the _testing_ part of training by using the following command:
```
root@9d052f0c3dcf:/STT# python3 train.py \
--test_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--checkpoint_dir stt-data/checkpoints
```
By passing just the `--test_files` parameter and the `--checkpoint_dir` parameter, `train.py` will re-run testing. Note that this command will fail if you don't have _checkpoints_ stored.
## Word Error Rate, Character Error Rate, loss and model performance
During acoustic model [training](TRAINING.md) with Tensorflow, you hopefully saw the training and validation _loss_ go down over time. At the end of the training, 🐸STT would have printed scores for your model called the _Word Error Rate (WER)_ and _Character Error Rate (CER)_.
The WER is how accurately 🐸STT was able to recognise a _word_, and is generally a measure of how well the language model (scorer) is operating. The CER is how accurately 🐸STT was able to recognise a _character_, and is generally a measure of how well the acoustic model is operating, along with an [alphabet](ALPHABET.md) file.
WER and CER are the typical scores reported for speech recognition models, but their usefulness will vary a lot, depending on the use case of your _setup_. You should not take the WER as the "be-all" metric for the performance of your _setup_.
Often, the data in your _test_ `.csv` file will be different to the data your model will be asked to perform inference on when it is deployed. It is the performance of your _setup_ at runtime - in a real life context - that is most important.
## Acoustic model and language model working together
Remember, the acoustic model and language model work together to produce your transcript. You might have an acoustic model that seems to perform abysmally, but if you combine it with the right language model, you experience amazing near-perfect accuracy. How is this possible?
The _acoustic model_ is where the majority of training time is spent. The job of the _acoustic model_ is to use the 🐸STT algorithm - a _sequence to sequence_ algorithm, to learn which acoustic signals correspond to which _letters_ (as specified in the `alphabet.txt` file). This accuracy is the _character error rate (CER)_.
In many languages though, words that sound the same are spelled differently. These are called [homonyms](https://en.wikipedia.org/wiki/Homonym). For example, the words `their`, `they're` and `there` are all pronounced similarly in English, but are spelled differently.
The _language model_ seeks to overcome this challenge. The _language model_, produced by a [scorer](SCORER.md), predicts which words will follow each other in a sequence. This is also known in linguistics as [n-gram modelling](https://en.wikipedia.org/wiki/N-gram). For example, the words `nugget`, `wings` and `salad` are more likely to occur after the word `chicken` than say `ticket`, even though the words `chicken` and `ticket` have similar sounds.
The _acoustic model_ and the _language model_ work together to provide better overall accuracy.
## Heuristics
In general, if you have a low CER - that is, your _characters_ are being detected accurately in your _acoustic model_, but you have a high WER - that is, the _words_ are not being detected accurately, this indicates that you should retrain your _language model_ ([scorer](SCORER.md)).
Conversely, if you have a high CER, and a low WER, this indicates that your _acoustic model_ may require fine-tuning.
## Fine tuning and transfer learning
_Fine tuning_ and _transfer learning_ are two processes used to improve the accuracy of an _acoustic model_. _Fine tuning_ is where the same [alphabet.txt](ALPHABET.md) file is used, with a set of _checkpoints_ from another model. In _transfer learning_, the alphabet layer is removed from the neural network, and this allows a model to be trained on a model from another language. In general, this works best on languages that have a similar vocabulary and/or structure. For example, English and French will work better than English and Hindi given that English and French are more similar than English and Hindi.
For more information on [fine tuning in 🐸STT, please consult the documentation](https://stt.readthedocs.io/en/latest/TRAINING.html#fine-tuning-same-alphabet).
For more information on [transfer learning in 🐸STT, please consult the documentation](https://stt.readthedocs.io/en/latest/TRAINING.html#transfer-learning-new-alphabet).
---
[Home](README.md) | [Previous - Training your model](TRAINING.md) | [Next - Deploying your model](DEPLOYMENT.md)

368
doc/playbook/TRAINING.md Normal file
View File

@ -0,0 +1,368 @@
[Home](README.md) | [Previous - Setting up your Coqui STT training environment](ENVIRONMENT.md) | [Next - Testing and evaluating your trained model](TESTING.md)
# Training a Coqui STT model
## Contents
- [Training a Coqui STT model](#training-a-coqui-stt-model)
* [Contents](#contents)
* [Making training files available to the Docker container](#making-training-files-available-to-the-docker-container)
* [Running training](#running-training)
+ [Specifying checkpoint directories so that you can restart training from a checkpoint](#specifying-checkpoint-directories-so-that-you-can-restart-training-from-a-checkpoint)
- [Advanced checkpoint configuration](#advanced-checkpoint-configuration)
* [How checkpoints are stored](#how-checkpoints-are-stored)
* [Managing disk space and checkpoints](#managing-disk-space-and-checkpoints)
* [Different checkpoints for loading and saving](#different-checkpoints-for-loading-and-saving)
+ [Specifying the directory that the trained model should be exported to](#specifying-the-directory-that-the-trained-model-should-be-exported-to)
* [Other useful parameters that can be passed to `train.py`](#other-useful-parameters-that-can-be-passed-to--trainpy-)
+ [`n_hidden` parameter](#-n-hidden--parameter)
+ [Reduce learning rate on plateau (RLROP)](#reduce-learning-rate-on-plateau--rlrop-)
+ [Early stopping](#early-stopping)
+ [Dropout rate](#dropout-rate)
* [Steps and epochs](#steps-and-epochs)
* [Advanced training options](#advanced-training-options)
* [Monitoring GPU use with `nvtop`](#monitoring-gpu-use-with--nvtop-)
* [Possible errors](#possible-errors)
+ [`Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.` error when training](#-failed-to-get-convolution-algorithm-this-is-probably-because-cudnn-failed-to-initialize--so-try-looking-to-see-if-a-warning-log-message-was-printed-above--error-when-training)
## Making training files available to the Docker container
Before we can train a model, we need to make the training data available to the Docker container. The training data was previously prepared in the [instructions for formatting data](DATA_FORMATTING.md). Copy or extract them to the directory you specified in your _bind mount_. This will make them available to the Docker container.
```
$ cd stt-data
$ ls cv-corpus-6.1-2020-12-11/
total 12
4 drwxr-xr-x 3 kathyreid kathyreid 4096 Feb 9 10:42 ./
4 drwxrwxr-x 7 kathyreid kathyreid 4096 Feb 9 10:43 ../
4 drwxr-xr-x 3 kathyreid kathyreid 4096 Feb 9 10:43 id/
```
We're now ready to being training.
## Running training
We're going to walk through some of the key parameters you can use with `train.py`.
```
python3 train.py \
--train_files persistent-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
--dev_files persistent-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
--test_files persistent-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv
```
**Do not run this yet**
The options `--train_files`, `--dev_files` and `--test_files` take a path to the relevant data, which was prepared in the section on [data formatting](DATA_FORMATTING.md).
### Specifying checkpoint directories so that you can restart training from a checkpoint
As you are training your model, 🐸STT will store _checkpoints_ to disk. The checkpoint allows interruption to training, and to restart training from the checkpoint, saving hours of training time.
Because we have our [training environment](ENVIRONMENT.md) configured to use Docker, we must ensure that our checkpoint directories are stored in the directory used by the _bind mount_, so that they _persist_ in the event of failure.
To specify checkpoint directories, use the `--checkpoint_dir` parameter with `train.py`:
```
python3 train.py \
--train_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
--dev_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
--test_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--checkpoint_dir stt-data/checkpoints
```
**Do not run this yet**
#### Advanced checkpoint configuration
##### How checkpoints are stored
_Checkpoints_ are stored as [Tensorflow `tf.Variable` objects](https://www.tensorflow.org/guide/checkpoint). This is a binary file format; that is, you won't be able to read it with a text editor. The _checkpoint_ stores all the weights and biases of the current state of the _neural network_ as training progresses.
_Checkpoints_ are named by the total number of steps completed. For example, if you train for 100 epochs at 2000 steps per epoch, then the final _checkpoint_ will be named `20000`.
```
~/stt-data/checkpoints-true-id$ ls
total 1053716
4 drwxr-xr-x 2 root root 4096 Feb 24 14:17 ./
4 drwxrwxr-x 5 root root 4096 Feb 24 13:18 ../
174376 -rw-r--r-- 1 root root 178557296 Feb 24 14:11 best_dev-12774.data-00000-of-00001
4 -rw-r--r-- 1 root root 1469 Feb 24 14:11 best_dev-12774.index
1236 -rw-r--r-- 1 root root 1262944 Feb 24 14:11 best_dev-12774.meta
4 -rw-r--r-- 1 root root 85 Feb 24 14:11 best_dev_checkpoint
4 -rw-r--r-- 1 root root 247 Feb 24 14:17 checkpoint
4 -rw-r--r-- 1 root root 3888 Feb 24 13:18 flags.txt
174376 -rw-r--r-- 1 root root 178557296 Feb 24 14:09 train-12774.data-00000-of-00001
4 -rw-r--r-- 1 root root 1469 Feb 24 14:09 train-12774.index
1236 -rw-r--r-- 1 root root 1262938 Feb 24 14:09 train-12774.meta
174376 -rw-r--r-- 1 root root 178557296 Feb 24 14:13 train-14903.data-00000-of-00001
4 -rw-r--r-- 1 root root 1469 Feb 24 14:13 train-14903.index
1236 -rw-r--r-- 1 root root 1262938 Feb 24 14:13 train-14903.meta
174376 -rw-r--r-- 1 root root 178557296 Feb 24 14:17 train-17032.data-00000-of-00001
4 -rw-r--r-- 1 root root 1469 Feb 24 14:17 train-17032.index
1236 -rw-r--r-- 1 root root 1262938 Feb 24 14:17 train-17032.meta
174376 -rw-r--r-- 1 root root 178557296 Feb 24 14:01 train-19161.data-00000-of-00001
4 -rw-r--r-- 1 root root 1469 Feb 24 14:01 train-19161.index
1236 -rw-r--r-- 1 root root 1262938 Feb 24 14:01 train-19161.meta
174376 -rw-r--r-- 1 root root 178557296 Feb 24 14:05 train-21290.data-00000-of-00001
4 -rw-r--r-- 1 root root 1469 Feb 24 14:05 train-21290.index
```
##### Managing disk space and checkpoints
_Checkpoints_ can consume a lot of disk space, so you may wish to configure how often a _checkpoint_ is written to disk, and how many _checkpoints_ are stored.
* `--checkpoint_secs` specifies the time interval for storing a _checkpoint_. The default is `600`, or every five minutes. You may wish to increase this if you have limited disk space.
* `--max_to_keep` specifies how many _checkpoints_ to keep. The default is `5`. You may wish to decrease this if you have limited disk space.
In this example we will store a _checkpoint_ every 15 minutes, and keep only 3 _checkpoints_.
```
python3 train.py \
--train_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
--dev_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
--test_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--checkpoint_dir stt-data/checkpoints \
--checkpoint_secs 1800 \
--max_to_keep 3
```
**Do not run this yet**
##### Different checkpoints for loading and saving
In some cases, you may wish to _load_ _checkpoints_ from one location, but _save_ _checkpoints_ to another location - for example if you are doing fine tuning or transfer learning.
* `--load_checkpoint_dir` specifies the directory to load _checkpoints_ from.
* `--save_checkpoint_dir` specifies the directory to save _checkpoints_ to.
In this example we will store a _checkpoint_ every 15 minutes, and keep only 3 _checkpoints_.
```
python3 train.py \
--train_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
--dev_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
--test_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--load_checkpoint_dir stt-data/checkpoints-to-train-from \
--save_checkpoint_dir stt-data/checkpoints-to-save-to
```
**Do not run this yet**
### Specifying the directory that the trained model should be exported to
Again, because we have our [training environment](ENVIRONMENT.md) configured to use Docker, we must ensure that our trained model is stored in the directory used by the _bind mount_, so that it _persists_ in the event of failure of the Docker container.
To specify where the trained model should be saved, use the `--export-dir` parameter with `train.py`:
```
python3 train.py \
--train_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
--dev_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
--test_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--checkpoint_dir stt-data/checkpoints \
--export_dir stt-data/exported-model
```
**You can run this command to start training**
## Other useful parameters that can be passed to `train.py`
_For a full list of parameters that can be passed to `train.py`, please [consult the documentation](https://stt.readthedocs.io/en/latest/Flags.html#training-flags)._
`train.py` has many parameters - too many to cover in an introductory PlayBook. Here are some of the commonly used parameters that are useful to explore as you begin to train speech recognition models with 🐸STT.
### `n_hidden` parameter
Neural networks work through a series of _layers_. Usually there is an _input layer_, which takes an input - in this case an audio recording, and a series of _hidden layers_ which identify features of the _input layer_, and an _output layer_, which makes a prediction - in this case a character.
In large datasets, you need many _hidden layers_ to arrive at an accurate trained model. With smaller datasets, often called _toy corpora_ or _toy datasets_, you don't need as many _hidden layers_.
If you are learning how to train using 🐸STT, and are working with a small dataset, you will save time by reducing the value of `--n_hidden`. This reduces the number of _hidden layers_ in the neural network. This both reduces the amount of computing resources consumed during training, and makes training a model much faster.
The `--n_hidden` parameter has a default value of `2048`.
```
python3 train.py \
--train_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
--dev_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
--test_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--checkpoint_dir stt-data/checkpoints \
--export_dir stt-data/exported-model \
--n_hidden 64
```
### Reduce learning rate on plateau (RLROP)
In neural networks, the _learning rate_ is the rate at which the neural network makes adjustments to the predictions it generates. The accuracy of predictions is measured using the _loss_. The lower the _loss_, the lower the difference between the neural network's predictions, and actual known values. If training is effective, _loss_ will reduce over time. A neural network that has a _loss_ of `0` has perfect prediction.
If the _learning rate_ is too low, predictions will take a long time to align with actual targets. If the learning rate is too high, predictions will overshoot actual targets. The _learning rate_ has to aim for a balance between _exploration and exploitation_.
If loss is not reducing over time, then the training is said to have _plateaued_ - that is, the adjustments to the predictions are not reducing _loss_. By adjusting the _learning rate_, and other parameters, we may escape the _plateau_ and continue to decrease _loss_.
* The `--reduce_lr_on_plateau` parameter instructs `train.py` to automatically reduce the _learning rate_ if a _plateau_ is detected. By default, this is `false`.
* The `--plateau_epochs` parameter specifies the number of epochs of training during which there is no reduction in loss that should be considered a _plateau_. The default value is `10`.
* The `--plateau_reduction` parameter specifies a multiplicative factor that is applied to the current learning rate if a _plateau_ is detected. This number **must** be less than `1`, otherwise it will _increase_ the learning rate. The default value is `0.1`.
An example of training with these parameters would be:
```
python3 train.py \
--train_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
--dev_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
--test_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--checkpoint_dir stt-data/checkpoints \
--export_dir stt-data/exported-model \
--n_hidden 64 \
--reduce_lr_on_plateau true \
--plateau_epochs 8 \
--plateau_reduction 0.08
```
### Early stopping
If training is not resulting in a reduction of _loss_ over time, you can pass parameters to `train.py` that will stop training. This is called _early stopping_ and is useful if you are using cloud compute resources, or shared resources, and can't monitor the training continuously.
* The `--early_stop` parameter enables early stopping. It is set to `false` by default.
* The `--es_epochs` parameter takes an integer of the number of epochs with no improvement after which training will be stopped. It is set to `25` by default, for example if this parameter is omitted, but `--early_stop` is set to `true`.
* The `--es_min_delta` parameter is the minimum change in _loss_ per epoch that qualifies as an improvement. By default it is set to `0.05`.
An example of training with these parameters would be:
```
python3 train.py \
--train_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
--dev_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
--test_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--checkpoint_dir stt-data/checkpoints \
--export_dir stt-data/exported-model \
--n_hidden 64 \
--reduce_lr_on_plateau true \
--plateau_epochs 8 \
--plateau_reduction 0.08 \
--early_stop true \
--es_epochs 10 \
--es_min_delta 0.06
```
### Dropout rate
In machine learning, one of the risks during training is that of [_overfitting_](https://en.wikipedia.org/wiki/Overfitting). _Overfitting_ is where training creates a model that does not _generalize_ well. That is, it _fits_ to only the set of data on which it is trained. During inference, new data is not recognised accurately.
_Dropout_ is a technical approach to reduce _overfitting_. In _dropout_, nodes are randomly removed from the neural network created during training. This simulates the effect of more diverse data, and is a computationally cheap way of reducing _overfitting_, and improving the _generalizability_ of the model.
_Dropout_ can be set for any layer of a neural network. The parameter that has the most effect for 🐸STT training is `--dropout_rate`, which controls the feedforward layers of the neural network. To see the full set of _dropout parameters_, consult the 🐸STT documentation.
* The `-dropout_rate` parameter specifies how many nodes should be dropped from the neural network during training. The default value is `0.05`. However, if you are training on less than thousands of hours of voice data, you will find a value of `0.3` to `0.4` works better to prevent overfitting.
An example of training with this parameter would be:
```
python3 train.py \
--train_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
--dev_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
--test_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--checkpoint_dir stt-data/checkpoints \
--export_dir stt-data/exported-model \
--n_hidden 64 \
--reduce_lr_on_plateau true \
--plateau_epochs 8 \
--plateau_reduction 0.08 \
--early_stop true \
--es_epochs 10 \
--es_min_delta 0.06 \
--dropout_rate 0.3
```
## Steps and epochs
In training, a _step_ is one update of the [gradient](https://en.wikipedia.org/wiki/Gradient_descent); that is, one attempt to find the lowest, or minimal _loss_. The amount of processing done in one _step_ depends on the _batch size_. By default, `train.py` has a _batch size_ of `1`. That is, it processes one audio file in each _step_.
An _epoch_ is one full cycle through the training data. That is, if you have 1000 files listed in your `train.tsv` file, then you will expect to process 1000 _steps_ per epoch (assuming a _batch size_ of `1`).
To find out how many _steps_ to expect in each _epoch_, you can count the number of lines in your `train.tsv` file:
```
~/stt-data/cv-corpus-6.1-2020-12-11/id$ wc -l train.tsv
2131 train.tsv
```
In this case there would be `2131` _steps_ per _epoch_.
* `--epochs` specifies how many _epochs_ to train. It has a default of `75`, which would be appropriate for training tens to hundreds of hours of audio. If you have thousands of hours of audio, you may wish to increase the number of _epochs_ to around 150-300.
* `--train_batch_size`, `--dev_batch_size`, `--test_batch_size` all specify the _batch size_ per _step_. These all have a default value of `1`. Increasing the _batch size_ increases the amount of memory required to process the _step_; you need to be aware of this before increasing the _batch size_.
## Advanced training options
Advanced training options are available, such as _feature cache_ and _augmentation_. They are beyond the scope of this PlayBook, but you can [read more about them in the 🐸STT documentation](https://stt.readthedocs.io/en/latest/TRAINING.html#augmentation).
For a full list of parameters that can be passed to the `train.py` file, [please consult the 🐸STT documentation](https://stt.readthedocs.io/en/latest/Flags.html#training-flags).
## Monitoring GPU use with `nvtop`
In a separate terminal (ie not from the session where you have the Docker container open), run the command `nvtop`. You should see the `train.py` process consuming all available GPUs.
If you _do not_ see the GPU(s) being heavily utilised, you may be training only on your CPUs and you should double check your [environment](ENVIRONMENT.md).
## Possible errors
### `Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.` error when training
_You can safely skip this section if you have not encountered this error_
There have been several reports of an error similar to the below when training is initiated. Anecdotal evidence suggests that the error is more likely to be encountered if you are training using an RTX-model GPU.
The error will look like this:
```
Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node tower_0/conv1d}}]]
[[concat/concat/_99]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node tower_0/conv1d}}]]
0 successful operations.
0 derived errors ignored.
```
To work around this error, you will need to set the `TF_FORCE_GPU_ALLOW_GROWTH` flag to `True`.
This is done in the file
`STT/training/coqui_stt_training/util/config.py`
and you should edit it as below:
```
root@687a2e3516d7:/STT/training/coqui_stt_training/util# nano config.py
...
# Standard session configuration that'll be used for all new sessions.
c.session_config = tfv1.ConfigProto(allow_soft_placement=True, log_device$
inter_op_parallelism_threads=FLAGS.in$
intra_op_parallelism_threads=FLAGS.in$
gpu_options=tfv1.GPUOptions(allow_gro$
# Set TF_FORCE_GPU_ALLOW_GROWTH to work around cuDNN error on RTX GPUs
c.session_config.gpu_options.allow_growth=True
```
---
[Home](README.md) | [Previous - Setting up your Coqui STT training environment](ENVIRONMENT.md) | [Next - Testing and evaluating your trained model](TESTING.md)

Binary file not shown.

After

Width:  |  Height:  |  Size: 222 KiB

View File

@ -5,4 +5,5 @@ sphinx-js==3.1
furo==2021.2.28b28 furo==2021.2.28b28
pygments==2.6.1 pygments==2.6.1
#FIXME: switch to stable after C# changes have been merged: https://github.com/djungelorm/sphinx-csharp/pull/8 #FIXME: switch to stable after C# changes have been merged: https://github.com/djungelorm/sphinx-csharp/pull/8
git+https://github.com/rogerbarton/sphinx-csharp.git git+https://github.com/reuben/sphinx-csharp.git@9dc6202f558e3d3fa14ec7f5f1e36a8e66e6d622
recommonmark==0.7.1