STT/doc/playbook/SCORER.md
2021-10-04 16:30:39 +02:00

311 lines
17 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

[Home](README.md) | [Previous - The alphabet.txt file](ALPHABET.md) | [Next - Acoustic Model and Language Model](AM_vs_LM.md)
# Scorer - language model for determining which words occur together
## Contents
- [Scorer - language model for determining which words occur together](#scorer---language-model-for-determining-which-words-occur-together)
* [Contents](#contents)
+ [What is a scorer?](#what-is-a-scorer-)
+ [Building your own scorer](#building-your-own-scorer)
- [Preparing the text file](#preparing-the-text-file)
- [Using `lm_optimizer.py` to generate values for the parameters `--default_alpha` and `--default_beta` that are used by the `generate_scorer_package` script](#using--lm-optimizerpy--to-generate-values-for-the-parameters----default-alpha--and----default-beta--that-are-used-by-the--generate-scorer-package--script)
* [Additional parameters for `lm_optimizer.py`](#additional-parameters-for--lm-optimizerpy-)
- [Using `generate_lm.py` to create `lm.binary` and `vocab-500000.txt` files](#using--generate-lmpy--to-create--lmbinary--and--vocab-500000txt--files)
- [Generating a `kenlm.scorer` file from `generate_scorer_package`](#generating-a--kenlmscorer--file-from--generate-scorer-package-)
- [Using the scorer file in model training](#using-the-scorer-file-in-model-training)
### What is a scorer?
A scorer is a _language model_ and it is used by 🐸STT to improve the accuracy of transcription. A _language model_ predicts which words are more likely to follow each other. For example, the word `chicken` might be frequently followed by the words `nuggets`, `soup` or `rissoles`, but is unlikely to be followed by the word `purple`. The scorer identifies probabilities of words occurring together.
The default scorer used by 🐸STT is trained on the LibriSpeech dataset. The LibriSpeech dataset is based on [LibriVox](https://librivox.org/) - an open collection of out-of-copyright and public domain works.
You may need to build your own scorer - your own _language model_ if:
* You are training 🐸STT in another language
* You are training a speech recognition model for a particular domain - such as technical words, medical transcription, agricultural terms and so on
* If you want to improve the accuracy of transcription
**🐸STT supports the _optional_ use of an external scorer - if you're not sure if you need to build your own scorer, stick with the built-in one to begin with**.
### Building your own scorer
_This section assumes that you are using a Docker image and container for training, as outlined in the [environment](ENVIRONMENT.md) section. If you are not using the Docker image, then some of the scripts such as `generate_lm.py` will not be available in your environment._
_This section assumes that you have already trained a model and have a set of **checkpoints** for that model. See the section on [training](TRAINING.md) for more information on **checkpoints**._
🐸STT uses an algorithm called [_connectionist temporal classification_](https://distill.pub/2017/ctc/) or CTC for short, to map between _input_ sequences of audio and _output_ sequences of characters. The mapping between _inputs_ and _outputs_ is called an _alignment_. The alignment between _inputs_ and _outputs_ is not one-to-one; many _inputs_ may make up an _output_. CTC is therefore a _probabilistic_ algorithm. This means that for each _input_ there are many possible _outputs_ that can be selected. A process call _beam search_ is used to identify the possible _outputs_ and select the one with the highest probability. A [language model](AM_vs_LM.md) or _scorer_ helps the _beam search_ algorithm select the most optimal _output_ value. This is why building your own _scorer_ is necessary for training a model on a narrow domain - otherwise the _beam search_ algorithm would probably select the wrong _output_.
The default _scorer_ used with 🐸STT is trained on Librivox. It's a general model. But let's say that you want to train a speech recognition model for agriculture. If you have the phrase `tomatoes are ...`, a general scorer might identify `red` as the most likely next word - but an agricultural model might identify `ready` as the most likely next word.
The _scorer_ is only used during the _test_ stage of [training](TRAINING.md) (rather than at the _train_ or _validate_ stages) because this is where the _beam search_ decdoder determines which words are formed from the identified characters.
The process for building your own _scorer_ has the following steps:
1. Having, or preparing, a text file (in `.txt` or `.txt.gz` format), with one phrase or word on each line. If you are training a speech recognition model for a particular _domain_ - such as technical words, medical transcription, agricultural terms etc, then they should appear in the text file. The text file is used by the `generate_lm.py` script.
2. Using the `lm_optimizer.py` with your dataset (your `.csv` files) and a set of _checkpoints_ to find optimal values of `--default_alpha` and `--default_beta`. The `--default_alpha` and `--default_beta` parameters are used by the `generate_scorer_package` script to assign initial _weights_ to sequences of words.
3. Using the `generate_lm.py` script which is distributed with 🐸STT, along with the text file, to create two files, called `lm.binary` and `vocab-500000.txt`.
4. Downloading the prebuilt `native_client` from the 🐸STT repository on GitHub, and using the `generate_scorer_package` to create a `kenlm.scorer` file.
5. Using the `kenlm.scorer` file as the _external_scorer_ passed to `train.py`, and used for the _test_ phase. The `scorer` does not impact training; it is used for calculating `word error rate` (covered more in [testing](TESTING.md)).
In the following example we will create a custom external scorer file for Bahasa Indonesia (BCP47: `id-ID`).
#### Preparing the text file
This is straightforward. In this example, we will use a file called `indonesian-sentences.txt`. This file should contain phrases that you wish to prioritize recognising. For example, you may want to recognise place names, digits or medical phrases - and you will include these phrases in the `.txt` file.
_These phrases should not be copied from `test.tsv`, `train.tsv` or `validated.tsv` as you will bias the resultant model._
```
~/stt-data$ ls cv-corpus-6.1-2020-12-11/id
total 6288
4 drwxr-xr-x 3 root root 4096 Feb 24 19:01 ./
4 drwxr-xr-x 4 root root 4096 Feb 11 07:09 ../
1600 drwxr-xr-x 2 root root 1638400 Feb 9 10:43 clips/
396 -rwxr-xr-x 1 root root 401601 Feb 9 10:43 dev.tsv
104 -rwxr-xr-x 1 root root 103332 Feb 9 10:43 invalidated.tsv
1448 -rwxr-xr-x 1 root root 1481571 Feb 9 10:43 other.tsv
28 -rwxr-xr-x 1 root root 26394 Feb 9 10:43 reported.tsv
392 -rwxr-xr-x 1 root root 399790 Feb 9 10:43 test.tsv
456 -rwxr-xr-x 1 root root 465258 Feb 9 10:43 train.tsv
1848 -rwxr-xr-x 1 root root 1889606 Feb 9 10:43 validated.tsv
```
The `indonesian-sentences.txt` file is stored on the local filesystem in the `stt-data` directory so that the Docker container can access it.
```
~/stt-data$ ls | grep indonesian-sentences
476 -rw-rw-r-- 1 root root 483481 Feb 24 19:02 indonesian-sentences.txt
```
The `indonesian-sentences.txt` file is formatted with one phrase per line, eg:
```
Kamar adik laki-laki saya lebih sempit daripada kamar saya.
Ayah akan membunuhku.
Ini pulpen.
Akira pandai bermain tenis.
Dia keluar dari ruangan tanpa mengatakan sepatah kata pun.
Besok Anda akan bertemu dengan siapa.
Aku mengerti maksudmu.
Tolong lepas jasmu.
```
#### Using `lm_optimizer.py` to generate values for the parameters `--default_alpha` and `--default_beta` that are used by the `generate_scorer_package` script
The `lm_optimizer.py` script is located in the `STT` directory if you have set up your [environment][ENVIRONMENT.md] as outlined in the PlayBook.
```
root@57e6bf4eeb1c:/STT# ls | grep lm_optimizer.py
lm_optimizer.py
```
This script takes a set of test data (`--test_files`), and a `--checkpoint_dir` parameter and determines the optimal `--default_alpha` and `--default_beta` values.
Call `lm_optimizer.py` and pass it the `--test_files` and a `--checkpoint_dir` directory.
```
root@57e6bf4eeb1c:/STT# python3 lm_optimizer.py \
--test_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--checkpoint_dir stt-data/checkpoints
```
In general, any change to _geometry_ - the shape of the neural network - needs to be reflected here, otherwise the _checkpoint_ will fail to load. It's always a good idea to record the parameters you used to train a model. For example, if you trained your model with a `--n_hidden` value that is different to the default (`1024`), you should pass the same `--n_hidden` value to `lm_optimizer.py`, i.e:
```
root@57e6bf4eeb1c:/STT# python3 lm_optimizer.py \
--test_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--checkpoint_dir stt-data/checkpoints \
--n_hidden 4
```
`lm_optimizer.py` will create a new _study_.
```
[I 2021-03-05 02:04:23,041] A new study created in memory with name: no-name-38c8e8cb-0cc2-4f53-af0e-7a7bd3bc5159
```
It will then run _testing_ and output a trial score.
```
[I 2021-03-02 12:48:15,336] Trial 0 finished with value: 1.0 and parameters: {'lm_alpha': 1.0381777700987271, 'lm_beta': 0.02094605391055826}. Best is trial 0 with value: 1.0.
```
By default, `lm_optimizer.py` will run `6` trials, and identify the trial with the most optimal parameters.
```
[I 2021-03-02 17:50:00,662] Trial 6 finished with value: 1.0 and parameters: {'lm_alpha': 3.1660260368070423, 'lm_beta': 4.7438794403688735}. Best is trial 0 with value: 1.0.
```
The optimal parameters `--default_alpha` and `--default_beta` are now known, and can be used with `generate_scorer_package`. In this case, the optimal settings are:
```
--default_alpha 1.0381777700987271
--default_beta 0.02094605391055826
```
because `Trial 0` was the best trial.
##### Additional parameters for `lm_optimizer.py`
There are additional parameters that may be useful.
**Please be aware that these parameters may increase processing time significantly - even to a few days - depending on your hardware.**
* `--n_trials` specifies how many trials `lm_optimizer.py` should run to find the optimal values of `--default_alpha` and `--default_beta`. The default is `6`. You may wish to reduce `--n_trials`.
* `--lm_alpha_max` specifies a maximum bound for `--default_alpha`. The default is `0.931289039105002`. You may wish to reduce `--lm_alpha_max`.
* `--lm_beta_max` specifies a maximum bound for `--default_beta`. The default is `1.1834137581510284`. You may wish to reduce `--lm_beta_max`.
For example:
```
root@57e6bf4eeb1c:/STT# python3 lm_optimizer.py \
--test_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--checkpoint_dir stt-data/checkpoints \
--n_hidden 4 \
--n_trials 3 \
--lm_alpha_max 0.92 \
--lm_beta_max 1.05
```
#### Using `generate_lm.py` to create `lm.binary` and `vocab-500000.txt` files
We then use `generate_lm.py` script that comes with 🐸STT to create a [_trie file_](https://en.wikipedia.org/wiki/Trie). The _trie file_ represents associations between words, so that during training, words that are more closely associated together are more likely to be transcribed by 🐸STT.
The _trie file_ is produced using a software package called [KenLM](https://kheafield.com/code/kenlm/). KenLM is designed to create large language models that are able to be filtered and queried easily.
First, create a directory in `stt-data` directory to store your `lm.binary` and `vocab-500000.txt` files:
```
stt-data$ mkdir indonesian-scorer
```
Then, use the `generate_lm.py` script as follows:
```
cd data/lm
python3 generate_lm.py \
--input_txt /STT/stt-data/indonesian-sentences.txt \
--output_dir /STT/stt-data/indonesian-scorer \
--top_k 500000 --kenlm_bins /STT/native_client/kenlm/build/bin/ \
--arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|1" \
--binary_a_bits 255 --binary_q_bits 8 --binary_type trie
```
_Note: the `/STT/native_client/kenlm/build/bin/` is the path to the binary files for `kenlm`. If you are using the Docker image and container (explained on the [environment page of the PlayBook](ENVIRONMENT.md)), then `/STT/native_client/kenlm/build/bin/` is the correct path to use. If you are not using the Docker environment, your path may vary._
You should now have a `lm.binary` and `vocab-500000.txt` file in your `indonesian-scorer` directory:
```
stt-data$ ls indonesian-scorer/
total 1184
4 drwxrwxr-x 2 root root 4096 Feb 25 23:13 ./
4 drwxrwxr-x 5 root root 4096 Feb 26 09:24 ../
488 -rw-r--r-- 1 root root 499594 Feb 24 19:05 lm.binary
52 -rw-r--r-- 1 root root 51178 Feb 24 19:05 vocab-500000.txt
```
#### Generating a `kenlm.scorer` file from `generate_scorer_package`
Next, we need to install the `native_client` package, which contains the `generate_scorer_package`. This is _not_ pre-built into the 🐸STT Docker image.
The `generate_scorer_package`, once installed via the `native client` package, is usable on _all platforms_ supported by 🐸STT. This is so that developers can generate scorers _on-device_, such as on an Android device, or Raspberry Pi 3.
To install `generate_scorer_package`, first download the relevant `native client` package from the [🐸STT GitHub releases page](https://github.com/coqui-ai/STT/releases/latest) into the `data/lm` directory. The Docker image uses Ubuntu Linux, so you should use either the `native_client.amd64.cuda.linux.tar.xz` package if you are using `cuda` or the `native_client.amd64.cpu.linux.tar.xz` package if not.
The easiest way to download the package and extract it is using `curl -L [URL] | tar -Jxvf [FILENAME]`:
```
root@dcb62aada58b:/STT/data/lm# curl -L https://github.com/coqui-ai/STT/releases/download/v1.0.0/native_client.tflite.Linux.tar.xz | tar -Jxvf -
libstt.so
generate_scorer_package
LICENSE
stt
coqui-stt.h
README.coqui
```
You can now generate a KenLM scorer file.
```
root@dcb62aada58b:/STT/data/lm# ./generate_scorer_package \
--alphabet ../alphabet.txt \
--lm ../../stt-data/indonesian-scorer/lm.binary
--vocab ../../stt-data/indonesian-scorer/vocab-500000.txt \
--package kenlm-indonesian.scorer \
--default_alpha 0.931289039105002 \
--default_beta 1.1834137581510284
6021 unique words read from vocabulary file.
Doesn't look like a character based (Bytes Are All You Need) model.
--force_bytes_output_mode was not specified, using value infered from vocabulary contents: false
Package created in kenlm-indonesian.scorer.
```
The message `Doesn't look like a character based (Bytes Are All You Need) model.` is _not_ an error.
If you receive the error message:
```
--force_bytes_output_mode was not specified, using value infered from vocabulary contents: false
Error: Cant parse scorer file, invalid header. Try updating your scorer file.
Error loading language model file: Invalid magic in trie header.
```
then you should add the parameter `--force_bytes_output_mode` when calling `generate_scorer_package`. This error most usually occurs when training languages that use [alphabets](ALPHABET.md) that contain a large number of characters, such as Mandarin. `--force_bytes_output_mode` forces the _decoder_ to predict `UTF-8` bytes instead of characters. For more information, [please see the 🐸STT documentation](https://stt.readthedocs.io/en/master/Decoder.html#bytes-output-mode). For example:
```
root@dcb62aada58b:/STT/data/lm# ./generate_scorer_package \
--alphabet ../alphabet.txt \
--lm ../../stt-data/indonesian-scorer/lm.binary
--vocab ../../stt-data/indonesian-scorer/vocab-500000.txt \
--package kenlm-indonesian.scorer \
--default_alpha 0.931289039105002 \
--default_beta 1.1834137581510284 \
--force_bytes_output_mode True
```
The `kenlm-indonesian.scorer` file is stored in the `/STT/data/lm` directory within the Docker container. Copy it to the `stt-data` directory.
```
root@dcb62aada58b:/STT/data/lm# cp kenlm-indonesian.scorer ../../stt-data/indonesian-scorer/
```
```
root@dcb62aada58b:/STT/stt-data/indonesian-scorer# ls -las
total 1820
4 drwxrwxr-x 2 1000 1000 4096 Feb 26 21:56 .
4 drwxrwxr-x 5 1000 1000 4096 Feb 25 22:24 ..
636 -rw-r--r-- 1 root root 648000 Feb 26 21:56 kenlm-indonesian.scorer
488 -rw-r--r-- 1 root root 499594 Feb 24 08:05 lm.binary
52 -rw-r--r-- 1 root root 51178 Feb 24 08:05 vocab-500000.txt
```
#### Using the scorer file during the test phase of training
You now have your own scorer file that can be used during the test phase of model training process using the `--scorer` parameter.
For example:
```
python3 train.py \
--test_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--checkpoint_dir stt-data/checkpoints-newscorer-id \
--export_dir stt-data/exported-model-newscorer-id \
--n_hidden 2048 \
--scorer stt-data/indonesian-scorer/kenlm.scorer
```
For more information on scorer files, refer to the [🐸STT documentation](https://stt.readthedocs.io/en/latest/Scorer.html).
---
[Home](README.md) | [Previous - The alphabet.txt file](ALPHABET.md) | [Next - Acoustic Model and Language Model](AM_vs_LM.md)