Merge pull request #1807 from JRMeyer/docs

Overhaul the language model docs + include in ToC
This commit is contained in:
Josh Meyer 2021-03-24 11:58:49 -04:00 committed by GitHub
commit 653ce25a7c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 65 additions and 40 deletions

View File

@ -1,64 +1,83 @@
.. _scorer-scripts: .. _scorer-scripts:
External scorer scripts How to Train a Language Model
======================= =============================
🐸STT pre-trained models include an external scorer. This document explains how to reproduce our external scorer, as well as adapt the scripts to create your own. Introduction
------------
The scorer is composed of two sub-components, a KenLM language model and a trie data structure containing all words in the vocabulary. In order to create the scorer package, first we must create a KenLM language model (using ``data/lm/generate_lm.py``, and then use ``generate_scorer_package`` to create the final package file including the trie data structure. This document explains how to train and package a language model for deployment.
The ``generate_scorer_package`` binary is part of the native client package that is included with official releases. You can find the appropriate archive for your platform in the `GitHub release downloads <https://github.com/coqui-ai/STT/releases/latest>`_. The native client package is named ``native_client.{arch}.{config}.{plat}.tar.xz``, where ``{arch}`` is the architecture the binary was built for, for example ``amd64`` or ``arm64``, ``config`` is the build configuration, which for building decoder packages does not matter, and ``{plat}`` is the platform the binary was built-for, for example ``linux`` or ``osx``. If you wanted to run the ``generate_scorer_package`` binary on a Linux desktop, you would download ``native_client.amd64.cpu.linux.tar.xz``. You will usually want to deploy a language model in production. A good language model will improve transcription accuracy by correcting predictable spelling and grammatical mistakes. If you can predict what kind of speech your 🐸STT will encounter, you can make great gains in accuracy with a custom language model.
Reproducing our external scorer For example, if you want to transcribe university lectures on biology, you should train a language model on text related to biology. With this biology-specific language model, 🐸STT will be able to better recognize rare, hard to spell words like "cytokinesis".
-------------------------------
Our KenLM language model was generated from the LibriSpeech normalized LM training text, available `here <http://www.openslr.org/11>`_. How to train a model
It is created with `KenLM <https://github.com/kpu/kenlm>`_. --------------------
You can download the LibriSpeech corpus with the following command: There are two steps to deploying a new language model for 🐸STT:
1. Identify and format text data for training
2. Train a `KenLM <https://github.com/kpu/kenlm>`_ language model using ``data/lm/generate_lm.py``
3. Package the model for deployment with ``generate_scorer_package``
Find Training Data
^^^^^^^^^^^^^^^^^^
Language models are trained from text, and the more similar that text is to the speech your 🐸STT system encounters at run-time, the better 🐸STT will perform for you.
For example, if you would like to transcribe the nightly news, then transcripts of past nightly news programs will be your best training data. If you'd like to transcribe an audio book, then the exact text of that book will create the best possible language model. If you want to put 🐸STT on a smart speaker, your training text corpus should include all the commands you make available to the user, such as "turn off the music" or "set an alarm for 5 minutes". If you can't predict the kind of speech 🐸STT will hear at run-time, then you should try to gather as much text as possible in your target language (e.g. Spanish).
Once you have identified text that is appropriate for your application, you should save it in a single file with one sentence per line. This text should not contain anything that a person wouldn't say, such as markup language.
Our release language model for English is generated from the LibriSpeech normalized training text, available `here <http://www.openslr.org/11>`_.
You can download the text with the following command:
.. code-block:: bash .. code-block:: bash
cd data/lm
wget http://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz wget http://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz
Then use the ``generate_lm.py`` script to generate ``lm.binary`` and ``vocab-500000.txt``. This LibriSpeech training text is around 4GB uncompressed, which should give you an idea of the size of a corpus needed for general speech recognition. For more constrained use cases with smaller vocabularies, you don't need as much data, but you should still try to gather as much as you can.
As input you can use a plain text (e.g. ``file.txt``) or gzipped (e.g. ``file.txt.gz``) text file with one sentence in each line. Train the Language Model
^^^^^^^^^^^^^^^^^^^^^^^^
If you are using a container created from ``Dockerfile.build``, you can use ``--kenlm_bins /STT/native_client/kenlm/build/bin/``. Assuming you found and formatted a text corpus, the next step is to use that text to train a KenLM language model with ``data/lm/generate_lm.py``.
Else you have to build `KenLM <https://github.com/kpu/kenlm>`_ first and then pass the build directory to the script.
Before training the language model, you should first familiarize yourself with the `KenLM toolkit <https://kheafield.com/code/kenlm/>`_. Most of the options exposed by the ``generate_lm.py`` script are simply forwarded to KenLM options of the same name, so you should read the KenLM documentation in order to fully understand their behavior.
.. code-block:: bash .. code-block:: bash
cd data/lm python3 generate_lm.py \
python3 generate_lm.py --input_txt librispeech-lm-norm.txt.gz --output_dir . \ --input_txt librispeech-lm-norm.txt.gz \
--top_k 500000 --kenlm_bins path/to/kenlm/build/bin/ \ --output_dir . \
--arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|1" \ --top_k 500000 \
--binary_a_bits 255 --binary_q_bits 8 --binary_type trie --kenlm_bins path/to/kenlm/build/bin/ \
--arpa_order 5 \
--max_arpa_memory "85%" \
--arpa_prune "0|0|1" \
--binary_a_bits 255 \
--binary_q_bits 8 \
--binary_type trie
``generate_lm.py`` will save the new language model as two files on disk: ``lm.binary`` and ``vocab-500000.txt``.
Afterwards you can use ``generate_scorer_package`` to generate the scorer package using the ``lm.binary`` and ``vocab-500000.txt`` files: Package the Language Model for Deployment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Finally, we package the trained KenLM model for deployment with ``generate_scorer_package``. You should always use one of the pre-built release binaries for ``generate_scorer_package``, but if for some reason you need to compile it yourself, please refer to :ref:`build-generate-scorer-package`.
Package the language model for deployment with ``generate_scorer_package`` as such:
.. code-block:: bash .. code-block:: bash
cd data/lm ./generate_scorer_package \
# Download and extract appropriate native_client package: --alphabet ../alphabet.txt \
curl -LO http://github.com/coqui-ai/STT/releases/... --lm lm.binary \
tar xvf native_client.*.tar.xz --vocab vocab-500000.txt \
./generate_scorer_package --alphabet ../alphabet.txt --lm lm.binary --vocab vocab-500000.txt \ --package kenlm.scorer \
--package kenlm.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284 --default_alpha 0.931289039105002 \
--default_beta 1.1834137581510284
The ``generate_scorer_package`` binary is part of the released ``native_client.tar.xz``. If for some reason you need to rebuild it, The ``--default_alpha`` and ``--default_beta`` parameters shown above were found with the ``lm_optimizer.py`` Python script.
please refer to how to :ref:`build-generate-scorer-package`.
Building your own scorer
------------------------
Building your own scorer can be useful if you're using models in a narrow usage context, with a more limited vocabulary, for example. Building a scorer requires text data matching your intended use case, which must be formatted in a text file with one sentence per line.
The LibriSpeech LM training text used by our scorer is around 4GB uncompressed, which should give an idea of the size of a corpus needed for a reasonable language model for general speech recognition. For more constrained use cases with smaller vocabularies, you don't need as much data, but you should still try to gather as much as you can.
With a text corpus in hand, you can then re-use ``generate_lm.py`` and ``generate_scorer_package`` to create your own scorer that is compatible with 🐸STT clients and language bindings. Before building the language model, you must first familiarize yourself with the `KenLM toolkit <https://kheafield.com/code/kenlm/>`_. Most of the options exposed by the ``generate_lm.py`` script are simply forwarded to KenLM options of the same name, so you must read the KenLM documentation in order to fully understand their behavior.
After using ``generate_lm.py`` to create a KenLM language model binary file, you can use ``generate_scorer_package`` to create a scorer package as described in the previous section. Note that we have a :github:`lm_optimizer.py script <lm_optimizer.py>` which can be used to find good default values for alpha and beta. To use it, you must first generate a package with any value set for default alpha and beta flags. For this step, it doesn't matter what values you use, as they'll be overridden by ``lm_optimizer.py`` later. Then, use ``lm_optimizer.py`` with this scorer file to find good alpha and beta values. Finally, use ``generate_scorer_package`` again, this time with the new values.

View File

@ -81,6 +81,12 @@ The fastest way to deploy a pre-trained 🐸STT model is with `pip` with Python
Contributed-Examples Contributed-Examples
.. toctree::
:maxdepth: 1
:caption: Language Model
Scorer
.. include:: ../SUPPORT.rst .. include:: ../SUPPORT.rst
Indices and tables Indices and tables