Merge pull request #1951 from coqui-ai/docs-pass
Documentation cleanup pass to match recent changes
This commit is contained in:
commit
6635668eb3
@ -21,7 +21,7 @@ For example, for the ``overlay`` augmentation:
|
||||
|
||||
.. code-block::
|
||||
|
||||
python3 train.py --augment overlay[p=0.1,source=/path/to/audio.sdb,snr=20.0] ...
|
||||
python -m coqui_stt_training.train --augment "overlay[p=0.1,source=/path/to/audio.sdb,snr=20.0]" ...
|
||||
|
||||
In the documentation below, whenever a value is specified as ``<float-range>`` or ``<int-range>``, it supports one of the follow formats:
|
||||
|
||||
@ -55,7 +55,7 @@ Within a single domain, augmentations are applied in the same order as they appe
|
||||
Sample domain augmentations
|
||||
---------------------------
|
||||
|
||||
**Overlay augmentation** ``--augment overlay[p=<float>,source=<str>,snr=<float-range>,layers=<int-range>]``
|
||||
**Overlay augmentation** ``--augment "overlay[p=<float>,source=<str>,snr=<float-range>,layers=<int-range>]"``
|
||||
Layers another audio source (multiple times) onto augmented samples.
|
||||
|
||||
* **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
|
||||
@ -67,7 +67,7 @@ Sample domain augmentations
|
||||
* **layers**: number of layers added onto the sample (e.g. 10 layers of speech to get "cocktail-party effect"). A layer is just a sample of the same duration as the sample to augment. It gets stitched together from as many source samples as required.
|
||||
|
||||
|
||||
**Reverb augmentation** ``--augment reverb[p=<float>,delay=<float-range>,decay=<float-range>]``
|
||||
**Reverb augmentation** ``--augment "reverb[p=<float>,delay=<float-range>,decay=<float-range>]"``
|
||||
Adds simplified (no all-pass filters) `Schroeder reverberation <https://ccrma.stanford.edu/~jos/pasp/Schroeder_Reverberators.html>`_ to the augmented samples.
|
||||
|
||||
* **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
|
||||
@ -77,7 +77,7 @@ Sample domain augmentations
|
||||
* **decay**: sound decay in dB per reflection - higher values will result in a less reflective perceived "room"
|
||||
|
||||
|
||||
**Resample augmentation** ``--augment resample[p=<float>,rate=<int-range>]``
|
||||
**Resample augmentation** ``--augment "resample[p=<float>,rate=<int-range>]"``
|
||||
Resamples augmented samples to another sample rate and then resamples back to the original sample rate.
|
||||
|
||||
* **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
|
||||
@ -85,7 +85,7 @@ Sample domain augmentations
|
||||
* **rate**: sample-rate to re-sample to
|
||||
|
||||
|
||||
**Codec augmentation** ``--augment codec[p=<float>,bitrate=<int-range>]``
|
||||
**Codec augmentation** ``--augment "codec[p=<float>,bitrate=<int-range>]"``
|
||||
Compresses and then decompresses augmented samples using the lossy Opus audio codec.
|
||||
|
||||
* **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
|
||||
@ -93,7 +93,7 @@ Sample domain augmentations
|
||||
* **bitrate**: bitrate used during compression
|
||||
|
||||
|
||||
**Volume augmentation** ``--augment volume[p=<float>,dbfs=<float-range>]``
|
||||
**Volume augmentation** ``--augment "volume[p=<float>,dbfs=<float-range>]"``
|
||||
Measures and levels augmented samples to a target dBFS value.
|
||||
|
||||
* **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
|
||||
@ -103,7 +103,7 @@ Sample domain augmentations
|
||||
Spectrogram domain augmentations
|
||||
--------------------------------
|
||||
|
||||
**Pitch augmentation** ``--augment pitch[p=<float>,pitch=<float-range>]``
|
||||
**Pitch augmentation** ``--augment "pitch[p=<float>,pitch=<float-range>]"``
|
||||
Scales spectrogram on frequency axis and thus changes pitch.
|
||||
|
||||
* **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
|
||||
@ -111,7 +111,7 @@ Spectrogram domain augmentations
|
||||
* **pitch**: pitch factor by with the frequency axis is scaled (e.g. a value of 2.0 will raise audio frequency by one octave)
|
||||
|
||||
|
||||
**Tempo augmentation** ``--augment tempo[p=<float>,factor=<float-range>]``
|
||||
**Tempo augmentation** ``--augment "tempo[p=<float>,factor=<float-range>]"``
|
||||
Scales spectrogram on time axis and thus changes playback tempo.
|
||||
|
||||
* **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
|
||||
@ -119,7 +119,7 @@ Spectrogram domain augmentations
|
||||
* **factor**: speed factor by which the time axis is stretched or shrunken (e.g. a value of 2.0 will double playback tempo)
|
||||
|
||||
|
||||
**Warp augmentation** ``--augment warp[p=<float>,nt=<int-range>,nf=<int-range>,wt=<float-range>,wf=<float-range>]``
|
||||
**Warp augmentation** ``--augment "warp[p=<float>,nt=<int-range>,nf=<int-range>,wt=<float-range>,wf=<float-range>]"``
|
||||
Applies a non-linear image warp to the spectrogram. This is achieved by randomly shifting a grid of equally distributed warp points along time and frequency axis.
|
||||
|
||||
* **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
|
||||
@ -133,7 +133,7 @@ Spectrogram domain augmentations
|
||||
* **wf**: standard deviation of the random shift applied to warp points along frequency axis (0.0 = no warp, 1.0 = half the distance to the neighbour point)
|
||||
|
||||
|
||||
**Frequency mask augmentation** ``--augment frequency_mask[p=<float>,n=<int-range>,size=<int-range>]``
|
||||
**Frequency mask augmentation** ``--augment "frequency_mask[p=<float>,n=<int-range>,size=<int-range>]"``
|
||||
Sets frequency-intervals within the augmented samples to zero (silence) at random frequencies. See the SpecAugment paper for more details - https://arxiv.org/abs/1904.08779
|
||||
|
||||
* **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
|
||||
@ -145,7 +145,7 @@ Spectrogram domain augmentations
|
||||
Multi domain augmentations
|
||||
--------------------------
|
||||
|
||||
**Time mask augmentation** ``--augment time_mask[p=<float>,n=<int-range>,size=<float-range>,domain=<domain>]``
|
||||
**Time mask augmentation** ``--augment "time_mask[p=<float>,n=<int-range>,size=<float-range>,domain=<domain>]"``
|
||||
Sets time-intervals within the augmented samples to zero (silence) at random positions.
|
||||
|
||||
* **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
|
||||
@ -157,7 +157,7 @@ Multi domain augmentations
|
||||
* **domain**: data representation to apply augmentation to - "signal", "features" or "spectrogram" (default)
|
||||
|
||||
|
||||
**Dropout augmentation** ``--augment dropout[p=<float>,rate=<float-range>,domain=<domain>]``
|
||||
**Dropout augmentation** ``--augment "dropout[p=<float>,rate=<float-range>,domain=<domain>]"``
|
||||
Zeros random data points of the targeted data representation.
|
||||
|
||||
* **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
|
||||
@ -167,7 +167,7 @@ Multi domain augmentations
|
||||
* **domain**: data representation to apply augmentation to - "signal", "features" or "spectrogram" (default)
|
||||
|
||||
|
||||
**Add augmentation** ``--augment add[p=<float>,stddev=<float-range>,domain=<domain>]``
|
||||
**Add augmentation** ``--augment "add[p=<float>,stddev=<float-range>,domain=<domain>]"``
|
||||
Adds random values picked from a normal distribution (with a mean of 0.0) to all data points of the targeted data representation.
|
||||
|
||||
* **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
|
||||
@ -177,7 +177,7 @@ Multi domain augmentations
|
||||
* **domain**: data representation to apply augmentation to - "signal", "features" (default) or "spectrogram"
|
||||
|
||||
|
||||
**Multiply augmentation** ``--augment multiply[p=<float>,stddev=<float-range>,domain=<domain>]``
|
||||
**Multiply augmentation** ``--augment "multiply[p=<float>,stddev=<float-range>,domain=<domain>]"``
|
||||
Multiplies all data points of the targeted data representation with random values picked from a normal distribution (with a mean of 1.0).
|
||||
|
||||
* **p**: probability value between 0.0 (never) and 1.0 (always) if a given sample gets augmented by this method
|
||||
@ -191,24 +191,22 @@ Example training with all augmentations:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python -u train.py \
|
||||
python -m coqui_stt_training.train \
|
||||
--train_files "train.sdb" \
|
||||
--feature_cache ./feature.cache \
|
||||
--cache_for_epochs 10 \
|
||||
--epochs 100 \
|
||||
--augment overlay[p=0.5,source=noise.sdb,layers=1,snr=50:20~10] \
|
||||
--augment reverb[p=0.1,delay=50.0~30.0,decay=10.0:2.0~1.0] \
|
||||
--augment resample[p=0.1,rate=12000:8000~4000] \
|
||||
--augment codec[p=0.1,bitrate=48000:16000] \
|
||||
--augment volume[p=0.1,dbfs=-10:-40] \
|
||||
--augment pitch[p=0.1,pitch=1~0.2] \
|
||||
--augment tempo[p=0.1,factor=1~0.5] \
|
||||
--augment warp[p=0.1,nt=4,nf=1,wt=0.5:1.0,wf=0.1:0.2] \
|
||||
--augment frequency_mask[p=0.1,n=1:3,size=1:5] \
|
||||
--augment time_mask[p=0.1,domain=signal,n=3:10~2,size=50:100~40] \
|
||||
--augment dropout[p=0.1,rate=0.05] \
|
||||
--augment add[p=0.1,domain=signal,stddev=0~0.5] \
|
||||
--augment multiply[p=0.1,domain=features,stddev=0~0.5] \
|
||||
--augment "overlay[p=0.5,source=noise.sdb,layers=1,snr=50:20~10]" \
|
||||
--augment "reverb[p=0.1,delay=50.0~30.0,decay=10.0:2.0~1.0]" \
|
||||
--augment "resample[p=0.1,rate=12000:8000~4000]" \
|
||||
--augment "codec[p=0.1,bitrate=48000:16000]" \
|
||||
--augment "volume[p=0.1,dbfs=-10:-40]" \
|
||||
--augment "pitch[p=0.1,pitch=1~0.2]" \
|
||||
--augment "tempo[p=0.1,factor=1~0.5]" \
|
||||
--augment "warp[p=0.1,nt=4,nf=1,wt=0.5:1.0,wf=0.1:0.2]" \
|
||||
--augment "frequency_mask[p=0.1,n=1:3,size=1:5]" \
|
||||
--augment "time_mask[p=0.1,domain=signal,n=3:10~2,size=50:100~40]" \
|
||||
--augment "dropout[p=0.1,rate=0.05]" \
|
||||
--augment "add[p=0.1,domain=signal,stddev=0~0.5]" \
|
||||
--augment "multiply[p=0.1,domain=features,stddev=0~0.5]" \
|
||||
[...]
|
||||
|
||||
|
||||
@ -218,20 +216,20 @@ Example of playing all samples with reverberation and maximized volume:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
bin/play.py --augment reverb[p=0.1,delay=50.0,decay=2.0] --augment volume --random test.sdb
|
||||
bin/play.py --augment "reverb[p=0.1,delay=50.0,decay=2.0]" --augment volume --random test.sdb
|
||||
|
||||
Example simulation of the codec augmentation of a wav-file first at the beginning and then at the end of an epoch:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
bin/play.py --augment codec[p=0.1,bitrate=48000:16000] --clock 0.0 test.wav
|
||||
bin/play.py --augment codec[p=0.1,bitrate=48000:16000] --clock 1.0 test.wav
|
||||
bin/play.py --augment "codec[p=0.1,bitrate=48000:16000]" --clock 0.0 test.wav
|
||||
bin/play.py --augment "codec[p=0.1,bitrate=48000:16000]" --clock 1.0 test.wav
|
||||
|
||||
Example of creating a pre-augmented test set:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
bin/data_set_tool.py \
|
||||
--augment overlay[source=noise.sdb,layers=1,snr=20~10] \
|
||||
--augment resample[rate=12000:8000~4000] \
|
||||
--augment "overlay[source=noise.sdb,layers=1,snr=20~10]" \
|
||||
--augment "resample[rate=12000:8000~4000]" \
|
||||
test.sdb test-augmented.sdb
|
||||
|
@ -1,8 +0,0 @@
|
||||
.. _byte-output-mode:
|
||||
|
||||
Training in byte output mode
|
||||
=============================
|
||||
|
||||
🐸STT includes a ``byte output mode`` which can be useful when working with languages with very large alphabets, such as Mandarin Chinese.
|
||||
|
||||
This training mode is experimental, and has only been used for Mandarin Chinese.
|
@ -32,13 +32,13 @@ The CSV files contain the following fields:
|
||||
* ``wav_filesize`` - samples size given in bytes, used for sorting the data before training. Expects integer
|
||||
* ``transcript`` - transcription target for the sample
|
||||
|
||||
To use Common Voice data for training, validation and testing, you should pass the ``CSV`` filenames to ``train.py`` via ``--train_files``, ``--dev_files``, ``--test_files``.
|
||||
To use Common Voice data for training, validation and testing, you should pass the ``CSV`` filenames via ``--train_files``, ``--dev_files``, ``--test_files``.
|
||||
|
||||
For example, if you download, extracted, and imported the French language data from Common Voice, you will have a new local directory named ``fr``. You can train STT with this new French data as such:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ python3 train.py \
|
||||
--train_files fr/clips/train.csv \
|
||||
--dev_files fr/clips/dev.csv \
|
||||
--test_files fr/clips/test.csv
|
||||
$ python -m coqui_stt_training.train \
|
||||
--train_files fr/clips/train.csv \
|
||||
--dev_files fr/clips/dev.csv \
|
||||
--test_files fr/clips/test.csv
|
||||
|
@ -10,31 +10,26 @@ Introduction
|
||||
|
||||
Deployment is the process of feeding audio (speech) into a trained 🐸STT model and receiving text (transcription) as output. In practice you probably want to use two models for deployment: an audio model and a text model. The audio model (a.k.a. the acoustic model) is a deep neural network which converts audio into text. The text model (a.k.a. the language model / scorer) returns the likelihood of a string of text. If the acoustic model makes spelling or grammatical mistakes, the language model can help correct them.
|
||||
|
||||
You can deploy 🐸STT models either via a command-line client or a language binding. 🐸 provides three language bindings and one command line client. There also exist several community-maintained clients and language bindings, which are listed `further down in this README <#third-party-bindings>`_.
|
||||
|
||||
*Note that 🐸STT currently only provides packages for CPU deployment with Python 3.5 or higher on Linux. We're working to get the rest of our usually supported packages back up and running as soon as possible.*
|
||||
You can deploy 🐸STT models either via a command-line client or a language binding.
|
||||
|
||||
* :ref:`The Python package + language binding <py-usage>`
|
||||
* :ref:`The Node.JS package + language binding <nodejs-usage>`
|
||||
* :ref:`The command-line client <cli-usage>`
|
||||
* :ref:`The native C API <c-usage>`
|
||||
* :ref:`The Node.JS package + language binding <nodejs-usage>`
|
||||
* :ref:`The .NET client + language binding <build-native-client-dotnet>`
|
||||
|
||||
.. _download-models:
|
||||
|
||||
Download trained Coqui STT models
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
You can find pre-trained models ready for deployment on the 🐸STT `releases page <https://github.com/coqui-ai/STT/releases>`_. You can also download the latest acoustic model (``.pbmm``) and language model (``.scorer``) from the command line as such:
|
||||
You can find pre-trained models ready for deployment on the 🐸STT `releases page <https://github.com/coqui-ai/STT/releases>`_. You can also download the latest acoustic model (``.tflite``) and language model (``.scorer``) from the command line as such:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
wget https://github.com/coqui-ai/STT/releases/download/v0.9.3/coqui-stt-0.9.3-models.pbmm
|
||||
wget https://github.com/coqui-ai/STT/releases/download/v0.9.3/coqui-stt-0.9.3-models.tflite
|
||||
wget https://github.com/coqui-ai/STT/releases/download/v0.9.3/coqui-stt-0.9.3-models.scorer
|
||||
|
||||
In every 🐸STT official release, there are several kinds of model files provided. For the acoustic model there are two file extensions: ``.pbmm`` and ``.tflite``. Files ending in ``.pbmm`` are compatible with clients and language bindings built against the standard TensorFlow runtime. ``.pbmm`` files are also compatible with CUDA enabled clients and language bindings. Files ending in ``.tflite``, on the other hand, are only compatible with clients and language bindings built against the `TensorFlow Lite runtime <https://www.tensorflow.org/lite/>`_. TFLite models are optimized for size and performance on low-power devices. You can find a full list of supported platforms and TensorFlow runtimes at :ref:`supported-platforms-deployment`.
|
||||
|
||||
For language models, there is only only file extension: ``.scorer``. Language models can run on any supported device, regardless of Tensorflow runtime. You can read more about language models with regard to :ref:`the decoding process <decoder-docs>` and :ref:`how scorers are generated <language-model>`.
|
||||
In every 🐸STT official release, there are different model files provided. The acoustic model uses the ``.tflite`` extension. Language models use the extension ``.scorer``. You can read more about language models with regard to :ref:`the decoding process <decoder-docs>` and :ref:`how scorers are generated <language-model>`.
|
||||
|
||||
.. _model-data-match:
|
||||
|
||||
@ -51,7 +46,7 @@ How well a 🐸STT model transcribes your audio will depend on a lot of things.
|
||||
|
||||
If you take a 🐸STT model trained on English, and pass Spanish into it, you should expect the model to perform horribly. Imagine you have a friend who only speaks English, and you ask her to make Spanish subtitles for a Spanish film, you wouldn't expect to get good subtitles. This is an extreme example, but it helps to form an intuition for what to expect from 🐸STT models. Imagine that the 🐸STT models are like people who speak a certain language with a certain accent, and then think about what would happen if you asked that person to transcribe your audio.
|
||||
|
||||
An acoustic model (i.e. ``.pbmm`` or ``.tflite``) has "learned" how to transcribe a certain language, and the model probably understands some accents better than others. In addition to languages and accents, acoustic models are sensitive to the style of speech, the topic of speech, and the demographics of the person speaking. The language model (``.scorer``) has been trained on text alone. As such, the language model is sensitive to how well the topic and style of speech matches that of the text used in training. The 🐸STT `release notes <https://github.com/coqui-ai/STT/releases/tag/v0.9.3>`_ include detailed information on the data used to train the models. If the data used for training the off-the-shelf models does not align with your intended use case, it may be necessary to adapt or train new models in order to improve transcription on your data.
|
||||
An acoustic model (i.e. ``.tflite`` file) has "learned" how to transcribe a certain language, and the model probably understands some accents better than others. In addition to languages and accents, acoustic models are sensitive to the style of speech, the topic of speech, and the demographics of the person speaking. The language model (``.scorer``) has been trained on text alone. As such, the language model is sensitive to how well the topic and style of speech matches that of the text used in training. The 🐸STT `release notes <https://github.com/coqui-ai/STT/releases/tag/v0.9.3>`_ include detailed information on the data used to train the models. If the data used for training the off-the-shelf models does not align with your intended use case, it may be necessary to adapt or train new models in order to improve transcription on your data.
|
||||
|
||||
Training your own language model is often a good way to improve transcription on your audio. The process and tools used to generate a language model are described in :ref:`language-model` and general information can be found in :ref:`decoder-docs`. Generating a scorer from a constrained topic dataset is a quick process and can bring significant accuracy improvements if your audio is from a specific topic.
|
||||
|
||||
@ -67,7 +62,7 @@ Model compatibility
|
||||
Using the Python package
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Pre-built binaries for deploying a trained model can be installed with ``pip``. It is highly recommended that you use Python 3.5 or higher in a virtual environment. Both `pip <https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/#installing-pip>`_ and `venv <https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/#creating-a-virtual-environment>`_ are included in normal Python 3 installations.
|
||||
Pre-built binaries for deploying a trained model can be installed with ``pip``. It is highly recommended that you use Python 3.6 or higher in a virtual environment. Both `pip <https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/#installing-pip>`_ and `venv <https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/#creating-a-virtual-environment>`_ are included in normal Python 3 installations.
|
||||
|
||||
When you create a new Python virtual environment, you create a directory containing a ``python`` binary and everything needed to run 🐸STT. For the purpose of this documentation, we will use on ``$HOME/coqui-stt-venv``, but you can use whatever directory you like.
|
||||
|
||||
@ -87,7 +82,7 @@ After your environment has been activated, you can use ``pip`` to install ``stt`
|
||||
|
||||
.. code-block::
|
||||
|
||||
(coqui-stt-venv)$ python3 -m pip install -U pip && python3 -m pip install stt
|
||||
(coqui-stt-venv)$ python -m pip install -U pip && python -m pip install stt
|
||||
|
||||
After installation has finished, you can call ``stt`` from the command-line.
|
||||
|
||||
@ -95,7 +90,7 @@ The following command assumes you :ref:`downloaded the pre-trained models <downl
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
(coqui-stt-venv)$ stt --model stt-0.9.3-models.pbmm --scorer stt-0.9.3-models.scorer --audio my_audio_file.wav
|
||||
(coqui-stt-venv)$ stt --model stt-0.9.3-models.tflite --scorer stt-0.9.3-models.scorer --audio my_audio_file.wav
|
||||
|
||||
See :ref:`the Python client <py-api-example>` for an example of how to use the package programatically.
|
||||
|
||||
@ -103,45 +98,10 @@ See :ref:`the Python client <py-api-example>` for an example of how to use the p
|
||||
|
||||
.. code-block::
|
||||
|
||||
(coqui-stt-venv)$ python3 -m pip install -U pip && python3 -m pip install stt-gpu
|
||||
(coqui-stt-venv)$ python -m pip install -U pip && python -m pip install stt-gpu
|
||||
|
||||
See the `release notes <https://github.com/coqui-ai/STT/releases>`_ to find which GPUs are supported. Please ensure you have the required `CUDA dependency <#cuda-dependency>`_.
|
||||
|
||||
.. _cli-usage:
|
||||
|
||||
Using the command-line client
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
To download the pre-built binaries for the ``stt`` command-line (compiled C++) client, use ``util/taskcluster.py``\ :
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python3 util/taskcluster.py --target .
|
||||
|
||||
or if you're on macOS:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python3 util/taskcluster.py --arch osx --target .
|
||||
|
||||
also, if you need some binaries different than current main branch, like ``v0.2.0-alpha.6``\ , you can use ``--branch``\ :
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python3 util/taskcluster.py --branch "v0.2.0-alpha.6" --target "."
|
||||
|
||||
The script ``taskcluster.py`` will download ``native_client.tar.xz`` (which includes the ``stt`` binary and associated libraries) and extract it into the current folder. ``taskcluster.py`` will download binaries for Linux/x86_64 by default, but you can override that behavior with the ``--arch`` parameter. See the help info with ``python3 util/taskcluster.py -h`` for more details. Specific branches of 🐸STT or TensorFlow can be specified as well.
|
||||
|
||||
Alternatively you may manually download the ``native_client.tar.xz`` from the `releases page <https://github.com/coqui-ai/STT/releases>`_.
|
||||
|
||||
Assuming you have :ref:`downloaded the pre-trained models <download-models>`, you can use the client as such:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
./stt --model coqui-stt-0.9.3-models.pbmm --scorer coqui-stt-0.9.3-models.scorer --audio audio_input.wav
|
||||
|
||||
See the help output with ``./stt -h`` for more details.
|
||||
|
||||
.. _nodejs-usage:
|
||||
|
||||
Using the Node.JS / Electron.JS package
|
||||
@ -173,6 +133,20 @@ See the `release notes <https://github.com/coqui-ai/STT/releases>`_ to find whic
|
||||
|
||||
See the :ref:`TypeScript client <js-api-example>` for an example of how to use the bindings programatically.
|
||||
|
||||
.. _cli-usage:
|
||||
|
||||
Using the command-line client
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The pre-built binaries for the ``stt`` command-line (compiled C++) client are available in the ``native_client.tar.xz`` archive for your desired platform. You can download the archive from our `releases page <https://github.com/coqui-ai/STT/releases>`_.
|
||||
|
||||
Assuming you have :ref:`downloaded the pre-trained models <download-models>`, you can use the client as such:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
./stt --model coqui-stt-0.9.3-models.tflite --scorer coqui-stt-0.9.3-models.scorer --audio audio_input.wav
|
||||
|
||||
See the help output with ``./stt -h`` for more details.
|
||||
|
||||
Installing bindings from source
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
@ -209,11 +183,8 @@ CUDA Dependency
|
||||
|
||||
The GPU capable builds (Python, NodeJS, C++, etc) depend on CUDA 10.1 and CuDNN v7.6.
|
||||
|
||||
.. _cuda-inference-deps:
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:caption: Supported Platforms
|
||||
|
||||
SUPPORTED_PLATFORMS
|
||||
|
||||
|
@ -3,54 +3,17 @@
|
||||
Exporting a model for deployment
|
||||
================================
|
||||
|
||||
After you train a STT model, your model will be stored on disk as a :ref:`checkpoint file <checkpointing>`. Model checkpoints are useful for resuming training at a later date, but they are not the correct format for deploying a model into production. The best model format for deployment is a protobuf file.
|
||||
After you train a STT model, your model will be stored on disk as a :ref:`checkpoint file <checkpointing>`. Model checkpoints are useful for resuming training at a later date, but they are not the correct format for deploying a model into production. The model format for deployment is a TFLite file.
|
||||
|
||||
This document explains how to export model checkpoints as a protobuf file.
|
||||
This document explains how to export model checkpoints as a TFLite file.
|
||||
|
||||
How to export a model
|
||||
---------------------
|
||||
|
||||
The simplest way to export STT model checkpoints for deployment is via ``train.py`` and the ``--export_dir`` flag.
|
||||
You can export STT model checkpoints for deployment by using the export script and the ``--export_dir`` flag.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ python3 train.py \
|
||||
--checkpoint_dir path/to/existing/model/checkpoints \
|
||||
--export_dir where/to/export/new/protobuf
|
||||
|
||||
However, you may want to export a model for small devices or for more efficient memory usage. In this case, follow the steps below.
|
||||
|
||||
Exporting as memory-mapped
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
By default, the protobuf exported by ``train.py`` will be loaded in memory every time the model is deployed. This results in extra loading time and memory consumption. Creating a memory-mapped protobuf file will avoid these issues.
|
||||
|
||||
First, export your checkpoints to a protobuf with ``train.py``:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ python3 train.py \
|
||||
--checkpoint_dir path/to/existing/model/checkpoints \
|
||||
--export_dir where/to/export/new/protobuf
|
||||
|
||||
Second, convert the protobuf to a memory-mapped protobuf with ``convert_graphdef_memmapped_format``:
|
||||
|
||||
.. code-block::
|
||||
|
||||
$ convert_graphdef_memmapped_format \
|
||||
--in_graph=output_graph.pb \
|
||||
--out_graph=output_graph.pbmm
|
||||
|
||||
``convert_graphdef_memmapped_format`` is a dedicated tool to convert regular protobuf files to memory-mapped protobufs. You can find this tool pre-compiled on the STT `release page <https://github.com/coqui-ai/STT/releases>`_. You should download and decompress ``convert_graphdef_memmapped_format`` before use. Upon a sucessful conversion ``convert_graphdef_memmapped_format`` will report conversion of a non-zero number of nodes.
|
||||
|
||||
Exporting for small devices
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
If you want to deploy a STT model on a small device, you might consider exporting the model with `Tensorflow Lite <https://www.tensorflow.org/lite>`_ support. Export STT model checkpoints for Tensorflow Lite via ``train.py`` and the ``--export_tflite`` flag.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ python3 train.py \
|
||||
--checkpoint_dir path/to/existing/model/checkpoints \
|
||||
--export_dir where/to/export/new/protobuf \
|
||||
--export_tflite
|
||||
$ python3 -m coqui_stt_training.export \
|
||||
--checkpoint_dir path/to/existing/model/checkpoints \
|
||||
--export_dir where/to/export/model
|
||||
|
@ -49,7 +49,7 @@ For more custom use cases, you might familiarize yourself with the `KenLM toolki
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python3 generate_lm.py \
|
||||
python generate_lm.py \
|
||||
--input_txt librispeech-lm-norm.txt.gz \
|
||||
--output_dir . \
|
||||
--top_k 500000 \
|
||||
|
@ -5,14 +5,14 @@ Automatic Mixed Precision
|
||||
|
||||
Training with `automatic mixed precision <https://medium.com/tensorflow/automatic-mixed-precision-in-tensorflow-for-faster-ai-training-on-nvidia-gpus-6033234b2540>`_ is available when training STT on an GPU.
|
||||
|
||||
Mixed precision training makes use of both ``FP32`` and ``FP16`` precisions where appropriate. ``FP16`` operations can leverage the Tensor cores on NVIDIA GPUs (Volta, Turing or newer architectures) for improved throughput. Mixed precision training often allows larger batch sizes. Automatic mixed precision training can be enabled by including the flag `--automatic_mixed_precision` at training time:
|
||||
Mixed precision training makes use of both ``FP32`` and ``FP16`` precisions where appropriate. ``FP16`` operations can leverage the Tensor cores on NVIDIA GPUs (Volta, Turing or newer architectures) for improved throughput. Mixed precision training often allows larger batch sizes. Automatic mixed precision training can be enabled by including the flag ``--automatic_mixed_precision true`` at training time:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ python3 train.py \
|
||||
$ python -m coqui_stt_training.train \
|
||||
--train_files train.csv \
|
||||
--dev_files dev.csv \
|
||||
--test_files test.csv \
|
||||
--automatic_mixed_precision
|
||||
--dev_files dev.csv \
|
||||
--test_files test.csv \
|
||||
--automatic_mixed_precision true
|
||||
|
||||
On a Volta generation V100 GPU, automatic mixed precision can speed up 🐸STT training and evaluation by approximately 30% to 40%.
|
||||
|
@ -5,15 +5,25 @@ Training: Advanced Topics
|
||||
|
||||
This document contains more advanced topics with regard to training models with STT. If you'd prefer a lighter introduction, please refer to :ref:`Training: Quickstart<intro-training-docs>`.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
1. :ref:`training-flags`
|
||||
2. :ref:`transfer-learning`
|
||||
3. :ref:`automatic-mixed-precision`
|
||||
4. :ref:`checkpointing`
|
||||
5. :ref:`common-voice-data`
|
||||
6. :ref:`training-data-augmentation`
|
||||
7. :ref:`exporting-checkpoints`
|
||||
8. :ref:`model-geometry`
|
||||
9. :ref:`parallel-training-optimization`
|
||||
10. :ref:`data-importers`
|
||||
11. :ref:`byte-output-mode`
|
||||
TRAINING_FLAGS
|
||||
|
||||
TRANSFER_LEARNING
|
||||
|
||||
MIXED_PRECISION
|
||||
|
||||
CHECKPOINTING
|
||||
|
||||
COMMON_VOICE_DATA
|
||||
|
||||
AUGMENTATION
|
||||
|
||||
EXPORTING_MODELS
|
||||
|
||||
Geometry
|
||||
|
||||
PARALLLEL_OPTIMIZATION
|
||||
|
||||
DATASET_IMPORTERS
|
||||
|
@ -3,14 +3,12 @@
|
||||
Command-line flags for the training scripts
|
||||
===========================================
|
||||
|
||||
Below you can find the definition of all command-line flags supported by the training scripts. This includes ``train.py``, ``evaluate.py``, ``evaluate_tflite.py``, ``transcribe.py`` and ``lm_optimizer.py``.
|
||||
Below you can find the definition of all command-line flags supported by the training modules. This includes the modules ``coqui_stt_training.train``, ``coqui_stt_training.evaluate``, ``coqui_stt_training.export``, ``coqui_stt_training.training_graph_inference``, and the scripts ``evaluate_tflite.py``, ``transcribe.py`` and ``lm_optimizer.py``.
|
||||
|
||||
Flags
|
||||
-----
|
||||
|
||||
.. literalinclude:: ../training/coqui_stt_training/util/config.py
|
||||
:language: python
|
||||
:linenos:
|
||||
:lineno-match:
|
||||
:start-after: sphinx-doc: training_ref_flags_start
|
||||
:end-before: sphinx-doc: training_ref_flags_end
|
||||
|
@ -41,18 +41,18 @@ If you don't want to use our Dockerfile template, you will need to manually inst
|
||||
Prerequisites
|
||||
^^^^^^^^^^^^^
|
||||
|
||||
* `Python 3.6 <https://www.python.org/>`_
|
||||
* `Python 3.6, 3.7 or 3.8 <https://www.python.org/>`_
|
||||
* Mac or Linux environment (training on Windows is *not* currently supported)
|
||||
* CUDA 10.0 and CuDNN v7.6
|
||||
|
||||
Download
|
||||
^^^^^^^^
|
||||
|
||||
We recommened that you clone the STT repo from the latest stable release branch on Github (e.g. ``v0.9.3``). You can find all 🐸STT releases `here <https://github.com/coqui-ai/STT/releases>`_).
|
||||
Clone the STT repo from GitHub:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ git clone --branch v0.9.3 --depth 1 https://github.com/coqui-ai/STT
|
||||
$ git clone https://github.com/coqui-ai/STT
|
||||
|
||||
Installation
|
||||
^^^^^^^^^^^^
|
||||
@ -86,23 +86,17 @@ Now that we have cloned the STT repo from Github and setup a virtual environment
|
||||
.. code-block:: bash
|
||||
|
||||
$ cd STT
|
||||
$ python3 -m pip install --upgrade pip wheel setuptools
|
||||
$ python3 -m pip install --upgrade -e .
|
||||
|
||||
The ``webrtcvad`` package may additionally require ``python3-dev``:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ sudo apt-get install python3-dev
|
||||
$ python -m pip install --upgrade pip wheel setuptools
|
||||
$ python -m pip install --upgrade -e .
|
||||
|
||||
If you have an NVIDIA GPU, it is highly recommended to install TensorFlow with GPU support. Training will be significantly faster than using the CPU.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ python3 -m pip uninstall tensorflow
|
||||
$ python3 -m pip install 'tensorflow-gpu==1.15.4'
|
||||
$ python -m pip uninstall tensorflow
|
||||
$ python -m pip install 'tensorflow-gpu==1.15.4'
|
||||
|
||||
Please ensure you have the required `CUDA dependency <https://www.tensorflow.org/install/source#gpu>`_ and :ref:`prerequisites <training-deps>`.
|
||||
Please ensure you have the required :ref:`prerequisites <training-deps>` and a working CUDA installation with the versions listed above.
|
||||
|
||||
Verify Install
|
||||
""""""""""""""
|
||||
@ -118,12 +112,12 @@ This script will train a model on a single audio file. If the script exits succe
|
||||
Training on your own Data
|
||||
-------------------------
|
||||
|
||||
Whether you used our Dockerfile template or you set up your own environment, the central STT training script is ``train.py``. For a list of command line options, use the ``--helpfull`` flag:
|
||||
Whether you used our Dockerfile template or you set up your own environment, the central STT training module is ``python -m coqui_stt_training.train``. For a list of command line options, use the ``--help`` flag:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ cd STT
|
||||
$ python3 train.py --helpfull
|
||||
$ python -m coqui_stt_training.train --help
|
||||
|
||||
Training Data
|
||||
^^^^^^^^^^^^^
|
||||
@ -143,12 +137,18 @@ Text transcripts should be formatted exactly as the transcripts you expect your
|
||||
CSV file format
|
||||
"""""""""""""""
|
||||
|
||||
The audio and transcripts used in training are passed to ``train.py`` via CSV files. You should supply CSV files for training (``train.csv``), development (``dev.csv``), and testing (``test.csv``). The CSV files should contain three columns:
|
||||
The audio and transcripts used in training are specified via CSV files. You should supply CSV files for training (``train.csv``), validation (``dev.csv``), and testing (``test.csv``). The CSV files should contain three columns:
|
||||
|
||||
1. ``wav_filename`` - the path to a WAV file on your machine
|
||||
2. ``wav_filesize`` - the number of bytes in the WAV file
|
||||
3. ``transcript`` - the text transcript of the WAV file
|
||||
|
||||
Alternatively, if you don't have pre-defined splits for training, validation and testing, you can use the ``--auto_input_dataset`` flag to automatically split a single CSV into subsets and generate an alphabet automatically:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ python -m coqui_stt_training.train --auto_input_dataset samples.csv
|
||||
|
||||
Start Training
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
@ -157,11 +157,11 @@ After you've successfully installed STT and have access to data, you can start a
|
||||
.. code-block:: bash
|
||||
|
||||
$ cd STT
|
||||
$ python3 train.py --train_files train.csv --dev_files dev.csv --test_files test.csv
|
||||
$ python -m coqui_stt_training.train --train_files train.csv --dev_files dev.csv --test_files test.csv
|
||||
|
||||
Next Steps
|
||||
----------
|
||||
|
||||
You will want to customize the settings of ``train.py`` to work better with your data and your hardware. You should review the :ref:`command-line training flags <training-flags>`, and experiment with different settings.
|
||||
You will want to customize the training settings to work better with your data and your hardware. You should review the :ref:`command-line training flags <training-flags>`, and experiment with different settings.
|
||||
|
||||
For more in-depth training documentation, you should refer to the :ref:`Advanced Training Topics <advanced-training-docs>` section.
|
||||
|
@ -14,17 +14,17 @@ If your own data uses the *extact* same alphabet as the English release model (i
|
||||
Fine-Tuning (same alphabet)
|
||||
---------------------------
|
||||
|
||||
You can fine-tune pre-trained model checkpoints by using the ``--checkpoint_dir`` flag in ``train.py``. Specify the path to the checkpoints, and training will resume from the pre-trained model.
|
||||
You can fine-tune pre-trained model checkpoints by using the ``--checkpoint_dir`` flag. Specify the path to the checkpoints, and training will resume from the pre-trained model.
|
||||
|
||||
For example, if you want to fine tune existing checkpoints to your own data in ``my-train.csv``, ``my-dev.csv``, and ``my-test.csv``, you can do the following:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ python3 train.py \
|
||||
--checkpoint_dir path/to/checkpoint/folder \
|
||||
--train_files my-train.csv \
|
||||
--dev_files my-dev.csv \
|
||||
--test_files my_test.csv
|
||||
$ python -m coqui_stt_training.train \
|
||||
--checkpoint_dir path/to/checkpoint/folder \
|
||||
--train_files my-train.csv \
|
||||
--dev_files my-dev.csv \
|
||||
--test_files my_test.csv
|
||||
|
||||
Transfer-Learning (new alphabet)
|
||||
--------------------------------
|
||||
@ -39,12 +39,12 @@ You need to specify the location of the pre-trained model with ``--load_checkpoi
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python3 train.py \
|
||||
python -m coqui_stt_training.train \
|
||||
--drop_source_layers 1 \
|
||||
--alphabet_config_path my-alphabet.txt \
|
||||
--save_checkpoint_dir path/to/output-checkpoint/folder \
|
||||
--load_checkpoint_dir path/to/input-checkpoint/folder \
|
||||
--train_files my-new-language-train.csv \
|
||||
--train_files my-new-language-train.csv \
|
||||
--dev_files my-new-language-dev.csv \
|
||||
--test_files my-new-language-test.csv
|
||||
|
||||
|
@ -136,10 +136,14 @@ add_module_names = False
|
||||
#
|
||||
html_theme = "furo"
|
||||
|
||||
html_css_files = [
|
||||
"custom.css",
|
||||
]
|
||||
|
||||
# Add any paths that contain custom static files (such as style sheets) here,
|
||||
# relative to this directory. They are copied after the builtin static files,
|
||||
# so a file named "default.css" will overwrite the builtin "default.css".
|
||||
html_static_path = [".static"]
|
||||
html_static_path = ["static"]
|
||||
|
||||
|
||||
# -- Options for HTMLHelp output ------------------------------------------
|
||||
|
@ -35,8 +35,8 @@ The fastest way to deploy a pre-trained 🐸STT model is with `pip` with Python
|
||||
$ source venv-stt/bin/activate
|
||||
|
||||
# Install 🐸STT
|
||||
$ python3 -m pip install -U pip
|
||||
$ python3 -m pip install stt
|
||||
$ python -m pip install -U pip
|
||||
$ python -m pip install stt
|
||||
|
||||
# Download 🐸's pre-trained English models
|
||||
$ curl -LO https://github.com/coqui-ai/STT/releases/download/v0.9.3/coqui-stt-0.9.3-models.pbmm
|
||||
|
3
doc/static/custom.css
vendored
Normal file
3
doc/static/custom.css
vendored
Normal file
@ -0,0 +1,3 @@
|
||||
#flags pre {
|
||||
white-space: pre-wrap;
|
||||
}
|
4
train.py
4
train.py
@ -3,6 +3,10 @@
|
||||
from __future__ import absolute_import, division, print_function
|
||||
|
||||
if __name__ == "__main__":
|
||||
print(
|
||||
"Using the top level train.py script is deprecated and will be removed "
|
||||
"in a future release. Instead use: python -m coqui_stt_training.train"
|
||||
)
|
||||
try:
|
||||
from coqui_stt_training import train as stt_train
|
||||
except ImportError:
|
||||
|
@ -191,7 +191,7 @@ def package_zip():
|
||||
log_info("Exported packaged model {}".format(archive))
|
||||
|
||||
|
||||
def main(_):
|
||||
def main():
|
||||
initialize_globals_from_cli()
|
||||
|
||||
if not Config.export_dir:
|
||||
|
@ -662,20 +662,24 @@ def main():
|
||||
|
||||
def deprecated_msg(prefix):
|
||||
return (
|
||||
f"{prefix} Using the training script as a generic driver for all training "
|
||||
f"{prefix} Using the training module as a generic driver for all training "
|
||||
"related functionality is deprecated and will be removed soon. Use "
|
||||
"the specific scripts: train.py/evaluate.py/export.py/training_graph_inference.py."
|
||||
"the specific modules: \n"
|
||||
" python -m coqui_stt_training.train\n"
|
||||
" python -m coqui_stt_training.evaluate\n"
|
||||
" python -m coqui_stt_training.export\n"
|
||||
" python -m coqui_stt_training.training_graph_inference"
|
||||
)
|
||||
|
||||
if Config.train_files:
|
||||
train()
|
||||
else:
|
||||
log_warn(deprecated_msg("Calling training script without --train_files."))
|
||||
log_warn(deprecated_msg("Calling training module without --train_files."))
|
||||
|
||||
if Config.test_files:
|
||||
log_warn(
|
||||
deprecated_msg(
|
||||
"Specifying --test_files when calling train.py script. Use evaluate.py."
|
||||
"Specifying --test_files when calling train module. Use python -m coqui_stt_training.evaluate"
|
||||
)
|
||||
)
|
||||
evaluate.test()
|
||||
@ -683,7 +687,7 @@ def main():
|
||||
if Config.export_dir:
|
||||
log_warn(
|
||||
deprecated_msg(
|
||||
"Specifying --export_dir when calling train.py script. Use export.py."
|
||||
"Specifying --export_dir when calling train module. Use python -m coqui_stt_training.export"
|
||||
)
|
||||
)
|
||||
export.export()
|
||||
@ -691,7 +695,7 @@ def main():
|
||||
if Config.one_shot_infer:
|
||||
log_warn(
|
||||
deprecated_msg(
|
||||
"Specifying --one_shot_infer when calling train.py script. Use training_graph_inference.py."
|
||||
"Specifying --one_shot_infer when calling train module. Use python -m coqui_stt_training.training_graph_inference"
|
||||
)
|
||||
)
|
||||
traning_graph_inference.do_single_file_inference(Config.one_shot_infer)
|
||||
|
@ -201,10 +201,10 @@ class _SttConfig(Coqpit):
|
||||
self.alphabet = alphabet
|
||||
else:
|
||||
raise RuntimeError(
|
||||
"Missing --alphabet_config_path flag. Couldn't find an alphabet file\n"
|
||||
"alongside checkpoint, and input datasets are not fully specified\n"
|
||||
"(--train_files, --dev_files, --test_files), so can't generate an alphabet.\n"
|
||||
"Either specify an alphabet file or fully specify the dataset, so one will\n"
|
||||
"Missing --alphabet_config_path flag. Couldn't find an alphabet file "
|
||||
"alongside checkpoint, and input datasets are not fully specified "
|
||||
"(--train_files, --dev_files, --test_files), so can't generate an alphabet. "
|
||||
"Either specify an alphabet file or fully specify the dataset, so one will "
|
||||
"be generated automatically."
|
||||
)
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user