From 0a11a8293e3cccc696b9bcc2b7e395d1a8010eae Mon Sep 17 00:00:00 2001 From: Alexandre Lissy Date: Tue, 31 Mar 2020 14:14:27 +0200 Subject: [PATCH] Mention validate_label_locale in training doc Fixes #2865 --- doc/TRAINING.rst | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/doc/TRAINING.rst b/doc/TRAINING.rst index a58ba6bf..823d7ff1 100644 --- a/doc/TRAINING.rst +++ b/doc/TRAINING.rst @@ -141,6 +141,18 @@ Feel also free to pass additional (or overriding) ``DeepSpeech.py`` parameters t Each dataset has a corresponding importer script in ``bin/`` that can be used to download (if it's freely available) and preprocess the dataset. See ``bin/import_librivox.py`` for an example of how to import and preprocess a large dataset for training with DeepSpeech. +Some importers might require additional code to properly handled your locale-specific requirements. Such handling is dealt with ``--validate_label_locale`` flag that allows you to source out-of-tree Python script that defines a ``validate_label`` function. Please refer to ``util/importers.py`` for implementation example of that function. +If you don't provide this argument, the default ``validate_label`` function will be used. This one is only intended for English language, so you might have consistency issues in your data for other languages. + +For example, in order to use a custom validation function that disallows any sample with "a" in its transcript, and lower cases everything else, you could put the following code in a file called ``my_validation.py`` and then use ``--validate_label_locale my_validation.py``: + +.. code-block:: python + + def validate_label(label): + if 'a' in label: # disallow labels with 'a' + return None + return label.lower() # lower case valid labels + If you've run the old importers (in ``util/importers/``\ ), they could have removed source files that are needed for the new importers to run. In that case, simply remove the extracted folders and let the importer extract and process the dataset from scratch, and things should work. Training with automatic mixed precision