45 lines
		
	
	
		
			2.2 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			45 lines
		
	
	
		
			2.2 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| .. _common-voice-data:
 | |
| 
 | |
| Common Voice training data
 | |
| ==========================
 | |
| 
 | |
| This document gives some information about using Common Voice data with STT. If you're in need of training data, the Common Voice corpus is a good place to start.
 | |
| 
 | |
| Common Voice consists of voice data that was donated through Mozilla's `Common Voice <https://commonvoice.mozilla.org/>`_ initiative. You can download the data sets for various languages `here <https://commonvoice.mozilla.org/data>`_.
 | |
| 
 | |
| After you download and extract a data set for one language, you'll find the following contents:
 | |
| 
 | |
| * ``.tsv`` files, containing metadata such as text transcripts
 | |
| * ``.mp3`` audio files, located in the ``clips`` directory
 | |
| 
 | |
| 🐸STT cannot directly work with Common Voice data, so you should run our importer script ``bin/import_cv2.py`` to format the data correctly:
 | |
| 
 | |
| .. code-block:: bash
 | |
| 
 | |
|    bin/import_cv2.py --filter_alphabet path/to/some/alphabet.txt /path/to/extracted/common-voice/archive
 | |
| 
 | |
| Providing a filter alphabet is optional. This alphabet is used to exclude all audio files whose transcripts contain characters not in the specified alphabet. Running the importer with ``-h`` will show you additional options.
 | |
| 
 | |
| The importer will create a new ``WAV`` file for every ``MP3`` file in the ``clips`` directory. The importer will also create the following ``CSV`` files:
 | |
| 
 | |
| * ``clips/train.csv``
 | |
| * ``clips/dev.csv``
 | |
| * ``clips/test.csv``
 | |
| 
 | |
| The CSV files contain the following fields:
 | |
| 
 | |
| * ``wav_filename`` - path to the audio file, may be absolute or relative. Our importer produces relative paths
 | |
| * ``wav_filesize`` - samples size given in bytes, used for sorting the data before training. Expects integer
 | |
| * ``transcript`` - transcription target for the sample
 | |
| 
 | |
| To use Common Voice data for training, validation and testing, you should pass the ``CSV`` filenames via ``--train_files``, ``--dev_files``, ``--test_files``.
 | |
| 
 | |
| For example, if you download, extracted, and imported the French language data from Common Voice, you will have a new local directory named ``fr``. You can train STT with this new French data as such:
 | |
| 
 | |
| .. code-block:: bash
 | |
| 
 | |
|    $ python -m coqui_stt_training.train \
 | |
|          --train_files fr/clips/train.csv \
 | |
|          --dev_files fr/clips/dev.csv \
 | |
|          --test_files fr/clips/test.csv
 |