diff --git a/Dockerfile.train.jupyter b/Dockerfile.train.jupyter index 71f3e466..a11d28b2 100644 --- a/Dockerfile.train.jupyter +++ b/Dockerfile.train.jupyter @@ -1,13 +1,13 @@ # This is a Dockerfile useful for training models with Coqui STT in Jupyter notebooks -FROM ghcr.io/coqui-ai/stt-train:v0.10.0-alpha.9 +FROM ghcr.io/coqui-ai/stt-train:v0.10.0-alpha.10 + +COPY notebooks /code/notebooks +WORKDIR /code/notebooks RUN python3 -m pip install --no-cache-dir jupyter jupyter_http_over_ws RUN jupyter serverextension enable --py jupyter_http_over_ws -RUN mv /code /home/STT -WORKDIR /home - EXPOSE 8888 -CMD ["bash", "-c", "jupyter notebook --notebook-dir=/home --ip 0.0.0.0 --no-browser --allow-root"] +CMD ["bash", "-c", "jupyter notebook --notebook-dir=/code/notebooks --ip 0.0.0.0 --no-browser --allow-root"] diff --git a/notebooks/train-ldc.ipynb b/notebooks/train-ldc.ipynb new file mode 100644 index 00000000..895785dd --- /dev/null +++ b/notebooks/train-ldc.ipynb @@ -0,0 +1,253 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "f79d99ef", + "metadata": {}, + "source": [ + "# Train your first 🐸 STT model 💫\n", + "\n", + "👋 Hello and welcome to Coqui (🐸) STT \n", + "\n", + "The goal of this notebook is to show you a **typical workflow** for **training** and **testing** an STT model with 🐸.\n", + "\n", + "Let's train a very small model on a very small amount of data so we can iterate quickly.\n", + "\n", + "In this notebook, we will:\n", + "\n", + "1. Download data and format it for 🐸 STT.\n", + "2. Configure the training and testing runs.\n", + "3. Train a new model.\n", + "4. Test the model and display its performance.\n", + "\n", + "So, let's jump right in!\n", + "\n", + "*PS - If you just want a working, off-the-shelf model, check out the [🐸 Model Zoo](https://www.coqui.ai/models)*" + ] + }, + { + "cell_type": "markdown", + "id": "be5fe49c", + "metadata": {}, + "source": [ + "## ✅ Download & format sample data for English\n", + "\n", + "**First things first**: we need some data.\n", + "\n", + "We're training a Speech-to-Text model, so we need some _speech_ and we need some _text_. Specificially, we want _transcribed speech_. Let's download an English audio file and its transcript and then format them for 🐸 STT. \n", + "\n", + "🐸 STT expects to find information about your data in a CSV file, where each line contains:\n", + "\n", + "1. the **path** to an audio file\n", + "2. the **size** of that audio file\n", + "3. the **transcript** of that audio file.\n", + "\n", + "Formatting the audio and transcript isn't too difficult in this case. We define a custom data importer called `download_sample_data()` which does all the work. If you have a custom dataset, you will probably want to write a custom data importer.\n", + "\n", + "**Second things second**: we want an alphabet. The output layer of a typical* 🐸 STT model represents letters in the alphabet, and you should specify this alphabet before training. Let's download an English alphabet from Coqui and use that.\n", + "\n", + "_*If you are working with languages with large character sets (e.g. Chinese), you can set `bytes_output_mode=True` instead of supplying an `alphabet.txt` file. In this case, the output layer of the STT model will correspond to individual UTF-8 bytes instead of individual characters._" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "53945462", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "### Download sample data\n", + "import os\n", + "import pandas\n", + "from coqui_stt_training.util.downloader import maybe_download\n", + "\n", + "def download_sample_data():\n", + " data_dir=\"english/\"\n", + " # Download data + alphabet\n", + " audio_file = maybe_download(\"LDC93S1.wav\", data_dir, \"https://catalog.ldc.upenn.edu/desc/addenda/LDC93S1.wav\")\n", + " transcript_file = maybe_download(\"LDC93S1.txt\", data_dir, \"https://catalog.ldc.upenn.edu/desc/addenda/LDC93S1.txt\")\n", + " alphabet = maybe_download(\"alphabet.txt\", data_dir, \"https://raw.githubusercontent.com/coqui-ai/STT/main/data/alphabet.txt\")\n", + " # Format data\n", + " with open(transcript_file, \"r\") as fin:\n", + " transcript = \" \".join(fin.read().strip().lower().split(\" \")[2:]).replace(\".\", \"\")\n", + " df = pandas.DataFrame(data=[(os.path.abspath(audio_file), os.path.getsize(audio_file), transcript)],\n", + " columns=[\"wav_filename\", \"wav_filesize\", \"transcript\"])\n", + " # Save formatted CSV \n", + " df.to_csv(os.path.join(data_dir, \"ldc93s1.csv\"), index=False)\n", + "\n", + "# Download and format data\n", + "download_sample_data()" + ] + }, + { + "cell_type": "markdown", + "id": "96e8b708", + "metadata": {}, + "source": [ + "### Take a look at the data (*Optional* )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fa2aec77", + "metadata": {}, + "outputs": [], + "source": [ + "csv_file = open(\"english/ldc93s1.csv\", \"r\")\n", + "print(csv_file.read())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6c046277", + "metadata": {}, + "outputs": [], + "source": [ + "alphabet_file = open(\"english/alphabet.txt\", \"r\")\n", + "print(alphabet_file.read())" + ] + }, + { + "cell_type": "markdown", + "id": "d9dfac21", + "metadata": {}, + "source": [ + "## ✅ Configure & set hyperparameters\n", + "\n", + "Coqui STT comes with a long list of hyperparameters you can tweak. We've set default values, but you will often want to set your own. You can use `initialize_globals_from_args()` to do this. \n", + "\n", + "You must **always** configure the paths to your data, and you must **always** configure your alphabet. Additionally, here we show how you can specify the size of hidden layers (`n_hidden`), the number of epochs to train for (`epochs`), and to initialize a new model from scratch (`load_train=\"init\"`)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d264fdec", + "metadata": {}, + "outputs": [], + "source": [ + "from coqui_stt_training.util.config import initialize_globals_from_args\n", + "\n", + "initialize_globals_from_args(\n", + " alphabet_config_path=\"english/alphabet.txt\",\n", + " train_files=[\"english/ldc93s1.csv\"],\n", + " dev_files=[\"english/ldc93s1.csv\"],\n", + " test_files=[\"english/ldc93s1.csv\"],\n", + " load_train=\"init\",\n", + " n_hidden=100,\n", + " epochs=200,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "799c1425", + "metadata": {}, + "source": [ + "### View all Config settings (*Optional*) " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "03b33d2b", + "metadata": {}, + "outputs": [], + "source": [ + "from coqui_stt_training.util.config import Config\n", + "\n", + "# Take a peek at the entire Config\n", + "print(Config.to_json())" + ] + }, + { + "cell_type": "markdown", + "id": "ae82fd75", + "metadata": {}, + "source": [ + "## ✅ Train a new model\n", + "\n", + "Let's kick off a training run 🚀🚀🚀 (using the configure you set above).\n", + "\n", + "This notebook should work on either a GPU or a CPU. However, in case you're running this on _multiple_ GPUs we want to only use one, because the sample dataset (one audio file) is too small to split across multiple GPUs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "550a504e", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "from coqui_stt_training.train import train, early_training_checks\n", + "import tensorflow.compat.v1 as tfv1\n", + "\n", + "# use maximum one GPU\n", + "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"\n", + "\n", + "early_training_checks()\n", + "\n", + "tfv1.reset_default_graph()\n", + "train()" + ] + }, + { + "cell_type": "markdown", + "id": "9f6dc959", + "metadata": {}, + "source": [ + "## ✅ Test the model\n", + "\n", + "We made it! 🙌\n", + "\n", + "Let's kick off the testing run, which displays performance metrics.\n", + "\n", + "We're committing the cardinal sin of ML 😈 (aka - testing on our training data) so you don't want to deploy this model into production. In this notebook we're focusing on the workflow itself, so it's forgivable 😇\n", + "\n", + "You can see from the test output that our tiny model has overfit to the data, and basically memorized this one sentence.\n", + "\n", + "When you start training your own models, make sure your testing data doesn't include your training data 😅" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dd42bc7a", + "metadata": {}, + "outputs": [], + "source": [ + "from coqui_stt_training.train import test\n", + "\n", + "tfv1.reset_default_graph()\n", + "test()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}