Add first Jupyter notebook

2021-07-23 12:12:02 -04:00 · 2021-07-23 12:12:02 -04:00 · 9f7fda14cb
commit 9f7fda14cb
parent 59e32556a4
2 changed files with 258 additions and 5 deletions
--- a/Dockerfile.train.jupyter
+++ b/Dockerfile.train.jupyter
@ -1,13 +1,13 @@
 # This is a Dockerfile useful for training models with Coqui STT in Jupyter notebooks

-FROM ghcr.io/coqui-ai/stt-train:v0.10.0-alpha.9
+FROM ghcr.io/coqui-ai/stt-train:v0.10.0-alpha.10
+
+COPY notebooks /code/notebooks
+WORKDIR /code/notebooks

 RUN python3 -m pip install --no-cache-dir jupyter jupyter_http_over_ws
 RUN jupyter serverextension enable --py jupyter_http_over_ws

-RUN mv /code /home/STT
-WORKDIR /home
-
 EXPOSE 8888

-CMD ["bash", "-c", "jupyter notebook --notebook-dir=/home --ip 0.0.0.0 --no-browser --allow-root"]
+CMD ["bash", "-c", "jupyter notebook --notebook-dir=/code/notebooks --ip 0.0.0.0 --no-browser --allow-root"]
--- a/notebooks/train-ldc.ipynb
+++ b/notebooks/train-ldc.ipynb
@ -0,0 +1,253 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "f79d99ef",
+   "metadata": {},
+   "source": [
+    "# Train your first 🐸 STT model 💫\n",
+    "\n",
+    "👋 Hello and welcome to Coqui (🐸) STT \n",
+    "\n",
+    "The goal of this notebook is to show you a **typical workflow** for **training** and **testing** an STT model with 🐸.\n",
+    "\n",
+    "Let's train a very small model on a very small amount of data so we can iterate quickly.\n",
+    "\n",
+    "In this notebook, we will:\n",
+    "\n",
+    "1. Download data and format it for 🐸 STT.\n",
+    "2. Configure the training and testing runs.\n",
+    "3. Train a new model.\n",
+    "4. Test the model and display its performance.\n",
+    "\n",
+    "So, let's jump right in!\n",
+    "\n",
+    "*PS - If you just want a working, off-the-shelf model, check out the [🐸 Model Zoo](https://www.coqui.ai/models)*"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "be5fe49c",
+   "metadata": {},
+   "source": [
+    "## ✅ Download & format sample data for English\n",
+    "\n",
+    "**First things first**: we need some data.\n",
+    "\n",
+    "We're training a Speech-to-Text model, so we need some _speech_ and we need some _text_. Specificially, we want _transcribed speech_. Let's download an English audio file and its transcript and then format them for 🐸 STT. \n",
+    "\n",
+    "🐸 STT expects to find information about your data in a CSV file, where each line contains:\n",
+    "\n",
+    "1. the **path** to an audio file\n",
+    "2. the **size** of that audio file\n",
+    "3. the **transcript** of that audio file.\n",
+    "\n",
+    "Formatting the audio and transcript isn't too difficult in this case. We define a custom data importer called `download_sample_data()` which does all the work. If you have a custom dataset, you will probably want to write a custom data importer.\n",
+    "\n",
+    "**Second things second**: we want an alphabet. The output layer of a typical* 🐸 STT model represents letters in the alphabet, and you should specify this alphabet before training. Let's download an English alphabet from Coqui and use that.\n",
+    "\n",
+    "_*If you are working with languages with large character sets (e.g. Chinese), you can set `bytes_output_mode=True` instead of supplying an `alphabet.txt` file. In this case, the output layer of the STT model will correspond to individual UTF-8 bytes instead of individual characters._"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "53945462",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "### Download sample data\n",
+    "import os\n",
+    "import pandas\n",
+    "from coqui_stt_training.util.downloader import maybe_download\n",
+    "\n",
+    "def download_sample_data():\n",
+    "    data_dir=\"english/\"\n",
+    "    # Download data + alphabet\n",
+    "    audio_file = maybe_download(\"LDC93S1.wav\", data_dir, \"https://catalog.ldc.upenn.edu/desc/addenda/LDC93S1.wav\")\n",
+    "    transcript_file = maybe_download(\"LDC93S1.txt\", data_dir, \"https://catalog.ldc.upenn.edu/desc/addenda/LDC93S1.txt\")\n",
+    "    alphabet = maybe_download(\"alphabet.txt\", data_dir, \"https://raw.githubusercontent.com/coqui-ai/STT/main/data/alphabet.txt\")\n",
+    "    # Format data\n",
+    "    with open(transcript_file, \"r\") as fin:\n",
+    "        transcript = \" \".join(fin.read().strip().lower().split(\" \")[2:]).replace(\".\", \"\")\n",
+    "    df = pandas.DataFrame(data=[(os.path.abspath(audio_file), os.path.getsize(audio_file), transcript)],\n",
+    "                          columns=[\"wav_filename\", \"wav_filesize\", \"transcript\"])\n",
+    "    # Save formatted CSV \n",
+    "    df.to_csv(os.path.join(data_dir, \"ldc93s1.csv\"), index=False)\n",
+    "\n",
+    "# Download and format data\n",
+    "download_sample_data()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "96e8b708",
+   "metadata": {},
+   "source": [
+    "### Take a look at the data (*Optional* )"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fa2aec77",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "csv_file = open(\"english/ldc93s1.csv\", \"r\")\n",
+    "print(csv_file.read())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6c046277",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "alphabet_file = open(\"english/alphabet.txt\", \"r\")\n",
+    "print(alphabet_file.read())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d9dfac21",
+   "metadata": {},
+   "source": [
+    "## ✅ Configure & set hyperparameters\n",
+    "\n",
+    "Coqui STT comes with a long list of hyperparameters you can tweak. We've set default values, but you will often want to set your own. You can use `initialize_globals_from_args()` to do this. \n",
+    "\n",
+    "You must **always** configure the paths to your data, and you must **always** configure your alphabet. Additionally, here we show how you can specify the size of hidden layers (`n_hidden`), the number of epochs to train for (`epochs`), and to initialize a new model from scratch (`load_train=\"init\"`)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d264fdec",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from coqui_stt_training.util.config import initialize_globals_from_args\n",
+    "\n",
+    "initialize_globals_from_args(\n",
+    "    alphabet_config_path=\"english/alphabet.txt\",\n",
+    "    train_files=[\"english/ldc93s1.csv\"],\n",
+    "    dev_files=[\"english/ldc93s1.csv\"],\n",
+    "    test_files=[\"english/ldc93s1.csv\"],\n",
+    "    load_train=\"init\",\n",
+    "    n_hidden=100,\n",
+    "    epochs=200,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "799c1425",
+   "metadata": {},
+   "source": [
+    "### View all Config settings (*Optional*) "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "03b33d2b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from coqui_stt_training.util.config import Config\n",
+    "\n",
+    "# Take a peek at the entire Config\n",
+    "print(Config.to_json())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ae82fd75",
+   "metadata": {},
+   "source": [
+    "## ✅ Train a new model\n",
+    "\n",
+    "Let's kick off a training run 🚀🚀🚀 (using the configure you set above).\n",
+    "\n",
+    "This notebook should work on either a GPU or a CPU. However, in case you're running this on _multiple_ GPUs we want to only use one, because the sample dataset (one audio file) is too small to split across multiple GPUs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "550a504e",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "from coqui_stt_training.train import train, early_training_checks\n",
+    "import tensorflow.compat.v1 as tfv1\n",
+    "\n",
+    "# use maximum one GPU\n",
+    "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"\n",
+    "\n",
+    "early_training_checks()\n",
+    "\n",
+    "tfv1.reset_default_graph()\n",
+    "train()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9f6dc959",
+   "metadata": {},
+   "source": [
+    "## ✅ Test the model\n",
+    "\n",
+    "We made it! 🙌\n",
+    "\n",
+    "Let's kick off the testing run, which displays performance metrics.\n",
+    "\n",
+    "We're committing the cardinal sin of ML 😈 (aka - testing on our training data) so you don't want to deploy this model into production. In this notebook we're focusing on the workflow itself, so it's forgivable 😇\n",
+    "\n",
+    "You can see from the test output that our tiny model has overfit to the data, and basically memorized this one sentence.\n",
+    "\n",
+    "When you start training your own models, make sure your testing data doesn't include your training data 😅"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dd42bc7a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from coqui_stt_training.train import test\n",
+    "\n",
+    "tfv1.reset_default_graph()\n",
+    "test()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}