Merge pull request #1966 from JRMeyer/cv-notebook

Python notebook for training on Common Voice
2021-09-16 06:51:30 -04:00 · 2021-09-16 06:51:30 -04:00 · df67678220
commit df67678220
parent fd719ac013 7cbe879fc6
6 changed files with 313 additions and 15 deletions
--- a/.github/workflows/build-and-test.yml
+++ b/.github/workflows/build-and-test.yml
@ -670,6 +670,22 @@ jobs:
          bitrate: ${{ matrix.bitrate }}
          model-kind: ${{ matrix.models }}
        timeout-minutes: 5
+  python-notebooks-tests:
+    name: "Lin|Python notebook tests"
+    runs-on: ubuntu-20.04
+    if: ${{ github.event_name == 'pull_request' }}
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 1
+      - uses: actions/setup-python@v2
+        with:
+          python-version: 3.7
+      - name: Run python notebooks
+        run: |
+          sudo apt-get install -y --no-install-recommends libopusfile0 libopus-dev libopusfile-dev
+          python -m pip install jupyter
+          ./ci_scripts/notebook-tests.sh
  training-basic-tests:
    name: "Lin|Basic training tests"
    runs-on: ubuntu-20.04
--- a/ci_scripts/notebook-tests.sh
+++ b/ci_scripts/notebook-tests.sh
@ -0,0 +1,14 @@
+#!/bin/bash
+set -xe
+
+source $(dirname "$0")/all-vars.sh
+source $(dirname "$0")/all-utils.sh
+
+set -o pipefail
+pip install --upgrade pip setuptools wheel | cat
+pip install --upgrade . | cat
+set +o pipefail
+
+for python_notebook in ./notebooks/*.ipynb; do
+    time jupyter nbconvert --to notebook --execute $python_notebook
+done
--- a/notebooks/README.md
+++ b/notebooks/README.md
@ -1,4 +1,7 @@
 # Python Notebooks for 🐸 STT

-1. Train a new Speech-to-Text model from scratch [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/coqui-ai/STT/blob/main/notebooks/train-your-first-coqui-STT-model.ipynb)
-2. Transfer learning (English --> Russian) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/coqui-ai/STT/blob/main/notebooks/easy-transfer-learning.ipynb)
+| Notebook title | Language(s) | Link to Colab |
+|----------------|---------------|-------------|
+|Train your first 🐸 STT model | English | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/coqui-ai/STT/blob/main/notebooks/train_your_first_coqui-STT_model.ipynb) |
+|Easy Transfer learning | English --> Russian | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/coqui-ai/STT/blob/main/notebooks/easy_transfer_learning.ipynb)|
+| Train a model with Common Voice | Serbian | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/coqui-ai/STT/blob/main/notebooks/train_with_common_voice.ipynb) |
--- a/notebooks/easy_transfer_learning.ipynb
+++ b/notebooks/easy_transfer_learning.ipynb
@ -34,9 +34,9 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "## Install Coqui STT if you need to\n",
-    "!git clone --depth 1 https://github.com/coqui-ai/STT.git\n",
-    "!cd STT; pip install -U pip wheel setuptools; pip install ."
+    "## Install Coqui STT\n",
+    "! pip install -U pip\n",
+    "! pip install coqui_stt_training"
   ]
  },
  {
@ -147,7 +147,7 @@
    "    alphabet_config_path=\"russian/alphabet.txt\",\n",
    "    train_files=[\"russian/ru.csv\"],\n",
    "    dev_files=[\"russian/ru.csv\"],\n",
-    "    epochs=200,\n",
+    "    epochs=100,\n",
    "    load_cudnn=True,\n",
    ")"
   ]
--- a/notebooks/train_with_common_voice.ipynb
+++ b/notebooks/train_with_common_voice.ipynb
@ -0,0 +1,265 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 5,
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.8.5"
+    },
+    "colab": {
+      "name": "train-with-common-voice-data.ipynb",
+      "private_outputs": true,
+      "provenance": [],
+      "collapsed_sections": []
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "f79d99ef"
+      },
+      "source": [
+        "# Train a 🐸 STT model with Common Voice data 💫\n",
+        "\n",
+        "👋 Hello and welcome to Coqui (🐸) STT \n",
+        "\n",
+        "This notebook shows a **typical workflow** for **training** and **testing** an 🐸 STT model on data from Common Voice.\n",
+        "\n",
+        "In this notebook, we will:\n",
+        "\n",
+        "1. Download Common Voice data (pre-formatted for 🐸 STT)\n",
+        "2. Configure the training and testing runs\n",
+        "3. Train a new model\n",
+        "4. Test the model and display its performance\n",
+        "\n",
+        "So, let's jump right in!\n",
+        "\n",
+        "*PS - If you just want a working, off-the-shelf model, check out the [🐸 Model Zoo](https://www.coqui.ai/models)*"
+      ],
+      "id": "f79d99ef"
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "fa2aec78"
+      },
+      "source": [
+        "## Install Coqui STT\n",
+        "! pip install -U pip\n",
+        "! pip install coqui_stt_training\n",
+        "## Install opus tools\n",
+        "! apt-get install libopusfile0 libopus-dev libopusfile-dev"
+      ],
+      "id": "fa2aec78",
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "be5fe49c"
+      },
+      "source": [
+        "## ✅ Download & format sample data for Serbian\n",
+        "\n",
+        "**First things first**: we need some data.\n",
+        "\n",
+        "We're training a Speech-to-Text model, so we want _speech_ and we want _text_. Specificially, we want _transcribed speech_. Let's download some audio and transcripts.\n",
+        "\n",
+        "To focus on model training, we formatted the Common Voice data for you already, and you will find CSV files for `{train,test,dev}.csv` in the data directory.\n",
+        "\n",
+        "Let's download some data for Serbian 😊\n"
+      ],
+      "id": "be5fe49c"
+    },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "608d203f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "### Download pre-formatted Common Voice data\n",
+    "import os\n",
+    "import tarfile\n",
+    "from coqui_stt_training.util.downloader import maybe_download\n",
+    "\n",
+    "def download_preformatted_data():\n",
+    "    if not os.path.exists(\"serbian/sr-data\"):\n",
+    "        maybe_download(\"sr-data.tar\", \"serbian/\", \"https://coqui-ai-public-data.s3.amazonaws.com/cv/7.0/sr-data.tar\")\n",
+    "        print('\\nExtracting data...')\n",
+    "        tar = tarfile.open(\"serbian/sr-data.tar\", mode=\"r:\")\n",
+    "        tar.extractall(\"serbian/\")\n",
+    "        tar.close()\n",
+    "        print('\\nFinished extracting data...')\n",
+    "    else:\n",
+    "        print('Found data - not extracting.')\n",
+    "\n",
+    "# Download + extract Common Voice data\n",
+    "download_preformatted_data()"
+   ]
+  },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "96e8b708"
+      },
+      "source": [
+        "### 👀 Take a look at the data"
+      ],
+      "id": "96e8b708"
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "fa2aec77"
+      },
+      "source": [
+        "! ls serbian/sr-data\n",
+        "! wc -l serbian/sr-data/*.csv"
+      ],
+      "id": "fa2aec77",
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "d9dfac21"
+      },
+      "source": [
+        "## ✅ Configure & set hyperparameters\n",
+        "\n",
+        "Coqui STT comes with a long list of hyperparameters you can tweak. We've set default values, but you will often want to set your own. You can use `initialize_globals_from_args()` to do this. \n",
+        "\n",
+        "You must **always** configure the paths to your data, and you must **always** configure your alphabet. Additionally, here we show how you can specify the size of hidden layers (`n_hidden`), the number of epochs to train for (`epochs`), and to initialize a new model from scratch (`load_train=\"init\"`).\n",
+        "\n",
+        "If you're training on a GPU, you can uncomment the (larger) training batch sizes for faster training."
+      ],
+      "id": "d9dfac21"
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "d264fdec"
+      },
+      "source": [
+        "from coqui_stt_training.util.config import initialize_globals_from_args\n",
+        "\n",
+        "initialize_globals_from_args(\n",
+        "    train_files=[\"serbian/sr-data/train.csv\"],\n",
+        "    dev_files=[\"serbian/sr-data/dev.csv\"],\n",
+        "    test_files=[\"serbian/sr-data/test.csv\"],\n",
+        "    checkpoint_dir=\"serbian/checkpoints/\",\n",
+        "    load_train=\"init\",\n",
+        "    n_hidden=200,\n",
+        "    epochs=1,\n",
+        "    beam_width=1,\n",
+        "    #train_batch_size=128,\n",
+        "    #dev_batch_size=128,\n",
+        "    #test_batch_size=128,\n",
+        ")"
+      ],
+      "id": "d264fdec",
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "799c1425"
+      },
+      "source": [
+        "### 👀 View all config settings"
+      ],
+      "id": "799c1425"
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "03b33d2b"
+      },
+      "source": [
+        "from coqui_stt_training.util.config import Config\n",
+        "\n",
+        "print(Config.to_json())"
+      ],
+      "id": "03b33d2b",
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "ae82fd75"
+      },
+      "source": [
+        "## ✅ Train a new model\n",
+        "\n",
+        "Let's kick off a training run 🚀🚀🚀 (using the configure you set above)."
+      ],
+      "id": "ae82fd75"
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "scrolled": true,
+        "id": "550a504e"
+      },
+      "source": [
+        "from coqui_stt_training.train import train\n",
+        "\n",
+        "train()"
+      ],
+      "id": "550a504e",
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "9f6dc959"
+      },
+      "source": [
+        "## ✅ Test the model\n",
+        "\n",
+        "We made it! 🙌\n",
+        "\n",
+        "Let's kick off the testing run, which displays performance metrics.\n",
+        "\n",
+        "The settings we used here are for demonstration purposes, so you don't want to deploy this model into production. In this notebook we're focusing on the workflow itself, so it's forgivable 😇\n",
+        "\n",
+        "You can still train a more State-of-the-Art model by finding better hyperparameters, so go for it 💪"
+      ],
+      "id": "9f6dc959"
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "dd42bc7a"
+      },
+      "source": [
+        "from coqui_stt_training.evaluate import test\n",
+        "\n",
+        "test()"
+      ],
+      "id": "dd42bc7a",
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}
--- a/notebooks/train_your_first_coqui_STT_model.ipynb
+++ b/notebooks/train_your_first_coqui_STT_model.ipynb
@ -32,9 +32,9 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "## Install Coqui STT if you need to\n",
-    "!git clone --depth 1 https://github.com/coqui-ai/STT.git\n",
-    "!cd STT; pip install -U pip wheel setuptools; pip install ."
+    "## Install Coqui STT\n",
+    "! pip install -U pip\n",
+    "! pip install coqui_stt_training"
   ]
  },
  {
@ -54,9 +54,9 @@
    "2. the **size** of that audio file\n",
    "3. the **transcript** of that audio file.\n",
    "\n",
-    "Formatting the audio and transcript isn't too difficult in this case. We define a custom data importer called `download_sample_data()` which does all the work. If you have a custom dataset, you will probably want to write a custom data importer.\n",
+    "Formatting the audio and transcript isn't too difficult in this case. We define `download_sample_data()` which does all the work. If you have a custom dataset, you will want to write a custom data importer.\n",
    "\n",
-    "**Second things second**: we want an alphabet. The output layer of a typical* 🐸 STT model represents letters in the alphabet, and you should specify this alphabet before training. Let's download an English alphabet from Coqui and use that.\n",
+    "**Second things second**: we want an alphabet. The output layer of a typical* 🐸 STT model represents letters in the alphabet. Let's download an English alphabet from Coqui and use that.\n",
    "\n",
    "*_If you are working with languages with large character sets (e.g. Chinese), you can set `bytes_output_mode=True` instead of supplying an `alphabet.txt` file. In this case, the output layer of the STT model will correspond to individual UTF-8 bytes instead of individual characters._"
   ]
@ -98,7 +98,7 @@
   "id": "96e8b708",
   "metadata": {},
   "source": [
-    "### Take a look at the data (*Optional* )"
+    "### 👀 Take a look at the data"
   ]
  },
  {
@ -150,8 +150,8 @@
    "    dev_files=[\"english/ldc93s1.csv\"],\n",
    "    test_files=[\"english/ldc93s1.csv\"],\n",
    "    load_train=\"init\",\n",
-    "    n_hidden=100,\n",
-    "    epochs=200,\n",
+    "    n_hidden=200,\n",
+    "    epochs=100,\n",
    ")"
   ]
  },
@ -160,7 +160,7 @@
   "id": "799c1425",
   "metadata": {},
   "source": [
-    "### View all Config settings (*Optional*) "
+    "### 👀 View all Config settings"
   ]
  },
  {