Add first Jupyter notebook
This commit is contained in:
parent
59e32556a4
commit
9f7fda14cb
@ -1,13 +1,13 @@
|
||||
# This is a Dockerfile useful for training models with Coqui STT in Jupyter notebooks
|
||||
|
||||
FROM ghcr.io/coqui-ai/stt-train:v0.10.0-alpha.9
|
||||
FROM ghcr.io/coqui-ai/stt-train:v0.10.0-alpha.10
|
||||
|
||||
COPY notebooks /code/notebooks
|
||||
WORKDIR /code/notebooks
|
||||
|
||||
RUN python3 -m pip install --no-cache-dir jupyter jupyter_http_over_ws
|
||||
RUN jupyter serverextension enable --py jupyter_http_over_ws
|
||||
|
||||
RUN mv /code /home/STT
|
||||
WORKDIR /home
|
||||
|
||||
EXPOSE 8888
|
||||
|
||||
CMD ["bash", "-c", "jupyter notebook --notebook-dir=/home --ip 0.0.0.0 --no-browser --allow-root"]
|
||||
CMD ["bash", "-c", "jupyter notebook --notebook-dir=/code/notebooks --ip 0.0.0.0 --no-browser --allow-root"]
|
||||
|
253
notebooks/train-ldc.ipynb
Normal file
253
notebooks/train-ldc.ipynb
Normal file
@ -0,0 +1,253 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f79d99ef",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Train your first 🐸 STT model 💫\n",
|
||||
"\n",
|
||||
"👋 Hello and welcome to Coqui (🐸) STT \n",
|
||||
"\n",
|
||||
"The goal of this notebook is to show you a **typical workflow** for **training** and **testing** an STT model with 🐸.\n",
|
||||
"\n",
|
||||
"Let's train a very small model on a very small amount of data so we can iterate quickly.\n",
|
||||
"\n",
|
||||
"In this notebook, we will:\n",
|
||||
"\n",
|
||||
"1. Download data and format it for 🐸 STT.\n",
|
||||
"2. Configure the training and testing runs.\n",
|
||||
"3. Train a new model.\n",
|
||||
"4. Test the model and display its performance.\n",
|
||||
"\n",
|
||||
"So, let's jump right in!\n",
|
||||
"\n",
|
||||
"*PS - If you just want a working, off-the-shelf model, check out the [🐸 Model Zoo](https://www.coqui.ai/models)*"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "be5fe49c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## ✅ Download & format sample data for English\n",
|
||||
"\n",
|
||||
"**First things first**: we need some data.\n",
|
||||
"\n",
|
||||
"We're training a Speech-to-Text model, so we need some _speech_ and we need some _text_. Specificially, we want _transcribed speech_. Let's download an English audio file and its transcript and then format them for 🐸 STT. \n",
|
||||
"\n",
|
||||
"🐸 STT expects to find information about your data in a CSV file, where each line contains:\n",
|
||||
"\n",
|
||||
"1. the **path** to an audio file\n",
|
||||
"2. the **size** of that audio file\n",
|
||||
"3. the **transcript** of that audio file.\n",
|
||||
"\n",
|
||||
"Formatting the audio and transcript isn't too difficult in this case. We define a custom data importer called `download_sample_data()` which does all the work. If you have a custom dataset, you will probably want to write a custom data importer.\n",
|
||||
"\n",
|
||||
"**Second things second**: we want an alphabet. The output layer of a typical* 🐸 STT model represents letters in the alphabet, and you should specify this alphabet before training. Let's download an English alphabet from Coqui and use that.\n",
|
||||
"\n",
|
||||
"_*If you are working with languages with large character sets (e.g. Chinese), you can set `bytes_output_mode=True` instead of supplying an `alphabet.txt` file. In this case, the output layer of the STT model will correspond to individual UTF-8 bytes instead of individual characters._"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "53945462",
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"### Download sample data\n",
|
||||
"import os\n",
|
||||
"import pandas\n",
|
||||
"from coqui_stt_training.util.downloader import maybe_download\n",
|
||||
"\n",
|
||||
"def download_sample_data():\n",
|
||||
" data_dir=\"english/\"\n",
|
||||
" # Download data + alphabet\n",
|
||||
" audio_file = maybe_download(\"LDC93S1.wav\", data_dir, \"https://catalog.ldc.upenn.edu/desc/addenda/LDC93S1.wav\")\n",
|
||||
" transcript_file = maybe_download(\"LDC93S1.txt\", data_dir, \"https://catalog.ldc.upenn.edu/desc/addenda/LDC93S1.txt\")\n",
|
||||
" alphabet = maybe_download(\"alphabet.txt\", data_dir, \"https://raw.githubusercontent.com/coqui-ai/STT/main/data/alphabet.txt\")\n",
|
||||
" # Format data\n",
|
||||
" with open(transcript_file, \"r\") as fin:\n",
|
||||
" transcript = \" \".join(fin.read().strip().lower().split(\" \")[2:]).replace(\".\", \"\")\n",
|
||||
" df = pandas.DataFrame(data=[(os.path.abspath(audio_file), os.path.getsize(audio_file), transcript)],\n",
|
||||
" columns=[\"wav_filename\", \"wav_filesize\", \"transcript\"])\n",
|
||||
" # Save formatted CSV \n",
|
||||
" df.to_csv(os.path.join(data_dir, \"ldc93s1.csv\"), index=False)\n",
|
||||
"\n",
|
||||
"# Download and format data\n",
|
||||
"download_sample_data()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "96e8b708",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Take a look at the data (*Optional* )"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "fa2aec77",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"csv_file = open(\"english/ldc93s1.csv\", \"r\")\n",
|
||||
"print(csv_file.read())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "6c046277",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"alphabet_file = open(\"english/alphabet.txt\", \"r\")\n",
|
||||
"print(alphabet_file.read())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d9dfac21",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## ✅ Configure & set hyperparameters\n",
|
||||
"\n",
|
||||
"Coqui STT comes with a long list of hyperparameters you can tweak. We've set default values, but you will often want to set your own. You can use `initialize_globals_from_args()` to do this. \n",
|
||||
"\n",
|
||||
"You must **always** configure the paths to your data, and you must **always** configure your alphabet. Additionally, here we show how you can specify the size of hidden layers (`n_hidden`), the number of epochs to train for (`epochs`), and to initialize a new model from scratch (`load_train=\"init\"`)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "d264fdec",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from coqui_stt_training.util.config import initialize_globals_from_args\n",
|
||||
"\n",
|
||||
"initialize_globals_from_args(\n",
|
||||
" alphabet_config_path=\"english/alphabet.txt\",\n",
|
||||
" train_files=[\"english/ldc93s1.csv\"],\n",
|
||||
" dev_files=[\"english/ldc93s1.csv\"],\n",
|
||||
" test_files=[\"english/ldc93s1.csv\"],\n",
|
||||
" load_train=\"init\",\n",
|
||||
" n_hidden=100,\n",
|
||||
" epochs=200,\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "799c1425",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### View all Config settings (*Optional*) "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "03b33d2b",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from coqui_stt_training.util.config import Config\n",
|
||||
"\n",
|
||||
"# Take a peek at the entire Config\n",
|
||||
"print(Config.to_json())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ae82fd75",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## ✅ Train a new model\n",
|
||||
"\n",
|
||||
"Let's kick off a training run 🚀🚀🚀 (using the configure you set above).\n",
|
||||
"\n",
|
||||
"This notebook should work on either a GPU or a CPU. However, in case you're running this on _multiple_ GPUs we want to only use one, because the sample dataset (one audio file) is too small to split across multiple GPUs."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "550a504e",
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from coqui_stt_training.train import train, early_training_checks\n",
|
||||
"import tensorflow.compat.v1 as tfv1\n",
|
||||
"\n",
|
||||
"# use maximum one GPU\n",
|
||||
"os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"\n",
|
||||
"\n",
|
||||
"early_training_checks()\n",
|
||||
"\n",
|
||||
"tfv1.reset_default_graph()\n",
|
||||
"train()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9f6dc959",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## ✅ Test the model\n",
|
||||
"\n",
|
||||
"We made it! 🙌\n",
|
||||
"\n",
|
||||
"Let's kick off the testing run, which displays performance metrics.\n",
|
||||
"\n",
|
||||
"We're committing the cardinal sin of ML 😈 (aka - testing on our training data) so you don't want to deploy this model into production. In this notebook we're focusing on the workflow itself, so it's forgivable 😇\n",
|
||||
"\n",
|
||||
"You can see from the test output that our tiny model has overfit to the data, and basically memorized this one sentence.\n",
|
||||
"\n",
|
||||
"When you start training your own models, make sure your testing data doesn't include your training data 😅"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "dd42bc7a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from coqui_stt_training.train import test\n",
|
||||
"\n",
|
||||
"tfv1.reset_default_graph()\n",
|
||||
"test()"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.5"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
Loading…
Reference in New Issue
Block a user