Remove external scorer file and documentation and flag references

2020-07-27 21:09:32 +02:00 · 2020-07-27 21:09:32 +02:00 · 2835151951
commit 2835151951
parent d98bf84b41
4 changed files with 5 additions and 7 deletions
--- a/data/README.rst
+++ b/data/README.rst
@ -5,7 +5,7 @@ This directory contains language-specific data files. Most importantly, you will

 1. A list of unique characters for the target language (e.g. English) in ``data/alphabet.txt``. After installing the training code, you can check ``python -m deepspeech_training.util.check_characters --help`` for a tool that creates an alphabet file from a list of training CSV files.

-2. A scorer package (``data/lm/kenlm.scorer``) generated with ``generate_scorer_package`` (``native_client/generate_scorer_package.cpp``). The scorer package includes a binary n-gram language model generated with ``data/lm/generate_lm.py``.
+2. A script used to generate a binary n-gram language model: ``data/lm/generate_lm.py``.

 For more information on how to build these resources from scratch, see the ``External scorer scripts`` section on `deepspeech.readthedocs.io <https://deepspeech.readthedocs.io/>`_.

--- a/data/lm/kenlm.scorer
+++ b/data/lm/kenlm.scorer
@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:d0cf926ab9cab54a8a7d70003b931b2d62ebd9105ed392d1ec9c840029867799
-size 953363776
--- a/doc/BUILDING.rst
+++ b/doc/BUILDING.rst
@ -282,8 +282,9 @@ Please push DeepSpeech data to ``/sdcard/deepspeech/``\ , including:


 * ``output_graph.tflite`` which is the TF Lite model
-* ``kenlm.scorer``, if you want to use the scorer; please be aware that too big
-  scorer will make the device run out of memory
+* External scorer file (available from one of our releases), if you want to use
+  the scorer; please be aware that too big scorer will make the device run out
+  of memory

 Then, push binaries from ``native_client.tar.xz`` to ``/data/local/tmp/ds``\ :

--- a/training/deepspeech_training/util/flags.py
+++ b/training/deepspeech_training/util/flags.py
@ -157,7 +157,7 @@ def create_flags():

    f.DEFINE_boolean('utf8', False, 'enable UTF-8 mode. When this is used the model outputs UTF-8 sequences directly rather than using an alphabet mapping.')
    f.DEFINE_string('alphabet_config_path', 'data/alphabet.txt', 'path to the configuration file specifying the alphabet used by the network. See the comment in data/alphabet.txt for a description of the format.')
-    f.DEFINE_string('scorer_path', 'data/lm/kenlm.scorer', 'path to the external scorer file.')
+    f.DEFINE_string('scorer_path', '', 'path to the external scorer file.')
    f.DEFINE_alias('scorer', 'scorer_path')
    f.DEFINE_integer('beam_width', 1024, 'beam width used in the CTC decoder when building candidate transcriptions')
    f.DEFINE_float('lm_alpha', 0.931289039105002, 'the alpha hyperparameter of the CTC decoder. Language Model weight.')