Language Model Toolkit
The Language Modeling Toolkit can be used to adapt the default Language Model with a text corpus. The use cases for this toolkit are:
- adding custom vocabulary terms to the Language Model
- boosting accuracy for keywords and/or phrases
- adapting the Language Model using a corpus of historical transcripts (this yields the best results)
Preparing the Training Data
To use the LM Toolkit, you need to prepare a "corpus" of text -- a text file with one keyword/utterance/phrase per line. For example. here's what your corpus should look like under the following scenarios:
Adding custom vocabulary to the Language Model
Marcus Aurelius APDEU Professor Katie Waulter ...
Boosting accuracy for keywords and/or phrases
cancel my account i want to cancel my account talk to an agent ...
Adapting the Language Model using a corpus of historical transcripts
Hi, welcome to Verizon my name is Domenic how can I help you? I need help troubleshooting my internet speeds. ...
The LM Toolkit cleans the text before training, so it's okay if your text contains punctuation and other formatting. If you want to make sure your text is cleaned properly, we recommend cleaning it yourself before training. If you are cleaning it yourself, you'll want to remove all punctuation, lowercase the text, and convert numbers to their written format (eg,
thirty eight million,
eighteen point two, etc.).
Expected Accuracy Improvements
Depending on the quantity of text data you use to adapt the Language Model, how good of a fit your text data is to the data you see in the real world, and how different your data is from the data we train our default Language Model with, you can see accuracy improvements anywhere from 1-5% absolute with a custom Language Model.
Building a Custom Language Model
Login to the AssemblyAI Quay Repository
docker login -u="<your username>" -p="<your password>" quay.io
Pull the LM Toolkit image
docker pull quay.io/assemblyai-onprem/lm-toolkit:0.1beta
Run the LM Toolkit on your text corpus
All you need to do is mount a text file into the Docker container, and then pass that text file to the
./toolkit command (as shown below) inside the Docker container. The
./toolkit script accepts two arguments, the first is the path to the text file, the second is the name of the file that the custom language model should be saved to.
docker run -it -v /path/to/corpus.txt:/corpus.txt \ quay.io/assemblyai-onprem/lm-toolkit:0.1beta \ ./toolkit /corpus.txt corpus.txt.toolkit.lm
Download the custom Language Model from the docker container
Once the Language Model has been completed, you can
docker cp it out of the container.
docker cp <containerid>:/toolkit/corpus.txt.toolkit.lm .
Using the custom Language Model
All you need to do is mount your custom Language Model file into your Stream Model Server or Regular Model Server Docker container, and then tell your Model Server the path to the Language Model to use. For more info on how to configure your Model Server to use your Custom Language Model, view the Async Engine docs or the Stream Model Server docs.