Language Model Toolkit

    This Toolkit is in beta. If you run into any bugs or issues, please contact us over Slack or via support@assemblyai.com

    The Language Modeling Toolkit can be used to adapt the default Language Model with a text corpus. The use cases for this toolkit are:

    Preparing the Training Data

    To use the LM Toolkit, you need to prepare a "corpus" of text -- a text file with one keyword/utterance/phrase per line. For example. here's what your corpus should look like under the following scenarios:

    Adding custom vocabulary to the Language Model

    Marcus Aurelius
    APDEU
    Professor Katie Waulter
    ...

    Boosting accuracy for keywords and/or phrases

    cancel my account
    i want to cancel my account
    talk to an agent
    ...

    Adapting the Language Model using a corpus of historical transcripts

    Hi, welcome to Verizon my name is Domenic how can I help you?
    I need help troubleshooting my internet speeds.
    ...

    The LM Toolkit cleans the text before training, so it's okay if your text contains punctuation and other formatting. If you want to make sure your text is cleaned properly, we recommend cleaning it yourself before training. If you are cleaning it yourself, you'll want to remove all punctuation, lowercase the text, and convert numbers to their written format (eg, seventeen, thirty eight million, eighteen point two, etc.).

    Expected Accuracy Improvements

    Depending on the quantity of text data you use to adapt the Language Model, how good of a fit your text data is to the data you see in the real world, and how different your data is from the data we train our default Language Model with, you can see accuracy improvements anywhere from 1-5% absolute with a custom Language Model.

    Building a Custom Language Model

    Login to the AssemblyAI Quay Repository

    docker login -u="<your username>" -p="<your password>" quay.io

    Pull the LM Toolkit image

    docker pull quay.io/assemblyai-onprem/lm-toolkit:0.1beta

    Run the LM Toolkit on your text corpus

    All you need to do is mount a text file into the Docker container, and then pass that text file to the ./toolkit command (as shown below) inside the Docker container. The ./toolkit script accepts two arguments, the first is the path to the text file, the second is the name of the file that the custom language model should be saved to.

    docker run -it -v /path/to/corpus.txt:/corpus.txt \
        quay.io/assemblyai-onprem/lm-toolkit:0.1beta \
        ./toolkit /corpus.txt corpus.txt.toolkit.lm

    Download the custom Language Model from the docker container

    Once the Language Model has been completed, you can docker cp it out of the container.

    docker cp <containerid>:/toolkit/corpus.txt.toolkit.lm .

    Using the custom Language Model

    All you need to do is mount your custom Language Model file into your Stream Model Server or Regular Model Server Docker container, and then tell your Model Server the path to the Language Model to use. For more info on how to configure your Model Server to use your Custom Language Model, view the Async Engine docs or the Stream Model Server docs.