Async Engine

    This guide will walk you through setting up the asynchronous speech engine Docker container on your own server. Setting up the speech engine takes about 15 minutes. When you're done with this setup, you'll have an endpoint you can send audio files to for transcription.

    Setting up your VPS

    Launch your instance with the below specifications, running Ubuntu Server 18.04 LTS.

    Server Requirements
    GPU K80 or M60 recommended (consumer grade cards such as the 1080Ti will work as well)
    CPU 4vCPU
    RAM 8GiB RAM
    HDD 50GiB
    If deploying on a CPU-only server, the GPU requirement is optional.
    Recommended Instance Type
    AWS g3s.xlarge or p2.xlarge
    GCS nvidia-tesla-k80
    Azure NC6

    Install Nvidia Drivers and CUDA (GPU servers only)

    # Add NVIDIA package repositories
    sudo dpkg -i cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
    sudo apt-key adv --fetch-keys
    sudo apt-get update
    sudo apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
    sudo apt-get update
    # Install NVIDIA driver
    sudo apt-get install --no-install-recommends nvidia-driver-410
    sudo apt-mark hold nvidia-driver-410
    # Reboot. Check that GPUs are visible using the command: nvidia-smi
    # Your Driver Version should show as 410.104
    Our Docker images require your Nvidia Driver Version to equal 410.104. This is the version we install in the above steps, but you'll want to make sure you prevent this driver from being automatically upgraded using the sudo apt-mark hold nvidia-driver-410 command.

    Install Docker

    Once your instance is online, SSH in and install Docker with the following steps.

    curl -fsSL -o

    To use Docker as a non-root user, add your user to the "docker" group with:

    sudo usermod -aG docker ubuntu
    # you'll need to log out and back in for this to take effect; or you can
    # continue to run the below docker commands as sudo

    Afterwards, confirm that Docker is running with sudo systemctl status docker (you can exit this shell just like you exit Vim).

    Install Nvidia-Docker (GPU servers only)

    The Docker containers you'll launch in the below steps need access to the GPU. To give Docker access to the GPU, we need to install Nvidia-Docker.

    # Add the package repositories
    curl -s -L | \
      sudo apt-key add -
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    curl -s -L$distribution/nvidia-docker.list | \
      sudo tee /etc/apt/sources.list.d/nvidia-docker.list
    sudo apt-get update
    # Install nvidia-docker2 and reload the Docker daemon configuration
    sudo apt-get install -y nvidia-docker2
    sudo pkill -SIGHUP dockerd
    # Test nvidia-smi with the latest official CUDA image
    docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

    Install docker-compose

    Run the following commands to install docker-compose, which we'll use to orchestrate the services on the Model Server.

    sudo curl -L "$(uname -s)-$(uname -m)" -o /usr/bin/docker-compose
    sudo chmod +x /usr/bin/docker-compose

    Make Nvidia Runtime the Default

    Set Docker to use the Nvidia runtime by default by making this file /etc/docker/daemon.json look like below:

        "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []

    Now restart restart Docker, and test the installation:

    sudo pkill -SIGHUP dockerd
    docker run --rm nvidia/cuda:9.0-base nvidia-smi

    You should see output like this, confirming that the container has access to the GPU:

    | NVIDIA-SMI 410.104              Driver Version: 410.104                   |
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
    | N/A   38C    P8    31W / 149W |      0MiB / 11441MiB |      2%      Default |
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |  No running processes found                                                 |

    Optimizing the GPU

    Optimize the GPU for faster inference. These commands need to be run on every boot, so it's a good idea to add them to a cronjob on reboot. The exact commands depend on your GPU.

    sudo nvidia-persistenced
    # for M60 (aws g3s.xlarge) and K80 (aws p2.xlarge) GPUs only
    sudo nvidia-smi --auto-boost-default=0
    # for K80 (aws p2.xlarge) GPUs
    sudo nvidia-smi -ac 2505,875
    # for M60 (aws g3s.xlarge) GPUs
    sudo nvidia-smi -ac 2505,1177

    Pull the Docker Image

    You'll need a username and password to our Docker image repository to run the following commands. If you don't have these credentials, get in touch with us and we can send them to you. Heads up - the images are a few GB in size!

    Login to the Image Repository

    docker login -u="<your username>" -p="<your password>"

    Pull the image from the repository

    # current GPU image
    docker pull
    # current CPU image
    docker pull
    See the Release Notes for prior images and the changelog.

    Launch the Docker Containers

    Create a docker-compose.yml file on your server with the following contents:

    version: '3.5'
        image: ${MODELSERVER_IMAGE}
        command: python
          SKIP_DECODER: 1    
          - 9999
          - 9999:9999
        restart: always
          - redis
          driver: "json-file"
            max-size: "200k"
            max-file: "10"      
        image: ${MODELSERVER_IMAGE}
        command: python
          LOGLEVEL: 'DEBUG'
        restart: always
          - redis
          driver: "json-file"
            max-size: "200k"
            max-file: "10"
        image: redis

    Now you're ready to up the services

    # set this env var to the name of the image you pulled in the above step
    # launch the services
    docker-compose -f docker-compose.yml up -d

    Confirm the containers are up by running docker ps, you should see the following containers running


    The API is now running on port 9999. Keep track of your instance's IP address for the next step where we test your setup!

    Make sure you set the proper security group/firewall rules so that your server can receive traffic on port 9999!

    Model Server API

    GET /health

    This endpoint will return a 501 if the API is unhealthy, and a 200 when the API is healthy. The API takes ~4-5 minutes to become healthy after launching the containers, while all the models are loaded into CPU/GPU memory.

    POST /transcribe/url

    Download an audio file and transcribe it. Requests to this endpoint are synchronous and will stay open until the transcription is complete. Transcriptions can take anywhere from 10% to 50% of the audio file's duration to complete, depending on your server's hardware specifications, and how many files you are transcribing in parallel (see more in the Processing Limits section below).

    JSON Parameters

    Param Example Info Required
    audio_url "" URL for the audio file to be transcribed. This must be a URL that the Model Server can access, for example from an internal CDN or Static File Server. Yes
    dual_channel false If working with dual channel audio files, set to true to transcribe each channel separately. No
    format_text true Toggle the option to automatically case proper nouns and convert numbers to digits ("seven" -> "7"). Set to false to disable this feature. Enabled by default. No
    punctuate true Toggle the option to automatically add punctuation to the transcription text. Set to false to disable this feature. Enabled by default. No
    audio_start_from 8000 Seek in your audio file to this time, in milliseconds, before we start transcribing. You're only charged for the duration of audio that is transcribed. No
    audio_end_at 20000 Stop transcribing your audio file when we reach this time, in milliseconds, in your audio file. You're only charged for the duration of audio that is transcribed. No

    Here is a dead simple example with cURL:

    curl --request POST \
        --url 'http://<your server ip address>:9999/transcribe/url' \
        --header 'content-type: application/json' \
        --data '
          "audio_url": ""

    JSON Response

    The API will respond with the following JSON once the transcription is complete. When setting dual_channel to true, the "utterances" key will contain a list of turn-by-turn utterances, as they appeared in the audio recording. Each object in the "utterances" list contains the channel information (this will be either "1" or "2"). Each word in the "words" array will also contain the channel key, so you can easily tell which channel each utterance/word is from.

        "blob": "You know Demons on TV like that and and for people to expose themselves to being rejected on TV or humiliated by fear factor or.",
        "duration": 12.096009070294784,
        "stereo_decode": false,
        "uid": "t_d2834b75-ccba-496f-80c7-9d7546731d15",
        "utterances": null,
        "words": [
                "confidence": 1.0,
                "end": 440,
                "start": 0,
                "text": "You"
                "confidence": 0.99,
                "end": 580,
                "start": 420,
                "text": "know"

    If there is an error during transcribing your audio file, you'll get a response back like this, describing what went wrong:

        "code": 1,
        "description": "Download error to, HTTPSConnectionPool(host='', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f1218275890>: Failed to establish a new connection: [Errno -2] Name or service not known',))",
        "uid": "t_f0929cd9-4f97-4c7d-a23d-8129766ae026"

    Processing Speed/Limits

    The Model Server can handle multiple transcription jobs in parallel when an instance with a GPU is used. We recommend the following concurrency limits (ie, how many files to transcribe at a single time) to keep the processing speed under 1x the duration of the audio file.

    GPU Instance Auto-Punctuation Enabled Recommended Concurrency Limit
    Yes Yes 8-10
    Yes No 12-14
    No Yes 1
    No No 1

    If you are ok with slower processing speeds, you can go over these limits.

    Configuring Environment Variables

    You can control a number of settings by setting additional environment variables in your Docker containers.

    Env Var Default Info
    MAX_TRANSCRIPTS_PARALLEL 14 How many audio files the Model Server at most will transcribe in parallel. See the Processing Limits section above for more info. If the Model Server receives more transcripts than what this value is set to, they will block.
    REJECT_OVER_PARALLEL True Should the Model Server reject additional requests with an HTTP status code 423 when it's currently transcribing >= MAX_TRANSCRIPTS_PARALLEL audio files.
    ENABLE_AUTO_PUNCTUATION True Set to False to prevent any Automatic Punctuation/Casing. This is useful if you want to get more concurrency by disabling all Automatic Punctuation, regardless of the JSON parameters sent to the API as it prevents these extra models from being loading.
    DECODER_BEAM_SIZE 5000 To sacrifice accuracy for more performance and speed, you can drop this value to anywhere between 2000-5000.
    CUSTOM_LANG_MODEL_PATH - If using a Custom Language Model, this should point to the path of your Custom Language Model, mounted into the worker Docker container via a volume. For more instructions see the LM Toolkit docs.