Real-Time Engine

    This guide will walk you through setting up the real-time speech engine on your own server. Setting up the speech engine takes about 15 minutes. When you're done with this setup, you'll have an endpoint you can stream audio to for real-time transcription via WebSocket.

    Architecture Overview

    Setting up your VPS

    Launch your instance with the below specifications, running Ubuntu Server 18.04 LTS.

    Server Requirements
    GPU K80 or M60 recommended (consumer grade cards such as the 1080Ti will work as well)
    RAM 8GiB RAM
    HDD 50GiB

    If launching on AWS/GCS/Azure, we recommend you use one of the following instance types:

    Recommended Instance Type
    AWS g3s.xlarge or p2.xlarge
    GCS nvidia-tesla-k80
    Azure NC6

    Install Nvidia Drivers and CUDA

    # Add NVIDIA package repositories
    sudo dpkg -i cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
    sudo apt-key adv --fetch-keys
    sudo apt-get update
    sudo apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
    sudo apt-get update
    # Install NVIDIA driver
    sudo apt-get install --no-install-recommends nvidia-driver-410
    sudo apt-mark hold nvidia-driver-410
    # Reboot. Check that GPUs are visible using the command: nvidia-smi
    # Your Driver Version should show as 410.104
    Our Docker images require your Nvidia Driver Version to equal 410.104. This is the version we install in the above steps, but you'll want to make sure you prevent this driver from being automatically upgraded using the sudo apt-mark hold nvidia-driver-410 command.

    Install Docker

    Once your instance is online, SSH in and install Docker with the following steps.

    curl -fsSL -o

    To use Docker as a non-root user, add your user to the "docker" group with:

    sudo usermod -aG docker ubuntu
    # you'll need to log out and back in for this to take effect; or you can
    # continue to run the below docker commands as sudo

    Afterwards, confirm that Docker is running with sudo systemctl status docker (you can exit this shell just like you exit Vim).

    Install Nvidia-Docker

    The Docker containers you'll launch in the below steps need access to the GPU. To give Docker access to the GPU, we need to install Nvidia-Docker.

    # Add the package repositories
    curl -s -L | \
      sudo apt-key add -
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    curl -s -L$distribution/nvidia-docker.list | \
      sudo tee /etc/apt/sources.list.d/nvidia-docker.list
    sudo apt-get update
    # Install nvidia-docker2 and reload the Docker daemon configuration
    sudo apt-get install -y nvidia-docker2
    sudo pkill -SIGHUP dockerd
    # Test nvidia-smi with the latest official CUDA image
    docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

    Install docker-compose

    Run the following commands to install docker-compose, which we'll use to orchestrate the services on the Model Server.

    sudo curl -L "$(uname -s)-$(uname -m)" -o /usr/bin/docker-compose
    sudo chmod +x /usr/bin/docker-compose

    Make Nvidia Runtime the Default

    Set Docker to use the Nvidia runtime by default by making this file /etc/docker/daemon.json look like below:

        "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []

    Now restart restart Docker, and test the installation:

    sudo pkill -SIGHUP dockerd
    docker run --rm nvidia/cuda:9.0-base nvidia-smi

    You should see output like this, confirming that the container has access to the GPU:

    | NVIDIA-SMI 410.104              Driver Version: 410.104                   |
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
    | N/A   38C    P8    31W / 149W |      0MiB / 11441MiB |      2%      Default |
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |  No running processes found                                                 |

    Optimizing the GPU

    Optimize the GPU for faster inference. These commands need to be run on every boot, so it's a good idea to add them to a cronjob on reboot. The exact commands depend on your GPU.

    sudo nvidia-persistenced
    # for M60 (aws g3s.xlarge) and K80 (aws p2.xlarge) GPUs only
    sudo nvidia-smi --auto-boost-default=0
    # for K80 (aws p2.xlarge) GPUs
    sudo nvidia-smi -ac 2505,875
    # for M60 (aws g3s.xlarge) GPUs
    sudo nvidia-smi -ac 2505,1177

    Pull the Docker Image for the Stream Model Server

    You'll need a username and password to our Docker image repository to run the following commands. If you don't have these credentials, get in touch with us and we can send them to you. Heads up - the images are a few GB in size!

    Login to the Image Repository

    docker login -u="<your username>" -p="<your password>"

    Pull the Image from the Repository

    # current GPU image
    docker pull

    Launch the Stream Model Server

    Create a docker-compose.yml file on your machine with the following contents:

    version: '3.5'
        command: ./server
          - 9123
          - 9123:9123
        restart: always
          - redis
        command: ./run
        restart: always
          - redis
        image: redis

    Now you're ready to up the services

    # set this env var to the name of the image you pulled in the above step
    # launch the services
    docker-compose -f docker-compose.yml up -d

    Confirm the containers are up by running docker ps, you should see the following containers running:


    Wrapping Up

    The containers are now running, with the server accessible on port 9123. Keep track of your instance's IP address, because we'll need to tell the WebSocket Server the IP address of your Model Server so it knows where to send audio data to for inference.

    Make sure you set the proper security group/firewall rules so that your server can receive traffic on port 9123!

    Stream Model Server Environment Variables

    You can control a number of settings by setting additional environment variables in the Stream Model Server Docker containers.

    Env Var Default Info
    CUSTOM_LANG_MODEL_PATH - If using a Custom Language Model, this should point to the path of your Custom Language Model, mounted into the worker Docker container via a volume. For more instructions see the LM Toolkit docs.
    DECODER_BEAM_SIZE 600 For more accurate results, you can increase this value to between 600 and 1000 but at the sacrifice of concurrency/scalability. For more scalability and less accurate results, you can drop this value to between 500 and 600.

    Launching the WebSocket Server

    Pull the Image from the Repository

    # current WebSocket server image
    docker pull

    Launching the Container

    sudo docker run -d \
        # tell the WebSocket container where to send audio data to for inference
        -e MODELSERVER_HOSTNAME=<your Model Server IP address>:9123 \
        -e MODELSERVER_PROTOCOL=http \
        # bind the container to port 80 on the host machine
        -p 80:4444 \
        # the name of the image we just pulled in the above step

    Running sudo docker ps should now show the WebSocket server container running.

    Using the WebSocket API

    Audio Requirements

    If your audio doesn't match these requirements, accuracy from the API will be very bad!

    Communicating with the WebSocket API

    You can open a WebSocket connection to your WebSocket API with the following hostname:

    ws://<your instance ip>

    API Parameters

    Param Default Info Required
    downsample - If streaming higher than 8khz data, set this to True. No
    sampleRate - If streaming higher than 8khz data, set this to the value of your audio data's sampling rate, eg 16000. No
    text2int False Can be used to toggle formatting numbers to their symbol form (eg, "seventy two" would appear as "72" in the transcription text). Set to True to enable. No

    For example, setting all 3 URL parameters on your WebSocket API endpoint would look like this:

    ws://<your instance ip>?downsample=True&sampleRate=16000&text2int=True

    The WebSocket API will send you back the following messages, indicating it is ready for you to begin sending up audio. You must wait for the ready message before starting to stream audio.

        "status": "ready",
        "info": "Websocket ready.",
        "msgId": "0a740e80-dd91-4e92-829f-e6851807c6b0"
    Every message from the WebSocket API will include the "msgId" key that you can use to guarantee messages are unique.

    Streaming Audio

    Once the WebSocket connection is established and open, you can begin streaming audio data up to the API. Each message you send over the Websocket connection should be a chunk of raw audio with a maximum size of 2048 bytes when streaming 8khz data, or 4096 bytes when streaming 16khz data.

    If you send messages with more audio data than this, you'll get the following message from the WebSocket API, and then your connection will be closed:

        "status": "rate_limited",
        "info": "The message chunk size must not exceed 2048",
        "msgId": "ce9a9f5e-720c-4ce9-85b2-6439a7b77948"

    Receiving Real-Time Transcripts

    As the Websocket API receives audio data, it'll start to respond with JSON messages containing the transcription, as well as other meta data like confidence scores. Here is an example of the JSON payload you'll receive:

        "msgId": "an19da-mao19a-ma1la0",
        "text": "hello world",
        "confidence": 0.86,
        "isFinal": false,
        "words": [],

    The text, confidence, and words keys will continue to be updated as the user speaks. Over time, as the WebSocket API gains more context, the transcription will become more accurate, and words sent in prior messages will be updated to be more accurate. For example:

    Message 1 (user starts speaking)

        "msgId": "an19da-mao19a-ma1la0",
        "text": "yellow",
        "confidence": 0.75,
        "isFinal": false,
        "words": [
                "text": "yellow",
                "confidence": 0.75,
                "start": 1000, # in milliseconds, start time for the word
                "end": 2000, # in milliseconds, end time for the word
                "intermed": true # indicates if the word will be updated or not

    Message 2 (user finishes first utterance)

    When the "utterance" is over, the "isFinal" key will be set to true and the "text" will be the most accurate transcription found for the utterance. The "text" will always be in its most accurate state when "isFinal" is true.

        "msgId": "an19da-mao19a-ma1la0",
        "text": "hello world",
        "confidence": 0.90,
        "isFinal": true,
        "words": [
                "text": "hello",
                "confidence": 0.90,
                "start": 1000, # in milliseconds, start time for the word
                "end": 2000, # in milliseconds, end time for the word
                "intermed": false # indicates if the word will be updated or not

    Message 3 (user starts new utterance)

    Once an utterance is over, the "text", "confidence", and "words" keys will be reset (signaling the start of a new utterance).

    We automatically determine an utterance is over based on multiple factors including silence and the words that are spoken. There can be, and usually is, multiple utterances within a WebSocket connection that stays open for a long time.

        "msgId": "an19da-mao19a-ma1la0",
        "text": "today i'll be",
        "confidence": 0.91,
        "isFinal": false,
        "words": [...],

    Manually Triggering an End-of-Utterance

    Sometimes you'll finish streaming audio before the WebSocket API replies with an "isFinal" true message. To manually tell the API the utterance is over and that you want the final transcription back for the utterance, instead of sending audio data to the WebSocket API you can send a single message "done".

    When the WebSocket API receives this message, it'll mark the utterance is finished and send you back the final, most accurate transcript. You may have to wait up to 1000 milliseconds to get the final message back from the WebSocket API after sending your "done" message.

    WebSocket API Environment Variables

    You can control a number of settings by setting additional environment variables in the WebSocket server Docker container.

    Env Var Default Info Required
    MODELSERVER_HOSTNAME - This is the hostname or IP address of the Model Server. For example or Don't include any protocol in this variable! Yes
    MODELSERVER_PROTOCOL "http" The protocal to use for connecting to the Model Server; can be "http" or "https". Yes
    INTERMED_PARTIALS - By default, the Websocket Server will respond with an accurate transcript every 4-5 seconds as you stream audio. If you want feedback faster, set this env var to True and the Websocket Server will respond with interim transcripts every 600ms worth of speech. These interim transcripts will be less accurate, and are updated over time as the user speaks and the model gets more context to update the transcription. No
    WEBSOCKET_SERVER_LOGLEVEL "WARNING" Control the level of logging for the container No
    SSL_CERT_PATH - Path to an SSL certificate to make the server start over HTTPS. Eg, "/var/foo/keys/ca.csr". The server reads the file from within the Docker container, so the SSL cert needs to be mounted into the Docker container! No
    SSL_CERT_KEY - Path to an SSL certificate key to make the server start over HTTPS. Eg, "/var/foo/keys/ca.key". The server reads the file from within the Docker container, so the SSL key needs to be mounted into the Docker container! No

    Concurrency and Scaling Up

    Each VPS/instance (running the Stream Model Server and WebSocket API containers) can handle 6 concurrent real-time streams before latency starts to become too high.

    If you changed the DECODER_BEAM_SIZE env var in the Stream Model Server Settings this number could be either higher or lower.

    When you're ready to scale up, we recommend putting a Load Balancer in front of the WebSocket API and the Stream Model Server, in an architecture like this, and then scaling up/down based on connection count: