Real-time streaming over WebSockets

    Using the WebSocket API, you can transcribe live speech and see transcriptions as the user speaks.

    Audio Requirements

    If your audio doesn't match these requirements, accuracy from the API will be very bad!

    Opening the WebSocket Connection

    You can open a WebSocket connection with the following URL:

    wss://websocket.assemblyai.com/websocket?apiToken=<your-secret-api-token>

    If you are working with 16khz data, you need to include the following URL parameters as well:

    &downsample=True&sampleRate=16000

    The WebSocket API will send you back the following messages, indicating it is ready for you to begin sending up audio. You must wait for the ready message before starting to stream audio.

    {
        "status": "authenticating",
        "info": "Waiting for authentication.",
        "msgId": "5382f490-4f62-4954-aff8-62287fb3791b"
    }
    
    {
        "status": "ready",
        "info": "Websocket ready.",
        "msgId": "0a740e80-dd91-4e92-829f-e6851807c6b0"
    }

    If authentication fails, you'll get the following message from the WebSocket API, and then the WebSocket connection will be closed:

    {
        "status": "denied",
        "info": "API token missing",
        "msgId": "ce9a9f5e-720c-4ce9-85b2-6439a7b77948"
    }
    Every message from the WebSocket API will include the "msgId" key that you can use to guarantee messages are unique.

    Streaming Audio

    Once the WebSocket connection is established and open, you can begin streaming audio data up to the API. Each message you send over the Websocket connection should be a chunk of raw audio with a maximum size of 2048 bytes when streaming 8khz data, or 4096 bytes when streaming 16khz data.

    If you send messages with more audio data than this, you'll get the following message from the WebSocket API, and then your connection will be closed:

    {
        "status": "rate_limited",
        "info": "The message chunk size must not exceed 2048",
        "msgId": "ce9a9f5e-720c-4ce9-85b2-6439a7b77948"
    }

    Receiving Real-Time Transcripts

    As the Websocket API receives audio data, it'll start to respond with JSON messages containing the transcription, as well as other meta data like confidence scores. Here is an example of the JSON payload you'll receive:

    {
        "msgId": "an19da-mao19a-ma1la0",
        "text": "hello world",
        "confidence": 0.86,
        "isFinal": false,
        "words": [],
    }

    The text, confidence, and words keys will continue to be updated as the user speaks. Over time, as the WebSocket API gains more context, the transcription will become more accurate, and words sent in prior messages will be updated to be more accurate.

    For example:

    Message 1 (user starts speaking)

    {
        "msgId": "an19da-mao19a-ma1la0",
        "text": "yellow",
        "confidence": 0.75,
        "isFinal": false,
        "words": [
            {
                "text": "yellow",
                "confidence": 0.75,
                "start": 1000, # in milliseconds, start time for the word
                "end": 2000, # in milliseconds, end time for the word
                "intermed": true # indicates if the word will be updated or not
            }
    
            ...
        ],
    }

    Message 2 (user finishes first utterance)

    When the "utterance" is over, the "isFinal" key will be set to true and the "text" will be the most accurate transcription found for the utterance. The "text" will always be in its most accurate state when "isFinal" is true.

    {
        "msgId": "an19da-mao19a-ma1la0",
        "text": "hello world",
        "confidence": 0.90,
        "isFinal": true,
        "words": [
            {
                "text": "hello",
                "confidence": 0.90,
                "start": 1000, # in milliseconds, start time for the word
                "end": 2000, # in milliseconds, end time for the word
                "intermed": false # indicates if the word will be updated or not
            }
    
            ...
        ],
    }

    Message 3 (user starts new utterance)

    Once an utterance is over, the "text", "confidence", and "words" keys will be reset (signaling the start of a new utterance).

    We automatically determine an utterance is over based on multiple factors including silence and the words that are spoken. There can be, and usually is, multiple utterances within a WebSocket connection that stays open for a long time.

    {
        "msgId": "an19da-mao19a-ma1la0",
        "text": "today i'll be",
        "confidence": 0.91,
        "isFinal": false,
        "words": [...],
    }

    Manually Triggering an End-of-Utterance

    Sometimes you'll finish streaming audio before the WebSocket API replies with an "isFinal" true message. To manually tell the API the utterance is over and that you want the final transcription back for the utterance, instead of sending audio data to the WebSocket API you can send a single message "done". When the WebSocket API receives this message, it'll mark the utterance is finished and send you back the final, most accurate transcript. You may have to wait up to 1000 milliseconds to get the final message back from the WebSocket API after sending your "done" message.