Real-time Transcription

    If you're working with live audio, you can stream your audio data in realtime to our secure websocket API found at wss://api.assemblyai.com/v2/realtime. We will stream your transcripts to you within a few hundred milliseconds delay.

    Websocat is an easy-to-use CLI for testing out websockets APIs. We shall use this tool in our examples. You can find more info on Websocat here.

    Establishing a Connection

    To connect with the real-time endpoint, you must use a websocket client and establish a connection with wss://api.assemblyai.com/v2/realtime.

    Authentication is handled via the "authorization" header. The value of this header should be your API token. For example, in websocat:

    $ websocat -H authorization:<API_TOKEN> wss://api.assemblyai.com/v2/realtime
    {"message_type": "SessionBegins", "session_id": "d3e8c537-2f11-494b-b497-e59a434588bd"}

    Session Descriptor Message

    Once your request is authorized and connection established, your client will receive a session descriptor message with the following fields:

    Parameter Example Info
    message_type SessionBegins Describes the message type
    session_id d3e8c537-2f11-494b-b497-e59a434588bd Unique identifier for the established session. Can be used to restablish session

    Sending Audio

    Input Message

    When sending an audio fragment via websocket, you should include the following parameters.

    Parameter Example Info
    audio_data UklGRtjIAABXQVZFZ… Raw audio data, base64 encoded. This can be the raw data recorded directly from a microphone, or read from a wav file.

    Audio Encoding Requirements

    The audio_data field above must comply with a strict encoding. This is because we don't do any transcoding to your data, we send it directly to the model for transcription. You can send the content of a .wav file to this endpoint, or raw data read directly from a microphone. Either way, you must send your audio over the websocket connection strictly using the following format.

    base64 encoding: base64 encoding is a simple way to encode your raw audio data so that it can be included as a JSON parameter in your websocket message. Most programming languages have very simple built-in functions for encoding binary data to base64.

    Example:

    $ websocat -H authorization:<API_TOKEN> wss://api.assemblyai.com/v2/realtime
    {"message_type": "SessionBegins", "session_id": "d3e8c537-2f11-494b-b497-e59a434588bd"}
    **{"audio_data": "UklGRtjIAABXQVZFZ..."}**
    {"message_type": "PartialTranscript", "session_id": "d3e8c537-2f11-494b-b497-e59a434588bd" ...}

    Transcription Response Types

    Our real-time transcription pipeline uses a two-phase transcription strategy, broken into partial and final results.

    Partial Results

    As you send audio data to the API, the API will respond within 100-200 milliseconds with transcriptions. The following keys will be in the JSON response.

    Parameter Example Info
    message_type PartialTranscript Describes the type of message
    session_id "5551722-f677-48a6-9287-39c0aafd9ac1" The unique id of your transcription.
    audio_start 1200 Start time of audio sample relative to session start, in milliseconds
    audio_end 1850 End time of audio sample relative to session start, in milliseconds
    confidence 0.956 The confidence score of the entire transcription, between 0 and 1.
    text "You know Demons on TV like..." The complete transcription for your audio.
    words [{"confidence": 1.0, "end": 440, "start": 0, "text": "You"}, ...] An array of objects, with the information for each word in the transcription text. Will include the start/end time (in milliseconds) of the word, and the confidence score of the word.
    created "2019-06-27 22:26:47.048512" The timestamp for your request

    Final Results

    After you've received your partial results, our model will continue to analyze incoming audio and, when ready, finalizes a result with higher accuracy.

    The following fields are provided

    Parameter Example Info
    message_type FinalTranscipt Describes the type of message
    session_id "5551722-f677-48a6-9287-39c0aafd9ac1" The unique id of your transcription.
    audio_start 1200 Start time of audio sample relative to session start, in milliseconds
    audio_end 1850 End time of audio sample relative to session start, in milliseconds
    confidence 0.956 The confidence score of the entire transcription, between 0 and 1.
    text "You know Demons on TV like..." The complete transcription for your audio.
    words [{"confidence": 1.0, "end": 440, "start": 0, "text": "You"}, ...] An array of objects, with the information for each word in the transcription text. Will include the start/end time (in milliseconds) of the word, and the confidence score of the word.
    created "2019-06-27 22:26:47.048512" The timestamp for your request

    Reconnecting to an existing session

    Sometimes unforeseen outages can cause your client to lose connection with our realtime servers. To help maintain continuity of your transcript stream, at the beginning of a connection we've provided you with a session descriptor message containing a session identifier. Using this identifier you can reconnect to your session and resume processing from where you left off. Simply reconnect to the following URL wss://api.assemblyai.com/v2/realtime/{session_id}. The standard authorization scheme applies here as well.

    Example: Reconnecting to an existing session

    $ websocat -H authorization:<API_TOKEN> wss://api.assemblyai.com/v2/realtime/d3e8c537-2f11-494b-b497-e59a434588bd"
    {"message_type": "SessionResumed", "session_id": "d3e8c537-2f11-494b-b497-e59a434588bd"}

    Note: This is an optional flow. In the event that you lose connection, you can alternatively start a new session by reconnecting to wss://api.assemblyai.com/v2/realtime without the session_id. This is a viable option, but you may lose some transcription accuracy for a short period of time.

    Ending a Session

    When you've completed your session, clients should send a json message with the following field.

    Parameter Example Info
    terminate_session true A boolean value to communicate that you wish to end your realtime session forever.


    $ websocat -H authorization:<API_TOKEN> wss://api.assemblyai.com/v2/realtime/d3e8c537-2f11-494b-b497-e59a434588bd"
    {"message_type": "SessionBegins", "session_id": "d3e8c537-2f11-494b-b497-e59a434588bd"}
    ...send audio...
    ...receive results...
    {"terminate_session": true}
    {"message_type": "FinalTranscript", ...}
    {"message_type": "SessionTerminated", "session_id": "d3e8c537-2f11-494b-b497-e59a434588bd"}

    If you have outstanding final transcripts, they will be sent to you. To finalize the session, a SessionTerminated message is sent to confirm our API has terminated your session. A terminated session cannot be reused.

    The websocket specification provides standard errors. Here's a brief breakdown of them here.

    Our API provides application-level websocket errors for well-known scenarios. Here's a breakdown of them.

    Quotas and Limits

    The following limits are imposed to ensure performance and service quality. Please contact us if you'd like to increase these limits.