Chat with us, powered by LiveChatAssemblyAI | Walkthroughs

Walkthroughs

#

Authentication

Authentication is handled via the authorization header. This header should be included in all of your API requests, and the value of this header should be your API token. All endpoints require authentication, and require you to set the authorization header.

#

Submitting Files for Transcription

The AssemblyAI API can transcribe audio/video files that are accessible via a URL. For example, audio files in an S3 bucket, on your server, via the Twilio API, etc.

Local files?

Need to upload files directly to the API? Jump to the tutorial on uploading files.

In the code sample to the right, we show how to submit the URL of your audio/video file to the API for transcription. After submitting your POST request, you will get a response that includes an id key and a status key.

The status key shows the status of your transcription. It will start with "queued", and then go to "processing", and finally to "completed".

To check on the status of your transcription, see the docs for Getting the Transcription Result

#

Getting the Transcription Result

After you submit an audio file for processing, the "status" key will go from "queued" to "processing" to "completed". You can make a GET request, as shown on the right, to check for updates on the status of your transcription.

You'll have to make repeated GET requests until the status is "completed" or "error". Once the status key is shown as "completed", you'll see the text, words, and other keys, including the results of any Audio Intelligence features you enabled, with the results of your transcription populated in the JSON response.

#

Specifying a Language

The language_code key can be used to specify the language of the speech in your audio file. For example, English or Spanish. For a full list of supported languages, see the Supported Languages page.

In the code examples to the right, you can see how to submit an audio file to the API for transcription with the language_code key included.

If you are unsure of the dominant language spoken in your audio file, you can use our Automatic Language Detection feature to automatically identify the dominant language in your file.

Pro tip

The language_code parameter is optional. If you do not include a language_code parameter in your request the default value will be en_us.

#

Uploading Local Files for Transcription

If your audio files aren't accessible via a URL already (like in an S3 bucket, static file server, or via an API like Twilio), you can upload your files directly to the AssemblyAI API.

Once your upload finishes, you'll get back a JSON response that includes an upload_url key. The upload_url points to a private URL, accessible only to AssemblyAI's backend servers, that you can submit for processing via the /v2/transcript endpoint.

Submit your Upload for Transcription

Once your audio file is uploaded, you can submit it for transcription just like any normal audio file. The URL in the upload_url key is what you'll use as the audio_url when Submitting Files for Transcription.

Pro tip

If you're not using our code examples, keep in mind the API expects the upload to be streamed to the API using Chunked Transfer Encoding. Most HTTP libraries have a nice interface for handling this. For example, in Python the "requests" library has a simple way to do Chunked Transfer Encoding uploads

Heads up

For privacy and security reasons, all uploads are immediately deleted after transcription completes.

#

Using Webhooks

Instead of polling for the result of your transcription, you can receive a webhook once your transcript is complete, or if there was an error transcribing your audio file.

Specify Your Webhook URL

When submitting an audio file for transcription, you can include the additional parameter webhook_url in your POST request. This must be a URL that can be reached by our backend.

Receiving the Webhook

You'll receive a webhook when your transcription goes to status "completed", or when your transcription goes to status "error". In either case, a POST request will be made to the webhook URL you supplied. The headers and body will look like this:

headers
---
content-length: 79
accept-encoding: gzip, deflate
accept: */*
user-agent: python-requests/2.21.0
content-type: application/json

request body
--
status: completed
transcript_id: 5552493-16d8-42d8-8feb-c2a16b56f6e8

Once you receive the webhook, you can make a GET request to /v2/transcript to fetch the final result of your transcription.

Including Custom Parameters in Your Webhook Request

Often times, you'll want to associate certain meta data with your transcription request, such as a customer ID, and have that passed back to your webhook. The easiest way to do this is to include these parameters in your webhook URL as query parameters, for example:

https://foo.com/webhook?myParam1=foo&myParam2=bar

Then, when you receive the webhook, you can parse these parameters out of the webhook URL.

Failed webhooks and retries

If we get a non 2xx response when we POST to your webhook URL, we'll retry the request 10 times, with a 10 second interval between each retry. After all 10 retries fail, we'll consider the webhook to be permanently failed.

If we are unable to reach your webhook URL (usually caused by a timeout, or your server being offline), no retries will be attempted.

#

Real-Time Streaming Transcription

If you're working with live audio streams, you can stream your audio data in real-time to our secure WebSocket API found at wss://api.assemblyai.com/v2/realtime/ws. We will stream transcripts back to you within a few hundred milliseconds, and additionally, revise these transcripts with more accuracy over time as more context arrives.

Open Source Example Code

Here are some open-source examples of our real-time endpoint to help you get started.

Establishing a Websocket Connection

Websocat is an easy-to-use CLI for testing out websocket APIs. We shall use this tool in our examples. You can find more info on Websocat here.

To connect with the real-time endpoint, you must use a WebSocket client and establish a connection with wss://api.assemblyai.com/v2/realtime/ws.

Authentication

Authentication is handled via the authorization header. The value of this header should be your API token. For example, in websocat:

$ websocat wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000 -H Authorization:<API_TOKEN>
{
    "message_type": "SessionBegins",
    "session_id": "d3e8c537-2f11-494b-b497-e59a434588bd",
    "expires_at": "2021-04-07T11:32:25.300329"
}

If you would like to create a temporary token for in-browser authentication you can learn more on that here.

Required Query Params

This endpoint also requires a query param sample_rate that defines the sample rate of your audio data. For example, in websocat:

$ websocat wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000 -H Authorization:<API_TOKEN>
{
    "message_type": "SessionBegins",
    "session_id": "d3e8c537-2f11-494b-b497-e59a434588bd",
    "expires_at": "2021-04-07T11:32:25.300329"
}

Session Descriptor Message

Once your request is authorized and connection established, your client will receive a "SessionBegins" message with the following JSON data:

Parameter Example Info
message_type SessionBegins Describes the message type.
session_id d3e8c537-2f11-494b-b497-e59a434588bd Unique identifier for the established session. Can be used to reestablish session.
expires_at 2021-04-07T11:32:25.300329 Timestamp when this session will expire.

Sending Audio

Input Message

When sending audio over the WebSocket connection, you should send a JSON payload with the following parameters.

Parameter Example Info
audio_data UklGRtjIAABXQVZFZ… Raw audio data, base64 encoded. This can be the raw data recorded directly from a microphone or read from an audio file.

base64 encoding

base64 encoding is a simple way to encode your raw audio data so that it can be included as a JSON parameter in your websocket message. Most programming languages have very simple built-in functions for encoding binary data to base64.

For example, a message payload would look like this:

{
  "audio_data": "UklGRtjIAABXQVZFZ..."
}

Audio Requirements

The raw audio data in the audio_data field above must comply with a strict encoding format. This is because we don't do any transcoding to your data, we send it directly to the model for transcription to reduce latency. The encoding of your audio must be in:

  • 16-bit Signed Integer PCM encoding
  • A sample rate that matches the value of the sample_rate query param you supply
  • 16-bit Precision
  • Single-channel
  • 100 to 2000 milliseconds of audio per message

Transcription Response Types

Our real-time transcription pipeline uses a two-phase transcription strategy, broken into partial and final results.

Partial Results

As you send audio data to the API, the API will immediately start responding with Partial Results. The following keys will be in the JSON response from the WebSocket API.

Parameter Example Info
message_type PartialTranscript Describes the type of message.
session_id "5551722-f677-48a6-9287-39c0aafd9ac1" The unique id of your transcription.
audio_start 1200 Start time of audio sample relative to session start, in milliseconds.
audio_end 1850 End time of audio sample relative to session start, in milliseconds.
confidence 0.956 The confidence score of the entire transcription, between 0 and 1.
text "You know Demons on TV like..." The complete transcription for your audio.
words [{"confidence": 1.0, "end": 440, "start": 0, "text": "You"}, ...] An array of objects, with the information for each word in the transcription text. Will include the start/end time (in milliseconds) of the word, and the confidence score of the word.
created "2019-06-27 22:26:47.048512" The timestamp for your request.

Final Results

After you've received your partial results, our model will continue to analyze incoming audio and, when it detects the end of an "utterance" (usually a pause in speech), it will finalize the results sent to you so far with higher accuracy, as well as add punctuation and casing to the transcription text.

The following keys will be in the JSON response from the WebSocket API when Final Results are sent:

Parameter Example Info
message_type FinalTranscript Describes the type of message.
session_id "5551722-f677-48a6-9287-39c0aafd9ac1" The unique id of your transcription.
audio_start 1200 Start time of audio sample relative to session start, in milliseconds.
audio_end 1850 End time of audio sample relative to session start, in milliseconds.
confidence 0.956 The confidence score of the entire transcription, between 0 and 1.
text "You know Demons on TV like..." The complete transcription for your audio.
words [{"confidence": 1.0, "end": 440, "start": 0, "text": "You"}, ...] An array of objects, with the information for each word in the transcription text. Will include the start/end time (in milliseconds) of the word, and the confidence score of the word.
created "2019-06-27 22:26:47.048512" The timestamp for your request.

Reconnecting to An Existing Session

Sometimes unforeseen outages can cause your client to lose connection with our real-time servers. To help maintain continuity of your transcript stream, at the beginning of a connection we've provided you with a session descriptor message containing a session identifier. Using this identifier you can reconnect to your session and resume processing from where you left off. Simply reconnect to the following URL wss://api.assemblyai.com/v2/realtime/ws/{session_id}. The standard authorization scheme applies here as well.

Example: Reconnecting to an existing session

$ websocat wss://api.assemblyai.com/v2/realtime/ws/d3e8c537-2f11-494b-b497-e59a434588bd -H authorization:<API_TOKEN>
{
    "message_type": "SessionResumed",
    "session_id": "d3e8c537-2f11-494b-b497-e59a434588bd"
}

This is an optional flow. In the event that you lose connection, you can alternatively start a new session by reconnecting to `wss://api.assemblyai.com/v2/realtime/ws` without the session_id. This is a viable option, but you may lose some transcription accuracy for a short period of time.

Ending a Session

When you've completed your session, clients should send a JSON message with the following field.

Parameter Example Info
terminate_session true A boolean value to communicate that you wish to end your real-time session forever.
$ websocat wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000 -H authorization:<API_TOKEN>
{
    "message_type": "SessionBegins",
    "session_id": "d3e8c537-2f11-494b-b497-e59a434588bd"
}
...send audio...
...receive results...
{"message_type": "SessionTerminated"}
{"message_type": "FinalTranscript", ...}
{"message_type": "SessionTerminated", "session_id": "d3e8c537-2f11-494b-b497-e59a434588bd"}

If you have outstanding final transcripts, they will be sent to you. To finalize the session, a SessionTerminated message is sent to confirm our API has terminated your session. A terminated session cannot be reused.

Closing and Status Codes

The WebSocket specification provides standard errors. Here's a brief breakdown of them here.

Our API provides application-level WebSocket errors for well-known scenarios. Here's a breakdown of them.

Error Condition Status Code Message
auth failed 4001 "Not Authorized"
insufficient funds 4002 "Insufficient Funds"
free tier user 4002 "This feature is paid-only and requires you to add a credit card. Please visit https://app.assemblyai.com/ to add a credit card to your account"
attempt to connect to nonexistent session id 4004 "Session not found"
attempt to connect to closed session 4010 "Session previously closed"
session expires 4008 "Session Expired"
attempt to connect to expired session id 4008 "Session Expired"
rate limited 4029 "Client sent audio too fast"
session times out 4031 "Session idle for too long"
audio too short 4032 "Audio duration is too short"
audio too long 4033 "Audio duration is too long"
bad json 4100 "Endpoint received invalid JSON"
bad schema 4101 "Endpoint received a message with an invalid schema"
reconnect attempts exhausted 1013 "Temporary server condition forced blocking client's request"

Quotas and Limits

The following limits are imposed to ensure performance and service quality. Please contact us if you'd like to increase these limits.

  • Idle Sessions - Sessions that do not receive audio within 1 minute will be terminated
  • Session Limit - Only 1 session at a time for free-tier users, 32 sessions at a time for paid users
  • Session Uniqueness - Only one WebSocket per session
  • Audio Sampling Rate Limit - Customers must send data in near real-time. If a client sends data faster than 1 second of audio per second for longer than 1 minute, we will terminate the session.

Adding Custom Vocabulary

Developers can also add up to 2500 characters of custom vocabulary to their real-time session by adding the optional query parameter word_boost in the URL. The parameter should map to a JSON encoded list of strings as shown in this python example:

import json
from urllib.parse import urlencode

sample_rate = 16000
word_boost = ["foo", "bar"]
params = {"sample_rate": sample_rate, "word_boost": json.dumps(word_boost)}

url = f"wss://api.assemblyai.com/v2/realtime/ws?{urlencode(params)}"

Creating Temporary Authentication Tokens

In some cases, a developer will need to authenticate on the client-side and won't want to expose their AssemblyAI token. You can do this by sending a POST request to https://api.assemblyai.com/v2/realtime/token with the parameter expires_in: {TTL in seconds}. Below is a quick example in curl.

The `expires_in` parameter must be greater than or equal to 60 seconds.

curl --request POST \
  --url https://api.assemblyai.com/v2/realtime/token \
  --header 'authorization: YOUR_AAI_TOKEN' \
  --header 'content-type: application/json' \
  --data '{"expires_in": 60}'

In response you will receive the following JSON output:

{
  "token": "b2e3c6c71d450589b2f4f0bb1ac4efd2d5e55b1f926e552e02fc0cc070eaedbd"
}

A developer can now use this temporary token in the browser to authenticate a new WebSocket session with the following endpoint wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000&token={New Temp Token}. An example of JavaScript in the browser would be as follows.

let socket;
const token =
  "b2e3c6c71d450589b2f4f0bb1ac4efd2d5e55b1f926e552e02fc0cc070eaedbd";

socket = new WebSocket(
  `wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000&token=${token}`
);