close

Overview.

Welcome to the AssemblyAI documentation. Using the API, you can easily convert speech to text. You can also teach the API to recognize an unlimited amount of custom words or phrases relevant to what you're building. For a quick walk through the API, check out the Getting Started section below.

Getting Started

First we'll teach the API what words or phrases to focus on by creating a Corpus. This makes the API more accurate for the audio we want to transcribe, and teaches the API any custom words, like names, it should know about. Then, we'll send in an audio file to request a Transcript.

Steps

  1. Teach the API your lingo
  2. Turn audio into text
  3. There is no step 3

Prerequisites

Step 1. Teach the API your lingo

Defining a set of common phrases and words custom to your use case makes the API much more accurate. A Corpus is a collection of phrases and words that you'd like the API to focus on or add to the vocabulary.

Replace your-secret-api-token with your API token, then run this curl command to create your first Corpus with some minimal data:

curl --request POST \
    --url 'https://api.assemblyai.com/v1/corpus' \
    --header 'authorization: your-secret-api-token' \
    --data '
    {
      "name": "foobar",
      "phrases": ["chewbacca", "lights on", "lights off"]
    }'

Depending on the the quantity of phrases, the API will take 20 to 60 seconds to process, and then you'll get an ID for the new Corpus object:

{
  "corpus": {
    "id": 265,
    "closed_domain": false,
    "name": "foobar"
  }
}

Now you're ready to proceed to the second and final step.

Step 2. Turn audio into text

Run this curl command to request a Transcript for the audio_src_url. The API will download and process the audio found at audio_src_url.

Note: we also reference the corpus_id we created in Step 1. This tells the API to use that specific Corpus when generating a Transcript.

curl --request POST \
    --url 'https://api.assemblyai.com/v1/transcript' \
    --header 'authorization: your-secret-api-token' \
    --data '
    {
      "audio_src_url": "http://www.moviesoundclips.net/download.php?id=3074&ft=mp3",
      "corpus_id": 265
    }'

You'll get a response with the ID of the Transcript and the status of the Transcript request:

{
  "transcript": {
    "id": 40,
    "status": "queued",
    "audio_src_url": "http://www.moviesoundclips.net/download.php?id=3074&ft=mp3",
    "corpus_id": 262,
    "text": null,
    "confidence": null,
    "segments": null
  }
}

Transcript status goes from queued, to processing, to completed. Processing time generaly takes under half the time of the audio length.

To get the results, poll for the Transcript ID with GET requests until the status is completed.

curl --request GET \
  --url https://api.assemblyai.com/v1/transcript/40 \
  --header 'authorization: your-secret-api-token'

Your final result should look like this:

{
  "transcript": {
    "status": "completed",
    "confidence": 0.84,
    "created": "2017-11-12T05:00:05.113353Z",
    "text": "is now the united states of vineland",
    "segments": [
      {
        "start": 0.0,
        "confidence": 0.84,
        "end": 3312.0,
        "transcript": "is now the united states of vineland"
      }
    ],
    "audio_src_url": "http://www.moviesoundclips.net/download.php?id=3074&ft=mp3",
    "corpus_id": 262,
    "id": 40
  }
}

You did it!
Up next is a review of core concepts and some best practices for ensuring high quality results. If you want to jump right into the rest of the API Documentation, you can skip ahead to the API Endpoints.

Background

AssemblyAI is an API for customizable speech recognition. Developers and companies use the API for things like transcribing phone calls and building voice powered smart devices.

You can customize the API to recognize unlimited industry specific words or phrases unique to your product without any training required. For example, you can recognize thousands of person or product names, or better recognize common phrases, unique to your product.

We've developed our own, more affordable speech recognition architecture, and are able to pass those savings on to developers by charging a fraction of traditional speech recognition pricing.

For better accuracy, you’ll need to make a Corpus, which is list of example phrases and custom words that you transcribe frequently. You can create as many Corpora as you'd like. This improves the accuracy of the API for your use case. It also teaches the API to recognize any custom words (like person, song, or product names) unique to your use case.

For example, if you are transcribing Tech and Business Podcasts, you could create one Corpus for the Tech Podcasts that has all the technical words/phrases you need to recognize, and another Corpus for the Business Podcasts that has all the business words/phrases you need to recognize. Then, when you transcribe audio, you could select which Corpus to use depending on the type of Podcast.

The more specific your Corpus is, the more accurate the recognition will be.

Best Practices

Signal Processing

For best results, avoid all signal processing on the audio you send to the API. Doing so will have a negative impact on the accuracy in most cases. While the processed audio may sound cleaner to you, our neural network will be confused by it. Specifically, avoid any Background Noise Supression processing, and Automatic Gain Control (AGC).

Far-Field and Noisy Speech

Our model will do it's best to account for noise and for echo introduced from far-field settings, but for best results position the microphone as close to the user as possible.

Closed Domain Corpus

If you have a very narrow use case, where your end-users will only ever say a handful of words or commands, you can create a Closed Domain Corpus that will limit the API to recognizing only the words and phrases you need to support. This will greatly improve the accuracy of the results.