close

Overview.

AssemblyAI helps you turn spoken speech into text. Using the API, you can customize the recognition to be more accurate for your use case through our concept of a Corpus.

Quickstart

In this quickstart, we'll get the API to recognize home automation commands like "turn on the lights" and "set the temperature to 80 degrees."

First, we'll create a Corpus for home automation commands. This makes the API more accurate, and teaches the API any custom words, like names, it should know about. Then, we'll send in an audio file to get a transcript.

Steps

  1. Create a Corpus
  2. Turn audio into text

Step 1. Create a Corpus

We'll create a Corpus to train the API on a few thousand example home automation commands we expect to recognize. First, download home_automation.json. This is the JSON payload we will send to the API.

Then, copy the below curl command and:

USE

curl --request POST \
     --url 'https://api.assemblyai.com/v1/corpus' \
     --header 'authorization: your-secret-api-token' \
     --data @/path/to/home_automation.json
    

After running the curl command, you'll get an ID for the new Corpus object:

{
  "corpus": {
    "id": 265,
    "status": "training",
    "updated": "2017-12-11T23:15:02.302011Z",
    "build": 0,
    "closed_domain": false,
    "name": "foobar"
  }
}

Keep track of this ID for the next step.

Step 2. Turn audio into text

Run the below curl command to request a transcript for the below audio clip:

The API will download and transcribe the audio.

Note: the below JSON references the corpus_id we created in Step 1. This tells the API to use that specific Corpus when generating a transcript. You'll need to change this to match the ID of the Corpus you just created.

USE

  curl --request POST \
    --url 'https://api.assemblyai.com/v1/transcript' \
    --header 'authorization: your-secret-api-token' \
    --data '
    {
      "audio_src_url": "https://s3-us-west-2.amazonaws.com/blog.assemblyai.com/audio/office_nine_degrees.wav",
      "corpus_id": 265
    }'
  

You'll get a response with the ID of the Transcript and the status of the Transcript request (keep track of this ID):

{
  "transcript": {
    "id": 40,
    "status": "processing",
    "audio_src_url": "https://s3-us-west-2.amazonaws.com/blog.assemblyai.com/audio/office_nine_degrees.wav",
    "corpus_id": 262,
    "text": null,
    "confidence": null,
    "segments": null
  }
}

Transcript status goes from processing, to completed. Processing time generaly takes under half the time of the audio length.

To get the results, poll for the Transcript ID with GET requests until the status is completed. If you need immediate results, check out the /v1/stream/ endpoint.

Note: replace 40 in the below curl command to the ID of your transcript returned in the above API call.

USE
curl --request GET \
  --url https://api.assemblyai.com/v1/transcript/40 \
  --header 'authorization: your-secret-api-token'

Your final result should look like this:


{
   "transcript":{
      "status": "completed",
      "confidence": 0.92,
      "created": "2017-12-11T23:15:05.235957Z",
      "text": "set the temperature in the office to nine degrees",
      "segments":[
         {
            "start": 0.0,
            "confidence": 0.92,
            "end": 4086.75,
            "transcript": "set the temperature in the office to nine degrees"
         }
      ],
      "audio_src_url": "https://s3-us-west-2.amazonaws.com/blog.assemblyai.com/audio/office_nine_degrees.wav",
      "corpus_id": 265,
      "id": 40
   }
}



You did it!

Up next, we review some core concepts and best practices to make the most of the API. We strongly recommend you read this section. If you want to jump right into the rest of the API Documentation, you can skip ahead to the API Endpoints.

Concepts

Corpus

A Corpus is a large collection of sentences, phrases, and words that you expect to be spoken in your use case. These can be historical transcripts if you have them available. If you don't have historical transcripts, these can be sentences or words you expect to recognize in your use case.

The API learns from the text in your Corpus to customize the speech recognition to be more accurate for your use case. It also adds custom words found in your Corpus, like names, to the vocabulary of the API. This can dramatically improve accuracy, in some cases up to 10% absolute (ie, 20% error rate to 10% error rate), compared to generic speech recognition.

For example, if you are transcribing phone calls and have historical transcripts for those phone calls, you would create a Corpus with all your historical transcripts. Or, if you are creating a home automation system, and don't have historical transcripts, you could come up with a few thousand expected home automation commands for your use case. And then, you create a Corpus with those examples.

A good Corpus contains at least 10,000 sentences or phrases. Here are some examples of what a Corpus would be for different use cases:

Use Case Have historical data Corpus would contain
Phone calls Yes Historical phone call transcripts
Podcasts Yes Historical Podcast transcripts
Interviews Yes Historical interview transcripts
Voice commands No Come up with a few thousand example voice commands through scripting, data entry, etc.

Closed Domain Corpus

If you have a very narrow use case, where your end-users will only ever say a handful of words or commands, you can create a Closed Domain Corpus that will limit the API to recognizing only the words and phrases you need to support. This will greatly improve the accuracy of the results.

Best Practices

Signal Processing

For best results, avoid all signal processing on the audio you send to the API. Doing so will have a negative impact on the accuracy in most cases. While the processed audio may sound cleaner to you, our neural network will be confused by it. Specifically, avoid any Background Noise Supression processing, and Automatic Gain Control (AGC).

Far-Field and Noisy Speech

Our model will do it's best to account for noise and for echo introduced from far-field settings, but for best results position the microphone as close to the user as possible.