If you're working with short audio files, less than 15 seconds, you can send the audio data directly to the
/v2/stream endpoint which will return a transcript to you within a few hundred milliseconds, directly in the request-response loop.
The audio data you send to this endpoint has to comply with a strict format. This is because we don't do any transcoding to your data, we send it directly to the model for transcription. You can send the content of a
.wav file to this endpoint, or raw data read directly from a microphone. Either way, you must record your audio in the following format to use this endpoint:
- 16-bit Signed Integer PCM encoding (ie, a .wav file)
- 8khz sampling rate
- 128kbps bitrate
- 16-bit Precision
- Single channel
- Headless (ie, strip any headers from wav files)
- 15 seconds or less of audio per request
When making a
POST request to this endpoint, you should include the following parameters.
||Raw audio data, base64 encoded. This can be the raw data recorded directly from a microphone, or read from a wav file.||Yes|
||This is set to
POSTrequest. Most programming languages have very simple built-in functions for encoding binary data to base64.
Depending on how much audio data you send, the API will respond within 100-750 milliseconds. The following keys will be in the JSON response.
||The unique id of your transcription.|
||The status of your transcription.|
||The confidence score of the entire transcription, between 0 and 1.|
||The complete transcription for your audio.|
||An array of objects, with the information for each word in the transcription text. Will include the start/end time (in milliseconds) of the word, and the confidence score of the word.|
||The timestamp for your request|