Getting speaker labels (Speaker Diarization)

    The API can automatically detect the number of speakers in your audio file, and each word in the transcription text can be associated to its speaker.

    Submit an audio file for transcription with Speaker Labels

    Heads up: The Speaker Labels feature is not supported when dual channel transcription is turned on. You can only have either Speaker Labels OR Dual Channel enabled when submitting a file for transcription.

    Get the transcription result

    Poll for your audio file to finish processing, and then you can GET the transcription result like normal:

    You'll get a JSON response like the response below. The "utterances" key will contain a list of "turn-by-turn" utterances, as they appeared in the audio recording.

    A "turn" is a "turn" in speakers during the conversation. For example, Speaker A says "hello" (turn 1) and then Speaker B says "hi" (turn 2).
    {
        "acoustic_model": "assemblyai_default",
        "audio_duration": 150.766167800454,
        "audio_url": "https://app.assemblyai.com/static/media/phone_demo_clip_1.wav",
        "confidence": 0.922175805047867,
        "dual_channel": true,
        "format_text": true,
        "id": "5552830-d8b1-4e60-a2b4-bdfefb3130b3",
        "language_model": "assemblyai_default",
        "punctuate": true,
        "status": "completed",
        "text": "Hi, I'm joy. Hi, I'm sharon. Do you have kids in school. ...",
        # the "utterances" key below is a list of the turn-by-turn utterances found in the audio
        "utterances": [
            {
                # speakers will be marked as "A" through "Z"
                "speaker": "A",
                "confidence": 0.97,
                "end": 1380,
                "start": 0,
                # the text for the entire speaker "turn"
                "text": "Hi, I'm joy.",
                # the individual words from the speaker "turn"
                "words": [
                    {
                        "speaker": "A",
                        "confidence": 1.0,
                        "end": 320,
                        "start": 0,
                        "text": "Hi,"
                    },
                    ...
                ]
            },
            # the next "turn" by speaker "B" - for example
            {
                "speaker": "B",
                "confidence": 0.94,
                "end": 3260,
                "start": 0,
                "text": "Hi, I'm sharon.",
                "words": [
                    {
                        "speaker": "B",
                        "confidence": 1.0,
                        "end": 480,
                        "start": 0,
                        "text": "Hi,"
                    },
                    ...
                ]
            },
            {
                "speaker": "A",
                "confidence": 0.94,
                "end": 5420,
                "start": 2820,
                "text": "Do you have kids in school.",
                "words": [
                    {
                        "speaker": "A",
                        "confidence": 1.0,
                        "end": 4300,
                        "start": 2820,
                        "text": "Do"
                    },
                    ...
                ]
            },
        ],
        # all of the words found in the audio across all speakers
        "words": [
            {
                "speaker": "A",
                "confidence": 1.0,
                "end": 320,
                "start": 0,
                "text": "Hi,"
            },
            {
                "speaker": "A",
                "confidence": 1.0,
                "end": 720,
                "start": 320,
                "text": "do"
            },
            ...
        ]
    }