Redact Personally Identifiable Information (PII) from transcript text

    Redact PII from transcripts

    Working with audio data that has sensitive information? In the below example, we show you how to request a transcription that has Personally Identifiable Information (PII), such as phone numbers and social security numbers, redacted. All redacted text will be replaced with # characters. For example, the phone number 111-2222 would become ###-#### in the text.

    PII redaction is only supported when text formatting is enabled (which is the default behavior). If you have explicitly disabled text formatting, PII redaction won't work!

    Specify which types of data to redact

    To best-fit the data redaction to your application, you can select from a set of redaction policies when PII redaction is enabled. You can include any or some of these policy names in the redact_pii_policies array when making your POST request as shown above.

    Policy Name Description
    all All of the below redaction policies will be enabled (applied by default).
    number_sequence A "lazy" rule that will redact any sequence of numbers equal to or greater than 2.
    email_address Redacts email addresses found in the transcription text.
    date_of_birth Redacts birthdays found in the transcription text.
    phone_number Redacts phone numbers found in the transcription text.
    us_social_security_number Redacts full 9 digit US Social Security Numbers found in the transcription text.
    credit_card_number Redacts complete credit card numbers found in the transcription text.

    Redact PII from audio

    When you request a transcription that has PII redacted, you also have an option to request audio redaction. In that case, we will mute the parts of your audio where PII numbers are spoken, and make a downloadable URL available for the redacted audio file.

    Important Considerations

    Submit an audio file for transcription and enable audio redaction

    Get the redacted audio url

    If a webhook_url was provided in your API request, we will send a POST to your webhook_url when the redacted audio is ready. The POST request headers and JSON body will look like this:

    headers
    ---
    content-length: 79
    accept-encoding: gzip, deflate
    accept: */*
    user-agent: python-requests/2.21.0
    content-type: application/json
    
    params
    --
    status: 'redacted_audio_ready'
    redacted_audio_url: 'https://link-to-redacted-audio'
    The redacted_audio_url link is only valid for 30 minutes!

    Retrieving the redact audio URL directly from the API

    If you can't receive a webhook, you can also make a GET request to the following endpoint to retrieve a URL for your redacted audio file:

    https://api.assemblyai.com/v2/transcript/<your transcript id>/redacted-audio

    This will return the following status codes and responses:

    200 status code (successful)

    {
        "status": "redacted_audio_ready",
        "redacted_audio_url": "https://link-to-redacted-audio"
    }

    Please note that the redacted_audio_url link is only accessible for 30 minutes. If you need to access it after this time, you can just hit the endpoint again to get a new link.

    While you can request a new link, the redacted audio file will be purged from our servers after 24 hours. You'll need to make sure to download the file and store it in your own server/S3 bucket/etc within 24 hours.

    202 status code (pending)

    A 202 status code will be returned if audio redaction is still in progress. Depending on the length of the file it can take several minutes after the audio file finishes transcribing for the redacted audio file to be created.

    400 status code

    A 400 will be returned if something is wrong with your request or if the redacted audio file is unavailable. You can read more about how to interpret and handle 400 errors in the docs here.