PII (Personally Identifiable Information) Redaction

    Redact PII from transcripts

    Working with audio data that has sensitive information? In the below example, we show you how to request a transcription that has Personally Identifiable Information (PII), such as phone numbers and social security numbers, redacted. All redacted text will be replaced with # characters. For example, the phone number 111-2222 would become ###-#### in the text.

    PII redaction is only supported when text formatting is enabled (which is the default behavior). If you have explicitly disabled text formatting, PII redaction won't work!

    Specify which types of data to redact

    To best-fit the data redaction to your application, you can select from a set of redaction policies when PII redaction is enabled. You can include any or some of these policy names in the redact_pii_policies array when making your POST request as shown above.

    Policy Name Description
    medical_process Medical process, including treatments, procedures and test. E.g., "heart surgery", "CT scan."
    medical_condition A medical condition. Includes diseases, syndromes, deficits, disorders. E.g., chronic fatigue syndrome, arrhythmia, depression.
    blood_type blood_type.
    drug Medical drug, including vitamins and minerals. E.g., Advil, Acetaminophen, Panadol.
    injury Human injury, e.q., I broke my arm, I have a sprained wrist. Includes mutations, miscarriages and disclocations.
    number_sequence A "lazy" rule that will redact any sequence of numbers equal to or greater than 2.
    email_address Email addresses
    date_of_birth Birthdays
    phone_number Phone numbers
    us_social_security_number Full 9 digit US Social Security Numbers
    credit_card_number Complete credit card numbers
    credit_card_expiration Expiration date of a credit card
    credit_card_cvv 3 or 4 digit CVV (security) code on a credit card
    date Dates like "July 3rd, 2020" or "August 8"
    nationality Origins and nationalities like "American", "Asian", or "Caucasian"
    event Names of events
    language Language names like "French" or "Spanish"
    location Location names like "Antarctica" or "Mexico"
    money_amount Dollar amounts like "30 dollars" or "$10.99"
    person_name Names like "Bob" and "Doug Jones"
    person_age Age of a person like "75" or "21"
    organization Organization names like "CNN" or "University of Alaska"
    political_affiliation US political parties like "Democrat" or "Republican"
    occupation Professions/occupations like "scientist" or "doctor"
    religion Names of religions like "Judaism" or "Catholic"

    Customize how Redacted PII is replaced

    By default, any PII that is detected is replaced with a hash - #. For example, the credit card number 1111-2222-3333-4444 is replaced with ####-####-####-####. By including the redact_pii_sub parameter in your POST request, you can customize how the PII is replaced.

    Here are the options for the redact_pii_sub parameter:

    Value Description
    hash PII that is detected is replaced with a hash - #. For example, I'm calling for John is replaced with ####. (Applied by default)
    entity_name PII that is detected is replaced with the associated policy name. For example, John is replaced with [PERSON_NAME]. This is recommended for readability.

    Here is an example of how to specify the redact_pii_sub parameter in your API request:

    Redact PII from audio

    When you request a transcription that has PII redacted, you also have an option to request audio redaction. In that case, we will mute the parts of your audio where PII numbers are spoken, and make a downloadable URL available for the redacted audio file.

    Important Considerations

    Submit an audio file for transcription and enable audio redaction

    Get the redacted audio url

    If a webhook_url was provided in your API request, we will send a POST to your webhook_url when the redacted audio is ready. The POST request headers and JSON body will look like this:

    headers
    ---
    content-length: 79
    accept-encoding: gzip, deflate
    accept: */*
    user-agent: python-requests/2.21.0
    content-type: application/json
    
    params
    --
    status: 'redacted_audio_ready'
    redacted_audio_url: 'https://link-to-redacted-audio'
    The redacted_audio_url link is only valid for 30 minutes!

    Retrieving the redact audio URL directly from the API

    If you can't receive a webhook, you can also make a GET request to the following endpoint to retrieve a URL for your redacted audio file:

    https://api.assemblyai.com/v2/transcript/<your transcript id>/redacted-audio

    This will return the following status codes and responses:

    200 status code (successful)

    {
        "status": "redacted_audio_ready",
        "redacted_audio_url": "https://link-to-redacted-audio"
    }

    Please note that the redacted_audio_url link is only accessible for 30 minutes. If you need to access it after this time, you can just hit the endpoint again to get a new link.

    While you can request a new link, the redacted audio file will be purged from our servers after 24 hours. You'll need to make sure to download the file and store it in your own server/S3 bucket/etc within 24 hours.

    202 status code (pending)

    A 202 status code will be returned if audio redaction is still in progress. Depending on the length of the file it can take several minutes after the audio file finishes transcribing for the redacted audio file to be created.

    400 status code

    A 400 will be returned if something is wrong with your request or if the redacted audio file is unavailable. You can read more about how to interpret and handle 400 errors in the docs here.