/ Tutorials

Customizing the API for Accurate Speech Recognition

Using the AssemblyAI API you can define a set of common phrases and words custom to your use case. This makes the API much more accurate, especially when dealing with things like echo, accents, and noise.

Step 1: Overview

A Corpus is a collection of phrases and words that you'd like the API to focus on or add to the vocabulary.

For example, lets say you are building a movie trivia app. Your app will ask users questions about actors, and they have to use their voice to respond with the correct actor name.

Your app asks: Who was the male star of the film Titanic?
Your user responds: "Leonardo DiCaprio"
Step 2: Create a Corpus

In order to get high accuracy on actor names, you can create a Corpus with all the actor names.

We'll name this Corpus actor-names and add 25 actor names to the phrases array. This will make the API be more accurate recognizing those actor names, and add any that aren't already in the API's vocabulary to the vocabulary.

# request
curl --request POST \
    --url 'https://api.assemblyai.com/v1/corpus' \
    --header 'authorization: your-secret-api-token' \
    --data '
    {
      "name": "actor-names",
      "phrases": ["Robert Downey Junior", "Leonardo DiCaprio", "Tom Cruise", "Johnny Depp", "George Clooney", "Steve Carell", "Meryl Streep", "Mark Wahlberg", "Brad Pitt", "Jennifer Lawrence", "Ryan Gosling", "Will Smith", "Bradley Cooper", "Channing Tatum", "Emma Stone", "Ben Affleck", "Daniel Craig", "Hugh Jackman", "Christian Bale", "Dwayne Johnson", "Kristen Stewart", "Jamie Foxx", "Anne Hathaway", "Matt Damon", "Sandra Bullock"]
    }'

The API responds with the id of the Corpus we just created:

# response
{
  "corpus": {
    "id": 265,
    "closed_domain": false,
    "name": "actor-names"
  }
}

We'll keep track of this id to use when generating transcripts.

In this example we only added 25 names, but you can actually add up to 9,000.

Step 3: Transcribe Audio

Now that we have our Corpus, we can transcribe audio using the Corpus we made and get more accurate results.

We'll make an API call to transcribe the following audio clip:

# request
curl --request POST \
    --url https://api.assemblyai.com/v1/transcript \
    --header 'authorization: your-secret-api-token' \
    --data '
    {
      "audio_src_url": "http://url.com/to/audio.wav",
      "corpus_id": 265
    }'

The API responds by telling us the transcript we just created is queued:

# response
{
  "transcript": {
    "status": "queued",
    "confidence": null,
    "created": "2017-11-12T05:00:05.113353Z",
    "text": null,
    "segments": null,
    "audio_src_url": "http://url.com/to/audio.wav",
    "corpus_id": 265,
    "id": 40
  }
}

But we can see the corpus_id in the response shows the id of the Corpus we created in Step 2. This confirms the API used our Corpus to fine-tune the accuracy of the recognition.

Now all we have to do is query for the transcript to check if it is completed or not (it usually takes 25-50% of the audio duration for the transcript to be generated):

curl --request GET \
  --url https://api.assemblyai.com/v1/transcript/40 \
  --header 'authorization: your-secret-api-token'

And we get back the following response from the API, with the accurate transcript, confidence score, and timestamps:

{
  "transcript": {
    "status": "completed",
    "confidence": 0.91,
    "created": "2017-11-12T05:00:05.113353Z",
    "text": "leonardo dicaprio",
    "segments": [
      {
        "start": 0.0,
        "confidence": 0.91,
        "end": 2970.0,
        "transcript": "leonardo dicaprio"
      }
    ],
    "audio_src_url": "http://url.com/to/audio.wav",
    "corpus_id": 265,
    "id": 40
  }
}

Because we used our Corpus, the transcript is accurate with a high confidence score. If there was noise or echo in the audio, or if the speaker had an accent, using our custom Corpus would help us mitigate these things and still get high accuracy.

Tips

  • If the audio recording was longer, the segments key would show the timestamps of the words spoken every few seconds. Since the audio recording we sent in was so short, we only received 1 segment.

  • Also, if we didn't want to have to query for the transcript, we could've used the /stream endpoint which returns an immediate transcript.