We're excited to announce significant accuracy improvements to our speech-to-text API. These improvements make the API significantly more accurate on audio of all types (phone calls, podcasts, videos) compared to before.

As a bonus, we're also announcing major speed improvements to the API. Transcribing a 60 minute audio file now takes as little as 5 minutes (1/12th audio duration)*.

Accuracy Improvements

Here's an example of the API transcribing a phone call. Even with poor quality, background noise (a baby crying), multiple speakers, and dialects, the API is able to generate an accurate transcription.

Yes there is something wrong with the school systems to the
budget they do not have enough money in the budget to do this
scoop do schooling the way they should um the budget the schools
are in need of repair and I don't think I get paid enough to do
schooling the way they should be the the the budget keep getting
cat and the teachers end up paying money out of their own
pockets to educate our children. I agree with you completely
they totally do not pay teachers enough I personally don't have
a school age child but from what I hear things or horrendous I
can't our school districts can't even attract enough qualified
teachers the classrooms are so overcrowded it's just it's unreal
a week we just we have so much violence in the school system they
can't afford to pay enough school officials to have someone there
monitor the children like they should be the quality of education
is just it's gone way down even so that was in school the the
student to teacher ratio has gone up tremendously there's just the
teachers are having to spend money out of their own pockets for class
supplies because they're not getting what they need from their
districts and from their school to help the situation they need
to budget more money in the school systems and make that one of
the main...

Here's another transcript for a speech by Barack Obama. It took just 15 seconds to generate the transcript for this audio clip.

Tonight is a particular honor for me because let's face it.
My presence on this stage is pretty unlikely. My father was a
foreign student born and raised in a small village in case.
You grew up hurting go went to school in a tin roof shack.
His father, my grandfather was a cop, a domestic servant
to the british. But my grandfather had larger dreams for
his son through hard work and perseverance. My father got
a scholarship to study in a magical place america that
shone as a beacon of freedom and opportunity to so many
who are comprised in her. My father met my mother she was
born in a town on the other side of the world. I can her
father worked on oil rigs and farms through most of the
depression. The day after pearl harbor, my grandfather
signed up for duty, join patten's army marched across europe
back home. My grandmother raised the baby and went to work
on a bomber assembly line after the war they studied on the
gi bill bought a house through at and later moved. We all
the way to hawaii in search of opportunity, and they too
had big drinks for their daughter, a common dream born or
two can my parents share. Not only an improbable look. They
shared an abiding faith in the possibilities of this nation.

Even with a good amount of echo and background noise, the API is able to generate a high quality transcription.

Custom Models

Both of the above examples use the API's default speech-to-text models. As a customer, you can go one step further and use the API to train a Custom Model on your historical data. It takes just a single API call to create a custom model.

Custom models will significantly improve accuracy, and add application-specific words to the vocabulary (like product or person names). This can result in an up to 10% absolute accuracy improvement for your application. That means going from 80% to 90% accuracy!

If you need help creating a custom model, you can join our Slack community or email us at support@assemblyai.com. We're always around and happy to help!

How we got here

Building a highly accurate, production speech-to-text API is a challenging task. Not only do you need thousands and thousands of hours of high quality training data to train a good model, you also need reliable audio processing pipelines and scalable infrastructure.

We've been able to improve our accuracy so much by introducing a new Deep Neural Network (DNN) architecture, and by improving our audio processing pipelines. Our new DNN architecture is bigger, faster, and more accurately models speech than our old DNN architectures. We've also come up with creative ways to grow our training data. For example, through tools like Harvest, we are able to generate large amounts of training data from the Internet. We're also researching ways to generate large volumes of synthetic training data, and plan to share our research in the coming months.

While the API is still in limited-access mode, if you would like access to the speech-to-text API, please email support@assemblyai.com.


* In production, we usually transcribe audio in 1/2 its audio duration (ie, 30 minutes of audio in 15 minutes) due to load, but we are improving our infrastructure to get this time down.