We're excited to announce significant accuracy improvements to our speech-to-text API. These improvements make the API significantly more accurate on audio of all types (phone calls, podcasts, videos) compared to before.
As a bonus, we're also announcing major speed improvements to the API. Transcribing a 60 minute audio file now takes as little as 5 minutes (1/12th audio duration)*.
Here's an example of the API transcribing a phone call. Even with poor quality, background noise (a baby crying), multiple speakers, and dialects, the API is able to generate an accurate transcription.
Yes there is something wrong with the school systems to the budget they do not have enough money in the budget to do this scoop do schooling the way they should um the budget the schools are in need of repair and I don't think I get paid enough to do schooling the way they should be the the the budget keep getting cat and the teachers end up paying money out of their own pockets to educate our children. I agree with you completely they totally do not pay teachers enough I personally don't have a school age child but from what I hear things or horrendous I can't our school districts can't even attract enough qualified teachers the classrooms are so overcrowded it's just it's unreal a week we just we have so much violence in the school system they can't afford to pay enough school officials to have someone there monitor the children like they should be the quality of education is just it's gone way down even so that was in school the the student to teacher ratio has gone up tremendously there's just the teachers are having to spend money out of their own pockets for class supplies because they're not getting what they need from their districts and from their school to help the situation they need to budget more money in the school systems and make that one of the main...
Here's another transcript for a speech by Barack Obama. It took just 15 seconds to generate the transcript for this audio clip.
Tonight is a particular honor for me because let's face it. My presence on this stage is pretty unlikely. My father was a foreign student born and raised in a small village in case. You grew up hurting go went to school in a tin roof shack. His father, my grandfather was a cop, a domestic servant to the british. But my grandfather had larger dreams for his son through hard work and perseverance. My father got a scholarship to study in a magical place america that shone as a beacon of freedom and opportunity to so many who are comprised in her. My father met my mother she was born in a town on the other side of the world. I can her father worked on oil rigs and farms through most of the depression. The day after pearl harbor, my grandfather signed up for duty, join patten's army marched across europe back home. My grandmother raised the baby and went to work on a bomber assembly line after the war they studied on the gi bill bought a house through at and later moved. We all the way to hawaii in search of opportunity, and they too had big drinks for their daughter, a common dream born or two can my parents share. Not only an improbable look. They shared an abiding faith in the possibilities of this nation.
Even with a good amount of echo and background noise, the API is able to generate a high quality transcription.
Both of the above examples use the API's default speech-to-text models. As a customer, you can go one step further and use the API to train a Custom Model on your historical data. It takes just a single API call to create a custom model.
Custom models will significantly improve accuracy, and add application-specific words to the vocabulary (like product or person names). This can result in an up to 10% absolute accuracy improvement for your application. That means going from 80% to 90% accuracy!
How we got here
Building a highly accurate, production speech-to-text API is a challenging task. Not only do you need thousands and thousands of hours of high quality training data to train a good model, you also need reliable audio processing pipelines and scalable infrastructure.
We've been able to improve our accuracy so much by introducing a new Deep Neural Network (DNN) architecture, and by improving our audio processing pipelines. Our new DNN architecture is bigger, faster, and more accurately models speech than our old DNN architectures. We've also come up with creative ways to grow our training data. For example, through tools like Harvest, we are able to generate large amounts of training data from the Internet. We're also researching ways to generate large volumes of synthetic training data, and plan to share our research in the coming months.
While the API is still in limited-access mode, if you would like access to the speech-to-text API, please email firstname.lastname@example.org.
* In production, we usually transcribe audio in 1/2 its audio duration (ie, 30 minutes of audio in 15 minutes) due to load, but we are improving our infrastructure to get this time down.