At AssemblyAI we use very large Deep Neural Network models to power our Speech-to-Text APIs. These models require lots and lots of data to train a good function and generalize well. As a startup, we don’t have access to the same user bases or resources as a large company like Google or Microsoft to compile large data sets. As a result, we’ve had to think creatively about how to build up a large data set to train our models.

Introducing Harvest

Harvest is our proprietary, autonomous framework which crawls the Internet, looks for candidate training data, automatically labels that training data, and prepares it for training. It does all this with basically zero error, which allows us to keep a gold standard data set for training highly accurate models.

So what does Harvest do?

In our use case, the training data we need for training is labeled speech data. Specifically, each unit of training data (a sample) is an audio clip with its correct transcription. We can then train on this sample and learn from it.

To automatically collect these samples from the internet, Harvest crawls the web looking for public (non-licensed) videos and audio files that have closed captions. Think TED talks or open source university courses. It then downloads these video/audio files and closed captions, and prepares the data for training.

Unfortunately, it’s not as simple as just downloading video/audio files with closed captions. Closed captions are mostly meant for readability, and are not timed with the audio. To keep a gold standard data set, the audio and transcript must be perfectly aligned. And to make matters worse, we’ve found empirically that there are a lot of errors in closed captions. This might be due to human error, or due to an automatic system being used to generated the closed captions which introduces errors.

If we ignored these issues, it would make the function our models have to learn much harder. To get around these issues, Harvest has learned to automatically figure out when to discard bad data. For example, it can automatically determine when closed captions might be incorrect. It will also try to line up the caption with the exact time in the video/audio that the caption was spoken to create better alignments between the transcription and the audio.

In order for Harvest to decide which captions are incorrect, we’ve developed a proprietary algorithm that Harvest uses to make the determination. The False Positive rate in this system is almost 0%, and we don’t really care about False Negatives (denying a correct transcript) because data on the Internet is abundant, so we can afford to be lazy with our assessment.

We’ve also developed architecture to make Harvest just as scalable as a simple web server. Usually, we’ll have dozens of Harvest “clients” running in parallel. In just a few weeks, we are able to generate millions of high quality, labeled audio clips that we can use as training data for our Deep Neural Networks. This is orders of magnitude faster, cheaper, and reliable than using human workers to do the same task (which we’ve experimented and compared results against).

What's next

There’s no reason why this system couldn’t also be used to compile training data for other tasks. For example, we are experimenting with building a similar system to bootstrap Image or Text data from the Internet for other projects.

In the near future, we’ll be releasing a small subset (100 hours) of our training data that will be free to use for commercial or non-commercial projects.

As a startup, we’re having to develop a lot of creative technology and efficiencies to deliver cutting edge AI technology on a budget. We’ll be sharing more of our ideas, systems, and open source contributions in the near future. Stay tuned!