Most phone call recordings and audio files are recorded in lossy formats such as the very common MP3 format. This is done to save storage space and bandwidth, because compressed audio takes up less space. But, as you can see in the below image, compressing audio data discards information which makes it harder for Speech-to-Text algorithms to transcribe the data accurately.


We see a lot of compressed data from customers, especially customers transcribing phone call audio with our Speech-to-Text API. When we talk about compression, we mainly mean lowering the bitrate of audio files and saving files in compressed formats like mp3.

In telephony, for example, a standard signal has a bitrate of 64kbps (kilobits per second). Most phone call recordings we see, though, have bitrates as low as 8kbps due to compression. To make matters worse, recordings are saved in mp3 formats which discard even more information to save space.

For example, listen to these two recordings and you can hear the difference in quality between bitrates. If you can't tell the difference right away, try listening with headphones on.

64kbps bitrate:

8kbps bitrate with mp3 encoding (compressed audio):

This compressed audio is not just harder for you to hear, it's also harder for Speech-to-Text algorithms to transcribe accurately. To solve this, we've developed a novel way to train our Speech-to-Text models to transcribe compressed audio data just as accurately as uncompressed data.

For example, here is the transcript from AssemblyAI and Google on the above compressed audio clip:


This may seem like a small difference, but minor errors in a transcription can completely change the context of the transcription.

By making our Speech-to-Text models robust to low bitrate and compressed audio data, we can offer much higher accuracy on phone call recordings for call centers, sales calls, voicemail, etc.

This is just one of the many things we're doing to make Speech-to-Text easier for developers and companies to access.