June 6, 2018
At GreenKey, our data science team is constantly focused on one question: How do we make a machine recognize speech as well as humans?
Several companies have shown computers outperforming humans at speech recognition, but these tests are normally on specific types of audio and don’t involve noisy environments. The fact is speech recognition engines have a hard time understanding all speech as well as humans.
Instead, many speech recognition engines are trained to perform well on specific content. This could be a speech recognition model for a particular language, noise environment, set of accents, or content type. If you build an array of speech recognition models that are all good at one thing, you can send audio to all of them, and then choose the best transcript. Many firms have started embracing this approach with success:
This concept is powerful – leveraging multiple speech recognition engines to get the best transcript. However, the process above is pretty inefficient. At GreenKey, we’re excited to announce the filing of our patent on a method that solves this problem in a better way.
Let’s say you have a heart issue and you want to see a doctor to get it diagnosed. There are many different types of doctors – generalists, neurologists, oncologists, radiologists, cardiologists. The current state of the art approach to speech recognition means that you should see every possible doctor, get all of their diagnoses, and choose the best one. That’s pretty inefficient, right? It requires a lot of time (and money).
So what do humans do instead? They make a decision on which doctor to see before they get a diagnosis. That doctor might refer them to a specialist or might be able to take care of their problem themselves. Even after seeing a specialist, you get the best diagnosis after only 2 doctor’s visits instead of dozens.
This is what GreenKey’s new patent does with speech recognition. Our SwitchBoard analyzes audio in less than a second to determine which transcription engine it thinks will best transcribe a piece of audio. Then, once it gets a transcript back, it figures out whether it should trust that transcript or get a second opinion from a different engine.
The result? Accurate transcription from multiple models without sacrificing time and resources. SwitchBoard’s multi-engine approach can reduce error rates by half. The graph below shows how accurate single models are versus the multi-engine approach, using 3 GreenKey built models and an extra model from a cloud transcription provider:
It is important to note that even though four models are being used here as choices for SwitchBoard, any segment of audio is transcribed by at most two models.
The ability to leverage multiple engines efficiently is a powerful tool. Voice technology is changing so fast, and there are many innovative companies out there training speech recognition models for a particular domain. Scribe’s SwitchBoard can leverage any GreenKey-built engine or any transcription engine that any company builds, as long as it takes in audio and returns a transcript. This not only means Scribe can deliver the most accurate transcript, it also means that it can transcribe multi-language conversations or conference calls where everyone has a different noise environment.