Amy C. Geojo
August 14, 2018
Voice recognition is only as good as the inputs used to train the technology. That’s why introducing Scribe to the dialect of hundreds of accents has been crucial to ensuring its accuracy for transcription services.
We started with this line:
Please call Stella. Ask her to bring these things with her from the store: six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her wednesday at the train station.
Of course, if you ask a room full of people to all say those words out loud, it will sound different coming out of each of their mouths (for a quick reference on how we all talk differently, check this out). These differences are what we wanted to pinpoint to enable Scribe to be as smart as possible – to be able to say with confidence, that no matter where someone comes from, or what their background is, their dialogue will be understood.
Some background: Automatic Speech Recognition (ASR) systems identify and process human speech. One primary use of such technology is to convert speech to text. This is no mean feat. For starters, there isn’t a 1:1 mapping between phonemes (linguistically meaningful speech sounds) and graphemes (characters comprising an alphabet). For instance, English has (debatably) 44 phonemes but only 26 letters. Moreover, the acoustic signature associated with a phoneme varies due to individual speaker differences and contextual effects, among other reasons.
Still, phonemes are categories and categories have boundaries. If everyone spoke the same dialect, an ASR system could theoretically use acoustic differences (formant differences etc.) to distinguish phonemes. (Background reading here and here.)
Dialects pose an additional challenge for ASR systems because the same word can be associated with different phonemes entirely. (Interesting blog from two linguists about vowel shifts.)
We leveraged data from the Speech Accent Archive, curated by linguists at George Mason University (Weinberger, S., 2013), which contains recordings from hundreds of people speaking the line above.
Considering participants born in the United States:
- The data includes recordings from individuals born in Washington D.C. and all states except Delaware. Of these, 25% are from individuals born in Virginia, California, New York or Pennsylvania.
- Ninety-five percent (342) participants spoke English from birth. The native languages for the other 18 speakers: Arabic (3), Mandarin (2), Greek (2), Farsi, French, Kikongo, Korean, Russian, Spanish, Tagalog, Twi, Urdu, Yiddish and Yupik.
Worldwide, 163 unique countries were represented with 191 different native languages. A problem (or a wonder of human achievement) is that individuals born in the same country do not necessarily share the same native language. For instance, 11 different native-languages are represented in the set of audio recordings from Chinese born speakers: 23 native speakers of Cantonese, 52 native speakers of Mandarin, with the remaining 13 spoken by native speakers of the other 9 languages.
Running and optimizing Scribe with this study, we have very promising results to report: We are very well covered in 49 out of the 50 states (sorry, Oklahoma), and well covered in most countries all over the world.
United States Accent Accuracy (English)
World Accent Accuracy (English)
Here’s why this is important: Our clients conduct business worldwide. We are dedicated to providing accurate transcriptions and entity discovery, no matter where they come from or where they live. In order to do so, it is important that we develop robust programs that are as insensitive to accent-based variances as possible.
And we’re on our way.