Please call Stella. Ask her to bring these things with her from the store: six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her wednesday at the train station.Of course, if you ask a room full of people to all say those words out loud, it will sound different coming out of each of their mouths (for a quick reference on how we all talk differently, check this out). These differences are what we wanted to pinpoint to enable Scribe to be as smart as possible – to be able to say with confidence, that no matter where someone comes from, or what their background is, their dialogue will be understood. Some background: Automatic Speech Recognition (ASR) systems identify and process human speech. One primary use of such technology is to convert speech to text. This is no mean feat. For starters, there isn’t a 1:1 mapping between phonemes (linguistically meaningful speech sounds) and graphemes (characters comprising an alphabet). For instance, English has (debatably) 44 phonemes but only 26 letters. Moreover, the acoustic signature associated with a phoneme varies due to individual speaker differences and contextual effects, among other reasons. Still, phonemes are categories and categories have boundaries. If everyone spoke the same dialect, an ASR system could theoretically use acoustic differences (formant differences etc.) to distinguish phonemes. (Background reading here and here.) Dialects pose an additional challenge for ASR systems because the same word can be associated with different phonemes entirely. (Interesting blog from two linguists about vowel shifts.) We leveraged data from the Speech Accent Archive, curated by linguists at George Mason University (Weinberger, S., 2013), which contains recordings from hundreds of people speaking the line above. Considering participants born in the United States:
- The data includes recordings from individuals born in Washington D.C. and all states except Delaware. Of these, 25% are from individuals born in Virginia, California, New York or Pennsylvania.
- Ninety-five percent (342) participants spoke English from birth. The native languages for the other 18 speakers: Arabic (3), Mandarin (2), Greek (2), Farsi, French, Kikongo, Korean, Russian, Spanish, Tagalog, Twi, Urdu, Yiddish and Yupik.
United States Accent Accuracy (English)
World Accent Accuracy (English)
Here’s why this is important: Our clients conduct business worldwide. We are dedicated to providing accurate transcriptions and entity discovery, no matter where they come from or where they live. In order to do so, it is important that we develop robust programs that are as insensitive to accent-based variances as possible. And we’re on our way.
Wishing you a cheerful holiday season and a Happy New Year from the entire team at GreenKey.
Speech recognition engines obtain the best accuracy by listening to thousands of hours of recorded audio.
With a new year ahead, the GK team predicts what will happen in the voice space in 2019.