Blog: Talk, Don't Type!
Humans have been talking for 100,000 years – and now, with the latest developments in machine learning and computing power, machines are smart enough to listen. Not only can machines recognise speech, they can understand the meaning of it. We are on the cusp of a profound change in behaviour.
GreenKey offers insights into the world of speech recognition and how we are leveraging ASR to digitize voice in the financial markets and to change the way the markets communicate.
The usage of automatic speech recognition (ASR) technology has exploded in the last 2 years in the consumer space off the back of billions of dollars of R&D from the tech giants. There is now a huge opportunity to leverage that technology in a wide range of domain specific applications. One such domain is the financial markets which largely run on voice communication.
Speech recognition capabilities have developed exponentially in the past few years. When Siri was first introduced in 2011, it was a fun piece of technology, but I didn’t know many people who used it for anything other than having a laugh. In fact, when my 4-year-old son got his hands on my wife’s new phone in 2011, he was fascinated by Siri and of course wanted to have a go speaking into it. Unfortunately, his mumbled 4-year-old version of ‘What is Siri?’ led to Siri saying, in her clear, bell-like tones, ‘Did you say auto-eroticism?’. As my wife lunged for the phone, Siri continued, ‘Would you like to do a web-search for auto-eroticism?’. Needless to say, Siri was quickly deactivated.
ASR technology has improved significantly since then and it is 4x faster than typing. Two factors are driving this. Factor one is the vast sum of money being spent on ASR by some of the biggest players in technology: Google, Apple, Amazon, Baidu, Microsoft and IBM.
Developments include a more sophisticated version of Apple’s Siri (including Siri being built into Macs as of this fall), Microsoft’s virtual assistant, Cortana, Google Now, a voice search function, and smart home devices including Amazon’s Echo and Google’s Home. What drives this investment, is factor two: user’s behavior.
As a percentage on mobile devices it is even higher - 20%. These numbers matter, because the accuracy of ASR is based on volume of speech data sets which feed into machine learning. What ASR does, in effect, is predict what you are probably saying, based on what millions of users have actually said in the past. The more data sets there are, the more accurate ASR’s predictions become.
This new level of sophistication means that ASR is no longer just used for making a hands-free phone call in the car or finding a near-by restaurant. There are many jobs where typing is difficult or impossible, or where the entire job is based on spoken conversations. In these cases, speech recognition technology can transform the way people work. The F-35 was the first U.S. fighter aircraft with an ASR system able to "hear" a pilot's spoken commands, leaving their hands free to control the aircraft. At GreenKey we are doing the same for a sector where voice is fundamental to the conduct of business – traders and brokers.
A Short History of ASR
The first ASR systems were developed in the 1950s by Bell Laboratories and IBM and could only understand digits. In the 1970s, the US Department of Defense started funding research projects looking for sound patterns and developed systems with the vocabulary of a 3 year old. Through the 1980s and 1990s, ASR turned towards prediction using statistical methods. As computers with faster processors arrived towards the end of the millennium, Dragon launched the first consumer product (Dragon Dictate). I remember gleefully thinking that I could use it to take all my university lecture notes: all I had to do, the thinking went, was to send my Dragon-enabled laptop in to class in a friend’s backpack, while I went off swim training. Alas, it wasn’t that good.
The last 5 years have seen a step change driven by advances in deep learning, big data and the availability of faster processing power from cloud computing. We are now seeing an explosion of both voice search apps and personal assistants. Apple introduced its (initially not very) intelligent personal assistant Siri on the iPhone 4S in 2011. Google offered its “Voice Search” app that uses data from billions of search queries to better predict what you're probably saying. In 2014, Amazon launched the Echo, a wireless speaker and voice command device that responds to the name Alexa and can be told to play music, make to-do lists and set alarms. Google has now announced Google Home, a Wi-Fi speaker with a built-in voice assistant to answer questions and control web-enabled devices in your home. Google recently provided open access to its new enterprise-level speech recognition API and leads the market in English, while Baidu’s “Deep Speech 2” has become ubiquitous for Mandarin. A stable of social robots has been developed that can talk to and apparently empathise with us. It is almost time to throw away the keyboard.
According to a Northstar Research study, half of all adults and teenagers use voice search every day (Siri, Google Now or Cortana). Google Now has an error rate of 8% compared to about 25% a few years ago. Recently Google open-sourced its TensorFlow machine learning system which underpins its ASR and Microsoft followed suit by open-sourcing its Computational Networks Toolkit (CNTK) for the ASR behind its Cortana virtual assistant. This will spur the rapid development for an array of ASR apps.
Image Credit: Internet Trends 2016: Code Conference. Presentation by Mary Meeker, Kleiner, Perkins, Caufield, Byers
How does cutting edge ASR work?
ASR systems are based on two key models:
- Acoustic model: to represent the relationship between an audio signal and the phonemes (sounds) or other linguistic units that make up speech. The acoustic model is built by taking audio recordings of speech (split into small consecutive slices or “frames” of milliseconds of audio to be analysed for its frequency content), and their text transcriptions. Software then creates statistical representations of the sounds that make up each word. The result is a probability distribution over all the phonemes in the model. Leading ASR systems use deep neural networks (DNNs) as the core technology to model the sounds of a language to assess which sound a user is producing at every instant in time. DNNs are capable of being “trained”: modelled on the human neural system, this is a form of machine learning based on representations of the brain’s neural response to specific stimuli, in this case millions of examples of recorded speech. Critically, ASR systems are able to process information blazingly fast, and can work in noisy environments.
- Language model: a probability distribution over sequences of words that estimates the relative likelihood of different phrases. The language model provides context to distinguish between words and phrases that sound similar. For example, the phrase “It's easy to wreck a nice beach" is pronounced almost the same as "It's easy to recognise speech". Given the context of the conversation, the language model enables this ambiguity from the acoustic model to be resolved.
To create a highly accurate ASR system requires large amounts of data to train both the acoustic model and the language model. By using domain-specific models, trained with a large volume of training data, it is possible to achieve accuracy greater than 95%.
Voice in the financial markets
Nearly all the key interactions in the financial markets are done by voice - trades or enquiries related to transactions that are large, complicated or for an illiquid product; important advice and important client interactions. This is simply because voice is the predominant form of human communication and it has qualities that are not matched by other mediums: immediacy, empathy, nuance, instant feedback. If you walk onto a trading floor of a large bank or broker you will see a thousand people, all on the telephone - talking, shouting, communicating . The financial markets run on voice.
Traders will become more efficient, vast amounts of data will be harnessed, fewer mistakes will be made, and there will be vastly more transparency and audibility. Regulators will have the same level of transparency and audit trails with voice interactions as with electronic ones, and this will allow the market participant to choose the form of communication that best serves their needs. The markets will become safer, cheaper and more efficient.