Blog: Talk, Don't Type!

Humans have been talking for 100,000 years – and now, with the latest developments in machine learning and computing power, machines are smart enough to listen.  Not only can machines recognise speech, they can understand the meaning of it.  We are on the cusp of a profound change in behaviour. 

GreenKey offers insights into the world of speech recognition and how we are leveraging ASR to digitize voice in the financial markets and to change the way the markets communicate. 

Synopsis
The usage of automatic speech recognition (ASR) technology has exploded in the last 2 years in the consumer space off the back of billions of dollars of R&D from the tech giants.  There is now a huge opportunity to leverage that technology in a wide range of domain specific applications.  One such domain is the financial markets which largely run on voice communication.  

Speech recognition capabilities have developed exponentially in the past few years.  When Siri was first introduced in 2011, it was a fun piece of technology, but I didn’t know many people who used it for anything other than having a laugh.  In fact, when my 4-year-old son got his hands on my wife’s new phone in 2011, he was fascinated by Siri and of course wanted to have a go speaking into it.  Unfortunately, his mumbled 4-year-old version of ‘What is Siri?’ led to Siri saying, in her clear, bell-like tones, ‘Did you say auto-eroticism?’.  As my wife lunged for the phone, Siri continued, ‘Would you like to do a web-search for auto-eroticism?’.  Needless to say, Siri was quickly deactivated.

ASR technology has improved significantly since then and it is 4x faster than typing.  Two factors are driving this.  Factor one is the vast sum of money being spent on ASR by some of the biggest players in technology: Google, Apple, Amazon, Baidu, Microsoft and IBM.  

Developments include a more sophisticated version of Apple’s Siri (including Siri being built into Macs as of this fall), Microsoft’s virtual assistant, Cortana, Google Now, a voice search function, and smart home devices including Amazon’s Echo and Google’s Home.  What drives this investment, is factor two: user’s behavior. 

Two years ago, in 2014, voice searches as a percentage of the total were still close to zero. Now, of the 3 billion Google searches done each day, 12% are voice searches. That is 360 million voice searches per day.

As a percentage on mobile devices it is even higher - 20%. These numbers matter, because the accuracy of ASR is based on volume of speech data sets which feed into machine learning.  What ASR does, in effect, is predict what you are probably saying, based on what millions of users have actually said in the past.  The more data sets there are, the more accurate ASR’s predictions become.  

This new level of sophistication means that ASR is no longer just used for making a hands-free phone call in the car or finding a near-by restaurant.  There are many jobs where typing is difficult or impossible, or where the entire job is based on spoken conversations.  In these cases, speech recognition technology can transform the way people work. The F-35 was the first U.S. fighter aircraft with an ASR system able to "hear" a pilot's spoken commands, leaving their hands free to control the aircraft.  At GreenKey we are doing the same for a sector where voice is fundamental to the conduct of business – traders and brokers.

A Short History of ASR
The first ASR systems were developed in the 1950s by Bell Laboratories and IBM and could only understand digits.  In the 1970s, the US Department of Defense started funding research projects looking for sound patterns and developed systems with the vocabulary of a 3 year old.  Through the 1980s and 1990s, ASR turned towards prediction using statistical methods.  As computers with faster processors arrived towards the end of the millennium, Dragon launched the first consumer product (Dragon Dictate).  I remember gleefully thinking that I could use it to take all my university lecture notes: all I had to do, the thinking went, was to send my Dragon-enabled laptop in to class in a friend’s backpack, while I went off swim training.  Alas, it wasn’t that good.  

The last 5 years have seen a step change driven by advances in deep learning, big data and the availability of faster processing power from cloud computing.  We are now seeing an explosion of both voice search apps and personal assistants.  Apple introduced its (initially not very) intelligent personal assistant Siri on the iPhone 4S in 2011.  Google offered its “Voice Search” app that uses data from billions of search queries to better predict what you're probably saying.  In 2014, Amazon launched the Echo, a wireless speaker and voice command device that responds to the name Alexa and can be told to play music, make to-do lists and set alarms.  Google has now announced Google Home, a Wi-Fi speaker with a built-in voice assistant to answer questions and control web-enabled devices in your home.  Google recently provided open access to its new enterprise-level speech recognition API and leads the market in English, while Baidu’s “Deep Speech 2” has become ubiquitous for Mandarin.  A stable of social robots has been developed that can talk to and apparently empathise with us.  It is almost time to throw away the keyboard.

According to a Northstar Research study, half of all adults and teenagers use voice search every day (Siri, Google Now or Cortana). Google Now has an error rate of 8% compared to about 25% a few years ago.  Recently Google open-sourced its TensorFlow machine learning system which underpins its ASR and Microsoft followed suit by open-sourcing its Computational Networks Toolkit (CNTK) for the ASR behind its Cortana virtual assistant.  This will spur the rapid development for an array of ASR apps.

Image Credit: Internet Trends 2016: Code Conference. Presentation by Mary Meeker, Kleiner, Perkins, Caufield, Byers

How does cutting edge ASR work?

ASR systems are based on two key models:

  1. Acoustic model: to represent the relationship between an audio signal and the phonemes (sounds) or other linguistic units that make up speech.  The acoustic model is built by taking audio recordings of speech (split into small consecutive slices or “frames” of milliseconds of audio to be analysed for its frequency content), and their text transcriptions.  Software then creates statistical representations of the sounds that make up each word.  The result is a probability distribution over all the phonemes in the model.  Leading ASR systems use deep neural networks (DNNs) as the core technology to model the sounds of a language to assess which sound a user is producing at every instant in time.  DNNs are capable of being “trained”: modelled on the human neural system, this is a form of machine learning based on representations of the brain’s neural response to specific stimuli, in this case millions of examples of recorded speech.  Critically, ASR systems are able to process information blazingly fast, and can work in noisy environments.  
  2. Language model:  a probability distribution over sequences of words that estimates the relative likelihood of different phrases.  The language model provides context to distinguish between words and phrases that sound similar.  For example, the phrase “It's easy to wreck a nice beach" is pronounced almost the same as "It's easy to recognise speech".  Given the context of the conversation, the language model enables this ambiguity from the acoustic model to be resolved.

To create a highly accurate ASR system requires large amounts of data to train both the acoustic model and the language model.  By using domain-specific models, trained with a large volume of training data, it is possible to achieve accuracy greater than 95%.

Voice in the financial markets
Nearly all the key interactions in the financial markets are done by voice - trades or enquiries related to transactions that are large, complicated or for an illiquid product; important advice and important client interactions.  This is simply because voice is the predominant form of human communication and it has qualities that are not matched by other mediums: immediacy, empathy, nuance, instant feedback.  If you walk onto a trading floor of a large bank or broker you will see a thousand people, all on the telephone - talking, shouting, communicating . The financial markets run on voice. 

The opportunity to apply ASR to the financial markets is immense and will be truly transformational.

Traders will become more efficient, vast amounts of data will be harnessed, fewer mistakes will be made, and there will be vastly more transparency and audibility.  Regulators will have the same level of transparency and audit trails with voice interactions as with electronic ones, and this will allow the market participant to choose the form of communication that best serves their needs.  The markets will become safer, cheaper and more efficient.

The GreenKey approach to ASR
Our approach is highly flexible and we are working with the industry to develop the standards.  The GreenKey team has a background in financial telephony and algorithmic trading making us well suited to applying ASR to the domain of financial markets.  We chose the leading open-source ASR frameworks, DNN models and rule based WFST graph decoders.

We optimised them for parallel computing and we use GPUs to give us blazing fast processing speed.  We have developed our models with thousands of hours of “trader speak” to make them highly domain specific.  We then train the models further to each specific user by having them read text into the system.  The system analyses the person's specific voice and uses it to fine-tune the acoustic model.  The GreenKey app can be deployed on any voice device that a trader or broker uses, including trading turrets, desk phones, mobiles or any other audio device.  All spoken communications can then be converted into usable data that can be leveraged to drive workflows or analysed for compliance, regulatory, sales and trading purposes.

To increase the accuracy even further for trading applications, we use a custom grammar approach that works like the FIX protocol for electronic communications.  In ASR the grammar describes all the likely terms the engine should expect and controls when to switch on and off various speech functions.  When well tuned, the grammar can deliver robust results by better defining the target space and whether to acquire or ignore or expand upon certain inputs.  

In markets where traders typically use a standard syntax and sequence of words for their trades, we have established that grammar within the system to make the engine more efficient and reliable.  GreenKey is developing global industry standards for voice communications and voice trading. 

Our domain specific approach gives super high accuracy (routinely higher accuracy than the Google API and > 95%) that would not be achievable if not for our:

  • Optimal acoustic model: because the voice originates on GreenKey, we can detect utterance, normalise and filter into a desired sampling rate.  This is crucial because recordings from trader turret systems have unique characteristics that confuse generic ASR systems, including low bandwidth cabling and signaling and unusual bitrate sampling. Further, we have a speaker dependent system that uses machine learning to train the system to the individual user’s pronunciation.  We can then append new features to the acoustic model that are specific to every user which further improves accuracy for that particular user.
  • Proprietary language model: with our large data set of financial market specific statements & conversations we have been able to optimize our language model.

Use Cases

We use our GreenKey ASR platform for two distinct products, and we are only just scratching the surface of use cases for each of them: 

  • General transcription: this means turning every spoken word into text (.wav files into.txt files) to create searchable, usable data.  In order to achieve very high levels of accuracy we do this with a small delay, which allows the language model to fully apply its probability determinations. 
  • Voice commands (voice driven workflows):  this means extracting certain keywords from a conversation in real-time to drive workflows, with a user-defined mapping & framework approach.  This can also include using a keyword in a live stream of data to commence and cease the ASR.

Conclusion
ASR has come a very long way in the last 5 years and is ready to transform our lives. When we talk about "Big Data", we often forget that the problem is as much related to data capture as it is to data analysis. GreenKey is about enabling intelligent voice data capture to help customers improve their analytics capabilities.

Having built the leading financial markets domain specific engine, at GreenKey we are super excited to digitize voice and change the way the markets’ communicate.

Acronyms and Terminology:

  • Automatic Speech Recognition (ASR): the methodologies and technologies that enable the recognition and translation of spoken language into text by computers.
  • Keyword Spotting:  a subfield of ASR that deals with the identification of keywords in utterances.
  • Transcription: the conversion of human speech into a text transcript.
  • Machine Learning: the construction of algorithms that can learn from example inputs and make data-driven predictions by building a model.
  • Deep Learning: applying machine learning to model high-level abstractions to discover patterns in large datasets with complex structures.  The first successful use case for deep learning was ASR.
  • Deep Neural Networks (DNNs): a biologically-inspired programming paradigm which enables a computer to learn from observational data.
  • Natural Language Processing (NLP): a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages.
  • Graphics Processing Unit (GPUs): the next generation of central processing units (CPUs), ie the processor in a computer.  GPUs have a highly parallel structure to make them more efficient for algorithms where the processing of large blocks of data is done in parallel.
  • Weighted finite-state transducer (WFST) decoder: a unified framework for integrating the many different models into a single model via composition operations, to improve search efficiency via optimization algorithms and to give flexibility to add new models.