Voice. Many of us take it for granted but it’s amongst the most powerful tools at our disposal since the beginning of civilization. And yet for the vast majority of the hours, we spend interacting with the technology we do it quietly. Unless off-course you are chatting away on Skype or the myriad of communication apps that have swamped the market. The thing to note here is that in case of a two-way conversation between one or more person on each end there are a few things happening here as data is being:
- Created by real people on each end of the medium or channel
- Exchanged between real people
- Processed by real people
- It is sometimes recorded by a machine and archived
Voice is simply a human medium. This post is predominantly a reflection of what happens when one node on either end of the exchange or both for that matter in the future become a machine and able to interact through voice. Through conversation. Enter Conversational Interfaces. Today they are starting off as one way with humans. Tomorrow we enter Star Wars where R2D2 and CP30 exchange data through voice (that’s right George Lucas may have thought of it first).
We like to think and define the advent of voice technologies in three waves:
- Recorder (yes as in a simple recording device but even that went digital in the 1980s)
- Interactive Virtual Recording (IVR) via telephony (Voice recognition and subsequent action)
- Voice for a Conversational Interface (CI)
Just like Video killed the Radio star. There was a move towards the visual as a means of best showing value, meaning and context. Voice became a part of the video and was subsumed by it. No one really thinks about how important the audio cues are unless you realize how a director is manipulating your feelings which watching a horror movie through the score itself. Try watching the same scene again without the sound. It’s not as scary. The human brain layers over audio information to make the experience richer. Our thesis is that much like this voice is here to stay and today’s run of the Graphical User Interface is about to get turbocharged through the introduction of CI.
True voice recognition was first invented by a Princeton man. Another American first. Hidden Markov Modelling (HMM) was a complex mathematical pattern matching technology created by Lenny Baum of Princeton University. The research was then shared with the likes of ARPA (the military as always was ahead of the game and the Advanced Research Projects Agency was also credited with having invented the precursor to the internet) and off-course those old chestnuts old big blue, IBM and Bell labs at AT&T.
HMM enabled significant research by private and military scientists that then went on to allow for the creation of the IVR technology. IVR could perhaps be described as a Fintech enabler as it was first developed by Charles Schwab to allow for hundreds of customers to be handled by the technology to call in and get quotes on stock and options. Once the technology was proven at the scale it became widely adopted across corporate America as late as the 1990s. Early customer use of IVR must have been significantly frustrating for some users due to the limited vocabulary in use. IVR was and remains, contrary to public impression, much more than a dumb terminal. The technology behind IVR allows for voice as an input, which is recognized, processed and then returns a result from a finite set of expected outcomes. This is largely in a sense rule-based. Many like myself who have seen the world of call centers evolve view IVR as the backbone that enabled call centers. This was largely possible thanks to Computer Telephony Integration which allowed for voice to be provided as an input to computers for processing and actioning. Financial Services used IVR to effectively create ‘extended office hours’ for simple information that could be provided to customers about their accounts or products and services.
CI will not exist in a vacuum. We simply could not analyze that much data in the palm of our hand this quickly ever before. Nor could we store that much data in such small components. Moores law has certainly enabled an entirely new capability for voice and mobile has essentially broken the leash of the desk for the beast once confined to the desktop kennel. Enter voice analytics the sister to CIs along with Artificial Intelligence to be able to decipher complex meaning and provide a suitable, appropriate and successful response.
There is complexity in voice around accent (will a non-native speaker with challenges in pronouncing words ever be as enabled) or in-fact the nuances of mood (will AI be able to interpret mood correctly and distinguish excitement and anger). What about the fact that we spend an inordinate number of minutes saying ‘can you hear me now’ because of poor network coverage, will we be as patient with the Voice Bots? We are less forgiving when it comes to machines. The expectations are heightened. And disappointment can lead to abandonment quickly. There is also the very serious challenge around those unable to hear, the deaf or partially deaf, and participate in what I deem to be the ‘Voice economy’. In time I am sure that voice will somehow be able to transcend these challenges through sheer processing horsepower or even perhaps the use of vibration to create visual clues (think about sound being used to manipulate particles that showcase letters on one end).
In a bid to not get disaggregated from its main source of revenue around visual advertising Google has launched its own platform to facilitate voice. The three horsemen of the voice apocalypse are Google (Home), Amazon (Echo) and Apple (Siri). The future of advertising in CI is unclear. Is there a return to the era of radio jingles? Will that be an unacceptable disruption to the User Experience around voice? Skills are only just being written into the Echo repositories around the world and it’s too early to set the standards in stone though best practices are emerging and fast being proven or failed. One certain challenge is around the depth and breadth of voice offerings that will captivate and retain the user base. Can voice develop a broad App ecosystem as quickly as GUI?
Voice is a deep and complex field with more data points that would appear obvious to the untrained ear. The computer of today, however, is a sophisticated listener and can interpret a multitude of data points related to voice that can provide clear and consistent data on the origin and provenance of the voice print. It is not entirely unsurprising to see some banks like HSBC now implementing voice-based security. Though I have to say that there is something discretely unattractive about repeating the words ‘my voice is my password’ to authenticate oneself over a phone call to what seems to be a next-generation IVR Bot. Emotionally Intelligence classes are likely to be available at General Assembly for Bots in the coming decade.
But before we get all excited about the power of what voice can do, we as technologists need to collect data to train the machines to be able to better service the complexity of languages as they seek help with utilities that service their lives. Eventually, our hypothesis is that voice will be a dominant interface supported by GUI. The reason why GUI is likely to stick around until we have implantables and total integration with human thought and consciousness is likely to be as mundane as the office worker not wanting everyone to know that he is shopping Home and Content insurance while sipping the morning coffee at his desk. The good news. Turning the newspaper page with sticky donut fingers and buying insurance just got a whole lot easier.
We have pondered further on the use of voice in insurance and look forward to discussing more of this with you next month. Until then please do write in with thoughts and feedback to firstname.lastname@example.org.