Prototyp Stockholm

The basics of voice interaction

Engineering war stories from teaching robots how to talk to humans

Tech used

ASR
Alexa
Furhat

Björn Helgeson has been working with voice interactions in different shapes and forms over the years - Google Actions, Amazon Skills, Sphinx and the Furhat Robot platform. In this post, he describes some basics behind the flourishing voice technology field.

Target audience: Everyone
Reading time: ~8 minutes

Coming from a world of web and graphical user interfaces, creating voice interaction experiences has been a great learning tool for me - for understanding both new technology and age-old human behaviour alike. Over the course of the last decade we've seen voice technologies go from being quirky peculiarities in automated answering machines, to taking up more and more space in or homes, phones and cars. This is a write-up of some experiences from working with these technologies - some under-the-hood peeks, common misconceptions and lessons learnt.

Language understanding has 2 important components

The first thing people usually don’t realize about computers understanding humans, is that there are two distinct processes involved. The first part is ASR (Automatic Speech Recognition, or more simply put: speech-to-text), and the second part is NLU (Natural Language Understanding). You can think of ASR as processing a sound-file with some speech in it, and getting back a text of what that speech was. This is something we engineers rely heavily on machine-learning algorithms to figure out for us. This means that whoever has the most data is likely to build the best speech-to-text services. That, in turn, means that most people working with voice interaction technologies use one of the big boys’ services - Google, Amazon, Apple, etc.

So ASR for many engineers is about obtaining a sound-file from the user, sending it off to Google and getting a response back with some text. Magic. Most folks think the work is done at this point. Oh no - this is where the second part comes in.

Spectrogram of audio file

Figure 1. Can you figure out what was was said in this sound clip? Google can. It was ‘potato salad’.

NLU is interpreting what a text means in a specific context. If I’m building an automated travel-agent, I need to codify into my app how the text “/Paris to London tomorrow around noon/” should be destructured like, in the specific context. From a sentence like that, perhaps we could extract something like the following data:

DEPARTURE_CITY: PARIS
ARRIVAL_CITY: LONDON
DEPARTURE_DATE_START: 2019-12-21:11:30:00
DEPARTURE_DATE_END: 2019-12-21:12:30:00

This is usually a mix of machine-learned algorithms, codified business logic and coming up with a whole bunch of response examples to feed to your NLU engine. But as we see, even after have a textual representation of what the user said - our job is only half done - we still need to extract meaning from it. And this is far from trivial.

Because even on simple yes-or-no questions humans have a plethora of weird ways to express themselves that a computer needs help discerning. To the question of “Would you like to continue?” humans can come up with a number of weird stuff other than a plain “Yes”. Things that other weird humans would interpret affirmatory - like “go on”, “proceed”, “go ahead”, “let’s do it”, etc. You can also have cases where the same response to a yes/no-question means opposite things depending on the context. The response “Sometimes” is semantically interpreted as “Yes” when the question is “Are you ever lonely?”, but as “No” when the question is “Are you always on time?”.

Every question has its own context, and every question its own response space. For a high-performing response handler to a simple question, you’ll most likely need to be able to interpret thousands of different texts. So designing quality voice interactions means working a lot with the NLU engine to make it extract the right meaning from your users responses.

Speech-to-text is highly context dependent

Machine learning algorithms can come up with some pretty weird stuff

Back to the first part of language understanding again. Since the Speech-to-text services that we voice-engineers use on a daily basis are machine learnt, it means they are trained on lots and lots of examples of real-life audio. That makes them very context dependent. Google and Amazon’s speech-to-text will have a much better chance coming up with the correct transcription if you give it some context. Answering “Yes I have” will be handled much better than only “Yes”, because the machine learning algorithms have figured out that “Yes I have” is a common pattern of words. And also that “yes” audially resembles so many other things, like “best”, “yeast”, “Jeff”, etc.

This context-needy-ness became hilariously apparent a few years ago when I was building a word-recall game to be part of a memory-test application. The computer was reading out a bunch of random words for the user to remember and repeat. “house”, “toy”, “they”, “ice”, “fun”, etc. The more words you remembered the higher your score. The only problem was that Google’s speech recognition algorithms could for the hell of it not make sense out of this list of non-connected words. Desperately trying to force some coherence out of this nonsense, the speech recognition engine consistently chose to return things that made more contextual sense instead, deciding the user had probably said: “House of the Rising Sun”.

We had another embarrassing problem designing a diabetes screening robot a while back. Some users couldn’t even make it past the screening selection stage - basically answering the question "Which condition would you like to be screened for?". Digging into the logs we found something peculiar. On the top list of misunderstood responses was the word “diabeetus”. The engineering team was puzzled. Why would Google’s speech-to-text service consistently return a wrongly spelled version of the ailment? After a brief googling, it turned out the voice recognition was highly influenced by a popular internet meme https://www.youtube.com/watch?v=m6CeGgzaGSE, named precisely “diabeetus” to mimic the southern accent of a man in a pharmaceutical commercial. At one point, the meme-spelling of “diabetes” was so prevalent in the speech recognition responses, that no matter how clearly articulated or well accentuated our attempts were (imagine a bunch of engineers desperately screaming “DI—A—BE—TES!” at a microphone), we still couldn’t produce the correctly-spelled response.

Endpoint detection (turn-taking)

To have a dialog we need two people who are both talking - preferably not at the same time. Deciding when your dialogue counterpart has finished talking and it’s your turn to speak is a tricky subject. Even humans get it wrong sometimes. Your Home Assistant or Echo device rely mainly on how much time passed since you stopped talking. It’s fairly crude, but usually quite effective for answers to simple questions like “Do you take your coffee with milk?”, or “How old are you?”. We expect the user to make some short noise and then fall silent. However when asking more stimulating questions like “Are you happy in life?”, the response might have more acoustical variance. The respondent might ponder the question for a while, take a dramatic pause, or stop between two sentences to think. Using silence alone makes it very hard for a robot to know if you’ve finished your “turn” or if you’re just pausing for contemplation.

Humans are usually quite good at turntaking in conversations, but have you ever been in a video meeting with a slight delay?

More sophisticated technologies can take advantage of other patterns in basic human behaviours. We tend to look away as we consider something carefully, and direct our gaze back onto our interlocutor as we finish speaking. So if we as voice interaction designers can grab a video-feed of the user, we can determine the gaze direction and wether the user is facing the robot or not. If the user is looking away after the original silence threshold has passed - we wait a bit more before passing the turn. More experimental methods of improving endpoint detection involve looking at the pitch of the voice at the end of the message, or running a semantic analysis to see if at any point, it seems like an appropriate place to finish a sentence.

Backchannels

Another weird thing we humans do without thinking about it much, is backchannels. In addition to the messages described above, going back and forth in strict turns, we actually give each other a lot of information in between. When outlining your latest vacation trip to your friend, you’re probably getting a constant feed of positive backchannels - like nodding or saying “yeah”, “uh-huh”, “mhm”, and “right” - indicating your counterpart’s active listening and understanding - your message is being received alright. If she on the other hand would be putting her head on tilt, muttering “hmm…” or “eeeh”, she might be indicating that she’s not agreeing with you or not following what you’re saying - negative backchannels. In the same manner we can convey all sorts of useful information to the speaker /while speaking/. Because this communication occurs “off” of the main message channel that is turn based, it’s dubbed “backchannel”.

This is imperative in designing quality voice interaction. If you’re not convinced of its importance - try being silent the next time you’re on the phone and someone’s telling you something - see how soon you’ll hear an “Are you still there?”.

To us Swedes, being notoriously consensus-seeking, the most common backchannels are the “mm” sound and the nod. We do them a lot. In fact, in a recent project we conducted a study to investigate a Swede’s typical listening behaviour, in order to emulate it in an actively listening robot. We filmed people while listening to another human, and counted the number of times they produced any kind of backchannel. It turned out to some people were making 60 backchannels in a 30 second listening period - or 2 per second. While this sounds bizarre, it didn’t stand out to anyone as strange listening behaviour in this context. We just happen to be that eager about telling our conversation partner everything’s all right.