Voice input is the future. So why is it so bad in videogames?

In Her, a new film by Spike Jonze, Joaquin Phoenix portrays a desperate writer who falls in love with the tender, caring voice of his operating system. Crazy? Well, if you ask Siri, “Do you have a boyfriend?” she’ll reply, “I have you. That’s enough family for me.” But move much further than that and things misfire. The courting ritual feels as awkward as a tongue-twisted attempt to ask permission to go to the restroom in high-school Spanish class.

When it comes to our relationships to our computers, it looks like love is out of the question for the time being. But what about free-flowing, platonic conversation, the way you would chew the fat with another human being? When will there be computers that can laugh at our jokes, encourage us to get out of bed, and ask about the grandkids? As for now, we’re mostly dictating commands to machines—telling our Kinect to turn the channel, or ordering our battle units to take out the snipers in the trees.

“We’re starting to use speech recognition for managing a calendar, or searching the web, but using it for conversation is a very new thing,” says Martin Reddy of ToyTalk, whose app The Winston Show encourages kids to yak with a googly-eyed host who looks like he’d be fun to squeeze. But the aim in the long run is to turn task management into entertainment, facilitating open-ended conversation that doesn’t always end predictably.

In developing their loquacious software, ToyTalk is on a mission to discover the art and science of chit-chat, Reddy tells me. That they are even posing the question shows that speech recognition has come a long way. Audrey—the earliest speech recognition program, developed in 1952 by Bell Laboratories—could only recognize spoken numbers. Nowadays, programs like Siri can interpret Southern accents and rely on geo-locational data. But computers are clueless about verbal interaction when it comes to many social conventions we take for granted, such as reading cues and understanding inflection.

Getting to a junction in artificial intelligence where man and machine exchange ideas colloquially is easier said than done. Many hurdles stand in the way, the most apparent being that these robotic voices just don’t understand us well. “Voice is an incredibly inaccurate and imprecise input device,” Reddy says, relaying that voice recognition software hears things right only around 80 to 90 percent of the time.

Speech is full of obvious obstacles that can trip up a non-human listener. 

There are other niggling issues that often make our artificial interlocutors behave like drunken stooges. Homophones, words that sound the same but have two different meanings, are an endless source of confusion. A program like Siri knows that we typically aren’t talking about grass dried in autumn when we say “hey,” but a farmer might have trouble googling the futures for “hay” on the commodities market, though he might find a good price on a condominium in “Haiti.” Speech is full of obvious obstacles that can trip up a non-human listener, such as recognizing if someone is actually speaking, or if the garbage disposal is running. On top of that, computers have no way to comprehend if more than one person is speaking.

Admittedly, this puts a serious damper on the romantic prospect of getting close and personal to the lady inside your phone. However, In other instances, the disembodied voice may make for a better listener than a real-life significant other. “On the AI side, we know how to respond to you,” Reddy says, pointing to the sophisticated profiling capabilities of their software. “We can pick something that’s more tuned to your location, age, and gender,” he says. The longer you correspond with Winston, the more personalized his responses will become. “There is definitely adaptation and learning going on,” he says.

Still, our deepest thoughts are pretty insignificant if these brilliant binary calculators can’t relate to what we’re feeling inside. Judging mood and emotion is a steep challenge for voice recognition software, but Reddy and his team think they can crack it. “Are you saying it sadly? This is something humans can pick up on. We should pick up on that too,” he says. The problem is that while humans are wired to be emotional, all electronic devices have are speakers and cameras and a microphone. These become part of the workaround. “A big thing for us is knowing how well our jokes land, so we do laugh detection. Is there laughter in the background?” Reddy says, calling the process “sentiment analysis.” He says that we are in the initial phase of emotion-savvy devices, although techniques such as monitoring the pitch of the voice to detect depression and using facial-recognition to capture a smile is coming.

While efforts to implement voice recognition in videogames thus far have been lackluster, Reddy sees enormous potential down the road. “I’m playing a lot of Borderlands 2 with my friends. We’re talking over wireless microphone, using our voice for strategics. It would be nice if you could talk to the non-playable characters in the same way,” he says. This could mean having a back-and-forth with the captain driving your Warthog, or perhaps even one day using your humanly charms to seduce an anthropoid in Mass Effect. In a world populated by real and artificial intelligences, you couldn’t tell the difference between computer-controlled characters and virtual avatars. And you’d stop to think: is this one real, or not?

Microphone image via National Film and Sound Archive Australia