In the last console generation, motion control was the innovation that sold 100 million Wiis. It was, thus, inevitable that Microsoft and Sony would come up with their own versions of Nintendo’s technology, but while Sony’s Playstation Move is in effect a WiiMote with its precision refined to the nth degree, Microsoft’s Kinect was from the beginning something very different. Launched in November 2010, the Kinect allowed players to interact with their Xbox 360s using their entire bodies. You could swing your arm to throw a grenade without worrying about losing your grip on a controller and breaking your TV. Dance games were finally, really dance games. Microsoft sold eight million Kinects in the first 60 days, and 24 million by February of 2013.
More than three years later, however, it can seem like we’re still waiting for the great motion game that will revolutionize everything. Wii sales plummeted in its final years, Nintendo’s successor, the WiiU, is still struggling to gain traction, and while Kinect sales have been impressive for a peripheral released five years after the Xbox 360 was introduced, there are still more than 50 million Xbox 360 consoles out there without a Kinect. With the Xbox One, Microsoft is reducing that number to zero. Every Xbox One ships with an updated version of the Kinect, even though this means a retail price $100 higher than Sony’s Playstation 4.
Tim Carmody is a former senior writer and reporter for Wired and The Verge, and has been a columnist and contributor at Newsweek, National Geographic, Hazlitt Tech Review, Snarkmarket and other outlets. Carmody wrote at length about the Kinect at its release, and talked to me about how the Kinect’s potential as an interface technology might still be unrealized.
The design of the Xbox One seems to be premised on a bet that the Kinect will become the way that a large number of people will choose to control their home entertainment centers. What does it look like as a new interface technology is adopted?
The last major new interface technology that got adopted was the multitouch touchscreens, where we moved away from keyboards and toward virtual controls. (…) The illusion of multitouch is that you’re directly manipulating something on the screen, as opposed to pressing a button that’s on the side, or using a joystick or anything else. Now, in actuality what’s happening is that there’s a lot of specific software code that’s making certain kinds of interactions turn into things that the computer can understand as being meaningful and significant. Some of these things are user mappable, where you can control what the behavior means, but ultimately, what matters is that the computer, the operating system, everything that’s interacting with the interface has to register certain kinds of movement, certain manipulations as meaningful, and others as not.
It’s the same with the camera on the Kinect. Even though the camera is registering lots of different points on your body in fairly high resolution, it has to kind of break out an alphabet of your body. That requires both some choices by the programmer and a bit of a learning curve in the users, even though some of those things are going to feel “natural” in that they’re similar to older technology that we use or that it’s close enough to imitating an action in a way that feels relatively intuitive. And that’s the magic—how close can it get to something recognizable? How quickly can you teach people to use these gestures?
Do you see the Kinect as participating in that illusion of directly manipulating something on the screen, or is there a level of abstraction closer to using a mouse to move a cursor that results from the distance between the user and the screen?
There’s definitely certain levels of abstraction, just like when we use language we learn how to take some parts of what people say as being meaningful, and others as not. For example, in English, in a literal sense, we discard pitch. We may consider pitch for tone, or for aesthetic beauty, but in a literal sense, how high or how low you say something aren’t qualities that matter.
Kinect is kind of like that, and some elements are technological limitations and some are choices. Kinect, for example, has an RGB camera where it can recognize you in color so that it can identify you. But its main camera for trying to understand movement as an interface actually doesn’t recognize color at all. It recognizes packets of light, it measures time of flight, and it has some really sophisticated algorithms that understand the working of light as it hits different surfaces so that it can get a pretty good three-dimensional picture of your body. But it’s not in color, because color is extraneous information when it comes to understanding what kind of movements you’re making in a video game. It’s always going to abstract something because computers are always abstracting things, and human beings are too. We’re always discarding information that doesn’t matter to make better sense of what does. And especially with games, you’re trying to create rules that people can understand and that are predictable. If every single movement you did in a Kinect game was meaningful, it would be chaos. You’d have no idea what to do or where to start. So being able to reduce those to a specific gesture is really powerful.
Now, the Kinect is more sophisticated in terms of the range of things it can capture than probably any other user interface. It does color, it does distance, it does shape and texture, and it also does voice. It can do height and width and speed. It can capture a range of movements that’s better than anything short of wearing some sort of full-body sensor suit. That would be the alternative. Imagine if you had 50 WiiMotes strapped to your body that were all beaming information back and the computer was trying to make sense of all of them, That would give you possibly more information than the Kinect. But maybe not.
Even without a console like the Xbox, media players and cable boxes are increasingly making TVs into smart devices, but I’m not sure I’ve ever seen a touchscreen TV. Is there a reason that multitouch touchscreens haven’t been used on televisions?
There are several things that make the television set not super suitable for being a touchscreen. One is that it’s extremely large, another is that you lose a lot on some screens when your fingers are obscuring what you can see. (…) The television screen is also a collective space. It’s fixed, in most cases, on a wall or on a stand, and the experiences that you have there are collective. So you need something that allows for collective input. The TV is also a piece of furniture that shares pretty fixed relationships to other pieces of furniture, couches and chairs and tables, and so on. For the television to be a computing device, it has to fit in well into that lifestyle. It’s just not going to be adopted otherwise.
There still are a lot of trade-offs. There is a lot of stuff you can’t do in multitouch that you can do with a keyboard. So the way that the Xbox is built, it uses game controllers, and remote controls you can use for movies that look a lot like television remote controls, it uses voice, and it uses gesture and motion recognition technology with a camera.
The other possibilities are really difficult, although some of them can work. But the camera is really powerful, and the camera is super casual. The camera is fixed in one place the same way that the television and the Xbox are. You can hook the camera and the microphone together, as the Kinect does, and have it always pointing at the user. You can actually reduce the number of assumptions about what inputs you’re going to have based on all that information. Cameras are really powerful things in the living room.
The other thing I should add, voice and camera are powerful for both gaming and browsing video, and you can borrow gestures from smartphones. You can move your hand from left to right and all of a sudden you’re swiping through albums for music or movies in the same way that you would on your touchscreen. You can leverage tools and skills that you have from another technology. The Kinect is really well suited toward the television as a technology, the kind of anthropology and the physical restrictions of the television. And it’s also pretty well suited for the kinds of uses that we’ve used televisions for or that are emerging—linear programming, video or television that can be recorded or on demand video, gaming, and telecommunication.
Telecom is kind of the new emergent media that we haven’t really used televisions for before, but that is something that’s happening, led by the Kinect but happening lots of other places. Comcast’s X1 box has a camera and a microphone, and this is something more and more companies are interested in as something to try to extend what these devices can do.
That isn’t to say the Kinect is very good at everything. It probably isn’t as good at fine hand motion as anything with an accelerometer or a gyroscope in it like a WiiMote or a smartphone. It is reasonably good at voice commands. It’s starting to get pretty good at distinguishing between multiple people in a room, which is actually a real challenge.
On PCs, Microsoft encouraged the use of open source drivers that allowed people to develop their own applications using the Kinect as an input device. Is there any chance that the Xbox One will eventually include the same sort of user configurability? Is that the sort of thing that people are looking for in the middle of their living room?
It can give be a big advantage to companies that offer more configurability. The foil right now more than Xbox vs. Playstation might be the number of people that are developing gaming based on open source technology—Ouya’s Android-based console, or Valve’s Steam Box. Microsoft has dominated PC gaming through Windows, and dominated console gaming with Xbox, so somebody like Valve is extremely interested in alternate forms of input, in giving both users and developers greater access than a more closed down system gives them. If that really becomes the arms race, configurability could really be something that gives a platform an edge. There are Kinect hackers and medialab people who are using the cameras and the basic toolkits and the code to try to extend and enlarge the capabilities of the Kinect. Having four Kinects in a room, for instance, and then running that information into a PC, creating something that can be more massively multiplayer. You could outfit an entire game space with Kinect cameras and people could be wearing Occulus Rift headsets, Google Glass or some other kind of heads-up display, and you could do real, live-action gameplay using a combination of these tools. That’s one direction things could head when you get these tools talking to each other.
At the same time, when the Kinect came out, the Xbox sold crazy numbers, and what they sold the most of were Xboxes without Kinects. And people bought Xboxes because games were great on Xbox, and watching movies was great on the system. So access to content becomes more of a draw for most people than access to alternate input. And that’s the thing finally that it all comes down to. The Wii was such a success because Nintendo made really great games for the Wii—Wii Sports, Super Mario, Zelda, the things they made themselves that really figured out that controller. They made some really good stuff that was really compelling and got a lot of people back into gaming. It’s likewise with Kinect. If they can make games specifically for the Kinect that are super compelling, then they’re going to win. There are two competition fronts: access to content, and the kind of input, and they always go hand in hand, because the experience is always about the two of them working together. Microsoft is pretty good at partnering with people, especially if they can figure a way that it all kind of benefits them. If I had to bet on anybody in this space, I’m going to bet on Microsoft.
Header image via Nicola since 1972