Stop your VR experiments, it’s time to talk ethics

In 1961, Stanley Milgram began a famous series of experiments testing the nature of obedience to authority. His paradigm—which involved participants being instructed to administer painful and even lethal levels of electric shocks to someone they believed was in another room—is now primarily taught in undergraduate psychology courses as an example of an unethical research design. By deliberately exposing uninformed volunteers to an emotionally distressing situation, critics argued that the risks posed to people in Milgram’s lab outweighed the potential benefits of the experiment’s results.

It is essential to remember that Milgram’s motivation in constructing his design was not sadistic or random, but driven by a specific and powerful political agenda. Beginning at the end of World War II and continuing into the 1960s, the well-publicized Nuremberg trials repeatedly involved former Nazis defending their actions by claiming that they were following orders and powerless to defect. U.S. political leaders felt pressured to establish that such large-scale coercion could not be possible under the American value system, which spawned an era of rigid nationalism and communist witch-hunting. Even Milgram began his experiments assuming that only a small minority of participants—all Americans living in or around New Haven, Connecticut—would be compelled to blindly follow authority to the point of harming another individual. When he found that in fact 65 percent of volunteers were willing to administer the highest level of voltage, he allowed this data to inform and revise his hypothesis, even though it ran counter to prevailing social sentiments that “good Americans” were moral, independent, and strong-minded. His design may have been unethical by today’s standards, but his scientific integrity remained intact.

///

In 2005, two psychological researchers named Sheese and Graziano published a report of their experiment in which they tasked participants at Purdue University to play Doom (1993) for 25 minutes. Alone in small, windowless rooms, half of the volunteers played the game as usual, using guns to shoot opponents. They would be marked in the researchers’ spreadsheet as belonging to the “violent group.” Those in the “non-violent group” walked the game’s lo-fi corridors with no weapons and nothing to do for the same amount of time. All participants then transitioned out of the game and into a decision-matrix scenario—commonly used in social science research—in which they had to choose to either cooperate with or defect from another person in order to earn points. The only information withheld from them was that the person they were cooperating with was actually a computer simulation. In effect, the researchers were testing how playing one game affected the way people played a different game, though they were hoping to extrapolate far greater conclusions.

a deliberate distortion of the data

Sheese and Graziano found that both groups were just as likely to cooperate and view the other “person” as trustworthy. However, they reported that the “violent group” was moderately more likely to defect than their counterparts. From this minor finding amid a plethora of non-significant results, the authors concluded that “playing violent video games may undermine prosocial and altruistic motivation… [and] appears to have contributed directly to participants’ willingness to exploit….” Their interpretation was, at best, a stretch, and at worst a deliberate distortion of the data. No mention was made of the fact that Doom was a 12-year-old title at the time—archaeological on a timeline of videogames—and thus a bizarre choice to represent contemporary violence in games. The researchers also did not consider that a better name for their “non-violent group” might have been “boredom group,” and that maybe what they were really measuring was the difference between people who were actively engaged in an exciting task for 25 minutes (the “violent group”) versus those forced to walk around an old videogame stripped of anything directly engaging.

Perhaps most significantly, this study that yielded little and yet labeled its findings as “remarkable” was not published in an obscure academic journal, but rather appeared in Psychological Science, one of the most influential and widely distributed psychology journals in the world.

The reason why this seemingly minor paper garnered such major attention was due to the fact that, as with Milgram, there was a strong political agenda driving Sheese and Graziano’s research. Here was another attempt to address nebulous social hysteria with hard laboratory science: six years before the Purdue study, two young men stalked the halls of Columbine High School in Colorado with automatic weapons, killing a dozen of their classmates and one teacher, injuring 21 others, and ultimately killing themselves. The public was aghast, the media baffled over how “good kids” (which, like “good Americans” in Milgram’s era, stood as a euphemism for white and middle-class) could perpetrate such a barbarous act. Alongside goth culture and rap music, videogames were nominated as a potential corruptor.

With alarming speed, many social scientists took up the mantle of proving that videogames made people violent in the real world. Unlike Milgram, however, researchers like Sheese and Graziano, along with the editors of publications like Psychological Science, did not let the unimpressive nature of incoming data deter them from their initial hypotheses. Only in the past five years, as political fires have cooled, have new studies been published demonstrating how much bad science took place for over a decade surrounding games—and even so, such studies aren’t finding real estate in the most prominent journals.

///

On March 16, 2016, Polygon reported on an experiment conducted by Patrick Harris, lead designer at game studio Minority Media. The experimental design, according to how Harris presented it at his 2016 Game Developers Conference talk (Minority Media did not respond to our request to discuss its details further) consisted of this: Harris went into a virtual reality space along with a single female participant who had not been briefed on what would happen. Harris then deliberately harassed her in the most invasive ways he could think of as afforded by VR’s unique level of sensory immersion. The woman later described the ordeal as a “damaging experience.”

Harris presented this experiment as evidence that harassment is “way, way, way worse” in VR versus other online spaces, and that measures must therefore be preemptively put in place in order to prevent other people from the very experience to which he purposely subjected an unsuspecting woman. His presentation at GDC represented an intersection of the studies outlined above: like both Milgram and the Purdue researchers, Harris was politically motivated, in this instance by the incendiary topic of online abuse in gaming (which has existed for time immemorial but only recently has started to be acknowledged as a genuine problem by both mainstream media outlets and game developers). There is a public interest in understanding why people harass other people online, and what harm it can cause, and Harris seemed eager to bring these questions to bear on the emerging medium of VR.

bad science

The comparisons don’t stop there, unfortunately. Also like Milgram, Harris conducted an experiment that modern standards regard as wholly unethical (Milgram, at least, had the excuse of predating the creation of those standards). And like the Purdue researchers, Harris’s experiment was methodologically ill-conceived and thus resulted in erroneous conclusions; that is to say, it was bad science.

In terms of ethics, the standard for research conducted on human subjects is clear: if there is reason to suspect that an experiment will cause pain or distress to a participant, you should try to find another way to test your hypothesis. Scientists seek knowledge, but not at all costs. Harris not only could have reasonably expected his virtual antics to cause distress, the entire point of the exercise seemed to be to try to harangue another person as intensely as possible. If he were a real scientist, the burden would be placed on Harris—by a disciplinary committee, most likely—to justify how the woman’s “damaging experience” yielded data worthy of her ordeal.

Unfortunately, there is no data. All we know is that Harris succeeded in creeping a woman out, which he reportedly felt guilty about—though not enough to stop him from doing it; a devotion to his cause that, incidentally, is perfectly predicted by the Milgram experiments. There was no apparent method of measuring the woman’s distress, and no way to compare it to the distress caused by other forms of online harassment. To declare that harassment in VR is “way worse” than in other places is therefore not only specious, but dismissive of those who suffer abuse in games, on social media, and elsewhere.

Harris’s experiment was less science than a poorly conceived polemic, poised to brand the emerging medium of VR as “dangerous,” just as videogames were branded following Columbine. His intentions may have come from a place of true concern, but that fact cannot forgive shoddy methodology. Claiming that we know something before we know it is more dangerous than any virtual simulation—it has the potential to marginalize people, suppress opposing information, and restrict artistic freedom.

Discovering the causes and correlates of unspeakable human behavior is an important pursuit of social science, from Nazism to school shootings to online harassment. But just as important is the pursuit of truth through ethical and integral research methods. VR is new and untested; we have an opportunity to understand its nature in an open and balanced discourse between developers, critics, and the general public. Rather than trying to be the first to declare that VR is all-good or all-bad based on limited data, the goal should be to approach the medium with both a scientist’s curiosity and a humanist’s compassion.