Re-Used Halo 3 Study Completely Skews Results to Frame Sexist Agenda [Updated]

This article has received an official response from the author of this study. In it the author goes into detail about the decisions made that lead to many of my problems with his article. Within this response he goes into detail about the reason for only including the specific responses that were recorded, why he chose to only look at the kills and deaths as a factor for skill rather than adding in player assists to help clarify a players skill, as well as other problems I failed to consider at the time I originally wrote this article. If you are reading this Mr. Kasumovic, thank you very much for your response.



Recently there has been an explosion of scientific research into video games, and with good reason. With an industry value of over 85 billion dollars in 2014 [1], video games have rapidly become the biggest entertainment industry in the world.  However, the potential that video games have to change gamers remains relatively unknown, and is an aspect that researchers are dying to understand.

Hundreds of new studies are being published each year looking at various gaming aspects, such as how platformers increase spatial reasoning, or how online cooperative games improve social skills, and so on. However, as has long been the case, a portion of these studies are misleading and flawed, and over-dramatic yet flawed results seem more likely to get media attention.

Unfortunately, it seems that not even peer review can prevent all of the studies like these from making their way into academic journals. Take, for instance, a study published recently discussing how people who played the game Halo 3 poorly were more likely to be sexist. This study, which has been presented as factually sound by several gaming and tech sites, has several distinct flaws that need to be fixed before the research should be taken seriously by anyone.

That is not to say that the entire study should be tossed out. Some aspects such as the way the researchers go about obtaining their original material, are quite useful. For example, in the study the researchers decided to use 3 different gamertags on Xbox Live. Two of these tags used a male and female voice to determine if the players reacted differently towards different genders, the third tag was used as a control and didn’t speak at all. Due to this fact, the data was removed because it didn’t tell anything about the gender interactions.

This method could prove to be a very useful method of determining if players are generally more hostile online when certain signifiers are presented to different genders.  Another factor that was very useful was the use of outside transcribers to determine what was said in each match. This helps prevent personal bias, and allows the study to properly quantify the exact data needed to determine their statistics. However, while these aspects are noteworthy they do not save this study from being an otherwise poor study.

To start with, the data that this study used was not current in the least. It was obtained from a study on Halo 3 conducted back in 2012, a point in time where the game was already five years old. While using the data from older studies is a perfectly acceptable practice, the researchers misrepresented the older data in such a manner as new and original. This is reflected in their results page which stated, “we played” rather than the more accurate “we obtained from the original data”. Due to the wording, readers are left with the false notion that this information is a current and accurate depiction of today’s gamer, rather than a reflection of the Halo 3 gaming community in 2012.

Despite this, the researchers did provide a thorough description of how they went about as they processed the data from their original study.  For each match in the data set, the researchers had independent transcribers record anything that was spoken, without letting the transcribers know the purpose behind their transcriptions. The researchers then crosschecked 10 percent of all transcriptions and determined that they were accurate.  Next, the author and an independent coder looked through the transcriptions for comments toward the experimental player. These comments were divided into positive, negative, and neutral in nature. Negative comments were then checked to see if they contained any sexist remarks made to the female experimental player.

While the distinction was made in order to determine when females were more likely to receive sexist comments, it willingly ignored any sexist remarks directed toward men.  Without this critical piece of information readers and researchers alike cannot determine if the amount of negative sexist comments observed was normal for both male and female players, or if female players received a statistically significantly larger amount of sexist comments than men. If the former is true, then the findings would only prove that players who perform poorly are generally more negative than players who perform well, hardly the insight into gender norms and online gaming that the authors and media coverage billed it as.

The study suffers from another major problem as it generalized old information to the public of today, without any indication of who comprised their sample demographic other than males who spoke when playing Halo 3 five years after it was released. This lack of information is partially due to the semi-anonymous nature of Xbox live pseudonyms, but lacking key information such as location, ethnicity, or age, makes it impossible to properly generalize this information to the public. They did not even record the one bit of demographic information that was available to them, player skill.

The manner which skill was determined would not distinguish between, as an example, a teenager who may have never played competitive online games alongside women before, or a fifty-year old Klansman. Without the necessary demographic information, it is not possible to determine the significance of the data in relation to the general public. Without factoring in other potential causes for the sexist behavior, or even attempting to control for such factors, the author has attempted to convince the public that the only reason for such sexist attitudes is the fact that a woman is performing better than a man in a game, and nothing else.

While the researchers didn’t control for factors such as age, nationality, location, or even ethnicity, they did make sure to carefully explain just what information was used. In total, the older study from which the data was obtained had 1136 participants, with roughly 574 participants played against the female voice, and 567 played against the male voice. However, these exact figures were omitted from the study because the author made sure to write only what he thought as “necessary” for the reader to come up with their own opinion.

This necessary information meant only looking at the participants that actually spoke, which ended up being 189 participants. While it is easy for a casual reader to assume the fact that people didn’t talk meant this information wasn’t useful, it is critical for researchers to publish all of the information that they obtained. The importance of including such “unnecessary” information means the difference between a figure of 1.9% or 13.4% of participants using sexist comments once it is stripped of context such as in media coverage, a figure which is seven times greater than initial findings.

One of my old statistics professors told me in my undergrad program that you can prove anything using statistics. Change the amount of participants in a study and you can prove that raising the temperature in a classroom can improve the likelihood of scoring an A on a final exam. This is what the researchers proved, if you manipulate the data enough, you can show that male gamers who play poorly will make sexist comments toward women while losing.

Out of every player that participated on the female side of the study, only 1.9% displayed a sexist comment. That is 11/574 (or 1.9%), breaking that down to only players who were active teammates of the female experiment, and we’ve are left with 11/246 (or 4.47%). If we reduce the numbers further to only the 84 participants that actually talked and we’re at 11/84 (or 13.4%) of all participants using sexist remarks.

This generalization promoted by manipulated data is far from the only over-generalization due to inconsistent, or incomplete data. Factors such as Playlist Rank are more dependent on the amount of time played than on skill. This Playlist Rank system only takes the amount of times a player has won or tied, lost, or disconnected into account, rather than factors such as cooperation, assists, or even driving vehicles, let alone a proper kill to death ratio.  In this study, Rank is believed to be a “status symbol” that shows “dominance” to the other players, rather than a factor that could easily be manipulated, or paid for.

In this respect, Rank is no more than a cosmetic decal that shows how long a person has been playing. While this would could have been a contributing factor had this study looked at games such as Call of Duty, where specific activity directly relates to Rank and thus an increase in competitiveness to ensure a higher rank, this is not the case in Halo 3. The only time skill becomes a factor in Halo 3’s rank system is at a much higher level than typically used in this study.

A second factor this study tries to use is a kill-death ratio to show skill of the player. The amount of kills you have over the amount of deaths you’ve had typically show your general level of skill according to the study. Again this takes out all cooperative team-like behavior. Just because a player isn’t killing the opposing team doesn’t mean they aren’t a credit to their team. Assists matter in a team-based game, driving the vehicle while others take the kill is important. By removing these important factors the data cannot accurately represent who performs the best in a team-based game like Halo 3, and skews the data to only show players who take aggressive action as dominant.

This study did show statistical significance for several interactions between the experimental player and the participants. It is important to understand that statistical significance is not proof of the conclusions they draw from it, rather that the likelihood of these events occurring again is greater than random chance. Given the small sample size, even the variation due to chance is quite large.

It does not account at all for countless demographic co-founders or other biases they do not control for and that could easily correlate with in-game communication habits, including their inappropriate use of Rank (which effectively measures the amount of time playing the game) and kill/death ratio (which is influenced by cooperative team-oriented playstyles). Instead they use this as a launching-off point to propose a elaborate hypothesis about the sociology of “low-status males”, as if this uncontrolled, scant, and noisy data is capable of providing insight into the human condition.

Despite the willful manipulation of the statistics, the over-generalization of the participant motives, the denial of a very useful control, the refusal to acknowledge demographic differences, and the fact that this data comes from a game that’s nearly eight years old, journalists continued to use this information.

This data was used to say that people who are bad at playing games are sexist. This study only had 11 participants say sexist things, but 11 participants became the public at large. This is why everyone should learn how to analyze a scientific paper, if only to come to your own conclusions.


