Archive for the 'Reviews' Category

Flaw #3 with review scores: the aggregate

In my previous two posts in this series, I discussed a couple reasons why review scores from individual sources can be problematic.  But what happens when the scores from all the different sources are aggregated to create a single, average score?  Fortunately, in most cases, the effects of the problems tend to cancel out rather than compound, so aggregate scores are a little more trustworthy than individual scores.  There are still a few problems, however, that deserve mentioning.

When averaging a bunch of different scores, aside from knowing the grand average, it is also important to have a sense of the distribution.  Metacritic does a good job of this by providing a “Metascore” along with a list of all the individual review scores in order from highest to lowest, but unfortunately it doesn’t calculate the standard deviation (a quantitative measure of the spread) of the scores.  Knowledge of the distribution is important for understanding how much reviewers agree or disagree on the merits and shortcomings of a game.  If the distribution is narrow, then the reviewers pretty much agree on how good or bad the game is; on the other hand, if the distribution is wide, then the reviews are more of a mixed bag.  This can be particularly significant for games scoring in the 70-85% range, which tend to fall into two main categories:  fundamentally mediocre and love-it-or-hate-it.  Simply looking at the average scores is not enough to make the distinction.

Another problem with aggregate scores is the need to convert individual scores to a common scale.  Both GameRankings and Metacritic use simple formulas to convert letter and numerical scores into percentages, but these formulas prove to be problematic once you realize that two different scores which convert to the same percentage may not have the same meaning.  For example, a 50% from one site might mean average, whereas a 50% at a different site could mean failing.  Trying to compare those two scores based on the numbers alone is like comparing apples to oranges.  It doesn’t work.

One last thing to be aware of when dealing with aggregate scores is how the average scores are calculated.  Are they just straight-up averages?  Are they weighted averages (and if so, how are the weights assigned)?  Are the highest and lowest scores dropped before computing the averages?  This may be a more subtle point, but the more information you have the better.  You don’t want to use numbers that you don’t understand when making your next video game purchase.

While aggregate review scores are less susceptible to some of the flaws found in the scores issued by individual reviewers, there are still a few things to keep in mind when working with them.  Just remember that you can never get all of the information you need to make an informed purchase with only a single number.


Flaw #2 with review scores: bashing

For the second entry in my series on why modern video game review scores are flawed, I’d like to discuss bashing.  The term “bashing,” as it is commonly used, refers to insults or other criticism levied against someone or something, often unnecessarily and/or inappropriately.  In the context of review scores, I consider bashing to be any deduction that is not the direct result of one or more of a game’s demerits.

There are two main types of bashing that affect review scores – sequel bashing and bashing for attention.  I’ll start with sequel bashing since it’s the more common offense, especially among the major review sites.  Sequel bashing affects all games that are not the first in their respective series.  Reviewers seem to feel that a sequel must completely blow away its predecessor if it is to receive a score at least as high as the predecessor’s.  Never mind that a sequel is fundamentally better than its predecessor – if it doesn’t reinvent or revolutionize every single aspect of the original game, it’s doomed to get a lower score simply because of this flawed interpretation of what a sequel should be.  Admittedly, there are some sequels that manage to muck things up (e.g., Devil May Cry 2, Rayman Raving Rabbids 2, Suikoden IV, etc.), but I do not deny that those games do deserve lower scores.  In general, however, if a sequel manages to improve upon its predecessor in several noteworthy ways, it deserves a higher score.

But what about series that go stale?  In other words, what can we say about sequels that are neither better nor worse than their predecessors?  I argue that there’s no reason to sequel bash these games either.  Sure, if a sequel suffers from outdated visuals or unimproved sound quality then it is entitled to a lower score because the standards for all games go up as time progresses, but if the core gameplay remains unchanged, there’s really no further reason to deduct from the score.  Consider the following remark from Matt Cassamassina’s review of Mario Party 8 on IGN:

In spite of our issues with the game, people who loved Mario Party 7 will probably enjoy Mario Party 8, too, but we’ve chosen not to reward Nintendo with an undeserved high score for a copy/paste sequel.

So scoring MP8 1.8/10 lower than MP7 is deserved for being of roughly the same caliber?  What if I’m new to the Mario Party series and I’m looking for somewhere to start?  Going by the review scores, I’ll probably gravitate away from MP8, even though it is referred to as a “copy/paste sequel” with no severe flaws that distinguish it from the other Mario Party games.  That’s not right.

The other main type of bashing is bashing for attention, and while this is less common than sequel bashing, it is still a problem.  Bashing for attention occurs when one reviewer takes issue with a game that is consistently well received by other reviewers, and that reviewer gives the game an undeservedly low score in order to set his or her review apart from the others.  See if you can find the reviewers guilty of bashing for attention in the following examples:

Example #1, Example #2, Example #3

I’m not saying that a reviewer cannot find fault with a game unless the other reviewers also find the same fault, but you must really be missing something if your score for a game is 40-50% lower than everyone else’s.  Even though reviews are subjective by nature, there still must be an objective component to accurately weight the merits and demerits of the game.  Otherwise, cheap tactics like bashing for attention only serve to undermine the regard given to review scores.

Flaw #1 with review scores: the 10-point scale

This entry is the first in a series I plan to write about why modern video game review scores are flawed.  One of the first steps in deciding whether or not to purchase a game is to read the reviews, and review scores are meant to provide at a glance what each reviewer thought of the game.  Converting paragraphs of text into a single numerical score, however, isn’t exactly a smooth process, and crucial information can be lost in translation.  But what good is a score if it does not accurately reflect the reviewer’s opinion?  In this post, I show why using a 10-point scale to review games is a bad idea.

The most critical problem with the 10-point scale is that the meaning of the middle scores is not well defined and not consistent from reviewer to reviewer.  It can generally be agreed that a game receiving a score of 9 or 10 is exceptional, and a game receiving a 0 or 1 (depending on how low the review scale goes, which also varies with the reviewer) is terrible.  But what can we say about a game that gets a 7?  Is a game considered average or below average if it gets a 5?  Admittedly, some reviewers and review sites do attempt to explain what their scores mean, but it would be nice to see a universal review scale so that such explanations would be unnecessary.

Another issue concerns the granularity of the scale, or how finely the scale is divided.  For example, a 10-point scale with granularity 0.5 allows scores of 0.0, 0.5, 1.0, and so on, all the way up to 10.0.  If a scale has too little granularity, then the review score has little meaning since two games could get the same score even if, comparatively, one is noticeably better than the other.  On the other hand, too much granularity goes beyond the realm of human discernment – is a game that earns a 6.3 really any better than a game earning a 6.2?  There’s also the problem of having constant granularity along the entire scale.  We may care whether a game earns an 8 or a 9, but we’d probably care a lot less whether a different game earns a 2 or a 3.

Given these points, I would argue that a traditional letter grading system would be best, provided that the reviewer uses a different mindset when assigning a grade and does not equate the letter grades with numbers.  Everyone can intuit that A = superb, B = good, C = average, D = below average, and F = failing, and with the addition of +’s and -‘s, the granulation is ideal.  I’ve noticed that EGM magazine and have switched to a letter-grade review system, which is a good start.  Gamespot also revamped its review system not too long ago, and their review scale now has granularity 0.5 (as opposed to 0.1).  These are all changes for the better, and it would be great if other reviewers followed suit.