The Great Ratings Recalculation

As American Idol grew more popular among television watchers, it drew a wider cross-section of music enthusiasts. Among the many consequences was a drastic drop-off in the number of "showstopper" performances, which we at WNTS.com arbitrarily define as having an approval rating of 90 or higher. We explained why in an editorial here ... and here ... and, uh, here too.

At the urging of many of our readers (and because the showstopper decline irked us, too), we looked into ways in which we could 'normalize' the ratings across seasons in some sort of organized, impartial, and meaningful fashion. In the summer of 2010, we put a proposal online in our Library for how we might do this, including a full list of adjusted ratings from the first eight seasons. Based on the feedback of many correspondents, we spent another year tweaking the formulas. Finally, in June of 2011, we put the Great Ratings Recalculation into production. Here's its gory story of glory, as it were.

First, The Cleanup

The GRR turned out to be way more work than we bargained for, but ultimately in a positive way.

In order to calculate accurate numbers, particularly for the very highest- and very lowest-rated performances where most of the Idolsphere's interest lies, we came to the conclusion that we'd need to go back into our trusty Excel spreadsheets and extract the "raw" numbers to at least a couple of decimal places. If we didn't, then any two performances from the same season that had the same (rounded) approval rating would remain tied after we applied the adjustments...even if those two performances were actually 0.999 points apart originally, one barely rounding up and the other barely rounding down.

In the process of doing this, we found a few minor errors that we made, plus some inconsistencies in our spreadsheet formulas from season to season. Those were the natural by-product of our Math Dept. making minor tweaks here and there over the years to improve accuracy. We then decided that if we were going to do this job right – and we will tell you right now, it was so much bloody work that we have no intention of ever undertaking it a second time – we should fix those glitches as well.

We also took the opportunity to look at a couple of past performances whose original ratings were suspect for one reason or another. From a GRR standpoint, the most significant of those was Adam Lambert's Mad World. As you may recall, that night's episode, like many in that difficult season, ran very long because the judges were operating under the assumption that American Idol was all about finding the best unsigned music critic in America. "Mad World" was the final performance, and it didn't start until well past 9pm, after many people's DVR's had automatically shut off. Thus, quite a few web reviewers didn't adjudge the performance, and while reading their comments we got the distinct impression that a disproportional number of those people were among Lambert's more strident detractors – they made it clear they had zero interest in chasing down the video clip online. "Mad World"'s 92, the highest "raw" approval rating in the past four seasons, has made us uncomfortable ever since.

There's simply no scientific way to solve a problem like this, so we discussed it among ourselves and with a handful of our regular correspondents. The verdict: the ruling on the field stands. Perhaps Lambert caught a lucky break under the circumstances...but if so, then so have many other contestants over the years, in countless different ways. Good (and bad) fortune has always been one of the components of an approval rating, so trying to correct for it in this particular case makes no sense whatsoever. Besides, the memorable performance is near-universally hailed today as one of the best in AI history, so its new ranking in the WNTS Top 40 certainly doesn't feel out of line.

In the process of doing all of this pre-GRR cleanup work, about 220 performances' "raw" ratings – the ones you've seen on WNTS for the past umpteen years – changed. In 210 of these cases, the movement was precisely one point after rounding. For eight other performances, it was two points.

In just two cases did a performance move more than that, and in both cases the reason was, ahem, stochastic digital fluxuation. In other words, we evidently fat-fingered their ratings when we transfered them from the spreadsheet to the database. Those two performances, if you care, are Clay Aiken's Someone Else's Star (a 56.1 that we mistook for 51.6) and the big mover...the unforgettable Never Too Much by Scott Savol. You forgot about that one years ago, you say? Um, yeah, to be honest, so did we. But we goofed big-time here: instead of a 28, Savol actually earned a 38. Which of course means nothing in the grand scheme of things...except, er...it might. Had we used the correct performance rating, Savol would have survived Week Two of the AI4 Camp Should-A-Been. replay, ousting Constantine Maroulis. Then in Week Three, Savol and the equally unforgettable Joseph Murena would have tied for the sixth and final Guys' spot in the Finals at 36. Who would have won on decimal points? Most likely Murena, since Savol's Top 16 score was actually a 35.62 that rounded up. But, we're afraid to re-run Murena's projected rating to be certain, because if Savol advances then it would scramble everything that followed. Someday, maybe we'll work up the nerve to check.

Then, The Recalculation

The formula that we ultimately settled on is pretty simple, all things considered.

First, we threw out things like reprise performances and Original Winners' Songs™, since they're not terribly well liked by reviewers and show very little about a constestant. Then, for each season, we calculated the mean and standard deviation of approval ratings. Careful...this is not the familiar WNTS s.d., which measures variance of opinion about each performance. Instead, we took the old-fashioned standard deviation of every performance rating that year. In years with high opinion variances, this tends to be relatively low (because ratings are bunched more closely together around 50), and vice versa.

To normalize the ratings, we basically massaged them so that every season's bell curve distribution became roughly the same. There are a few details and tweaks that we're leaving out, but they didn't have much effect on the final numbers. Ultimately, very high and very low ratings from seasons with high "performance" s.d.'s increased and decreased, respectively. Ratings closer to 50 changed very little or not at all. At the end of the day, every season wound up with almost the same average rating as before. Only the outlying performances – that is, the 5-star and 1-star crowds – were much affected. Those are the numbers you now see on the site.

Finally, The Presentation

When we were through with this twisted process, we realized that we now had three different "classes" of approval ratings on WNTS. There were the "overnight" ratings that we publish after every episode, which tend to fluxuate quite a bit as we poll more sites. Then there are the "in-season" or "raw" numbers that we arrive at by the weekend, and which do not change very much for the duration of that season. Finally, there are the "final" scores after the GRR adjustments are applied at each year's end.

To distinguish between these three rating classes, we borrowed the familiar gold/silver/bronze paradigm from the Olympics. As the ratings become more mature and (hopefully) accurate, they move up the medals' food chain:

— Bronze stars represent overnight ratings.
— Silver stars represent in-season ratings.
— Gold stars represent final ratings.

For single performance ratings, this is pretty easy to grasp. However, things get a little more complex when we get into aggregate ratings, such as the average for an artist or a contestant. In those cases, we use the lowest level of any performance in the average. For example, if a song's performances comprise 10 golds, 2 silvers and 1 bronze, we list it as bronze. For an aggregate rating to be gold, then all of its performances must be fully normalized and adjusted.

Starting in 2012, gold, silver and bronze will be in everyday use on the site. In the interim, for the duration of calendar year 2011 we deliberately left the S10 Final Three episode tagged as "in-season", and the S10 Finale as "overnight", so that our readers can get used to the pretty new colors. But do keep in mind that both episodes comprise only indeed final, adjusted, normalized, they-ain't-changin'-anymore approval ratings.

-- The staff of WNTS.com

The Great Ratings Recalculation

First, The Cleanup

Then, The Recalculation

Finally, The Presentation

For New Visitors: