So you say you want to learn all there is to know about the WhatNotToSing.com performance rating system? As the old saying goes: Be careful what you wish for, you may get it. Below is the full story of how we calculate the ratings, from soup to nuts. (And we have to deal with plenty of the latter.)
A word of advice before proceeding: this is a very lengthy article and it gets quite technical in spots. If you want a very simple executive summary of the performance ratings, we suggest you read our website's primer for new visitors, Intro to WNTS 101. If that's not enough, we have a more in-depth overview (but without all the gory statistical details you'll find here) on our Frequently Asked Questions page.
Before we discuss how we calculate the approval ratings, let's take a moment to discuss what they really are....
WhatNotToSing.com approval ratings measure how well (or not so well) a performance was received by the Internet fans of American Idol. A common misconception is that approval ratings are synonymous with how "good" or "bad" the singer was that night. In reality, singing quality is just one factor (albeit the most significant one) that affects the final number. Others include song choice; presentation; the judges' remarks; the contestant's personality, composure, and performance history; and how well other contestants performed that night in comparison.
We'd love to isolate and rate separately all the factors that go into a performance. But, we don't know how to do that, at least not objectively. Thus we rate the performance as a whole, then use the reviewers' comments to interpret the results (and allow our readers to do the same) to determine what America liked and didn't like about it. This is why we avoid using terms like "better" and "worse" on the site when comparing performances – we use phrases like "higher-rated" and "less-liked" instead.
The goal of Project WNTS is to provide a consistent, objective comparison basis between performances. If we do this successfully, fans and future Idol contestants can make intelligent decisions based on hard facts, instead of silly bromides (like "Never sing Aretha Franklin!") that don't stand up to analysis.
As for what separates a "good" vs. "bad" performance, that's entirely up to you. If you believe that Summertime was the worst performance in AI history and one with a single-digit approval rating was the best, our official response is: "whatever." There are a few performances that America rated in the 40s and 50s that we happen to think were terrific, and at least one in the 90s whose popularity we still don't understand. No matter. America decides, we report.
This is important: Do not mistake WhatNotToSing.com approval ratings for a voters' poll. Ratings are a measure of how well viewers liked or disliked a performance. We make no attempt whatsoever to predict how they'll actually cast their ballots.
American Idol voting patterns are extremely complex, as anyone who's watched the show for any length of time already understands. For better and for worse, performance quality is just one piece of the puzzle. What matters most is how much a contestant motivates America – particularly his or her fanbase of loyal supporters – to pick up their phones at the end of the episode. (Perversely, sometimes the best way to do this is to perform spectacularly badly, though that's not a well you want to go to too often.)
Fortunately, there are a large cadre of "free agent" voters out there who cast their ballots primarily for the evening's best performances. Singing well will thus almost always (note: we said "almost") see you safely through to the next week. Even more encouragingly, there is a very strong correlation between approval ratings and long-term survival on the show. But sometimes the non-performance forces are so strong that there is simply nothing a contestant can do to overcome them. Historically, one performance from Season Two illustrates this phenomenon best: Band Of Gold.
Among the several popular websites that do specialize in real-time Idol voting predictions, we recommend DialIdol.com.
We now return you to your regularly scheduled article.
WhatNotToSing.com performance ratings are based entirely on publicly-posted opinions we find on the World Wide Web. They come from blogs, newspaper articles, and forums; from message boards, roundtables, chat room logs, online polls and feeds of every type imaginable. Some have dubbed this virtual world of American Idol fans "The Idolsphere".
A small minority of these sources are websites dedicated solely to Idol. A larger number are general forums for television shows in general, and reality shows in particular. Most, however, are websites that have nothing to do with Idol whatsoever, and this is by design. We strive for the widest possible sampling of opinions, not just those of the show's most rabid fans.
We've pulled opinions from hobby sites dedicated to, among other things, ice skating, snorkeling, scrapbooking, hunting, photography, movies, Disney, travel, and video games. We've pulled them from religous sites, kids' sites, military sites, ethnic sites, activist sites, senior citizens' sites, networking sites, tech sites, and sports sites — you'd be surprised and perhaps encouraged at how well Idol ties America together. We've pulled opinions from viciously partisan political blogs of both wings, sometimes holding our noses until we get to the page with the information we need. (Like the Army motto goes: get in, get the job done, get out.) Thanks to search engines like Google and Yahoo!, a few simple search terms are all it takes to produce hundreds of results in places we'd otherwise never dream of looking.
Are there any sources that we don't use? Yes, but that's a much shorter list:
Additionally, we only use reviews posted freely and publicly on the Web. So if you post reviews on a password-protected or otherwise publicly inaccessible site, or if you email your reviews directly to us, sorry, we cannot use them.
That we won't say. Given some of the shenanigans that go on in the Idolsphere, we don't want to encourage über-fans and pranksters from trying to game the ratings.
We can say, however, that there are no websites we use each and every week. Instead, we take random samplings of opinions from a randomly chosen set of sites. There are a couple of popular Idol forums on the web that we scan more often than others – one tends to skew younger and male, the other older and female, and that helps us even out the demographics. In no case, however, does any one site account for more than 10% of the data points for a given week.
We place very few restrictions on where we accept opinions from. But, when we accept them is a different matter entirely.
From the earliest days of the WNTS Project, it became apparent that opinions changed drastically after the voting results were announced. Performances that the Idolsphere felt were merely below-average on Tuesday became "atrocities" and "screechfests" on Wednesday if that contestant advanced while a different, better-liked contestant was 'shockingly' eliminated. Similarly, performances that were originally considered good but nothing special became instant Grammy nominees if the singer ended up in the Bottom Three or sent home.
In short, once the results are announced, many reviewers tended to critique America's voting habits more than the performance itself. And this would defeat the entire purpose of WhatNotToSing.com – namely, to use sound statistical techniques to evaluate what songs, artists, and music styles produce the most successful receptions. Therefore, we only count reviews timestamped in the 24-hour window between the performance and the start of the results show. (We do make occasional exceptions for bloggers and columnists who've established a good track record of impartiality.)
Our policies and restrictions hold us back very little, because there are a wealth of Idol musings to choose from on the Web. From Season Four on, we've collected at least 300 different opinions for each performance, usually much more. For Season Two and Season Three, which we calculated retroactively, we were still able to collect at least 200 individual opinions.
Only for Season One did we run into trouble, because there aren't enough surviving Internet sources available. Reviews of the four pre-finals episodes, when viewership was miniscule, were particularly difficult to come by. To supplement our data, we established the Season One Review Crew in 2008, in which WhatNotToSing.com readers were invited to send us their critiques and opinions from that first season. This was the only case in which approval ratings were calculated significantly from reviews published long after the fact.
Performance reviews on the Web come in many flavors. The more descriptive and intelligently written they are, the more we love and revere them (and the greater weight they ultimately have on the overall WhatNotToSing.com ratings.) Here's a sampling of what we use, from least sophisticated to most.
On message boards and chat rooms, simple rankings are most popular. Fans express their opinion of the evening's performances by placing them in order from best to worst, e.g.:
These are useful, but they're not terribly descriptive. Was Chris as much better than Taylor as Taylor was to Katherine? Was Chris far better than Kellie, or were they all bunched in the "muddled middle"? With basic rankings, there's no way to tell - all we know is the order.
More sophisticated reviewers go a step further by expressing where they felt were clear delineations in the performance level, e.g.:
This is better; now we know the reviewer felt that the seven performers fell into three strata, and that Chris and Taylor were a clear cut above the rest (and Kellie a distant cut below.) But there are still questions. Were Chris and Taylor really that good, or were they just the best of a mediocre episode? Was Kellie that bad, or just average on a night in which everyone else brought their A-game? Unless the reviewer includes some descriptive text (even if it's as little as "this episode stunk!") we can't even hazard a guess. We need more contextual information. Which brings us to....
These are the next step up in the critics' food chain. Here, reviewers not only provide ordinals for the evening, but they attempt to assess each performance qualitatively. For example:
Chris - 10 out of 10
Taylor - 9.5
Katherine - 8
Elliot - 8
Ace - 7.5
Paris - 6.5
Kellie - 3
Now we're getting somewhere. According to this reviewer, Chris wasn't just OK on an otherwise bad episode. He was very good in the general scheme of things – presumably, having been assigned a maximum 10 of 10 rating, as good as it gets.
Better still, if this reviewer grades all episodes using the same scale, we now have an idea how each performance fares relative to performances from other episodes. We'll discuss this more later on.
Can we infer also that Kellie had a very bad night? Not quite. The problem here is that every reviewer's scale is relative. Perhaps a '3' in this person's book is very bad, or perhaps it's poor but not awful. If you think we're splitting hairs, you haven't visited too many Idol blogs, have you? It's not uncommon to read reviews along these lines:
Kellie: OMG that was the Worst. Singing. Ever!!!!
Was there a SINGLE NOTE in tune??
My ears are still bleeding! I thought I was going to throw up when she missed that high
note towards the end. Awful, awful, awful - if she starts singing again I'm going
to puncture my eardrums and crawl out onto my fire escape until it's over!!!!
5.5 stars out of 10.
Uh, excuse us, but...5.5 stars??! One shudders to think what this reviewer would consider a 1 star performance, or what he or she would do upon hearing one. Maybe commit ritual suicide in her living room? Anyway, this (made-up) example just underscores the fact that we can't treat rating values as absolute and simply average the scores. We have to convert them as best we can to a common, meaningful scale - mathematicians call this process "normalization", if you're interested. And that takes us to...
Brief, well-written, impartial critiques of each performance are, in our eyes, the cream of the crop. They not only offer the finest description of a performance, but they often provide us with a means to put all of our data into a complete context. Take this hypothetical critique, for example:
Katherine, "Someone To Watch Over Me": Fantastic!; Kat really put her
full voice into it. Diana DeGarmo sang this pretty well during AI3, but Katherine's version
was much better. It was her best performance since the semifinals, and along with
Chris, the best of the night.
4.5 stars (of 5)
We would send this reviewer some combination of flowers, chocolates and pizza if we could, because this is ideal for our purposes. Obviously, he or she really liked the performance, but look how much useful information she provided in three short sentences:
We'll take other data sources into account as well, such as results from online polls and public chat rooms, as well as "Best Of The Season" lists that are a staple of some blogs and forums. However, we don't give any of these as much weight as the rankings, ratings, and critiques posted between the performance and the results show. Those make up about 98% of all WhatNotToSing.com approval ratings.
OK, we've collected all these opinions. How do we turn them into numerical approval ratings on a scale of 0 to 100? Here's the approach we use. If you've found the reading to be pretty easy so far, we'd better warn you that the article is about to take a sharp turn for the technical. If you have math anxiety, just skip this chapter.
For each episode, we start by calculating the "ordinal" score of each performance by each reviewer. If there are 10 performances, the one that the reviewer felt was the best of the night is awarded 9 points, because there were nine performances it "beat". The second-best gets 8 points, and so on to the worst performance, which scores zero. We do this for all reviews, then sum the results.
This step is actually far more complex than we just described, because we also adjust for factors such as ties, missed performances, "(gap)"s in a reviewer's list, multiple-song weeks (i.e., the Final 5 onwards), and a few other things. By the end of this step, we have a consensus ranking of the performances and a very rough idea of how they rate in relation to one another.
Next, we set aside all reviews that contained rankings only; there's nothing else we can glean from them. For the remaining reviews, we start the process over again, but this time we average the ratings of each performance. Because every reviewer uses a different rating scale (1 to 10, or 0 to 5 stars, or A through F), we convert them as best as we can to our 0 to 100 scale.
We should mention here that we treat some rated reviews more equally than others. In particular, reviewers who use the same rating scale consistently throughout a season, or who provide useful comparison points to other episodes and seasons, or who just include some intelligent commentary (as noted above), are given rather more weight. That's because we're much more confident of being able to quantify their true opinions.
Next, we compare the list of ratings (which have meaningful numbers assigned to each entry) with the list of rankings (which do not). Sometimes they're consistent, but oftentimes there are a few discrepencies. If so, then we adjust the rating list to bring it roughly into line with the ranking list. What's next?
Here's where things get tricky. We have to "normalize" the preliminary results so that they lie in a scale consistent across all episodes and all seasons. Our goal, in other words, is for a performance with an approval rating of 50 in the Season "X" Finale to be equivalent to that of one having scored 50 in the Season "Y" Semifinals.
To do this, we again rely on well-written critiques, plus rated reviews from bloggers and columnists whom we know use consistent rating scales. Sometimes this is easy; we can "eyeball" the results and make slight adjustments to bring everything into line. Sometimes it's not, particularly for episodes in which the standard deviation of reviews is very high.
What's 'standard deviation'? It's a very common statistical measure of how much variance there is in a data set. Say we have two performances with approval ratings of 50, each based on six reviewers' collective opinions. For Performance A, all six reviewers scored it as a 50. For Performance B, three scored it at 80 and three at 20, which averages to 50. The difference is in the standard deviation (often abbreviated as the Greek letter sigma σ): A's is 0, B's is nearly 33. The higher the value, the more differences of opinion there were among the Idolsphere. (For WhatNotToSing.com approval ratings, the average σ is about 18.)
To "fit" one episode's performance ratings into the whole, we make use of an old statistics technique known as Least Squares. Continuing the example from the previous chapter, let's say we've calculated Katherine's preliminary rating for the week as 86. But, based on several critiques, we know that it compared favorably to a performance we previously rated at 88, but unfavorably to ones we previously rated at 80 and 82.
Hold on, you shout, that's impossible! How can a performance's numerical rating be >= 88 but at the same time <= 80 and 82 ?! Welcome to our world. People's opinions and tastes vary greatly, so we virtually never have the luxury of mathematically consistent comparison points.
That's where Least Squares comes in. We adjust the rating so as to minimize (deep breath) the sum of the squares of the absolute value of the difference between the final rating and each comparison point. We could spend ten paragraphs explaining this concept further, or we can just point you to Wikipedia for the details.
At any rate, applying Least Squares to the preliminary rating of 86 plus the three comparison points of 88, 82, and 80, we arrive at a new rating of 84. (Trust us.) Typically we do not go through this complicated process for every performance – rather, we save it for the highest- and lowest-rated performances of the episode, to establish reliable endpoints. Then we let the middle performances "float" to their final resting spots.
At this point we're done the grunt work, but we sometimes make a few very small adjustments. Primarily, we try to adjust for the fact that some contestants (particularly those who are said to have 'overstayed their welcome' in the competition) acquire strong negative backlashes, and these often seep into the Idolsphere's reviews. So we may throw out opinions from reviewers we believe may have become too emotional or too biased against a certain contestant or his fans.
We sometimes also throw out reviewers whom we believe have become too biased for a certain contestant. We do this very sparingly, because building a strong, rabid fanbase is part of the Idol game. We're highly reluctant to punish a contestant for doing his or her job too well.
As the season progresses, many reviewers start producing "Best Of The Season" and "Worst Of The Season" lists, plus occasional cross-season comparisons. We'll only consider these from reviewers who've established a strong track record with us for impartiality and intelligence. We sometimes use these to tweak previous weeks' ratings, rarely more than a point or two. (This is why we publish a disclaimer that ratings should not be considered final until the season is complete.) Once the season is finished, each performance's approval rating is officially frozen.
To make life a little simpler for our readers, we often refer to approval ratings as ranging from 1 to 5 stars, in 20-point intervals. There's nothing significant about these intervals, mathematically or otherwise. They're just a convenient way to split up the scale.
|1 star||0 to 19|
|2 star||20 to 39|
|3 star||40 to 59|
|4 star||60 to 79|
|5 star||80 to 100|
Is it possible for a performance to wind up with a rating below 0 or above 100 after all adjustments are considered? Theoretically, yes. We'd guess the actual limits are roughly -3 and 103. The chance of a performance falling out of the 0-to-100 scale in reality, however, is virtually nil.
Thanks for having read this far! If you have further questions on the rating system or suggestions on how to improve it, we'd love to hear from you.
-- The Ratings Board of WNTS.com