Booting out the Warmest 100

(Beware – this article includes a link to some probable spoilers for tomorrow’s Hottest 100 count. You can read this article without reading those spoilers.)

You’re probably familiar with Triple J’s Hottest 100. It’s the world’s largest write-in music poll. Last year, Triple J made an easy, shareable link for people to post their votes out on Twitter and Facebook. Alas, these links were easy scraped from the web, and the Warmest 100 (link to 2012 count) was born. The top 10 (but not its order) was revealed, and the top three was guessed perfectly.

This year, voters weren’t given a shareable link, but a few thousand people took photos of their confirmation e-mails and posted them to Instagram. With a tiny bit of OCR work, the Warmest 100 guys posted their predictions for this year. They found about half the number of votes that they did last year through the scraping method, which is no mean feat, given the lack of indexing.

So the question is — how useful are these votes in predicting the Hottest 100? What songs can we be sure will be in the Hottest 100? How certain is that top 10?

Both years, Justin Warren independently replicated their countdown (spoilers), and has written up his methodology for collecting the votes this year. I asked him for his data to do some analysis, happily, he obliged.

He’s since updated his method, and his counts, and written those up, too (spoilers).

Update: he’s updated his method *again* based on some feedback I offered, and has also written that up (spoilers). This is the data my final visualisation runs off.

So, what have I done with the data?

Bootstrap Analysis

When you have a sample — a small number of votes — from the entire count, you can’t really be certain where each song will actually appear in the count.

In this case, Justin’s data collected 17,000 votes from an estimated 1.5 million votes. That’s a sample of 0.1% of the total estimated vote. It’s a sample, but we have no idea how that compares to the actual count.

If we think that the votes that we have is a representative sample of all of the votes, then what we’d like to know is what would happen if we scale this sample up to the entire count. Where will songs end up if there’s a slight inaccuracy in our sample?

The good news is that computers give us a way to figure that out!

Bootstrap analysis (due to Effron) is a statistical technique that looks at a sample of votes from the whole set of votes, and randomly generates a new set of votes, with about as many votes as the original sample. The trick is that you weight each song by the amount of votes it received in the sample. This means that songs are picked in roughly the same proportion as they appear in the sample. The random sampling based on this weighted data adds noise.

You can think of this sample as a “noisy” version of the original sample. That is, it will be a version of the original sample, but with slight variation.

If you repeat this sampling process several thousand times, and rank the songs each time, you can get a feel for where each song could appear in the rankings.

How do you do that? Well, you can look at all of the rankings a given song gets for each randomised set. Sort this list, and pick the middle 98% of them. Based on that middle 98% of rankings, you can be 98% certain that the song will be at one of those positions. In statistics, this middle 98% is called the 98% confidence interval by bootstrap.

You can repeat this for different confidence levels, by picking a different amount of rankings around the middle.

I’ve used Google Spreadsheets to visualise these confidence intervals. The lightest blues are the 99% confidence intervals. The darkest blue intervals are the 70% confidence interval. The darkest single cell is the median — i.e. the middle of all of the rankings that we collected for that song in the bootstrap process.

The visualisation is up on Google Docs. (spoilers, etc).

I’ve run the same visualisation on Justin’s 2012 data, it’s less of a spoiler than the 2013 version if you care about that. It can inform the rest of the article for you.

Notes

First up, a bit on my methodology: Justin’s data didn’t separate votes into their original ballots. So, I had to pick songs individually. To improve accuracy, I selected songs in blocks of 10, where each song in the block of 10 is different — this vaguely resembles the actual voting process.

In my experiments, I ran the sampling and ranking process 10,000 times.

You’ll notice some interesting trends in this visualisation. The first one is that the higher the song is in the countdown, the narrower its blue interval is. Why is this so?

Well, as songs get more popular, the distance between each song in the number of votes received grows. In Justin’s sample of the votes, #100 and #73 were separated by 15 votes. So if one or two votes changed between #73 and #100, that ordering could change spectacularly. Given Justin’s sample is of 17,000 votes, 15 votes represents an 0.1% change in the vote.

So at those low rankings, a tiny change in votes can make for a massive difference in ranking.

At the other end of the count, #1 and #2 are separated by 16 votes. #3 and #4 are separated by 22 votes. #4 and #5 are separated by 51 votes. Down the bottom of the list, where 16 votes could move a song 33 places in our count, you’d need 16 votes to change just to swap positions 1 and 2.

What this means from a statistical perspective is that the closer to the top you are, the more work you need to do to change your position in the count.

You’ll also see this phenomenon in the right-hand side of the intervals. The interval of a given colour on the right-hand side of the interval will generally be longer than the same colour on the left. Once again, this is because lower ranks swap around more easily than higher ranks.

Update: Since writing this article, I ran one more test – how many of the songs in the top 100 of Justin’s raw sample of votes will make it into the actual Hottest 100? Well, bootstrapping helps us here too. For each bootstrap trial, I take the top 100 songs, and see how many of those are in the raw top 100. I reckon, with 98% confidence, that we’ll get 91 songs in the actual Hottest 100. Thanks to David Quach for the suggestion.

In summary: the Warmest 100 approach is statistically a very good indicator of the top 4 songs. The top 4 is almost certainly correct (except that 1&2 and 3&4 might swap around between themselves). Everything up to #7 will probably be in the top 10.

The sampling approach is less accurate at the bottom, but I’m pretty confident everything in the top 70 will be in the actual top 100.

I’m also pretty confident that 91 of the songs in the raw top 100 will appear in the actual top 100.

End

I’ll be making some notes on how these confidence intervals got borne out in the actual count on Monday. I’m very interested to see how this analysis gives us a better idea of how accurate the Warmest 100 actually is.