Blogs
The case of the red-haired kids
Dec 31, 2012 02:17 PM , By Vasudevan Mukunth
When going through data - whether from Google or CERN - understanding how and how not to look at the numbers can make a large difference.

Seriously, shame on me for not noticing the release of a product named Correlate until December 2012. Correlate by Google was released in May last year and is a tool to see how two different search trends have panned out over a period of time. But instead of letting you pick out searches and compare them, Correlate saves a bit of time by letting you choose one trend and then automatically picks out trends similar to the one you’ve your eye on.

For instance, I used the “Draw” option and drew a straight, gently climbing line from September 19, 2004, to July 24, 2011 (both randomly selected). Next, I chose “India” as the source of search queries for this line to be compared with, and hit “Correlate”. Voila! Google threw up 10 search trends that varied over time just as my line had.

Since I’ve picked only India, the space from which the queries originate remains fixed, making this a temporal trend - a time-based one. If I’d fixed the time - like a particular day, something short enough to not produce strong variations - then it’d have been a spatial trend, something plottable on a map.

Now, there were a lot of numbers on the results page. The 10 trends displayed in fact were ranked according to a particular number “r” displayed against them. The highest ranked result, “free english songs”, had r = 0.7962. The lowest ranked result, “to 3gp converter”, had r = 0.7653.

And as I moused over the chart itself, I saw two numbers, one each against the two trends being tracked. For example, on March 1, 2009, the “Drawn Series” line had a number +0.701, and the “free english songs” line had a number -0.008, against it.

What do these numbers mean?

This is what I want to really discuss because they have strong implications on how lay people interpret data that appears in the context of some scientific text, like a published paper. Each of these numbers is associated with a particular behaviour of some trend at a specific point. So, instead of looking at it as numbers and shapes on a piece of paper, look at it for what it represents and you’ll see so many possibilities coming to life.

The numbers against the trends, +0.701 for “Drawn Series” (my line) and -0.008 for “free english songs” in March ‘09, are the deviations. The deviation is a lovely metric because it sort of presents the local picture in comparison to the global picture, and this perspective is made possible by the simple technique used to evaluate it.

Consider my line. Each of the points on the line has a certain value. Use this information to find their average value. Now, the deviation is how much a point’s value is away from the average value.

It’s like if 11 red-haired kids were made to stand in a line ordered according to the redness of their hair. If the “average” colour around was a perfect orange, then the kid with the “reddest” hair and the kid with the palest-red hair will be the most deviating. Kids with some semblance of orange in their hair-colour will be progressively less deviating until they’re past the perfect “orangeness”, and the kid with perfectly-orange hair will completely non-deviating.

So, on August 23, 2009, “Drawn Series” was higher than its average value by 0.701 and “free english songs” was lower than its average value by 0.008. Now, if you’re wondering what the units are to measure these numbers: Deviations are dimensionless fractions - which means they’re just numbers whose highness or lowness are indications of intensity.

And what’re they fractions of? The value being measured along the trend being tracked.

Now, enter standard deviation. Remember how you found the average value of a point on my line? Well, the standard deviation is the average value among all deviations. It’s like saying the children fitting a particular demographic are, for instance, 25 per cent smarter on average than other normal kids: the standard deviation is 25 per cent and the individual deviations are similar percentages of the “smartness” being measured.

So, right now, if you took the bigger picture, you’d see the chart, the standard deviation (the individual deviations if you chose to mouse-over), the average, and that number “r”. The average will indicate the characteristic behaviour of the trend - let’s call it “orange” - the standard deviation will indicate how far off on average a point’s behaviour will be deviating in comparison to “orange” - say, “barely orange”, “bloody”, etc. - and the individual deviations will show how “orange” each point really is.

At this point I must mention that I conveniently oversimplified the example of the red-haired kids to avoid a specific problem. This problem has been quite single-handedly responsible for the news-media wrongly interpreting results from the LHC/CERN on the Higgs search.

In the case of the kids, we assumed that, going down the line, each kid’s hair would get progressively darker. What I left out was how much darker the hair would get with each step.

Let’s look at two different scenarios.

Scenario 1: The hair gets darker by a fixed amount each step.

Let’s say the first kid’s got hair that’s 1 units of orange, the fifth kid’s got 5 units, and the 11th kid’s got 11 units. This way, the average “amount of orange” in the lineup is going to be 6 units. The deviation on either side of kid #6 is going to increase/decrease in steps of 1. In fact, from the first to the last, it’s going to be 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, and 5. Straight down and then straight up.

Scenario 2: The hair gets darker slowly and then rapidly, also from 1 to 11 units.

In this case, the average is not going to be 6 units. Let’s say the “orangeness” this time is 1, 1.5, 2, 2.5, 3, 3.5, 4, 5.5, 7.5, 9.75, and 11 per kid, which brings the average to ~4.6591 units. In turn, the deviations are 3.6591, 3.1591, 2.6591, 2, 1591, 1.6591, 1.1591, 0.6591, 0.8409, 2.8409, 5.0909, and 6.3409. In other words, slowly down and then quickly more up.

In the second scenario, we saw how the average got shifted to the left. This is because there were more less-orange kids than more-orange ones. What’s more important is that it didn’t matter if the kids on the right had more more-orange hair than before. That they were fewer in number shifted the weight of the argument away from them!

In much the same way, looking for the Higgs boson from a chart that shows different peaks (number of signature decay events) at different points (energy levels), with taller but fewer peaks to one side and shorter but many more peaks to the other, can be confusing. While more decays could’ve occurred at discrete energy levels, the Higgs boson is more likely (note: not definitely) to be found within the energy-level where decays occur more frequently (in the chart below, decays are seen to occur more frequently at 118-126 GeV/c2 than at 128-138 GeV/c2 or 110-117 GeV/c2).

If there’s a tall peak where a Higgs isn’t likely to occur, then that’s an outlier, a weirdo who doesn’t fit into the data. It’s probably called an outlier because its deviation from the average could be well outside the permissible deviation from the average.

This also means it’s necessary to pick the average from the right area to identify the right outliers. In the case of the Higgs, if its associated energy-level (mass) is calculated as being an average of all the energy levels at which a decay occurs, then freak occurrences and statistical noise are going to interfere with the calculation. But knowing that some masses of the particle have been eliminated, we can constrain the data to between two energy levels, and then go after the average.

So, when an uninformed journalist looks at the data, the taller peaks can catch the eye, even run away with the ball. But look out for the more closely occurring bunches - that’s where all the action is!

If you notice, you’ll also see that there are no events at some energy levels. This is where you should remember that uncertainty cuts both ways. When you’re looking at a peak and thinking “This can’t be it; there’s some frequency of decays to the bottom, too”, you’re acknowledging some uncertainty in your perspective. Why not acknowledge some uncertainty when you’re noticing absent data, too?

While there’s a peak at 126 GeV/c2, the Higgs weighs between 124-125 GeV/c2. We know this now, so when we look at the chart, we know we were right in having been uncertain about the mass of the Higgs being 126 GeV/c2. Similarly, why not say “There’s no decays at 113 GeV/c2, but let me be uncertain and say there could’ve been a decay there that’s escaped this measurement”?

Maybe this idea’s better illustrated with this chart.

- Idea from Prof. Matt Strassler's blog

There’s a noticeable gap between 123 and 125 GeV/c2. Just looking at this chart and you’re going to think that with peaks on either side of this valley, the Higgs isn’t going to be here… but that’s just where it is! So, make sure you address uncertainty when you’re determining presences as well as absences.

So, now, we’re finally ready to address “r”, the Pearson covariance coefficient. It’s got a formula, and I think you should see it. It’s pretty neat.

(TeX: r\quad =\quad \frac { { \Sigma }_{ i=1 }^{ n }({ X }_{ i }\quad -\quad \overset { \_ }{ X } )({ Y }_{ i }\quad -\quad \overset { \_ }{ Y } ) }{ \sqrt { { \Sigma }_{ i=1 }^{ n }{ ({ X }_{ i }\quad -\quad \overset { \_ }{ X } ) }^{ 2 } } \sqrt { { \Sigma }_{ i=1 }^{ n }{ (Y_{ i }\quad -\quad \overset { \_ }{ Y } ) }^{ 2 } } })

The equation says "Let's see what your Pearson covariance, "r", is by seeing how much all of your variations are deviant keeping in mind both your standard deviations."

The numerator is what’s called the covariance, and the denominator is basically the product of the standard deviations. X-bar, which is X with a bar atop, is the average value of X - my line - and the same goes for Y-bar, corresponding to Y - “mobile games”. Individual points on the lines are denoted with the subscript “i”, so the points would be X1, X2, X3, ..., and Y1, Y2, Y3, …”n” in the formula is the size of the sample - the number of days over which we’re comparing the two trends.

The Pearson covariance coefficient is not called the Pearson deviation coefficient, etc., because it normalises the graph’s covariance. Simply put, covariance is a measure of how much the two trends vary together. It can have a minimum value of 0, which would mean one trend’s variation has nothing to do with the other’s, and a maximum value of 1, which would mean one trend’s variation is inescapably tied with the variation of the other’s. Similarly, if the covariance is positive, it means that if one trend climbs, the other would climb, too. If the covariance is negative, then one trend’s climbing would mean the other’s descending (In the chart below, between Oct ’09 and Jan ’10, there’s a dip: even during the dive-down, the blue line is on an increasing note – here, the local covariance will be negative).

Apart from being a conveniently defined number, covariance also records a trend’s linearity. In statistics, linearity is a notion that stands by its name: like a straight line, the rise or fall of a trend is uniform. If you divided up the line into thousands of tiny bits and called each one on the right the “cause” and the one on the left the “effect”, then you’d see that linearity means each effect for each cause is either an increase or a decrease by the same amount.

Just like that, if the covariance is a lower positive number, it means one trend’s growth is also the other trend’s growth, and in equal measure. If the covariance is a larger positive number, you’d have something like the butterfly effect: one trend moves up by an inch, the other shoots up by a mile. This you’ll notice is a break from linearity. So if you plotted the covariance at each point in a chart as a chart by itself, one look will tell you how the relationship between the two trends varies over time (or space).


Please Wait while comments are loading...