Kemp’s Blog

A technical blog about technical things

Careful data handling

While processing some experimental data recently I was comparing two data streams and saw:

RMSE Corr n p-value
3.26 0.19 39 0.25

In this case, “Corr” is Pearson’s correlation and the quoted p-value is for the significance of the correlation. Given a scale of -4 to 4, it looks like it’s fairly safe to say that the two data streams don’t have very much in common. The lack of samples is unfortunate but couldn’t be helped in this case.

It wasn’t such a simple case however, as I noticed that two of the samples shouldn’t have been included in the comparison (for very well-defined experimental reasons). Two points out of 39 shouldn’t make a big difference right? 😉 Having removed these points from the comparison I got the following results:

RMSE Corr n p-value
2.05 0.69 37 2.14e-06

As you can see, this made a significant difference (literally). Making a scatterplot of these results earlier on would have pointed me to this mistake much quicker.

Remember kids, 1) you need to know your data (why is each of those points included in the analysis?), and 2) scatterplots are your friend when looking at any type of correlation.