Shepherd’s *pi* correlation is a robust test for the statistical association between two variables. Being robust in this context means that it is less susceptible to outliers and other violations of the assumptions made by standard parametric tests. I developed this procedure together with my colleagues *Geraint Rees* and *Ben de Haas* at the UCL Institute of Cognitive Neuroscience. It is neither very sophisticated nor is it the only way to treat data sets containing outliers. However, it is meant to be relatively straightforward and easy to apply and it can help refine your analysis. Like all tests, it probably does not work equally well in all situations but it appears to do a good job in maintaining reasonable power in the absence of inflated false alarm rates when used with data that are typical for experiments in systems neuroscience. The MatLab functions for it are here (they come with documentation and example data):

The procedure works by first bootstrapping the Mahalanobis distances of the data, that is, the distance of each data point from the bivariate mean, taking into account the covariance structure. The distances are bootstrapped because outliers could already skew the Mahalanobis distance. Data points farther than 6 squared units are then removed as outliers. Shepherd’s *pi* is simply Spearman’s *rho* on the remaining data. The p-value is adjusted because – ironically – removing outliers can inflate false alarm rates. More details about this procedure can be found in this article: Better ways to improve standards in brain-behavior correlation analysis.

We do not propose that Shepherd’s *pi* should be used in isolation. Using only robust or non-parametric statistics is not sensible because they are generally underpowered compared to parametric tests. When calculating correlations it therefore always makes sense to start with Pearson’s *r* or – if you have a suspicion of influential outliers or non-linear relationships – Spearman’s *rho *(although if you know about non-linearity beforehand, it probably makes more sense to just transform the data to linearise it). Further, you should bootstrap confidence intervals for the correlation coefficient and compare those to the nominal confidence interval one would expect under the assumptions of the test (a function for this is also included in my toolbox above). Only if there is a major discrepancy between the two intervals are more robust tests really necessary. Finally, keep in mind that even if robust tests are not significant this does not rule out that there is an effect. You must ask yourself the questions: Do the results still show the same trend? And are the outliers you detected truly biological outliers?

Robust statistics shouldn’t be taken as ground truth but can flag up spurious or artifactual data. But they, too, can be spurious. Consider this completely hypothetical example: You have a basket of apples and they range in colour from bright green to deep red with various shades in between. Further, there are three random guys called Albert, Carl, and Ramon. They all taste the apples and rate each of them on a scale between sour and sweet. It is conceivable that we will then find a positive correlation between colour and sweetness for the judgements made by all three. In this situation it is also quite likely that there is considerable agreement between the raters on how they rated the individual apples – thus there is also a correlation between the three sets of sweetness ratings. Now imagine you discovered that the correlation between apple colour and Albert’s ratings is strongly driven by outliers. Can we really take this as evidence that this correlation is spurious? This is flawed reasoning as the sweetness judgements appear to be quite reliable across our random sample of notable scientists. Rather than the correlation, it is your criteria for identifying outliers that are presumably incorrect.

This illustrates the importance of estimating the reliability of individual data points by replicating or retesting as many of the measures, directly and/or conceptually, and including all of the tests. In addition, it may also help to test the correlation not only for the each of the three raters but also with the mean across their ratings. This may already remove some of the noise associated with their ratings. Moreover, you may also want to apply the same principle to the colour variable, for instance by having some objective measure of the apples’ colours, such as the wavelength reflected under consistent lighting etc. Naturally, it isn’t always possible to collect such additional data but it is advisable to accumulate as many independent measures as is feasible. One way to address this problem may also be to split the data in half and test whether the measures are consistent.

You also need to ensure that your measurements are meaningful. Imagine for example that we had a fourth rater, Ben, whose ratings are highly erratic and do not show any evidence of correlation with colour (for example, he rates half the apples as extremely sour or extremely sweet with no apparent pattern). But then you find out that this is because he hates the taste of apples to begin with and to overcome his nausea he quickly makes ratings and moves on to the next apple. This is obviously not evidence against the hypothesis that colour and taste are linked. Rather, it suggests that this rater is not providing any useful data.

These examples underline the notion that outliers are not merely a statistical phenomenon – you must interpret the data at hand and make a judgement on whether a spurious effect is likely. It’s not all only about the numbers.