August 12th, 2014

White's reality check for data snooping in R

With this post, I try to provide an easy to understand, step-by-step, practical implementation of White's reality check as presented in A reality check for data snooping. I'm using R for the implementation. Note that at the end of this article you probably won't have a piece of code ready to copy and paste. What you should have instead is a better understanding of the method. This is primarly why I'm going through the paper like this. You should also note that I'm writing this in the specific context of trading and trading strategies/models (shouldn't matter too much though).

Data snooping

Data snooping is one of the many tools of self-deceit traders use in the process of developing a trading strategy. Appropriately named, White's reality check tries to reduce the level of self-deceit and provides a fairly simple but powerful method of separating the wheat from the chaff in terms of data mined trading strategies/models/patterns/whatever.

Quoting Halbert White,

data snooping occurs when a given set of data is used more than once for purposes of inference or model selection. When such data reuse occurs, there is always the possibility that any satisfactory results obtained may simply be due to chance rather than to any merit inherent in the method yielding the results.

In other words, if you're going to data mine for profitable strategies, you better not bet the ranch on every profitable strategy you run across. It's still quite possible that at least some of your results were simply due to luck. Before you bet anything, submit your models to a reality check.

Bootstrap Reality Check

Step 1 - the benchmark

Before we can evaluate pretty much anything, we must first have a benchmark or standard to refer the tested object to. When measuring the performance of trading strategies this benchmark is rather easy to determine. To put it bluntly, if we've measured our performance on detrended data, a good benchmark for any trading strategy is zero. The expected return of a random strategy (i.e. coin flipping) on data having mean zero (detrended) is zero. If we can make more than zero on detrended data then we're probably doing something right (better than coin flipping).

Using White's notation (page 13), we're going to set the benchmark model h^0\hat h_0 to

h^0,t+1=0,(t=R,...,T) \hat h_{0, t+1} = 0, (t = R, ..., T)

where the RR to TT interval is our observed history containing nn values.

Step 2 - the models (strategies)

The word "model" is what the paper uses and this is as it should be but in our case I find it somewhat inappropriate. We're not trying to measure the difference between a model's predicted price and the benchmark price. Rather, we're trying to compare the returns on the benchmark (zero) with the returns on our strategies. So instead of declaring a generic model as an example, as White does (still on page 13), we're simply going to define the relevant values for our strategies as

h^x,t+1=ln(eqt+1eqt),(t=R,...,T),(x=1,...,s) \hat h_{x, t+1} = \ln(\frac{eq_{t+1}}{eq_t}), (t = R, ..., T), (x = 1,..., s)

where eqteq_{t} is equity at time tt, and h^x,t+1\hat h_{x, t+1} the collection of daily log returns for each strategy, where each strategy has an index xx going from 11 to ss - that is, we're testing ss strategies. This example assumes of course we're working with daily data. However, this doesn't have to be the case. We could be talking about nn trades, weeks, whatever, it does not make a difference.

Step 3 - performance measure

We have our daily log returns and we have our benchmark. It's time to measure how well each strategy has done in testing. For each strategy xx where x=(1,...,s)x = (1, ..., s) we have the collection of excess daily returns given by

f^x,t+1=h^x,t+1h^0,t+1,(t=R,...,T),(x=1,...,s)\hat f_{x, t+1} = \hat h_{x, t+1} - \hat h_{0,t+1}, (t = R, ..., T), (x = 1, ..., s)

This might seem a bit silly given that h^0,t+1\hat h_{0,t+1} is zero for all values of tt but I'm going through the motions here to try and maintain a parallel to the original paper. We're using a benchmark equal to zero here so obviously f^x,t+1\hat f_{x, t+1} above is merely the same thing as h^x,t+1\hat h_{x, t+1}, the collection of daily log returns, but if the benchmark was something other than zero this step would've made a big difference.

Before moving on, we must also calculate the mean of f^x,t+1\hat f_{x,t+1} above, that is, the mean daily log excess returns. This we declare as

fˉx=1nt=RTf^x,t+1,(x=1,...,s)\bar f_{x} = \frac{1}{n} \sum_{t=R}^T \hat f_{x,t+1}, (x = 1, ..., s)

Step 4 - the bootstrap

We're almost there. What we need to do now is learn how to run a bootstrap. In math language, we're looking for the collection of mean excess log returns of the bootstrapped samples. Each bootstrapped sample has index ii and goes from 11 to N=500N = 500.

fˉx,i=1nt=RTf^x,θi(t)+1,(x=1,...,s),(i=1,...,N)\bar f_{x,i}^* = \frac{1}{n} \sum_{t=R}^T \hat f_{x, \theta_i(t)+1}, (x = 1, ..., s), (i = 1, ..., N)

This might look a bit complicated in math language but it's rather easy to implement in code. All we're doing is resampling with replacement each daily log returns series xx a number of NN times and then calculating the mean for each resampled series.

Let's put that in R code. First, let's calculate the NN sets of random observation indexes of length nn, {θi(t),t=R,...,T},i=1,...,N\{\theta_i(t), t = R, ..., T\},i = 1, ..., N. There are a few methods for doing this and which one you should use depends mainly on the structure of the data you're working with.

If the data points you're working with are i.i.d. then using a stationary bootstrap of fixed block length 1 would be fine. If the data points are not independent (as it is the case with time series, most of the time) you should use a bootstrap method which produces blocks of random length having geometric mean length equal to 1/q1/q where qq is between 00 and 11. The original paper uses q=1q = 1 for the simplified example and q=0.5q = 0.5 for the practical application. There are various methods even for determining an appropriate value of qq but we'll keep those out of this post.

Anyways, let's look at the code. First, a method using q=1q = 1, that is, a fixed block length equal to 11.

n <- 1000 #number of daily observations (distance from R to T)
N <- 500 #number of resamples
theta <- matrix(nrow = N, ncol = n) #theta indexes, N indexes of length n
for(i in 1:N){ # generate theta_i indexes
    theta[i,] <- sample(n,n,replace=TRUE) #index i, length n
                                          #each row is a theta
                                          #index collection
}

This works fairly well for data such as the one we're working with as daily log return points are fairly independent. However, you can also use an R package and do a proper time series bootstrap for random block lengths having geometric mean 1/q1/q. Let's say q=0.5q = 0.5 and 1/q=21/q = 2. In R:

library(boot) #boot package for ts bootstrap
x <- rnorm(n) #random vector of length n
y <- tsboot(x, mean, R=N, l=2, sim="geom") #N resamples, geom mean len = 2
theta <- matrix(nrow = N, ncol=n)
theta <- boot.array(y) #returns 500 index collections of length n.
                       #each row is a theta index
                       #collection

Using these indexes to calculate fˉx,i\bar f_{x,i}^* above is now quite simple.

s <- 100 #number of models
f <- matrix(nrow = s, ncol = n) #matrix holding n daily log returns
                                #for each strategy x
f.mean.boot <- matrix(nrow=s, ncol=N) #N bootstrapped resample means
                                      #for each x strategy
for(x in 1:s){ #go through each model
    for(i in 1:N){ #resample N times
        f.mean.boot[x,i] = mean(f[x,theta[i,]]) #mean of resampled f[x,]
                                               #using theta[i,]
    }
}

f.mean.boot[x,i] above is our fˉx,i\bar f_{x,i}^*. This we'll be using in the next step.

Step 5 - the vees

At last, we get to see some actual results - the vees, as I like to call them. I'm talking about, of course, the two V values, VxV_x and Vˉx,i\bar V_{x,i}^*. Let's start with x=1x=1. We have

Vˉ1=n1/2fˉ1\bar V_1 = n^{1/2} \bar f_1
Vˉ1,i=n1/2(fˉ1,ifˉ1)\bar V_{1,i}^*= n^{1/2} (\bar f_{1,i}^*- \bar f_1)

If we're only testing one strategy, that is, s=1s=1, we don't need anything beyond these first two vees. All we have to do is compare the sample value of Vˉ1\bar V_1 to the percentiles of Vˉ1,i\bar V_{1,i}^* and see if our trading strategy is any good at a significance level α\alpha of our choosing.

This is how code would look like for just a single strategy:

f.mean <- numeric(s) #vector holding mean daily returns
                     #for each strategy x
for(x in 1:s){ #go through each strategy
    f.mean <- mean(f[x,]) #mean returns for model x
}
V1 <- n^(1/2)*f.mean[1];
V1.boot <- numeric(N)
for(i in 1:N){
    V1.boot[i] <- n^(1/2)*(f.mean.boot[1,i] - f.mean[1])
}

V1 in the code above is of course Vˉ1\bar V_1 and V1.boot is Vˉ1,i\bar V_{1,i}^* . To compare the sample value of Vˉ1\bar V_1 to the percentiles of Vˉ1,i\bar V_{1,i}^* all we have to do is write the following code:

p <- length(which(V1.boot > V1))/N #returns p-value for strategy

This counts the values higher than V1 among V1.boot and divides that count by N to return the percentage of all V1.boot values which exceed V1. This is our p-value. If this p-value is then lower than our α\alpha level of significance then the null claiming the model produces returns lower or equal than zero can be rejected.

Moving on to strategy number two, i.e. x=2x = 2, we have the following:

Vˉ2=max{n1/2fˉ2,Vˉ1}\bar V_2 = \text{max}\{ n^{1/2} \bar f_2, \bar V_1 \}
Vˉ2,i=max{n1/2(fˉ2,ifˉ2),Vˉ1,i}\bar V_{2,i}^*=\text{max}\{ n^{1/2} (\bar f_{2,i}^* - \bar f_2), \bar V_{1,i}^*\}

At this point we're scrapping the first two vees and working with these last ones. These stand for the better of the two models. To test if the best model among the two beats the benchmark, just like before, we simply compare the sample value of Vˉ2\bar V_2 to the percentiles of Vˉ2,i\bar V_{2,i}^*. In R:

V2 <- max(n^(1/2)*f.mean[2],V1)
for(i in 1:N){
    V2.boot[i] <- max(n^(1/2)*(f.mean.boot[2,i] - f.mean[2]),V1.boot[i])
}

This process goes on for all strategies x=1,...,sx = 1, ..., s. At point x=sx = s we'd have

Vˉs=max{n1/2fˉs,Vˉs1}\bar V_s = \text{max}\{ n^{1/2} \bar f_s, \bar V_{s-1} \}
Vˉs,i=max{n1/2(fˉs,ifˉs),Vˉs1,i}\bar V_{s,i}^* = \text{max}\{ n^{1/2} (\bar f_{s,i}^* - \bar f_s), \bar V_{s-1,i}^* \}

and just like before we'd compare Vˉs\bar V_s to the percentiles of Vˉs,i\bar V_{s,i}^* . Specifically, we have to sort the values of Vˉs,i\bar V_{s,i}^* , plug Vˉs\bar V_s in there, count how many values are higher than Vˉs\bar V_s, and divide that number by NN. If we denote the count of those values higher than Vˉs\bar V_s as MM we have a bootstrap reality check p-value

p=MNp = \frac{M}{N}

Note that the paper sets MM as those values smaller than Vˉs\bar V_s and then calculates p=1MNp = 1 - \frac{M}{N}. It's the same thing but I like my version better. It's also how I write it in code.

This is how this iterative process would look like in R if we were to only keep in memory the current and previous V values of interest.

prev.V <- n^(1/2)*f.mean[1] #V1
prev.V.boot <- numeric(N) #V1.boot
for(i in 1:N){
    prev.V.boot <- n^(1/2)*(f.mean.boot[1,i] - f.mean[1])
}
for(x in 2:s){ #start from V2 and go on till Vs
    current.V <- max(n^(1/2)*f.mean[x],prev.V)
    for(i in 1:N){
        current.V.boot[i] <- max(n^(1/2)*(f.mean.boot[x,i] - f.mean[x]),prev.V.boot[i])
    }
}
p <- length(which(current.V.boot > current.V))/N #returns p-value for best model

The p value at the end is what interests us most. If that p-value is smaller than the α\alpha value we chose for our experiment then the result obtained for the best model among all tested models is indeed statistically significant. This means we can reject the null hypothesis claiming the best model produces zero expected returns with a certain degree of confidence that our best model didn't do so well purely by chance.

Conclusions

We've gone through a step by step implementation of the bootstrap reality check and arrived at a result which allows us to select the best performing strategy among a series of multiple strategies with a certain degree of confidence that our results are not simply the result of good luck. I hope you found this post useful in understanding the method. For more technical details, proofs, etc, please see the original paper, A reality check for data snooping.

Other good reads on the topic would be http://en.wikipedia.org/wiki/False_discovery_rate (for more of an overview/intro), this paper by Benjiamini and Hochberg (link), and, for an expanded version of the bootstrap reality check, the paper by Romano and Wolf, Stepwise Multiple Testing as Formalized Data Snooping.