With this post, I try to provide an easy to understand, step-by-step, practical implementation of White's reality check as presented in A reality check for data snooping. I'm using R for the implementation. Note that at the end of this article you probably won't have a piece of code ready to copy and paste. What you should have instead is a better understanding of the method. This is primarly why I'm going through the paper like this. You should also note that I'm writing this in the specific context of trading and trading strategies/models (shouldn't matter too much though).
Data snooping is one of the many tools of self-deceit traders use in the process of developing a trading strategy. Appropriately named, White's reality check tries to reduce the level of self-deceit and provides a fairly simple but powerful method of separating the wheat from the chaff in terms of data mined trading strategies/models/patterns/whatever.
Quoting Halbert White,
data snooping occurs when a given set of data is used more than once for purposes of inference or model selection. When such data reuse occurs, there is always the possibility that any satisfactory results obtained may simply be due to chance rather than to any merit inherent in the method yielding the results.
In other words, if you're going to data mine for profitable strategies, you better not bet the ranch on every profitable strategy you run across. It's still quite possible that at least some of your results were simply due to luck. Before you bet anything, submit your models to a reality check.
Before we can evaluate pretty much anything, we must first have a benchmark or standard to refer the tested object to. When measuring the performance of trading strategies this benchmark is rather easy to determine. To put it bluntly, if we've measured our performance on detrended data, a good benchmark for any trading strategy is zero. The expected return of a random strategy (i.e. coin flipping) on data having mean zero (detrended) is zero. If we can make more than zero on detrended data then we're probably doing something right (better than coin flipping).
Using White's notation (page 13), we're going to set the benchmark model to
where the to interval is our observed history containing values.
The word "model" is what the paper uses and this is as it should be but in our case I find it somewhat inappropriate. We're not trying to measure the difference between a model's predicted price and the benchmark price. Rather, we're trying to compare the returns on the benchmark (zero) with the returns on our strategies. So instead of declaring a generic model as an example, as White does (still on page 13), we're simply going to define the relevant values for our strategies as
where is equity at time , and the collection of daily log returns for each strategy, where each strategy has an index going from to - that is, we're testing strategies. This example assumes of course we're working with daily data. However, this doesn't have to be the case. We could be talking about trades, weeks, whatever, it does not make a difference.
We have our daily log returns and we have our benchmark. It's time to measure how well each strategy has done in testing. For each strategy where we have the collection of excess daily returns given by
This might seem a bit silly given that is zero for all values of but I'm going through the motions here to try and maintain a parallel to the original paper. We're using a benchmark equal to zero here so obviously above is merely the same thing as , the collection of daily log returns, but if the benchmark was something other than zero this step would've made a big difference.
Before moving on, we must also calculate the mean of above, that is, the mean daily log excess returns. This we declare as
We're almost there. What we need to do now is learn how to run a bootstrap. In math language, we're looking for the collection of mean excess log returns of the bootstrapped samples. Each bootstrapped sample has index and goes from to .
This might look a bit complicated in math language but it's rather easy to implement in code. All we're doing is resampling with replacement each daily log returns series a number of times and then calculating the mean for each resampled series.
Let's put that in R code. First, let's calculate the sets of random observation indexes of length , . There are a few methods for doing this and which one you should use depends mainly on the structure of the data you're working with.
If the data points you're working with are i.i.d. then using a stationary bootstrap of fixed block length 1 would be fine. If the data points are not independent (as it is the case with time series, most of the time) you should use a bootstrap method which produces blocks of random length having geometric mean length equal to where is between and . The original paper uses for the simplified example and for the practical application. There are various methods even for determining an appropriate value of but we'll keep those out of this post.
Anyways, let's look at the code. First, a method using , that is, a fixed block length equal to .
n <- 1000 #number of daily observations (distance from R to T)
N <- 500 #number of resamples
theta <- matrix(nrow = N, ncol = n) #theta indexes, N indexes of length n
for(i in 1:N){ # generate theta_i indexes
theta[i,] <- sample(n,n,replace=TRUE) #index i, length n
#each row is a theta
#index collection
}
This works fairly well for data such as the one we're working with as daily log return points are fairly independent. However, you can also use an R package and do a proper time series bootstrap for random block lengths having geometric mean . Let's say and . In R:
library(boot) #boot package for ts bootstrap
x <- rnorm(n) #random vector of length n
y <- tsboot(x, mean, R=N, l=2, sim="geom") #N resamples, geom mean len = 2
theta <- matrix(nrow = N, ncol=n)
theta <- boot.array(y) #returns 500 index collections of length n.
#each row is a theta index
#collection
Using these indexes to calculate above is now quite simple.
s <- 100 #number of models
f <- matrix(nrow = s, ncol = n) #matrix holding n daily log returns
#for each strategy x
f.mean.boot <- matrix(nrow=s, ncol=N) #N bootstrapped resample means
#for each x strategy
for(x in 1:s){ #go through each model
for(i in 1:N){ #resample N times
f.mean.boot[x,i] = mean(f[x,theta[i,]]) #mean of resampled f[x,]
#using theta[i,]
}
}
f.mean.boot[x,i]
above is our . This we'll be using in the next step.
At last, we get to see some actual results - the vees, as I like to call them. I'm talking about, of course, the two V values, and . Let's start with . We have
If we're only testing one strategy, that is, , we don't need anything beyond these first two vees. All we have to do is compare the sample value of to the percentiles of and see if our trading strategy is any good at a significance level of our choosing.
This is how code would look like for just a single strategy:
f.mean <- numeric(s) #vector holding mean daily returns
#for each strategy x
for(x in 1:s){ #go through each strategy
f.mean <- mean(f[x,]) #mean returns for model x
}
V1 <- n^(1/2)*f.mean[1];
V1.boot <- numeric(N)
for(i in 1:N){
V1.boot[i] <- n^(1/2)*(f.mean.boot[1,i] - f.mean[1])
}
V1 in the code above is of course and V1.boot is . To compare the sample value of to the percentiles of all we have to do is write the following code:
p <- length(which(V1.boot > V1))/N #returns p-value for strategy
This counts the values higher than V1 among V1.boot and divides that count by N to return the percentage of all V1.boot values which exceed V1. This is our p-value. If this p-value is then lower than our level of significance then the null claiming the model produces returns lower or equal than zero can be rejected.
Moving on to strategy number two, i.e. , we have the following:
At this point we're scrapping the first two vees and working with these last ones. These stand for the better of the two models. To test if the best model among the two beats the benchmark, just like before, we simply compare the sample value of to the percentiles of . In R:
V2 <- max(n^(1/2)*f.mean[2],V1)
for(i in 1:N){
V2.boot[i] <- max(n^(1/2)*(f.mean.boot[2,i] - f.mean[2]),V1.boot[i])
}
This process goes on for all strategies . At point we'd have
and just like before we'd compare to the percentiles of . Specifically, we have to sort the values of , plug in there, count how many values are higher than , and divide that number by . If we denote the count of those values higher than as we have a bootstrap reality check p-value
Note that the paper sets as those values smaller than and then calculates . It's the same thing but I like my version better. It's also how I write it in code.
This is how this iterative process would look like in R if we were to only keep in memory the current and previous V values of interest.
prev.V <- n^(1/2)*f.mean[1] #V1
prev.V.boot <- numeric(N) #V1.boot
for(i in 1:N){
prev.V.boot <- n^(1/2)*(f.mean.boot[1,i] - f.mean[1])
}
for(x in 2:s){ #start from V2 and go on till Vs
current.V <- max(n^(1/2)*f.mean[x],prev.V)
for(i in 1:N){
current.V.boot[i] <- max(n^(1/2)*(f.mean.boot[x,i] - f.mean[x]),prev.V.boot[i])
}
}
p <- length(which(current.V.boot > current.V))/N #returns p-value for best model
The p value at the end is what interests us most. If that p-value is smaller than the value we chose for our experiment then the result obtained for the best model among all tested models is indeed statistically significant. This means we can reject the null hypothesis claiming the best model produces zero expected returns with a certain degree of confidence that our best model didn't do so well purely by chance.
We've gone through a step by step implementation of the bootstrap reality check and arrived at a result which allows us to select the best performing strategy among a series of multiple strategies with a certain degree of confidence that our results are not simply the result of good luck. I hope you found this post useful in understanding the method. For more technical details, proofs, etc, please see the original paper, A reality check for data snooping.
Other good reads on the topic would be http://en.wikipedia.org/wiki/False_discovery_rate (for more of an overview/intro), this paper by Benjiamini and Hochberg (link), and, for an expanded version of the bootstrap reality check, the paper by Romano and Wolf, Stepwise Multiple Testing as Formalized Data Snooping.