In the last volatility filters post we saw that trades from a simple Trend Following system (20-50 MA cross-over) had different expectancy based on the relative level of volatility at trade entry. This suggested that a filter blocking trades most volatile at entry (in the top decile: 90 to 100% of past volatility) would raise the expectancy per trade.
However, this conclusion was obtained via a single observation (ie. one back-test sample). Ideally we want many samples, to be able to establish more robust conclusions – or at least be able to calculate a level of confidence in the result from our one observation.
And this is where the bootstrap test comes in handy. In this post I’ll try and illustrate how we can apply the bootstrap concept to a slightly different problem, in order to strengthen back-testing research results.
BOOTSTRAP TO COMPARE TWO POPULATIONS
Instead of using the bootstrap test to check if the profitability of a single back-test is statistically significant, we use it to check whether the difference between 2 sample groups is statistically significant. Here, the 2 sample groups are:
- Trades with entry volatility at 0-90% of historical levels (lower-volatility group)
- Trades with entry volatility at 90-100% of historical levels (high-volatility group)
The difference between the 2 groups can be measured by the difference of the means of each group. In the original samples, the difference between the R-multiple mean values was fairly high with a ratio of nearly 3-to-1::
- Lower-volatility group: Average R-multiple = 0.27
- High-volatility group: Average R-multiple = 0.10
However, this difference could be down to random variation. Running a statistical test such as a bootstrap will enable us to determine a level of confidence in the results.
Similarly to a two-sample t-test, we need to formulate the null hypothesis (H0):
H0: the mean of each group is identical (mean1 = mean2). This is equivalent to the difference of the means being nil (mean1 – mean2 = 0)
with the alternative hypothesis (Ha):
Ha: the mean of group 1 (lower volatility group) is higher than the mean of group 2
The goal of the bootstrap test is to generate the sampling distribution of the difference of the means and to calculate a p-value for rejecting the null hypothesis.
Here are the steps to follow:
- Form two samples. Each sample contains all trades’ R-multiple values for one of the group.
- For each resample, select random instances (with replacement) of R-multiples from each group (same number of instances as the group count). Calculate the mean for each group and compute their difference (bootstrapped statistic).
- Perform a large number of resamples to generate a large number of bootstrapped statistics (difference of the means between the 2 group resamples).
- Form the sampling distribution of the difference of means generated in the step above.
- Derive the p-value of the difference of the means being 0 or less
For the case under study, the bootstrap gave us this sampling distribution (10,000 resamples):
We can calculate the p-value, which is 0.0276 (97.24% of the observations are positive, leaving only 2.76% zero or negative). We can therefore reject the null hypothesis.
The conclusion is that the result is statistically significant at the 5% level (p-value less than 0.05): there is only a slim chance (2.74%) that the null hypothesis is true (Type I error).
So, the filter seems to markedly improve the performance of the system. But of course, this is not a panacea… It is very well possible that a filtered trade might end up being a large runner.
With Trend Following’s dependency on infrequent large winners, missing a major trade because it has been filtered out could make a big difference to that year’s performance.