In the last volatility filters post we saw that trades from a simple Trend Following system (20-50 MA cross-over) had **different expectancy based on the relative level of volatility** at trade entry. This suggested that a filter blocking trades most volatile at entry (in the top decile: 90 to 100% of past volatility) would raise the expectancy per trade.

However, this conclusion was obtained via a **single observation** (ie. one back-test sample). Ideally we want many samples, to be able to establish more robust conclusions – or at least be able to calculate a **level of confidence** in the result from our one observation.

And this is where the **bootstrap test** comes in handy. In this post I’ll try and illustrate how we can apply the bootstrap concept to a slightly different problem, in order to strengthen back-testing research results.

**BOOTSTRAP TO COMPARE TWO POPULATIONS**

Instead of using the bootstrap test to check if the profitability of a single back-test is statistically significant, we use it to check whether the **difference between 2 sample groups** is statistically significant. Here, the 2 sample groups are:

- Trades with entry volatility at 0-90% of historical levels (lower-volatility group)
- Trades with entry volatility at 90-100% of historical levels (high-volatility group)

The difference between the 2 groups can be measured by the difference of the means of each group. In the original samples, the difference between the **R-multiple mean values** was fairly high with a ratio of nearly 3-to-1::

- Lower-volatility group: Average R-multiple = 0.27
- High-volatility group: Average R-multiple = 0.10

However, this difference could be down to **random variation**. Running a statistical test such as a bootstrap will enable us to determine a **level of confidence** in the results.

Similarly to a two-sample t-test, we need to formulate the null hypothesis (H0):

H0: the mean of each group is identical (mean1 = mean2). This is equivalent to the difference of the means being nil (mean1 – mean2 = 0)

with the alternative hypothesis (Ha):

Ha: the mean of group 1 (lower volatility group) is higher than the mean of group 2

The goal of the bootstrap test is to generate the **sampling distribution of the difference of the means** and to calculate a p-value for rejecting the null hypothesis.

Here are the steps to follow:

- Form two samples. Each sample contains all trades’ R-multiple values for one of the group.
- For each resample, select random instances (with replacement) of R-multiples from each group (same number of instances as the group count). Calculate the mean for each group and compute their difference (bootstrapped statistic).
- Perform a large number of resamples to generate a large number of bootstrapped statistics (difference of the means between the 2 group resamples).
- Form the sampling distribution of the difference of means generated in the step above.
- Derive the p-value of the difference of the means being 0 or less

For the case under study, the bootstrap gave us this sampling distribution (10,000 resamples):

We can calculate the **p-value**, which is **0.0276** (97.24% of the observations are positive, leaving only 2.76% zero or negative). We can therefore reject the null hypothesis.

The conclusion is that the result is statistically significant at the 5% level (p-value less than 0.05): there is only a slim chance (2.74%) that the null hypothesis is true (Type I error).

So, the filter seems to markedly improve the performance of the system. But of course, this is not a panacea… It is very well possible that a filtered trade might end up being a large runner.

With Trend Following’s dependency on infrequent large winners, missing a major trade because it has been filtered out could make a big difference to that year’s performance.