Backtesting Bias:
Feels Good, Until You Blow Up

In an ideal trading universe (free from backtesting bias), we’d all have a big golden “causation magnifying glass”. Through the lens of this fictional tool, you’d zoom in and understand the fleeting, enigmatic nature of the financial markets, stripping bare all its causes and effects.

Knowing exactly what causes exploitable inefficiencies would make predicting market behaviour and building profitable trading strategies a fairly cushy gig, right?

If you’re an engineer or scientist reading this, you are probably nodding along, hoping I’ll say the financial markets show some kind of domino effect for capitalists. That you can model them with the kinds of analytical methods you’d throw at a construction project or the petri dish. But unfortunately, trying to shoehorn the markets into formulas is a futile exercise… like stuffing Robot Wealth’s frontman Kris into a suit.

Since the markets aren’t strictly deterministic, this makes testing your new and exciting strategy ideas a bit tricky. We’d all love to know for sure whether our ideas will be profitable before we throw real money at them. But, since you can’t realistically apply mathematical equations to your strategy to derive its future performance, you’ll need to resort to the next best thing — experimentation.

By experimenting with your trading strategy during development, you’re left wide open to some fatal errors when assessing its potential future performance. These can, and will, cost you time, money and many, many headaches.

In this post, you’re going to learn how to identify and avoid several common, often expensive backtesting bias so you can build more robust, more profitable systematic trading strategies. Let’s go!

How do you experiment as a systematic trader?

As far as I know, you’ve got two options when it comes to testing systematic trading strategies:

  1. You can find out the actual performance of your strategy
  2. You can find out the likely performance of your strategy

The first option simply involves throwing real money at a live version of your strategy and seeing if it does anything exciting. The second is testing your strategy, usually on past data, before following option 1 later down the pipeline. If like us, you don’t enjoy needlessly throwing away wads of your capital, you’ll first test your strategy via simulation, assessing whether it is likely to perform as well in the future as it did in the past.

In financial trading, such a simulation of past performance is called a backtest.

Despite what you’ll read online in what I’ll call retail trading folklore, backtesting your strategies is not as simple as aligning signals with entry and exit prices and summoning the results. Simply doing a backtest is one thing, but gaining accurate, actionable data that will help you keep more of your capital is another

Such simplistic approaches will undoubtedly lead to some backtest bias.

In this post, we’re going to focus on a few ways that biases in your development methodology can hold you back from successful trading. Many of these effects are subtle yet profound: they can and will creep into your development process and can have disastrous effects on strategy performance.

Look-Ahead Bias aka Peeking Bias

If you could travel back in time, you’d probably be quite tactical about your “creative” work wherever you landed. Google? Your idea. Predicting future events to become a local deity? Hold my feather crown.

In trading, you might want to hold off on all that. Look-ahead bias is a type of backtesting bias introduced by allowing future knowledge to affect your decisions around historical scenarios or events. As a trader running backtests, this bias impacts your trade decisions by acting upon knowledge that would not have been available at the time the original trade decision was taken.

What does this look like in practice?

A popular example is executing an intra-day trade on the basis of the day’s closing price, when that closing price is not actually known until the end of the day.

Even if you use backtesting software that’s designed against lookahead bias, you need to be careful. A subtle but potentially serious mistake is to use the entire simulation period to calculate a trade parameter (for example, a portfolio optimization parameter) which is then retrospectively applied at the beginning of the simulation.

This error is so common that you must always double check for it. And triple check if your backtest looks really good.

Overfitting Bias aka Over-Optimization Bias

If you run a backtest producing annual returns in the ballpark of thousands of percent, don’t quit your day job – you’ve likely stumbled across overfitting bias. Apart from being comedic firewood for your favourite FX forum, these backtests are useless for real, systematic trading purposes.

Check out the following plots to see overfitting as backtesting bias in action. The blue squares in the figure below are an artificially generated quadratic function with some noise added to distort the underlying signal. The lines represent various models fitted to the data points. The red line is a linear regression line; the green, blue and orange lines are quadratic, cubic and quartic functions respectively. Apart from the linear regression line, these all do a decent job of modelling the data, for this region in the parameter space.

The pink line is a high-order polynomial regression: notice that it fits this data best of all:

overfitting as backtesting bias in action

But do these models hold up out-of-sample? What I’m really asking is, how well do they generalize to data that was not used in the model-fitting process?

Well, the next plot shows the performance of the quadratic, cubic and quartic functions in a new region of the observed variable space, meaning an out-of-sample data set. In this case, the quadratic function is clearly the best performer, and we know that it most closely matches the underlying generating function – this is an example of a well-fit model.

The other models do a pretty crummy job of predicting the value of the function for this new, unseen region of parameter space, even though they looked pretty attractive on the in-sample data.

a chart showing a bad prediction of the value of the function for this parameter space

The best model on the in-sample data set, the high-order polynomial, does a terrible job of modeling this out-of-sample region. In fact, in order to see it, we have to look at a completely different portion of the y-axis, and even use a logarithmic scale to make sense of it:

that's not a very useful trend - backtesting bias

This model is predicting hugely negative values of our function when we know that it could never generate a single negative value (thanks to the quadratic term in the underlying function). The function looks nothing like a quadratic function: it is more like a hyperbolic function. Or bad modern art.

This misrepresentation of the underlying process is a classic example of overfitting, and it’ll have you banging your head against the wall a lot in your early days of algo trading. In fact, you’ll face this problem every day in your strategy development, you just learn how to eat its punches with experience.

Overfitting bias affects strategies that are tested on in-sample data. The same data is used to optimize and then test the strategy. Common sense will tell you that a strategy will perform well on the data with which it was optimized – that’s the whole point of optimization! What’s more, exhaustively searching the parameter space and choosing a local performance maximum will undoubtedly lead to overfitting and failure in an out-of-sample test.

It is crucial to understand that the purpose of the in-sample data set is not to measure the performance of a strategy. The in-sample data is used to develop the strategy and find parameter values that may be suitable. At best, you should consider the in-sample results to be indicative of whether the strategy can be profitable at all.

Avoid using the in-sample results to benchmark likely future performance of a strategy.

Look again at the figures above. The model with the best in-sample performance was the high-order polynomial shown by the magenta line. That said, its out-of-sample performance was as enviable as stepping on a plug. The quadratic, cubic and fourth-order models all performed reasonably well in the in-sample test, but the quadratic model was the clear star performer in the out-of-sample test. Obviously, you can infer little about performance on unseen data using in-sample testing.

Here’s the real insidious part….

When you fit a model (a trading strategy) to a noisy data set (and financial data is a rave), you risk fitting your model to the noise, rather than the underlying signal. The underlying signal is the anomaly or price effect that you believe provides profitable trading opportunities, and this signal is what you are actually trying to capture with your model.

Noise gets between you and the money. It’s a random process, and it’s unlikely to repeat itself exactly the same way. If you fit your model to the noise, you’ll end up with a random model. Unless you enjoy paying for your broker’s 12oz rib-eye steak, this isn’t something you should ever trade.

the expected payout of a random model is zero, but with costs factored in it is less than zero

So what’s the overarching lesson from all this?

Well, in-sample data is only useful in the following ways:

  1. Finding out whether a strategy can be profitable and under what conditions
  2. Determining which parameters have a significant impact on performance
  3. Determining sensible ranges over which parameters might be optimized
  4. Debugging the strategy, that is, ensuring trades are being entered as expected

Given the topic of this post, you’ll notice something missing from that list: measuring the performance of a trading strategy.

Any estimate of performance you derive from an in-sample test is plagued with overfitting and similar backtesting bias and is likely to be an optimistic estimate – unless your entire development process is watertight…but that’s a story for another time.

The solution to overfitting bias is adopting a sensible approach to the markets and strategy development. This includes:

  • Keeping trades simple. The fewer the number of fittable parameters, the better.
  • Favouring trades that can be rationalised in a sentence, over blindly data mining for trading rules.
  • Optimising for robustness, not in-sample performance (more on this later).
  • Avoid the temptation to be precise in your model specification. Market data is noisy and fickle, and any signal is weak.
  • Avoid trades that will, at best, marginally cover retail trading costs, such as scalping.

I’ll repeat myself: don’t use in-sample data to measure your strategy’s performance unless you have been very very very careful!

Data-Mining Bias aka Selection Bias

This one is unavoidable. So, rather than spending an eternity trying to eliminate this bias entirely, just be aware of it and accept that, generally, your strategies won’t perform as well in the markets as they did in your simulations.

You’ll commonly introduce data-mining bias when selecting the best performer from a bunch of strategy variants, variables or markets to continue developing. If you’re persistent enough in trying strategies and markets, you’ll eventually find one that performs well simply due to luck.

Think of it this way. Say you develop a trend following strategy in FX. The strategy crushes its backtest on EUR/USD but flops on USD/JPY. Any sensible person would trade the EUR/USD market, right? Sure, but you’ve just introduced selection bias into your process. Now, your estimate of the strategy’s performance is upwardly biased. Should you throw the strategy in the bin? Not necessarily. Maybe it performed well on EUR/USD specifically for good reason. But nevertheless, some selection bias has crept into your development process.

There are statistical tests to account for data mining bias, including comparing the performance of the strategy with a distribution of random performances. You can find examples of this in the Robot Wealth Advanced Course.

You can also use out-of-sample data, but this quickly becomes problematic as we only have a finite amount of historical data on which to develop.

So your best bet to overcome selection bias is simply adopting a sensible, measured approach to strategy development, as outlined above.

Backtesting Bias: Conclusion

As a rule of thumb, you want to build robust trading strategies that exploit real market anomalies or inefficiencies. What’s more, you want to do this with an approach grounded in simplicity. The more complex your approach, the more likely you are to fall into the backtesting bias traps we’ve talked about above. Either way, it’s surprisingly easy to find strategies that appear to do well, but whose performance turns out to be due to luck or randomness. That’s part of the game.

You have probably noticed that I introduced the concept of a “sensible approach to strategy development”. But what does that look like? We’ve covered it at a conceptual level, which is useful in its own right. That said, we think that teaching our approach in detail is much better accomplished through our Algo Bootcamps. In Bootcamp, you can learn first-hand by watching us apply the process in real-time and participating in the development of real trading strategies. There are some very smart people inside our community, too, who can keep you on the right track.

This Backtesting Bias blog post originally appears as an excerpt from our Algorithmic Trading with Zorro course, which will be available in the coming weeks.

5 thoughts on “Backtesting Bias: <br>Feels Good, Until You Blow Up”

  1. This is a good post on robust trading. If the parameter during back-test has a big range , it is more robust else it is more fragile.

    Reply

Leave a Comment