Optimal Data Windows for Training a Machine Learning Model for Financial Prediction

It would be great if machine learning were as simple as just feeding data to an out-of-the box implementation of some learning algorithm, then standing back and admiring the predictive utility of the output. As anyone who has dabbled in this area will confirm, it is never that simple. We have features to engineer and transform (no trivial task – see here and here for an exploration with applications for finance), not to mention the vagaries of dealing with data that is non-Independent and Identically Distributed (non-IID). In my experience, landing on a model that fits the data acceptably at the outset of a modelling exercise is unlikely; a little (or a lot!) of effort is usually required to be expended on tuning and debugging the algorithm to achieve acceptable performance.

In the case of non-IID time series data, we also have the dilemma of the amount of data to use in the training of a predictive model. Given the non-stationarity of asset prices, if we use too much data, we run the risk of training our  model on data that is no longer relevant. If we use too little data, we run the risk of building an under-fit model. This begs the question: Is there an ideal amount of data to include in machine learning models for financial prediction? I don’t know, but I doubt the answer is clear cut since we never know when the underlying process is about to undergo significant change. I hypothesise that it makes sense to use the minimum amount of data that leads to acceptable model performance, and testing this is the subject of this post.

How Much Data?

In classical data science, model performance generally improves as the amount of training data is increased. However, as mentioned above, due to the non-IID nature of the data we use in finance, this happy assumption is not necessarily applicable. My theory is that using too much data (that is, using a training window that extends far into the past) is actually detrimental to model performance.

In order to explore this idea, I decided to build a model based on previous asset returns and measures of volatility. The volatility measure that I used is the 5-period Average True Range (ATR) minus the 20-period ATR normalized over the last 50 periods. The data used is the EUR/USD daily exchange rate sampled at 9:00am GMT between 2006 and 2016.

The model used the previous three values of the returns and volatility series as the input features and the next day’s market direction as the target feature. I trained a simple two-class logistic regression model using R’s glm function with a time-series cross validation approach. This approach involves training the model on a window of data and predicting the outcome of the next period, then shifting the training window forward in time by one period. The model is then retrained on the new window and the next period’s outcome predicted. This process is repeated along the length of the time series. The cross-validated performance of the model is simply the performance of the next-day predictions using some suitable performance measure. I recorded the profit factor and sharpe ratio of the model’s predictions. I used class probabilities to determine the positions for the next day as follows:

if P_{up} >= 0.55, go long at open

if P_{down} >= 0.55, go short at open

if 0.45 < P_{up} < 0.55 (equivalent to 0.45 < P_{down} < 0.55), remain flat

where P_{up} and P_{down} are the calculated probabilities for the next day’s market direction to be positive and negative respectively.

Positions were liquidated at the close.

In order to investigate the effects of the size of the data window, I varied its size between 15 and 1,600 days and recorded the cross-validated performance for each case. I also recorded the average in-sample performance on each of the training windows. Slicing up the data so that the various cross-validation samples were consistent across window lengths took some effort, but this wrangling was made simpler using Max Kuhn (to whom I once again tip my hat) and his caret package.

The results are presented below.

IS-CV Profit Factor

 

We can see that for the smallest window lengths, the in-sample performance greatly exceeds the cross-validated performance. In other words, when we use very little data, the model fits the training data well, but fails to generalize out of sample. It has a variance problem, which is what we would expect.

Then things get interesting. As we add slightly more data in the form of a longer training window, the in-sample performance decreases, but the cross-validated performance increases, very quickly rising to meet the in-sample performance. In-sample and cross-validated performance is very similar for a range of window lengths between 25 and 75 days. This is an important result, because when the cross-validated performance approximates the in-sample performance, we can conclude that the model is capturing the underlying signal and is therefore likely to generalise well. Encouragingly, this performance is reasonably robust in the approximate window range 25-75 days. If we had only one data point showing reasonable cross-validated performance, I wouldn’t trust that this wasn’t due to randomness. The existence of a region of reasonable performance implies that we may have a degree of confidence in the results.

As we add yet more data to our training window, we can see that the in-sample performance continues to deteriorate, eventually reaching a lower limit, and that the cross-validated performance likewise continues to decline, with a notable exception around 500 days. This suggests that as we increase the training window length, the model develops a bias problem and underfits the data.

These results are perhaps confounded by the fact that the optimal window length may be a characteristic of this particular market and the particular 10-year period used in this experiment. Actually, I feel this is quite likely. I haven’t run this experiment on other markets or time periods yet, but I strongly suspect that each market will exhibit different optimal window lengths, and that these will probably themselves vary with time. Notwithstanding this, it appears that we can at least conclude that in finance, more data is not necessarily better.

Equity Curves

I know how much algorithmic traders like to see an equity curve, so here is the model performance using a variety of selected window lengths, as well as the buy and hold equity curve of the underlying. Transactions costs are not included.

EquityCurves0.6

In this case, the absolute performance is nothing spectacular*. However, it demonstrates the differences in the quality of the predictions obtained using different window lengths for training the models. We can clearly see that more is not necessarily better, at least for this particular period of time.

Performance as a Function of Class Probability Threshold

It is also interesting to investigate how performance varies across the different windows lengths as a function of the class probability threshold used in the trading decisions. Here is a heatmap of the model’s Sharpe ratio for various window lengths and class probability thresholds.

SharpeHeatMap

We can see a fairly obvious region of higher Sharpe ratios for lower window lengths and generally increasing class probability threshold. The region of the higher Sharpe ratios for longer window lengths and higher class probabilities (the upper right corner) is actually slightly misleading, since the number of trades taken for these model configurations is vanishingly small. However, we can see that when those trades do occur, they tend to be of a higher quality.

Finally, here are several equity curves for a window length of 30 days and various class probability thresholds.

Window30Returns

Conclusions

This post investigated the effects of varying the length of the training window on the performance of a simple logistic regression model for predicting the next-day direction of the EUR/USD exchange rate. Results indicated that more data does not necessarily lead to a better predictive model. In fact, there may be a case for using a relatively small window of training data to force the model to continuously re-learn and adapt to the most immediate market conditions. There appears to be a trade-off to contend with, with very small windows exhibiting vast differences between performance on the training set and performance on out of sample data, and very large windows performing poorly both in-sample and out-of-sample.

While absolute performance of the model was nothing to get excited about, the model used here was a very simple logistic regression classifier and minimal effort was spent on feature engineering. This suggests that the outcomes of this research could potentially be used in conjunction with more sophisticated algorithms and features to build a model with acceptable performance. This will be the subject of future posts.

The axiom he who has the most data wins is widely applicable in many data science applications. This doesn’t appear to be the case when it comes to building predictive models for the financial markets. Rather, the research presented here suggests that the development and engineering of the model itself may play a far larger role in its out of sample performance. This implies that model performance is more a function of the skill of the developer than on the ability to obtain as much data as possible. I find that to be a very satisfying conclusion.

Source Code

Here’s some source code if you are interested in reproducing my results. Warning: it is slightly hacky and takes a long time to run if you store all the in-sample performance data! By default I have commented out that part of the code.

*Of course, building a production trading model is not the point of the exercise. Apologies for pointing this out; I know most of you already understand this, but I invariably get emails after every post from people questioning the performance of the ‘trading algorithms’ I post on my blog. Just to be clear, I am not posting trading algorithms!! I am sharing my research. Performance on market data, particularly relative performance, is a quick and easy way to interpret the results of this research. I don’t intend for anyone (myself included) to use the simple logistic regression model presented here in a production environment. However, I do intend to use the concepts presented in this post to improve my existing models or build entirely new ones. There is more than enough information in this post for you to do the same, if you so desired.

15 Comments

  • Tom

    August 9, 2016

    This very much agrees with my own observations. I think that shorter time spans probably capture regime changes more cleanly rather than getting confused by a variety of behaviours. This may be different if more complex models such as deep neural networks are used but they will probably present other problems. Anyway, many thanks for the well-written article.

    Reply
    • Robot Master

      August 10, 2016

      Thanks for the reply, Tom. I agree with both your points. I am still very interested in applying deep neural nets to algo trading, particularly to streamline the feature engineering phase. But they do present their own unique challenges in their practical application. There is certainly a case for using simpler models, particularly if you can ensemble them together in a clever way.

      Reply
  • Pingback: Quantocracy's Daily Wrap for 08/09/2016 | Quantocracy

  • sven

    August 11, 2016

    RM, Thanks very much for your detailed article. I’ve been trying to reproduce on my end but am getting hung up in the structure of your EU_daily.csv file and the definition of your volatility. You describe it above as “The volatility measure that I used is the 5-period Average True Range (ATR) minus the 20-period ATR normalized over the last 50 periods.”. I’ve implemented by using the ATR package within ‘TTR’ library for a 5 periods and 20 periods. I divide this difference by the 50 day running min of the 50 day difference. I’m not sure if this is how you meant ‘normalized over the last 50 periods’.

    Any help would be great! Thanks again for your work.

    Reply
    • sven

      August 11, 2016

      Correction…

      I divide this difference by the 50 day running min of the difference.

      Reply
      • Robot Master

        August 11, 2016

        Hey Sven, thanks for reading.

        I scaled my volatility measure over 50 days using the following formula: 2 * cdf(0.5*(x_i-x_{median})/(x_{P75}-x_{P25}))-1

        where x is the value of the raw volatility measure, the median and inter-quartile range of x are calculated over the last 50 periods and the subscript i denotes the current value.

        There is a very useful compression function built into Zorro that performs this calculation which I used and then exported the data for importing into R.

        This was fairly quick and dirty, so you might get better results using other volatility measures.

        Cheers

        Reply
        • sven

          August 12, 2016

          You’re welcome, thanks again for the interesting post.

          Understood on the function above. You’re right, there are countless other more complex (and simpler) ways to measure volatility but this is more than sufficient to illustrate the point.

          In line 114, you reference a threshold of 52.5%, I assume that this was simply the most recent choice at the time you posted the source code, I couldn’t tell from the included charts above whether this was the what you used or you used 55% as referenced in the formulas.

          I’ve more or less replicated your work (thanks for the help doing so!) and found a similar pattern regarding less data yielding better results although my equity curve does not look quite as stable. Furthermore, I’m using 5:00PM EST as the snapshot for daily rates… a follow on area for research could be testing for stability when predicting different times of the day.

          Reply
          • Robot Master

            August 12, 2016

            Hi Sven

            You’ve got a great eye for detail! Well spotted. I used 55% in the charts shown.

            Glad to hear you found a similar pattern regarding the shorter window lengths. I have found that when sampling foreign exchange data at daily frequency, the sampling time tends to have a significant impact for many currency pairs, so I am not surprised that your equity curves look different to mine. Check out this paper for a possible explanation: Currency Returns in Different Time Zones. The results in this paper generally agree with what I’ve experienced in the FX markets.

  • Vivek

    August 11, 2016

    Thanks for your interesting post.

    You find that a short training window works better, but you are using relatively few inputs. If you had more predictors, maybe a larger window would work better.

    Your plot of cross-validation performance vs. window length could be smoothed by something like lowess, since the many relative minima and maxima are likely due to noise.

    Reply
    • Robot Master

      August 12, 2016

      Hi Vivek

      It is possible that a larger window would work better with more predictors. Having more predictors essentially leads to a more complex model, which may help overcome the bias problem developed by my model for larger window lengths. If you feel like researching this and sharing the results, I’d love to hear from you.

      Indeed, a lowess smoother would reduce the impact of those noisy minima and maxima in the CV-window plot. Thanks for the tip!

      Reply
  • laurent K

    August 21, 2016

    Hi and thanks for sharing your research.

    My own research also came to the conclusion that shorter windows were more suitable. Especially, extreme events (for instance 2007-2008) had huge impact on the outcome. However, throwing away data is rather counter intuitive. As factors are dynamic, it of course makes sense to shorten the estimation window to only capture the most relevant information. Nevertheless, as quant we hypothesize that the future behavior of the market will be somewhat self-similar. So, although old data is outdated, it should contain some (useful) information which could reduce uncertainty about future outcome.

    Recently, I quickly thought about another option which was to train once on a short window and then shrink the result toward the longer window (or alternatively use an ensemble approche and combine all windows). We could set the shrinkage factor either by minimizing a cost function or simply use an exponential smoothing factor. In any case, the diversification effect should work and we can hope to retain some benefit of old information.

    Another option would be to model explicitly the dynamic of the feature. In portfolio optimization with factors, the betas are often modeled as simple mean-reverting process. So including a lagged term of the features, in order to capture their dynamic, could help, but I would be more careful about overfitting in this case.

    Digression: For the threshold choice, sizing position according to the forecast confidence (after some rescaling) generally works fine and could be interesting to investigate further.

    I’ll try to investigate these options should I have time and share the results if anyone is interested.

    Best, L

    Reply
    • Robot Master

      August 31, 2016

      Hey Laurent

      Thanks so much for your thoughtful comments. I really agree about being cautious about throwing away old data. Intuitively, one would think there would have to be something useful in there, even if it obscured by the passage of time. I really like the idea of ensembling a bunch of different windows and maybe using a majority as the basis for trading decisions. Maybe weighting the more recent data windows would help too.

      I’m super-interested in hearing about any results you are willing to share. I’ll also investigate the ensembling idea as the subject of a future blog post.

      Thanks again!
      Kris

      Reply
  • Pingback: URL

  • Sachin

    January 4, 2017

    Hi Kris,

    Just wondering if you could give us your csv file so that we can play around with it as well. I realise I can download the exchange rates myself, but not sure how to calculate atrRegime metric.

    Cheers,
    Sachin

    Reply
    • Kris Longmore

      January 9, 2017

      Hey Sachin

      Nice to hear from you mate. Sorry for the slow reply. Yep sure thing, I’ll provide some data sets in csv format that you can play around with. I’ll generate files for various daily exchange rates and close times (which I think is a really interesting feature in itself). The data set I used in my analysis was quite arbitrary – daily EUR/USD sampled at 9:00 GMT. Give me a couple of days and I’ll upload them to the site.

      Cheers
      Kris

      Reply

Leave a Reply