My first post on using machine learning for financial prediction took an in-depth look at various feature selection methods as a data pre-processing step in the quest to mine financial data for profitable patterns. I looked at various methods to identify predictive features including Maximal Information Coefficient (MIC), Recursive Feature Elimination (RFE), algorithms with built-in feature selection, selection via exhaustive search of possible generalized linear models, and the Boruta feature selection algorithm. I personally found the Boruta algorithm to be the most intuitive and elegant approach, but regardless of the method chosen, the same features seemed to keep on turning up in the results.

In this post, I will take this analysis further and use these features to build predictive models that could form the basis of autonomous trading systems. Firstly, I’ll provide an overview of the algorithms that I have found to generally perform well on this type of machine learning problem as well as those algorithms recommended by David Aronson (2013) in Statistically Sound Machine Learning for Algorithmic Trading of Financial Instruments (SSML). I’ll also discuss a framework for measuring the performance of various models to facilitate robust comparison and model selection. Finally, I will discuss methods for combining predictions to produce ensembles that perform better than any of the constituent models alone.

Without further ado, let’s dive in and discuss some machine learning algorithms.

Algorithm selection

Anyone familiar with machine learning can tell you that the quantity of algorithms available to the practitioner these days is staggering. With the rise of open source packages like R, used widely by industry, academics and hobbyists to collaborate on and share machine learning research, some of the most cutting edge statistical learning frameworks are literally at our fingertips.

Its an exciting time, but also somewhat daunting.

At last count there were 81 machine learning packages listed on CRAN’s Machine Learning Task View, and many of those packages provide access to numerous individual algorithms! It’s impractical to perform an exhaustive search for the ‘best’ algorithm for a particular task, but it is certainly possible to arrive at some guidelines around what tends to work better for particular use cases.

Aronson and Masters (2013) prefer linear and quadratic regression, boosted trees, and general regression neural networks. They state that “a single decision tree’s utility is debatable for financial data”, as are bagged ensembles of trees such as random forests, however boosted trees may be more appropriate. Personally, I found that neural networks initialized using stacked autoencoders were more promising than the general regression networks preferred by Aronson and Masters. I also found particularly useful models based on Friedman’s gradient boosting machine (Friedman, 2001). In addition, I was able to obtain surprisingly decent results from a simple k-Nearest Neighbors algorithm, however I had less success with bagging methods like random forests. Like Aronson and Masters, I avoided using single decision trees – for the the number of variables used in the investigation (see next section), there seems little point. The table below shows the algorithms that I investigated and highlights those that showed the most promise for this particular use case.


Even with the small selection listed here, we clearly have some decisions to make. I’ll present some results from these models below.

Choosing combinations of variables

For the purpose of this exercise, I chose the six features from the previous post that I feel are most likely to convey predictive information about the target variable. There are numerous combinations of features that we could use to build individual models, for example, various combinations of 2, 3, etc variables. Assuming we only build models based on at least two variables, we have 57 possible unique combinations from a pool of six features. Aronson and Masters (2013) caution against using too many variables due to the risk of overfitting the data, stating that two or three variables is usually the practical limit. In this analysis, I’ll take their advice and explore models with two or three features only. This limits the number of unique combinations to 35.

The following features were chosen (refer to the previous post for calculation details):

  1. 3-period momentum
  2. Delta bollinger width
  3. 10-period velocity
  4. 10:100 period ATR ratio
  5. 10:20 period ATR ratio
  6. 7-period ATR

Each feature was normalized to the range $+1 : -1$ using a rolling 50-period window. I generally find that this method gives superior model performance to normalizing over the entire data set.

Download the raw data I used in this study via this link: EUvarsD1. In addition, the code below uses Zorro to calculate the features and target variables listed above and outputs them to a CSV file like the one in the download link. Using this code, you can generate your own feature sets for use in the modelling framework described in this post. You can also  use it to change any other parameters, such as the normalization period and the amount of historical data. Using Zorro, this can easily be adapted to other markets as well. Click the box below to expand the code.

A framework for measuring and comparing performance

We have 35 possible variable combinations and 7 algorithms with which to construct predictive models. The subset of variables was constrained based on the feature selection process discussed in the last post. I’ve constrained the list of algorithms by attempting to maximize their diversity. For example, I’ve chosen a simple nearest neighbor algorithm, a bagging algorithm, boosting algorithms, tree-based models, neural networks and so on. Clearly, I’ve constrained my universe of models to only a fraction of what is possible. Still, there is a lot of choice. We could randomly choose various models in the hope of landing on something profitable, however since with today’s computing power we very much have the means to implement it, I much prefer the idea of a systematic, comprehensive assessment. I’ll describe my framework for such an assessment below.

In this example, I will train each model to maximize the return of trading the model’s predictions normalized to the recent volatility measured by the 100-period ATR. Simple enough, but how would we objectively assess the performance of each model against this metric? And more importantly, how do we account for overfitting? Ideally, we would measure the out-of-sample performance of each model, but of course we have a finite amount of data and we need to maximize its utility. This is where cross validation comes in. There are plenty of great sources on the internet for detailed descriptions of cross validation, so I will only describe the procedure briefly.

Cross validation involves dividing the training data into k portions, training a model on k-1 portions and testing it on the portion that was held out. The performance on the hold out set is a first estimate of the true out of sample performance. The procedure is repeated using each subsample as the hold out test set and the results finally aggregated. This is known as k-fold cross validation. The estimate can be made more robust (usually) by randomly resampling the data into new portions and repeating the procedure. This is known as repeated k-fold cross validation. Other variations include bootstrapped cross validation (where resampling is undertaken with replacement – that is, individual observations can appear more than once in any subsample) and leave-one-out cross validation (where one observation at a time is held out, the remainder of the data used to train a model, and the performance of the model estimated by predicting the held out observation, repeated for every observation in the training set).

Cross validation is a very useful procedure for estimating the true out of sample performance while maximizing the utility of the training data. This sounds fantastic, and it is for most data sets, but time series data presents some unique challenges. Consider a data set with no temporal dimension and in which the observations are independent and identically distributed. In any predictive modelling task for such a data set, we can never have too much data (within the practical constraints of computing resources of course). This is not necessarily the case with time series data.

If we use too much of the time series to train our model, we risk including irrelevant and outdated patterns which of course by definition are unlikely to show up in the next instances of the time series. If we use too little data, we run the risk of under-fitting the model and missing the predictive information we hope to capture. How much data do we need then? I don’t know, but I intend to find out.

Rob Hyndman describes a method for cross-validating time series data which is extremely useful for algorithmic system development as it actually mirrors a procedure that can be used for live trading. Also known as “rolling origin forecast evaluation” and “forward chaining”, it is the best method I have found for quickly estimating the future performance of a model. The procedure is implemented thus:

  1. Fit the model to a window of sequential data of length $t$: $x_1, x_2, …, x_t$
  2. Predict the next value in the sequence, $x_{t+1}$, and compute the forecast error by comparing the prediction with the observed value.
  3. Shift the origin of the window forward by one and repeat steps 1 and 2.
  4. Repeat the process until $x_1=x_{n-t}$ where $n$ is the length of the series (ie until we run out of data for creating a window of length $t$)
  5. Aggregate the forecast error for an estimate of the out of sample performance of the model

For readers familiar with walk forward analysis, time series cross validation is equivalent to walk forward analysis with the test set being a single period.

Is there an optimal window length?

As mentioned above, I want to investigate the existence or otherwise of an optimal amount of data to include in the rolling window of training data. How much data is too much? Too little? At what point do we incorporate old and outdated information into a model to its detriment? At what point do we underfit the model due to lack of data?

For this experiment, I’ll model the EUR/USD exchange rate using a gradient boosting machine, a neural network and a k-nearest neighbors algorithm using various window lengths in the cross-validation procedure. My hypothesis is that there exists an optimal amount of data that maximizes the performance of a model for this particular time series. I am choosing three different algorithms in order to test the sensitivity of the optimal window length to the choice of algorithm. I expect that the length of the window itself will be an optimization parameter that is different for each market and that may itself change with time.

The figures below show the Sharpe ratios and directional accuracy by window length for the cross-validated model predictions. Returns of the analogous trading system were calculated as follows:

for all $y > 0$, go long at the close and hold the position for 1 period

for all $y < 0$, go short at the close and hold the position for 1 period

where $y$ is the prediction of the next period’s return normalized to the 100-period ATR. Results are exclusive of trading costs.

Sharpe ratio by window length

Accuracy by window length

The Sharpe ratio shows an obvious increase with increasing window length. This came as a surprise to me, as I suspected that there would be a point where data was too far removed from current market conditions to be of use in a predictive trading model. If such a point exists, it is clearly greater than 1,000 days. In hindsight, normalizing each feature using a rolling 50-period normalization window very likely ensures that the model dynamically adapts to changing conditions, but I must admit that I stumbled upon this more by accident than by design.

Setting aside the obvious trend in the Sharpe ratio results, the other thing that jumps out at me is that the cross-validated performance of the k-Nearest Neighbors algorithm is at least on par with, and at times significantly better than the much more complex gradient boosting machines and neural networks. It is interesting that such a simple algorithm can hold its own against the more complex models.

In my last post, I talked about the concept of a “prediction threshold”, being a method of filtering trades by entering the market only when the magnitude of the predicted next-period return exceeds a certain value. My hypothesis for doing so is that there is merit in filtering the small moves, which would be affected by randomness, in favor of the larger moves. Said differently, it makes sense (depending on the objectives of the trader) to target those returns in the tails of the distribution. Testing this hypothesis, I plotted a heatmap of Sharpe ratios by window length and prediction threshold for the neural network model:

NNet Sharpe Ratio by Window Length and Prediction Threshold

In general, there appears to be merit to this hypothesis with Sharpe ratios generally increasing for increasing prediction threshold.

Here’s the code I used for the neural network models (the models based on k-nearest neighbors and gradient boosting machines used a similar procedure):

Comprehensive model comparison

Now that I know that, in general, a longer rolling window helps a predictive model achieve a larger Sharpe ratio, my next experiment will be a comprehensive comparison of the 245 possible models from our 7 algorithms and 35 variable combinations.

The results below show the cross-validated Sharpe ratios for each combination of algorithm and variable subset investigated in this study using a rolling window of 1,000 days of history as the training data. In each case, the optimal model was found by tuning the individual algorithm’s hyperparameters across a sensible subset of possible values. No prediction threshold was used to filter trades (that is, each model took a long or short position every day). Once again, transaction costs are excluded.

Comprehensive model comparison 2

The neural network initialized using a stacked autoencoder is the clear out-performer of this group of algorithms. This model seems to be able to learn the underlying predictive patterns better than any other algorithm used in this study, and it does so consistently regardless of the combination of variables used. Initialization with a stacked autoencoder essentially forces the network to recreate the input data in the pre-training phase, which results in a network that tends to learn the features that form a good representation of its input, reducing the noise component of the input data. With this in mind, it may be possible to use this technique with a larger number of features, potentially foregoing an extensive feature selection process. This is an interesting idea, and something to pursue at another time.

We can also see that the boosting algorithms performed relatively well, the multi-adaptive regressive spline models were consistent, and that k-nearest neighbors performed well for such a simple model. The Cubist models were less consistent and the random forest models performed worst of all. 

The best Sharpe ratio of all the models was 1.72.

The following R code shows how I implemented this framework for efficiently testing and comparing the algorithms used in this study:

Accounting for data mining bias

Data mining bias refers to the unfortunate selection of a trading model based on randomly good performance. For instance, a system with no basis in economic or financial reality has a profit expectancy of exactly zero, excluding transaction costs. However, due to the finite sample size of a backtest, sometimes such a system will show a backtested performance that can lead us to believe it is better (or worse) than random. As the number of samples grow in live trading, the worthlessness of such a system becomes apparent.

Data mining bias shows up often, but it is of particular concern when we harness computing power to systematically test numerous potential trading models, exactly as I have done in this post. White (2000) describes a method for accounting for this data mining bias, referred to as White’s Reality Check or the Bootstrap Reality Check. David Aronson covers it in detail in “Evidence Based Technical Anlysis.” I will only describe the procedure briefly here, but here is the link to the original paper.

White’s Reality Check requires that we keep a record of all variants of the trading model that were tested during the development process and produce a zero-mean returns series for each. We then randomize these returns series using bootstrap resampling and note the total return of the best performer. This process is repeated several thousand times, and the median best return corresponds to the data mining bias introduced by the development process. We can then observe where the originally selected best model fits into the distribution of bootstrapped results to obtain a confidence level relating to its possession or otherwise of an actual positive expectancy. Saddest of all, and I think this is why White used the term “reality check”, is that the expected performance of the selected strategy in real trading is only the performance measured in its backtest MINUS the median performance of the bootstrapped returns series.

Here is the histogram of Sharpe ratios from implementing White’s Reality Check on the 245 models under comparison for 5,000 bootstrap iterations:

Bootstrapped Sharpe Ratios

The best Sharpe ratio I obtained is approximately equivalent to the median bootstrapped best Sharpe ratio, implying that its expectancy is actually close to zero. However, I have clearly mis-estimated the data mining bias since I excluded the models discarded during the hyperparameter tuning phase of model construction. In addition, this method is known to have a bias towards Type II errors. In other words, this method tends to reject systems that do have an edge. In any event, following is the R code that implements the reality check. I leave it to the interested reader to implement the reality check across the full set of models including those rejected during hyperparameter tuning.

Ensembles and hybrid methods

Ensembling is the practice of aggregating the predictions of multiple models in order to achieve a prediction accuracy that exceeds any individual model. It is analogous to using a committee of experts to reach a consensus. There are numerous ways to create an ensemble, including:

  • Bagging: aggregate models based on bootstrapped training data, for example the random forest algorithm.
  • Boosting: models developed sequentially where each additional model aims to improve performance on the least accurate part of the feature space of the previous model. An example is the boosted trees used previously in this study.
  • Stacking: model predictions are combined using a “meta-model”. In my experience tends to result in over-fit models.
  • Aggregating models based on random subsets of the input features.
  • Aggregating models based on random subsets of the training data, which is equivalent to bagging without replacement as opposed to bootstrapping.
  • Aggregating models based on different algorithms.

We have seen above that the models that incorporate boosting fared better than the random forest models, which use the bagging technique. Aronson and Masters (2013) advocate the latter three methods listed above as a protection against overfiitting. They argue that even if the component models are over-fit, if they are trained on different training and/or feature sets, they will be exposed to different noise patterns while real patterns will tend to be represented across the various training sets. The idea is that the noise patterns cancel out, while the real patterns are reinforced. Ensembles appear to be most effective when the component models each have a positive expectancy, but whose predictions are uncorrelated.

In this study, I will focus on combining different algorithms and different subsets of the feature set and I will once again advocate a systematic approach. Following is my approach for selecting component models for an ensemble:
  1. Build a model for each variable and algorithm combination (already done, above)
  2. Exclude any model with a cross-validated Sharpe ratio < $x$, say 1.25
  3. Examine a correlation matrix of the predictions of the remaining models. Retain a subset of models such that all correlations are < 0.75
  4. Combine the remaining models by either averaging the predictions or forming a majority vote on the direction

The figure below shows the equity curves of the best model and the two ensembles formed by averaging and majority vote described above (trading position sizes on EUR/USD relative to the 100-period ATR and excluding trading costs).

Ensemble Comparsion

In this case, there is little overall difference between the best model and either of the ensembles. There may be other possible ensembles that perform better; I haven’t performed an exhaustive search. Note that in the absence of an out of sample test following the aggregation of component models into an ensemble, this step has introduced a measure of selection bias into the results. Here’s the code to reproduce these results:

Finally, I’ll try combining the predictions using a linear regression model as a “meta-model”. This approach has the potential to curve-fit the results, so I will test it using the time series cross validation approach used above:

This gives a Sharpe ratio of only 0.94, significantly less than the simpler ensembles and indeed the best of the component models.

Next Steps

This post and the last provide a framework for systematically comparing predictive models based on machine learning algorithms as the basis of trading systems. Despite the length of the posts, they have barely scratched the surface of what is possible. The following are ideas for future research that I intend to pursue. I would love to hear from people interested in collaborating on any of these ideas and I offer my work thus far as a basis from which to proceed:

  • Regime-based split linear models. The idea is that the training data is split based on different market regimes on the basis of a particular variable or condition for determining where the split occurs. Regimes characterized by extreme volatility are inherently difficult to predict, and it may be useful to simply avoid trading during these periods. A regime-based approach could potentially verify this hypothesis and provide clues as to its practical application. I have briefly investigated this approach and did not find a measurable increase in performance, but I haven’t investigated closely enough to rule out this idea completely.
  • Another regime-based approach is to dynamically adjust the relative weights on individual component models based on each model’s performance in the current market conditions. This is appealing in that it does not require classification of the market regime. The weight adjustment is essentially regime-agnostic in the sense that it would not care about terms such as “trending” or “range bound”, whose identification of course carries substantial lag. By taking its cues from real-time performance, a dynamically adaptive weight allocation approach would minimize this lag to the extent possible.
  • So far I have simply aggregated models into ensembles through averaging predictions, combining directional votes and using simple linear regression as a “meta-model”. My feeling is that combining predictions using non-linear stacking methods will lead to over-fitting. However, it would be useful to verify this feeling on hard data.
  • I have also read about ‘sequential prediction” which, like boosting, involves building an ensemble from a series of models. If the first model is able to learn a dominant pattern, its residuals are then used as input to a second model under the assumptions that these residuals contain the noise terms as well as more subtle patterns. The first model’s predictions are then considered a measure of the dominant pattern and the second model’s predictions an estimate of the deviation of the first model’s prediction from the target.
  • A related approach is to attempt to capture linear relationships in the data using classical time series methods such as ARIMA/GARCH and then use the residuals of these models as input to an algorithm capable of capturing more complex non-linear relationships, such as a neural network.
  • Finally, the work presented here has only considered the most recent values of individual features in model construction. There may be benefit in incorporating recent past values as features, for example the 1-period return for each of the last 3 periods.

A Note on the Practicalities of Trading Systems Research

If I can offer one piece of advice based on my experiences, I would caution systems researchers (especially those with a penchant for the theoretical) against trying to build the “perfect” predictive model, if such a thing even exists. Admittedly it is an interesting exercise, but an approach much more grounded in the realities of extracting profits from the markets is to focus on building a model that meets one’s needs. If I could turn back the clock to the tune of about twelve months, I would go live with my machine learning models much earlier (as soon as they were fit for purpose) and then refined them as time went on. I made the mistake of repeatedly putting off the go-live date until I had improved the models performance just a bit more. In doing so, I missed some perfectly achievable profits. In addition, I would have uncovered implementation and execution issues much more efficiently.


You have seen in this post and the last that mining financial data for profitable patterns in a statistically sound manner is incredibly time consuming and effort-intensive. What you’ve read in these last two posts is simply a distillation of a significant research effort and one that I hope provides a framework for others interested in this field to take it further.

We have seen that neural networks initialized with stacked autoencoders have a great deal of potential in predictive models for trading systems, but that incredibly simple models like the k-nearest neighbors algorithm can also perform adequately. I have also touched on the specter of data mining bias and explored one possible method for accounting for it. Finally, we explored ensembles of component models, but didn’t get a significant boost to model performance in this case.

I intend to pursue the ideas listed in the Next Steps section, likely beginning with the sequential prediction approach. If people are interested in collaborating on any of these ideas, I would very much like to hear from you.


Author’s gravatar

Thanks,Very interesting
Can you share the data set files?

Author’s gravatar

Hi Lev

Apologies for that oversight. You can now access the raw data via a download link in the post (under the heading “Choosing combinations of variables”). I’ve also added the source code (written in Lite-C, compatible with Zorro) that I used to generate this data. This code can be modified to generate custom feature sets for any market or time period for which you have price data. The code will output a CSV file compatible with the framework in the post.


Author’s gravatar

Hi Kris

Thanks for the source and csv.

I tried to reproduce your “heatmap of Sharpe ratios by window length and prediction threshold” and I get different results , All my predictions was in range (-0.3 – 0.3)


Author’s gravatar

That result has really got me scratching my head. I suspect something is not quite right with the implementation on your end because with so many models being compared, you would expect to see much more variation in the Sharpes beyond that range through random variation alone.

If anyone else gets this result, please let me know in the comments.

Author’s gravatar

I mean predictions not shapes.
The problem is only in “heatmap of Sharpe ratios by window length and prediction threshold” sample
modellist.nnet[[j]]$pred$pred in range of (-0.3 – 0.3)

Author’s gravatar

OK, I understand now. That’s actually not an unexpected result and indicates that in this case the model is having difficulty predicting the observations in the tails of the returns/atr distribution. Here’s one of my results, for comparison:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.51530 -0.10380 -0.01300 -0.01491 0.07690 0.40590

Check out what happens with the directional accuracy of these predictions though. They are actually good enough to give the model a decent cross-validated equity curve. They also hint at some potential improvements that can be made. For example, would a classification approach be better than the regression approach presented here? What about intelligent subsampling of the input to each iteration of the time series cross-validation procedure? I actually don’t know off the top of my head if such approaches would make a difference, but the results you see hint that there are improvements that could be made.

Author’s gravatar

Fantastic tutorial series! Had never come across a neural network initialised with stacked autoencoders before. Looking forward to playing about with this technique.

Author’s gravatar

Thanks Gekko! They are a relatively recent addition to my toolkit as well. The unsupervised reconstruction of the input appears to help the network effectively clarify a pattern if it does exist. The Zorro guys have done some work in this area too, using a wider feature set and a deeper architecture than the ones I’ve used here. They achieved a directional accuracy in the order of ~60%. My own work corroborates this. SAEs have enormous potential.

Author’s gravatar

This grabbed my attention as well. I have not had much success with NNs of various architectures for regression problems. Classification of course is quite good, although my lack of patience for fiddling with feature scaling and data shape often just lead me to use a GBM. I wonder if there is any literature out there explaining why it would help with regression specifically. There is quite a bit that explains why boosted trees and NNs don’t fare well in regression. If unsupervised pre-training really changes the game for regression that’s very encouraging, as there are alot of those techniques already out there and more on the horizon. I am going to start testing toute suite.

Author’s gravatar

Hi Shane. In my experience, neural nets fare extremely well on complex regression tasks. They have been referred to as “universal approximators” for good reason, having the ability to approximate any linear or non-linear function (compare this with for example a linear time series model like ARIMA). Being powerful learners, they also have the ability to very easily over-fit the training data. Therefore it makes sense to apply appropriate guards against over-fitting: regularization techniques (L1, L2), random dropouts, unsupervised pre-training and early stopping can be effective ways to get the most out of a neural network while preventing its power from getting out of control.

Also, if you are more comfortable with classification, you can simply re-frame the regression problem into a classification one using something like:
if ( > 0) target.variable = 1
else target.variable = 0

Author’s gravatar

Thanks for the reply (and the great blog material in general). I actually have no discomfort with regression, and wrt finance/trading, continuous or countable targets are usually my choice. I simply just have not been able to surpass a MARS model or a custom nonparametric approach using a probabilistic programming library with any NN (and still haven’t after a few tests with deepnet BTW). However, this may have a great deal to do with my feature space and my choice of target. One observation I have made is that these R NN libraries have done a great job of setting some defaults on the myriad hyperparameters and providing some quick automated ways to iterate through them.
One particular architecture that I note is glaringly lacking is Recurrent Neural Networks. LSTMs, GRUs and other time series/sequence aware NN architectures have achieved the best performance for me wrt finance/trading. Obviously, there ability to model non-stationary data is desirable. So hopefully someone is generous enough to write some libraries for those in R that become part of the caret ecosystem. In the meantime, I’ll keep at it with Torch, Keras, Blocks etc.

Again, you have motivated me to spend some more time with these models and the caret ecosystem in particular. Look forward to future posts.

Author’s gravatar

Hi Shane

Investigating recurrent networks is a priority for me too. These guys use recurrent architectures in many of the topics discussed in this publication and report good results in the FX markets. I know of no library in R that incorporates RNNs just yet. Time to make a foray into the world of R library creation perhaps?

Author’s gravatar

HI Kris
Great Article and thanks for posting!
I am interested in the Stacked Auto encoder approach you are using.
If you are using this to compress the data, can you not use more than 2/3 features to create inputs to your NN. Can i also ask how you go about choosing the optimal number of layers and the number of nodes in the Auto-encoder.

Author’s gravatar

Hi Thomas

Indeed it does make sense to use more features with the stacked autoencoder approach. The unsupervised reconstruction of the input assists the network to detect any predictive patterns that may be present. This means that redundant or noisy features will have less of an impact on the output. I would however caution against a brute force approach where you throw everything you’ve got at such a network. You will quickly run into practical computation issues, particularly if you are simultaneously exploring various network architectures. Some intelligent feature engineering goes a long way.

In terms of choosing an optimal architecture, unfortunately there isn’t a simple formula that I’m aware of. Yoshua Bengio at the University of Montreal published some very useful guidelines for practical training of deep architectures which I use to guide my starting point. From there, I’ll train and cross-validate several architectures and hyperparameter sets. Max Kuhn’s caret package in R is my go-to toolkit for doing this as efficiently as possible. Here’s a link to Bengio’s paper: and Max’s website:

One word of caution, if I may: It’s easy to get bogged down in the search for an “optimal” model even having discovered several models that will serve the intended purpose very well. The search for an “optimal” model could go on indefinitely if we allowed it. Focus on a defined objective for your model and let that guide your efforts.

Thanks for commenting!

Author’s gravatar

Kris, that’s another great post ! 
I love that “Comprehensive model comparison” heatmap.

I like using the K-Nearest Neighbors algorithm for my “exploration” (i.e. exploring my feature space) because it is MUCH faster than most other types of algorithms.
I find that KNN usually gives reasonable results, thus it allows me to find interesting features much more quickly ….

Let’s say I have 1000 features;
choose(1000, 2) + choose(1000, 3) = 166,666,500
choose(1000, 2) + choose(1000, 3) + choose(1000, 4) = 41,583,791,250

That’s a lot of combinations !
With KNN I can explore ~100,000 random combinations over a 24 hour period. That’s perhaps a 100x speedup compared with say SVM.

Best, Nick

Author’s gravatar

Thanks Nick!

I agree, the heatmaps reveal a lot of useful information quickly. Plus they brighten up the blog a little!

That’s a lot of features! An exhaustive search of even the 2- and 3-variable combinations at a rate of 100,000/day would require approximately 4.5 years to complete! But of course you are doing random search of the feature space which is obviously a lot more efficient. Still, without seeing your data set, I would speculate that out of the 1,000 features, there would be many highly correlated, and therefore redundant features which could be removed with little to no detriment of the final model. Again, without seeing your data, I suspect that an intelligent feature selection phase would assist greatly.

Your approach got me thinking and I started googling to see whether others were using also this approach. Turns out there is some activity in this area. Have you seen this paper:

Have you experimented with such a feature-weighted version of k-NN?

Author’s gravatar

Yes, a lot of them will be correlated to some degree and an intelligent feature selection process does help.

It’s easy to have 1000 features these days. Even starting with the standard technical indicators in TTR (~50 I think), you could add Super Smoother filters, take differences, apply PCA …. Then we have macro, fundamental and sentiment data….
A lot of people won’t like the data-driven approach – whatever works for them I suppose ….

Cheers Kris.

Author’s gravatar

Indeed! There is certainly an air of dismissiveness to the data-driven approach in the quant finance community. Personally, I see it as just another weapon in the arsenal and a useful source of portfolio diversification in addition to the classical approaches.


Author’s gravatar

Hi Kris,

Do you choose the best network structure for “live” trading after the optimization
nnetGrid <- expand.grid(.layer1 = c(2:3), .layer2 = c(0:2), .layer3=0, .hidden_dropout=c(0, 0.1, 0.2), .visible_dropout=c(0, 0.1, 0.2))

Because I understand that for each model you choose the result with best porfit with summery function?

Author’s gravatar

Hi Lev

That line simply specifies the grid of hyperparameters that are tested. The caret train function works as follows: for each hyperparameter combination, build a model on each iteration of the cross-validation procedure and calculate the average performance across each cross-validation iteration. Once this is done, the ‘best’ combination of hyperparameters is selected and a model fitted on the entire training data set. The specific case of time series cross validation is very applicable to trading since it mimics the process we would go through in order to trade the markets using these methods, with one important caveat: if at any point in time we picked the ‘best’ model based on this hyperparameter tuning process for live trading, we introduce selection bias and therefore the cross-validated performance is an over-estimate of actual performance. An unbiased estimate would only be possible with another out of sample validation data set. White’s Reality Check is another method of dealing with this.

In practice, I find that a further out of sample validation set is generally a more accurate predictor of future performance, but the utility of this is milted by the finite nature of our data set. I am happy enough to go live with a model selected in this way, but I temper my expectations of its future performance using White’s Reality Check.

Author’s gravatar

Interesting post – many thanks. Re your second point under “Next Steps” were you thinking along the lines of a variant of Zorro’s “Equity Curve Trading”, only rather than turning models on/off adjusting their relative position sizes (though some might be reduced to zero)?

Was also wondering if you had considered using equity curve plus other performance stats (e.g. rolling Sharpe/Sortino/K-Ratio) as predictors for a standalone overlay model that would automatically control individual model position size (weighting)?

Author’s gravatar

Hi Andy. I actually had in mind something like your second point: using a rolling performance measure as a means of controlling individual model weighting. However there are of course many approaches and I don’t favour one over the other. What I am really trying to achieve is to reduce the selection bias in my system by including as many models as possible. The downside of my approach is that there is always a lag in calculating the “optimal” portfolio weights.

Author’s gravatar

Thanks. A long while back I had a similar lag issue with calculating the weights of a portfolio of non-ML models. One technique that seemed to work (occasionally quite well) and minimised the lag was to use a Kalman filter on the equity curves and calculate various difference measures based on that. I tried using all Kalman values, the filtered values, the smooths and the predictions. As I recall, the predictions worked best, which I didn’t expect.

Author’s gravatar

Hi – thanks for the fascinating and informative post!
Did you try holding some data out and seeing how one of your deep neural nets performed on it? When I follow along your code and attempt doing this, the predictions on out of sample data have very little variation, suggesting an input scaling problem or that the network has been overfit…

Author’s gravatar

Hey Jim
Yes I did try holding out some data and testing the performance of the various neural nets. My out of sample predictions do show a sensible amount of variation. I get a greater variation between the neural nets trained on smaller training windows and those trained on larger windows, as one would expect. Further, while directional out of sample predictions tend to converge as the training window increases, I still see variation in the absolute out of sample predictions.

I doubt the problem you are having is related to input scaling if you are using the same data that I used, since the inputs are all scaled to the range -1 to +1 (you can verify this with the R command “summary(eu)”. Of course, if you are using your own data, you would need to ensure that you pre-processed your input features appropriately.

Thanks for checking out my blog!

Author’s gravatar

Hi Kris,

Great posts! Many thanks for sharing.

Have you ever put real money to trade using your machine learning systems? Some people say these systems always mysteriously failed, while others say GS, JP Morgan and DE Shaw etc all use machine learning system to trade and made big bucks, what’s your comment on this?

Author’s gravatar

Hi Jeff

No problem, glad you found it useful. Yes, I am currently allocating to my ML strategies. I don’t think there is any great mystery about trading systems failing – it happens all the time. I have little idea what those guys you mentioned do to make money, but if they aren’t using machine learning, they are foregoing a very useful tool.

Author’s gravatar

Hi Kris,

Thank you so much for your comment. What do your average profit factors look like for your machine learning trading systems ? Which factor you think leads to a successful trading system, indicators or algorithms, if you’re asked to name only one?

Happy Holidays


Author’s gravatar

Hi Jeff

If I had to pick one, I’d pick execution as the most important determinant of algo trading success. That’s probably a very boring response, but one that I think is justified!

Happy holidays to you too.

Author’s gravatar


Thanks so much for this really insightful tutorial, I think this is one of the best examples of nnet application to risk and return data in the public domain.

I have a basic question on the absretSummary function you’re using to assess the learner’s performance. Are you using the observed risk-adjusted returns in the line rather than geometric or logn scaled returns?

trades <- positions*data[, "obs"]

Thanks & Regards,

Author’s gravatar

Matt, thanks for the kind words, great to hear that the article was useful.

You’re correct – I’m using the observed risk adjusted returns in my objective function: the last close minus the prior close divided by the 100-period average true range.


Author’s gravatar

Someone pointed out that there is no sense to do a variable analysis because with different data-window or different time at all, the variables would be different.

I see that if you choose a specific time ( as the years of the strategy ) and data-window ( as the period to generate an indicator ) everything will be different. Then a SMA cross over strategy will be winning using 20-period in 2009 and 15-period over 2015. The fact that the period changes does not mean that the analysis is wrong, as the fact that the variable importance may change does not mean that the analysis is wrong.

I see two solutions for the variable importance change depending on data-window etc:

1- We use different data-windows and years to do the analysis and we pick up only the variables that remain as importance in all the different scenarios. Then the discussion is closed because the variables choosen are the best in each scenario.

2- We perform a WFO where the variables used are changed to train the model as the parameters of the model are tuned in every WFO cycle. In that way “variable tunning” should be added to the analysis meaning that if we choose that the model is retrainned the first trading day of each month, then the variable analysis should be as well done and be part of that training. does it make sense?

Author’s gravatar

For sure, you raise really good points. Both your solutions are sensible; however in my experience you’ll have more success with the second one. Including the variable selection step in the model re-training process is not uncommon in machine learning applications to time series data. Another option is to create models with the best subset of variables over different lookback periods and ensemble their predictions.

Author’s gravatar

Great content on your site!

I’m trying to get my head around stacked autoencoders and dropout.

Quora: “Though the fundamental principle is same, making sure that large no of the parameters do not over-fit the data. But on a closer look they work differently. While denoising work on  input layer only, dropout work on all layers (but output).  So I guess when u have a deep network and you also want the hidden units of some higher layer to avoid over-fitting you might want to choose dropout over denoising. For shallow models both should work the same.”

The following quote is from the following article:

The idea of adding noise to the states of units has previously been used in the context of Denoising Autoencoders (DAEs) by Vincent et al. (2008, 2010) where noise is added to the input units of an autoencoder and the network is trained to reconstruct the noise-free input. Our work extends this idea by showing that dropout can be effectively applied in the hidden layers as well and that it can be interpreted as a form of model averaging

Seems like autoencoders and dropout are two ways of achieving the same thing, although dropout can be used on more layers. Any thoughts on this would be most welcome!


Author’s gravatar

I guess you could say that, although I’m not sure about the mathematical equivalence. In my experience, dropout has been the most effective way to control overfitting in my applications. As I understand denoising autoencoders, we apply noise only on the input layer, while of course dropout extends through the network. I was heavily into autoencoders a few years ago, but have moved away from them in the deep learning research I’m currently doing.

Author’s gravatar

Hi Kris and thank you so much, all your work are fantastic!

I have a question

why  modellist.nnet[[j]]$pred$pred value are different from predict( modellist.nnet[[j]]) ?


Thanks & Regards



Leave a Reply

Your email address will not be published. Required fields are marked *