## Machine learning for financial prediction: experimentation with David Aronson’s latest work – part 1

**Update 1: In response to a suggestion from a reader, I’ve added a section on feature selection using the Boruta package. **

**Update 2: Responding to another suggestion, I’ve added some equity curves of a simple trading system using the knowledge gained from this analysis.**

**Update 3: In response to a comment from Alon, I’ve added some Lite-C code that generates the training data used in this post and outputs it to a .csv file. You can have some serious fun with this and it enables you to greatly extend the research presented here. Enjoy!**

One of the first books I read when I began studying the markets a few years ago was David Aronson’s *Evidence Based Technical Analysis*. The engineer in me was attracted to the ‘Evidence Based’ part of the title. This was soon after I had digested a trading book that claimed a basis in chaos theory, the link to which actually turned out to be non-existent. Apparently using complex-sounding terms in the title of a trading book lends some measure of credibility. Anyway, *Evidence Based Technical Analysis *is largely a justification of a scientific approach to trading, including a method for rigorous assessment of the presence of data mining bias in backtest results. There is also a compelling discussion based in cognitive psychology of the reasons that some traders turn away from objective methods and embrace subjective beliefs. I find this area fascinating.

Readers of this blog will know that I am very interested in using machine learning to profit from the markets. Imagine my delight when I discovered that David Aronson had co-authored a new book with Timothy Masters titled *Statistically Sound Machine Learning for Algorithmic Trading of Financial Instruments – *which I will herein refer to as SSML. I quickly devoured the book and have used it as a handy reference ever since. While it is intended as a companion to Aronson’s (free) software platform for strategy development, it contains numerous practical tips for any machine learning practitioner and I’ve implemented most of his ideas in R.

I used SSML to guide my early forays into machine learning for trading, and this series describes some of those early experiments. While a detailed review of everything I learned from SSML and all the research it inspired is a bit voluminous to relate in detail, what follows is an account of what I found to be some of the more significant and practical learnings that I encountered along the way.

This post will focus on feature engineering and also introduce the data mining approach. The next post will focus on algorithm selection and ensemble methods for combining the predictions of numerous learners.

## The data mining approach

Data mining is just one approach to extracting profits from the markets and is different to a model-based approach. Rather than constructing a mathematical representation of price, returns or volatility from first principles, data mining involves searching for patterns first and then fitting a model to those patterns after the fact. Both model-based and data mining approaches have pros and cons, and I contend that using both approaches can lead to a valuable source of portfolio diversity.

The Financial Hacker summed up the advantages and disadvantages of the data mining approach nicely:

The advantage of data mining is that you do not need to care about market hypotheses. The disadvantage: those methods usually find a vast amount of random patterns and thus generate a vast amount of worthless strategies. Since mere data mining is a blind approach, distinguishing real patterns – caused by real market inefficiencies – from random patterns is a challenging task. Even sophisticated reality checks can normally not eliminate all data mining bias. Not many successful trading systems generated by data mining methods are known today.

David Aronson himself cautions against putting blind faith in data mining methods:

Though data mining is a promising approach for finding predictive patterns in data produced by largely random complex processes such as financial markets, its findings are upwardly biased. This is the data mining bias. Thus, the profitability of methods discovered by data mining must be evaluated with specialized statistical tests designed to cope with the data mining bias.

I would add that the implicit assumption behind the data mining approach is that the patterns identified will continue to repeat in the future. Of course, the validity of this assumption is unlikely to ever be certain.

Data mining is a term that can mean different things to different people depending on the context. When I refer to a data mining approach to trading systems development, I am referring to the use of statistical learning algorithms to uncover relationships between feature variables and a target variable (in the regression context, these would be referred to as the independent and dependent variables, respectively). The feature variables are observations that are assumed to have some relationship to the target variable and could include, for example, historical returns, historical volatility, various transformations or derivatives of a price series, economic indicators, and sentiment barometers. The target variable is the object to be predicted from the feature variables and could be the future return (next day return, next month return etc), the sign of the next day’s return, or the actual price level (although the latter is not really recommended, for reasons that will be explained below).

Although I differentiate between the data mining approach and the model-based approach, the data mining approach can also be considered an exercise in predictive modelling. Interestingly, the model-based approaches that I have written about previously (for example ARIMA, GARCH, Random Walk etc) assume linear relationships between variables. Modelling non-linear relationships using these approaches is (apparently) complex and time consuming. On the other hand, some statistical learning algorithms can be considered ‘universal approximators’ in that they have the ability to model any linear or non-linear relationship. It was not my intention to get into a philosophical discussion about the differences between a model-based approach and a data mining approach, but clearly there is some overlap between the two. In the near future I am going to write about my efforts to create a hybrid approach that attempts a synergistic combination of classical linear time series modelling and non-linear statistical learning – trust me, it is actually much more interesting than it sounds. Watch this space.

## Variables and feature engineering

### The prediction target

The first and most obvious decision to be made is the choice of target variable. In other words, what are we trying to predict? For one-day ahead forecasting systems, profit is the usual target. I used the next day’s return normalized to the recent average true range, the implication being that in live trading, position sizes would be inversely proportionate to the recent volatility. In addition, by normalizing the target variable in this way, we may be able to train the model on multiple markets, as the target will be on the same scale for each.

### Choosing predictive variables

In SSML, Aronson states that the golden rule of feature selection is that the predictive power should come primarily from the features and not from the model itself. My research corroborated this statement, with many (but not all) algorithm types returning correlated predictions for the same feature set. I found that the choice of features had a far greater impact on performance than choice of model. The implication is that spending considerable effort on feature selection and feature engineering is well and truly justified. I believe it is critical to achieving decent model performance.

Many variables will have little or no relationship with the target variable and including these will lead to overfitting or other forms of poor performance. Aronson recommends using chi-square tests and Cramer’s V to quantify the relationship between variables and the target. I actually didn’t use this approach, so I can’t comment on it. I used a number of other approaches, including ranking a list of candidate features according to their Maximal Information Coefficient (MIC) and selecting the highest ranked features, Recursive Feature Elimination (RFE) via the caret package in R, an exhaustive search of all linear models, and Principal Components Analysis (PCA). Each of these are discussed below.

### Some candidate features

Following is the list of features that I investigated as part of this research. Most were derived from SSML. This list is by no means exhaustive and only consists of derivatives and transformations of the price series. I haven’t yet tested exogenous variables, such as economic indicators, the price histories of other instruments and the like, but I think these are deserving of attention too. The following list is by no means exhaustive, but provides a decent starting point:

- 1-day log return
- Trend deviation: the logarithm of the closing price divided by the lowpass filtered price
- Momentum: the price today relative to the price
*x*days ago, normalized by the standard deviation of daily price changes. - ATR: the average true range of the price series
- Velocity: a one-step ahead linear regression forecast on closing prices
- Linear forecast deviation: the difference between the most recent closing price and the closing price predicted by a linear regression line
- Price variance ratio: the ratio of the variance of the log of closing prices over a short time period to that over a long time period.
- Delta price variance ratio: the difference between the current value of the price variance ratio and its value
*x*periods ago. - The Market Meanness Index: A measure of the likelihood of the market being in a state of mean reversion, created by the Financial Hacker.
- MMI deviation: The difference between the current value of the Market Meanness Index and its value
*x*periods ago. - The Hurst exponenet
- ATR ratio: the ratio of an ATR of a short (recent) price history to an ATR of a longer period.
- Delta ATR ratio: the difference between the current value of the ATR ratio and the value
*x*bars ago. - Bollinger width: the log ratio of the standard deviation of closing prices to the mean of closing prices, that is a moving standard deviation of closing prices relative to the moving average of closing prices.
- Delta bollinger width: the difference between the current value of the bollinger width and its value
*x*bars ago.

Thus far I have only considered the most recent value of each variable. I suspect that the recent history of each variable would provide another useful dimension of data to mine. I left this out of the feature selection stage since it makes more sense to firstly identify features whose current values contain predictive information about the target variable before considering their recent histories. Incorporating this from the beginning of the feature selection stage would increase the complexity of the process by several orders of magnitude and would be unlikely to provide much additional value. I base that statement on a number of my own assumptions, not to mention the practicalities of the data mining approach, rather than any hard evidence.

### Transforming the candidate features

In my experiments, the variables listed above were used with various cutoff periods (that is, the number of periods used in their calculation). Typically, I used values between 3 and 20 since Aronson states in SSML that lookback periods greater than about 20 will generally not contain information useful to the one period ahead forecast. Some variables (like the Market Meanness Index) benefit from a longer lookback. For these, I experimented with 50, 100, and 150 bars.

Additionally, it is important to enforce a degree of stationarity on the variables. Davind Aronson again:

Using stationary variables can have an enormous positive impact on a machine learning model. There are numerous adjustments that can be made in order to enforce stationarity such as centering, scaling, and normalization. So long as the historical lookback period of the adjustment is long relative to the frequency of trade signals, important information is almost never lost and the improvements to model performance are vast.

- Scaling: divide the indicator by the interquartile range (note, not by the standard deviation, since the interquartile range is not as sensitive to extremely large or small values).
- Centering: subtract the historical median from the current value.
- Normalization: both of the above. Roughly equivalent to traditional z-score standardization, but uses the median and interquartile range rather than the mean and standard deviation in order to reduce the impact of outliers.
- Regular normalization: standardizes the data to the range -1 to +1 over the lookback period (x-min)/(max-min) and re-centered to the desired range.

In my experiments, I generally adopted regular normalization using the most recent 50 values of the features.

### Removing highly correlated variables

It makes sense to remove variables that are highly correlated with other variables since they are unlikely to provide additional information that isn’t already contained in the other variables. Keeping these variables will also add unnecessary computation time, increase the risk of overfitting and bias the final model towards the correlated variables.

1 2 3 4 5 6 7 |
library(corrplot) cor.mat <- cor(gu_data[, -1]) # compute the correlation matrix without the target, features in data frame 'gu_data' highCor <- findCorrelation(cor.mat, 0.50) #apply correlation filter highCor <- highCor+1 # increment by one to account for removing the target above gu_data_filt <- gu_data[, -highCor] cor.mat.filt <- cor(gu_data_filt) corrplot(cor.mat.filt, order = "hclust", type = 'lower') #plot correlation matrix |

These are the remaining variables and their pairwise correlations:

### Feature selection via Maximal Information

The maximal information coefficient (MIC) is a non-parametric measure of two-variable dependence designed specifically for rapid exploration of many-dimensional data sets. Read more about MIC here. I used the minerva package in R to rank my variables according to their MIC with the target variable (next day’s return normalized to the 100-period ATR). Results shown below and throughout this post are for the GBP/USD exchange rate from 19 March 2009 to 31 December 2014 (daily data):

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
## MIC library(minerva) # variables and target in data frame gu_data mine.filt <- mine(gu_data_filt[, -1], gu_data_filt[, 1]) mic <- as.data.frame(mine.filt$MIC) mic.ordered <- mic[order(mic), , drop = FALSE] ### RESULTS MMI 0.1210261 atrRat10_20 0.1287799 bWidth10 0.1315506 apc3 0.1353889 ATR7 0.1356449 mom5 0.1404709 deltaPVR5 0.1447020 bWidth100 0.1454078 deltaMMI 0.1467144 deltaATRRat5 0.1503392 bWidth20 0.1517109 atrRat10_100 0.1559397 velocity10 0.1601864 deltabWidth3 0.1607715 mom3 0.1653665 |

These results show that none of the features have a particularly high MIC with respect to the target variable, which is what I would expect from noisy data such as daily exchange rates. However, certain variables have a higher MIC than others. In particular, the 3-period momentum, the 3-period delta bollinger width and the 10-period velocity outperform the rest of the variables by a decent margin.

Notice that the top 2 MICs come from variables that have a short lookback period (3 days). The top variables are also fairly simple measures of trend, momentum and price volatility, or the recent changes in those characteristics.

### Recursive feature elimination

I also used recursive feature elimination (RFE) via the caret package in R to isolate the most predictive features from my list of candidates. RFE is an iterative process that involves constructing a model from the entire set of features, retaining the best performing features, and then repeating the process until all the features are eliminated. The model with the best performance is identified and the feature set from that model declared the most useful.

I performed RFE using a random forest model:

1 2 3 4 5 6 7 8 9 10 |
# define the control function and CV method cntrl <- rfeControl(functions=rfFuncs, method="repeatedcv", repeats = 50, number=5) rfe.results <- rfe(gu_data_filt[,-1], gu_data_filt[,1], sizes=c(2:15), rfeControl=cntrl) print(rfe.results) # list final feature set predictors(rfe.results) #### Results The top 5 variables (out of 15): atrRat10_100, atrRat10_20, velocity10, ATR7, mom5 |

In this case, the RFE process has emphasized variables that describe volatility and trend, but has decided that the best performance is obtained by incorporating all 15 variables into the model. Here’s a plot of the cross validated performance of the best feature set for various numbers of features:

I am tempted to take the results of the RFE with a grain of salt. My reasons are:

- The RFE algorithm does not consider interactions between variables. For example, assume that two variables individually have no effect on model performance, but due to some relationship between them they improve performance when both are included in the feature set. RFE is likely to miss this predictive relationship.
- The performance of RFE is directly related to the ability of the specific algorithm (in this case random forest) to uncover relationships between the variables and the target. At this stage of the process, we have absolutely no evidence that the random forest model is applicable in this sense to our particular data set.
- Finally, the implementation of RFE that I used was the ‘out of the box’ caret version. This implementation uses root mean squared error (RMSE) as the objective function, however I don’t believe that RMSE is the best objective function for this data due to the significant influence of extreme values on model performance. It is possible to have a low RMSE but poor overall performance if the model is accurate across the middle regions of the target space (corresponding to small wins and losses), but inaccurate in the tails (corresponding to big wins and losses)

In order to address (3) above, I implemented a custom summary function so that the RFE was performed such that the cross-validated absolute return was maximized. I also applied the additional criterion that only predictions with an absolute value of greater than 5 would be considered under the assumption that in live trading we wouldn’t enter positions unless the prediction exceeded this value. The custom summary function that I used and the results are as follows:

1 2 3 4 5 6 7 8 9 10 11 |
absretSummary <- function (data, lev = NULL, model = NULL) { positions <-ifelse(abs(data[, "pred"]) > 5, sign(data[, "pred"]), 0) trades <- abs(c(1,diff(positions))) profits <- positions*data[, "obs"] profit <- sum(profits) names(profit) <- 'profit' return(profit) } ### The top 2 variables (out of 2): atrRat10_100, atrRat10_20 |

The results are a little different to those obtained using RMSE as the objective function. The focus is still well and truly on the volatility indicators, but in this case the best cross validated performance occurred when selecting only 2 out of the 15 candidate variables. Here’s a plot of the cross validated performance of the best feature set for various numbers of features:

The model clearly performs better in terms of absolute return for a smaller number of predictors. Performance bottoms at 8 predictors and then improves, but never again achieves the performance obtained with 2-4 predictors. This is consistent with Aronson’s assertion that we should stick with at most 3-4 variables otherwise overfitting is almost unavoidable.

The performance profile of the model tuned on absolute return is very different to that of the model tuned on RMSE, which displays a consistent improvement as the number of predictors is increased. Using RMSE as the objective function (which seems to be the default in many applications I’ve come across) would result in a very sub-optimal final model in this case. This highlights the importance of ensuring that the objective function is a good proxy for the performance being sought in practice.

In the RFE example above, I used 50 iterations of 5-fold cross validation, but I haven’t held out a test set of data or estimated performance with an inner cross validation loop.

### Models with in-built feature selection

A number of machine learning algorithms have feature selection in-built. Max Kuhn’s website for the caret package contains a list of such models that are accessible through the caret package. I’ll apply several and compare the features selected to those selected with other methods. For this experiment, I used a diverse range of algorithms that include various ensemble methods and both linear and non-linear interactions:

- Bagged multi-adaptive regressive splines (MARS)
- Boosted generalized additive model (bGAM)
- Lasso
- Spike and slab regression (SSR)
- Model tree
- Stochastic gradient boosting (SGB)
- Bayesian additive regression trees (BART)

For each model, I did only very basic hyperparameter tuning within caret using 5-fold cross validation repeated 10 times. This returns the model with the best cross-validated performance (that is, the best average absolute return over each cross validation run). In practice, I would actually repeat cross validation at least 100 times and average the performance across the resulting 500 models. Maximization of absolute return was used as the objective function.

The table below shows the top 5 variables and the performance of the final cross validated model for each algorithm:

Algorithm | Abs.Ret | Var.1 | Var.2 | Var.3 | Var.4 | Var.5 |

MARS | 316 | mom5 | bWidth10 | deltaMMI | velocity10 | apc3 |

bGAM | 494 | atrRat10_100 | mom5 | bWidth100 | deltaPVR5 | ATR7 |

Lasso | 251 | velocity10 | mom3 | mom5 | apc3 | deltaPVR5 |

SSR | 117 | atrRat10_100 | mom5 | bWidth100 | deltaPVR5 | ATR7 |

Model tree | 121 | atrRat10_100 | mom5 | bWidth100 | ATR7 | deltaATRRat5 |

SGB | 63 | atrRat10_100 | deltaATRRat5 | ATR7 | mom5 | deltaPVR5 |

BART | 198 | atrRat10_100 | bWidth20 | mom5 | atrRat10_20 | deltaMMI |

We can see that 5-day momentum is included in the top 5 predictors for every algorithm I investigated. The ratio of the 10- to 100-day ATR featured 5 out of 7 times, and was the top feature every time it was selected. The bollinger width variable was selected 5 times, in one form or another (10- and 20-day variables once each and 100-day variable thrice). Other notable mentions include the 7-day ATR and the 5-day change in the price variance ratio which were both included in the top variables 4 times. The table below summarizes the frequency with which each variable was selected:

13 of the 15 variables were selected in the top 5 by at least one algorithm. Only MMI and deltabWidth3 never made the top 5.

### Model selection using glmulti

The glmulti package fits all possible unique generalized linear models from the variables and returns the ‘best’ models as determined by an information criterion (Aikake in this case). The package is essentially a wrapper for the glm (generalized linear model) function that allows selection of the ‘best’ model or models, providing insight into the most predictive variables. By default, glmulti builds models from the main interactions, but there is an option to also include pairwise interactions between variables. This increases the computation time considerably, and I found that the resulting ‘best’ models were orders of magnitude more complex than those obtained using main interactions only, and results were on par.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
plot(x, type = 's', col = 'blue') print(x) glmulti.analysis Method: h / Fitting: glm / IC used: aic Level: 1 / Marginality: FALSE From 1000 models: Best IC: 8443.65959954518 Best model: [1] "target ~ 1 + atrRat10 + atrRat10_20 + bWidth100 + ATR7" Evidence weight: 0.00982721902929488 Worst IC: 8449.63630478929 17 models within 2 IC units. 901 models to reach 95% of evidence weight. |

We retain the models whose AICs are less than 2 units from the ‘best’ model. 2 units is a rule of thumb for models that, for all intents and purposes, are likely to be on par in terms of their performance:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
weights <- weightable(x) bst <- weights[weights$aic <= min(weights$aic) + 2,] print(bst) model 1 target ~ 1 + atrRat10_100 + atrRat10_20 + bWidth100 + ATR7 2 target ~ 1 + atrRat10_100 + bWidth100 + ATR7 3 target ~ 1 + bWidth100 + ATR7 4 target ~ 1 + atrRat10_100 + deltaATRRat5 + atrRat10_20 + bWidth100 + ATR7 5 target ~ 1 + atrRat10_100 + deltaMMI + atrRat10_20 + bWidth100 + ATR7 6 target ~ 1 + atrRat10_100 + atrRat10_20 + bWidth100 + bWidth20 + ATR7 7 target ~ 1 + atrRat10_100 + deltaMMI + bWidth100 + ATR7 8 target ~ 1 + velocity10 + bWidth100 + ATR7 9 target ~ 1 + atrRat10_100 + deltabWidth3 + atrRat10_20 + bWidth100 + ATR7 10 target ~ 1 + mom5 + atrRat10_100 + atrRat10_20 + bWidth100 + ATR7 11 target ~ 1 + bWidth100 + bWidth20 + ATR7 12 target ~ 1 + mom3 + atrRat10_100 + atrRat10_20 + bWidth100 + ATR7 13 target ~ 1 + velocity10 + atrRat10_100 + atrRat10_20 + bWidth100 + ATR7 14 target ~ 1 + atrRat10_100 + MMI + atrRat10_20 + bWidth100 + ATR7 15 target ~ 1 + apc3 + atrRat10_100 + atrRat10_20 + bWidth100 + ATR7 16 target ~ 1 + deltaPVR5 + atrRat10_100 + atrRat10_20 + bWidth100 + ATR7 17 target ~ 1 + atrRat10_100 + bWidth10 + atrRat10_20 + bWidth100 + ATR7 |

Notice any patterns here? The top models all selected the 7-day ATR and a bollinger width variable. The ATR ratios also feature heavily. Noticeably sparse are the momentum variables. This is confirmed with this plot of the model averaged variable importance (averaged over the best 1,000 models):

Note that these models only considered the main, linear interactions between each variable and the target. Of course, there is no guarantee that any relationship is linear, if it exists at all. Still, this method provides some useful insight.

One of the great things about glmulti is that it facilitates model-averaged predictions – more on this when I delve into ensembles in part 2 of this series.

### Generalized linear model with stepwise feature selection

Finally, I used a generalized linear model with stepwise feature selection:

1 2 3 4 5 6 7 8 9 10 11 12 13 |
# generalized linear model with stepwise selection glmStepAICModel <- train(gu_data_filt[, -1], gu_data_filt[, 1], method = "glmStepAIC", trControl = cntrl, metric = "profit", maximize = TRUE) print(glmStepAICModel$finalModel) Coefficients: (Intercept) atrRat10_100 atrRat10_20 bWidth100 ATR7 -11.166 -11.319 8.911 -5.287 913.226 Degrees of Freedom: 766 Total (i.e. Null); 762 Residual Null Deviance: 2709000 Residual Deviance: 2670000 AIC: 8444 |

The final model selected 4 of the 15 variables: the ratio of the 10- to 100-day ATR, the ratio of the 10- to 20-day ATR , the 100-day bollinger width and the 7-day ATR.

### Boruta: all relevant feature selection

Boruta finds relevant features by comparing the importance of the original features with the importance of random variables. Random variables are obtained by permuting the order of values of the original features. Boruta finds a minimum, mean and maximum value of the importance of these permuted variables, and then compares these to the original features. Any original feature that is found to be more relevant than the maximum random permutation is retained.

Boruta does not measure the absolute importance of individual features, rather it compares each feature to random permutations of the original variables and determines the relative importance. This theory very much resonates with me and I intuit that it will find application in weeding out uninformative features from noisy financial data. The idea of adding randomness to the sample and then comparing performance is analogous to the approach I use to benchmark my systems against a random trader with a similar trade distribution.

The box plots in the figure below show the results obtained when I ran the Boruta algorithm for the 15 filtered variables for 1,000 iterations. The blue box plots show the permuted variables of minimum, mean and maximum importance, the green box plots indicate the original features that ranked higher than the maximum importance of the random permuted variables, and the variables represented by the red box plots are discarded.

The variables retained by the Boruta algorithm were: MMI, 5-period momentum, 5-period delta ATR ratio, delta MMI, 7-day ATR, 10-day velocity, the 10-to-20 period ATR ratio and the 10-to-100 period ATR ratio. The latter 4 were ranked significantly higher than the former 4, with the 10-to-100 day ATR ratio being the clear winner. These results are largely consistent with the results obtained through other methods, perhaps with the exception of the inclusion of the MMI and the delta MMI, however these variables were ranked only marginally better than the best random permuted variable.

Side note: The developers state that “Boruta” means “Slavic spirit of the forest.” As something of a slavophile myself, I did some googling and discovered that this description is something of a euphemism. Check out some of the items that pop up in a Google image search!

### Discussion of feature selection methods

It is important to note that any feature selection process naturally invites a degree of selection bias. For example, from a large set of uninformative variables, a small number may randomly correlate with the target variable. The selection algorithm would then rank these variables highly. The error would only be (potentially) uncovered through cross validation of the selection algorithm or by using an unseen test or validation set. Feature selection is difficult and can often make predictive performance worse, since it is easy to over-fit the feature selection criterion. It is all to easy to end up with a subset of attributes that works really well on one particular sample of data, but not necessarily on any other. There is a fantastic discussion of this at the Statistics Stack Exchange community that I have linked here because it is just so useful.

It is critical to take steps to minimize selection bias at every opportunity. The results of any feature selection process should be cross validated or tested on an unseen hold out set. If the hold out set selects a vastly different set of predictors, something has obviously gone wrong – or the features are worthless. The approach I took in this post was to cross validate the results of each test that I performed, with the exception of the Maximal Information Criterion and glmulti approaches. I’ve also selected features based on data for one market only. If the selected features are not robust, this will show up with poor performance when I attempt to build predictive models for other markets using these features.

I think that it is useful to apply a wide range of methods for feature selection, and then look for patterns and consistencies across these methods. This approach seems to intuitively be far more likely to yield useful information than drawing absolute conclusions from a single feature selection process. Applying this logic to the approach described above, we can conclude that the 5-day momentum, the ratio of the 10- to 100-day ATR , the ratio of the 10- to 20-day ATR, the 100-day bollinger width and the 7-day ATR are probably the most likely to yield useful information since they continually show up in most of the feature selection methods that I investigated. Other variables that may be worth considering include the 5-day delta price variance ratio, the 3-day delta MMi and the 10-day velocity.

In part 2 of this article, I’ll describe how I built and combined various models based on these variables.

## Principal Components Analysis

An alternative to feature selection is Principal Components Analysis (PCA), which attempts to reduce the dimensionality of the data while retaining the majority of the information. PCA is a linear technique: it transforms the data by linearly projecting it onto a lower dimension space while preserving as much of its variation as possible. Another way of saying this is that PCA attempts to transform the data so as to express it as a sum of uncorrelated components.

Again, note that PCA is limited to a linear transformation of the data, however there is no guarantee that non-linear transformations won’t be better suited. Another significant assumption when using PCA is that the principal components of future data will look those of the training data. It’s also possible that the smallest component, describing the least variance, is also the only one carrying information about the target variable, and would likely be lost when the major variance contributors are selected.

To investigate the effects of PCA on model performance, I cross validated 2 random forest models, the first using the principal components of the 15 variables, and the other using all 15 variables in their raw form. I chose the random forest model since it includes feature selection and thus may reveal some insights about how PCA stacks up in relation to other feature selection methods. For both models, I peformed 5-fold cross validation repeated 200 times for a total of 1,000 surrogate models. I also used the same random seed so that the cross validation folds would be the same across both models, allowing direct comparison.

1 2 3 4 5 6 |
set.seed(101) pca.model <- train(gu_data_filt[, -1], gu_data_filt[, 1], method = 'rf', preProcess = c('pca'), trControl = cntrl) set.seed(101) raw.model <- train(gu_data_filt[, -1], gu_data_filt[, 1], method = 'rf', trControl = cntrl) |

In order to infer the difference in model performance, I collected the results from each resampling iteration of both final models and compared their distributions via a pair of box and whisker plots:

1 2 3 4 5 |
resamp.results <- resamples(list(PCA = pca.model, RAW = raw.model)) trellis.par.set(theme = col.whitebg()) bwplot(resamp.results, layout = c(1, 1)) |

The model built on the raw data outperforms the model built on the data’s principal components in this case. The resampled mean profit is higher and the distribution is shifted in the positive direction. Sadly however, both distributions look only sightly better than random and have wide distributions.

## A simple trading system

I will go into more detail about building a practical trading system using machine learning in the next post, but the following demonstrates a simple trading system based on some of the information gained from the analysis presented above. The system is based on three of the indicators that the feature selection analysis identified as being predictive of the target variable. The features used were 10-day velocity, 3-day momentum and 7-day ATR. The data used was the GBP/USD exchange rate’s daily closing price (at midnight GMT) from 2009-2015. I trained a generalized boosted regression model using the gbm package in R using these indicators as the independent variables predicting the next day return normalized to the recent ATR.

Firstly, I investigated the cross-validated performance on the entire data set and compared this performance with the distribution of performances from a system based on random trading. I split the data set into 5 random segments and then trained a model (with an objective function that maximized total profit) on 4 of the 5 segments and then tested it on 5th. The result of the test on the 5th, out-of-sample segment was saved as the performance statistic for that model run. I repeated this process 5 times such that each time a different segment was held out as the test set. I then repeated this process a total of 1,000 times so that I ended up with 5,000 instances of the total profit performance statistic. I compared this distribution to the distribution obtained by taking N/5 random trades a total of 5,000 times, where N is the number of next-day return observations in the entire data set. The distribution of the results are shown below:

There is almost no difference in the distributions of the random trading strategy and the strategy created with the carefully selected features! This doesn’t look good, and suggests that for all my effort, the features identified as being predictive turned out to be little better than random. Or did they?

The returns series of most financial instruments consists of a relatively large number of small positive and small negative values and a smaller number of large positive and large negative values. I assert that the values whose magnitude is smaller are more random in nature than the values whose magnitude is large. On any given day, all things being equal, a small negative return could turn out to be a small positive return by the time the close rolls around, or vice versa, as a result of any number of random occurrences related to the fundamentals of the exchange rate. These same random occurrences are less likely to push a large positive return into negative territory and vice versa, purely on account of the size of the price swings involved. Following this logic, I hypothesize that my model is likely to be more accurate in its extreme predictions than in its ‘normal’ range.

We can test this hypothesis on the simple trading strategy described above by entering positions only when the model predicts a return that is large in magnitude. I divided the data into training and testing sets using a 75:25 split. I trained the model on the 75% training data set and tested it on the 25% testing data set using increasing prediction thresholds as my entry criteria. Here are the results, along with the buy and hold return from the testing data set:

Now the strategy is looking a lot more enticing. We can see that increasing the prediction threshold for entering a trade significantly improves the out of sample performance compared with the buy and hold strategy and the strategy based on raw model predictions (corresponding to a prediction threshold of zero). The prediction threshold can be adjusted depending on the trader’s appetite for high returns (threshold = 3) as opposed to minimal drawdown (threshold = 20).

## Conclusions

- The MIC analysis implied that variables based on only 3-5 days of price history are the most useful in predicting the next day’s normalized return. 3-day momentum was a clear winner, and delta bollinger width, 10-day velocity and the 10-to-100-day ATR ratio also scored relatively well.
- The RFE analysis indicated that it may be prudent to focus on variables that measure volatility or recent changes in volatility, for lookback periods of up to 10 days.
- An exhaustive search of all possible generalized linear models that considered main interactions using glmulti implied that 7-day ATR and bollinger width variables are most predictive. The ATR ratios are also regularly selected in the top models.
- Stepwise feature selection using a generalized linear model selected the 10-to-100- and 10-to-20- day ATRs, the 100-day bollinger width and the 7-day ATR.
- Transforming the variables using PCA reduced the performance of a random forest model relative to using the raw variables.

The same features seem to be selected over and over again using the different methods. Is this just a fluke, or has this long and exhaustive data mining exercise revealed something of practical use to a trader? In part 2 of this series, I will investigate the performance of various machine learning algorithms based on this feature selection exercise. I’ll compare different algorithms and then investigate combining their predictions using ensemble methods with the objective of creating a useful trading system.

## References

Aronson, D. 2006, *Evidence-Based Technical Analysis: Applying the Scientific Method and Statistical Inference to Trading Signals*.

Aronson, D. and Masters, T. 2014-, *Statistically Sound Machine Learning for Algorithmic Trading of Financial Instruments: Developing Predictive-Model-Based Trading Systems Using TSSB*

## Appendix – Code for Generating the Training File

In response to a request from a reader, here is some code that I used to generate the training data used in this analysis. See my reply to Alon below, reproduced here:

I created this code a while ago and have since added to it, so there are a bunch of other indicators and transforms that will be output to a csv file. You can pick out the ones that I’ve used in this post, or experiment with the others, as you like. The output file is called “variables.csv” and this is what I read into my training data object “gu_data” that you see in the R code used in this post.

Also, you will need to manually exclude some rows from the output file where indicator values are unable to be calculated due to an insufficient lookback period.

This code is written in Lite-C and is compatible with the Zorro platform, which is an awesome tool for doing this sort of analysis. Some exciting things you can do to extend my research by tweaking this code include:

- Try other bar periods (I used daily);
- If using daily bars, try sampling at a more sensible time (ie not midnight, as I have done here);
- Obtain the same data for any market for which you have data by changing the asset() call towards the start of the script;
- Use Zorro’s recently added pre-processing tools for better scaling/compressing/normalizing of the data than I have used here (these tools were added to the platform after I wrote this post);
- Add other indicators or data that you are interested in;
- Experiment with longer prediction horizons. By default, I used a one-day prediction horizon. Could predicting further into the future give better results?
- Try using a classification approach by changing the target variable from a return (a numerical value) to a market direction (a binary value);
- Plenty more that I haven’t even thought of…

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 |
// Click [Test] for exporting price data to a .csv file in the Data folder // The records are stored in the format: time, open, high, low, close ... // f.i. "31.12.12 00:00, 1.32205, 1.32341, 1.32157, 1.32278 ..." //function for Z-score transformation var zscore(vars data, int period) { return (data[0] - SMA(data, period))/StdDev(data, period); } //function of calculating the n-period IBS var nperIBS(vars high, vars low, vars close, int period) { int i; var ibs; for (i=0; i<period; i++) { ibs += (close[i] - low[i])/(high[i] - low[i]); } return (ibs/period); } function run(){ StartDate = 2009; EndDate = 2015; LookBack = 400; asset("GBP/USD"); int period = 50; //lookback for calculating normalized/standardized values vars Price = series(price()); vars High = series(priceHigh()); vars Low = series(priceLow()); vars Close = series(priceClose()); vars logPrice = series(log(Price[0])); vars logOpen = series(log(priceOpen())); vars logHigh = series(log(High[0])); vars logLow = series(log(Low[0])); vars logClose = series(log(Close[0])); vars logDeltaClose = series(logClose[0] - logClose[1]); // ATR based on log prices vars logATR = series(ATR(logOpen, logHigh, logLow, logClose, 7)); //Trend variable vars trendSlow = series(LowPass(Price, 100)); vars trendFast = series(LowPass(Price,20)); vars trend = series(trendFast[0] - trendSlow[0]); vars trendNorm = series(Normalize(trend, period)); vars trendZscore = series(zscore(trend, period)); //Linear regression of log price / ATR vars velocity3 = series(TSF(logPrice, 3)/ATR(logOpen, logHigh, logLow, logClose, 7)); vars velocity5 = series(TSF(logPrice, 5)/ATR(logOpen, logHigh, logLow, logClose, 7)); vars velocity10 = series(TSF(logPrice, 10)/ATR(logOpen, logHigh, logLow, logClose, 7)); vars velocity3N = series(Normalize(velocity3, period)); vars velocity5N = series(Normalize(velocity5, period)); vars velocity10N = series(Normalize(velocity10, period)); //Quadratic regression of log price / ATR vars accel3 = series(polyfit(0, logPrice, 3, 2, 1)/ATR(logOpen, logHigh, logLow, logClose, 7)); vars accel5 = series(polyfit(0, logPrice, 5, 2, 1)/ATR(logOpen, logHigh, logLow, logClose, 7)); vars accel10 = series(polyfit(0, logPrice, 10, 2, 1)/ATR(logOpen, logHigh, logLow, logClose, 7)); vars accel3N = series(Normalize(accel3, period)); vars accel5N = series(Normalize(accel5, period)); vars accel10N = series(Normalize(accel10, period)); //Cubic regression of log price / ATR vars cubic10 = series(polyfit(0, logPrice, 10, 3, 1)/ATR(logOpen, logHigh, logLow, logClose, 7)); vars cubic25 = series(polyfit(0, logPrice, 25, 3, 1)/ATR(logOpen, logHigh, logLow, logClose, 7)); vars cubic50 = series(polyfit(0, logPrice, 50, 3, 1)/ATR(logOpen, logHigh, logLow, logClose, 7)); vars cubic10N = series(Normalize(cubic10, period)); vars cubic25N = series(Normalize(cubic25, period)); vars cubic50N = series(Normalize(cubic50, period)); //Price momentum vars mom3 = series((logDeltaClose[0] - logDeltaClose[3])/sqrt(Moment(logDeltaClose, 100, 2))); vars mom5 = series((logDeltaClose[0] - logDeltaClose[5])/sqrt(Moment(logDeltaClose, 100, 2))); vars mom10 = series((logDeltaClose[0] - logDeltaClose[10])/sqrt(Moment(logDeltaClose, 100, 2))); vars mom3N = series(Normalize(mom3, period)); vars mom5N = series(Normalize(mom5, period)); vars mom10N = series(Normalize(mom10, period)); //Close minus MA vars closeDevMA3 = series(log(Close[0]/SMA(Close, 3))/ATR(logOpen, logHigh, logLow, logClose, 7)); vars closeDevMA5 = series(log(Close[0]/SMA(Close, 5))/ATR(logOpen, logHigh, logLow, logClose, 7)); vars closeDevMA10 = series(log(Close[0]/SMA(Close, 10))/ATR(logOpen, logHigh, logLow, logClose, 7)); vars closeDevMA3N = series(Normalize(closeDevMA3, period)); vars closeDevMA5N = series(Normalize(closeDevMA5, period)); vars closeDevMA10N = series(Normalize(closeDevMA10, period)); //Linear forecast dev vars linDev3 = series((logPrice[0] - velocity3[0])/ATR(logOpen, logHigh, logLow, logClose, 7)); vars linDev5 = series((logPrice[0] - velocity5[0])/ATR(logOpen, logHigh, logLow, logClose, 7)); vars linDev10 = series((logPrice[0] - velocity10[0])/ATR(logOpen, logHigh, logLow, logClose, 7)); vars linDev3N = series(Normalize(linDev3, period)); vars linDev5N = series(Normalize(linDev5, period)); vars linDev10N = series(Normalize(linDev10, period)); // Quad forecast dev vars quadDev3 = series((logPrice[0] - accel3[0])/ATR(logOpen, logHigh, logLow, logClose, 7)); vars quadDev5 = series((logPrice[0] - accel5[0])/ATR(logOpen, logHigh, logLow, logClose, 7)); vars quadDev10 = series((logPrice[0] - accel10[0])/ATR(logOpen, logHigh, logLow, logClose, 7)); vars quadDev3N = series(Normalize(quadDev3, period)); vars quadDev5N = series(Normalize(quadDev5, period)); vars quadDev10N = series(Normalize(quadDev10, period)); // cubic forecast dev vars cubicDev3 = series((logPrice[0] - accel3[0])/ATR(logOpen, logHigh, logLow, logClose, 7)); vars cubicDev5 = series((logPrice[0] - accel5[0])/ATR(logOpen, logHigh, logLow, logClose, 7)); vars cubicDev10 = series((logPrice[0] - accel10[0])/ATR(logOpen, logHigh, logLow, logClose, 7)); vars cubicDev3N = series(Normalize(cubicDev3, period)); vars cubicDev5N = series(Normalize(cubicDev5, period)); vars cubicDev10N = series(Normalize(cubicDev10, period)); // abs price change oscillator vars apc3 = series((Moment(logDeltaClose, 3, 1) - Moment(logDeltaClose, 100, 1))/ATR(logOpen, logHigh, logLow, logClose, 100)); vars apc5 = series((Moment(logDeltaClose, 5, 1) - Moment(logDeltaClose, 100, 1))/ATR(logOpen, logHigh, logLow, logClose, 100)); vars apc10 = series((Moment(logDeltaClose, 10, 1) - Moment(logDeltaClose, 100, 1))/ATR(logOpen, logHigh, logLow, logClose, 100)); vars apc3N = series(Normalize(apc3, period)); vars apc5N = series(Normalize(apc5, period)); vars apc10N = series(Normalize(apc10, period)); //Price variance ratio vars priceVarRatSlow = series(Moment(logPrice, 20, 2)/Moment(logPrice, 100, 2)); vars priceVRSNorm = series(Normalize(priceVarRatSlow, period)); vars priceVRSZscore = series(zscore(priceVarRatSlow, period)); vars priceVarRatFast = series(Moment(logPrice, 10, 2)/Moment(logPrice, 20, 2)); vars priceVRFNorm = series(Normalize(priceVarRatFast, period)); vars priceVRFZscore = series(zscore(priceVarRatFast, period)); vars priceVarRat3 = series(Moment(logPrice, 3, 2)/Moment(logPrice, 100, 2)); vars priceVarRat5 = series(Moment(logPrice, 5, 2)/Moment(logPrice, 100, 2)); vars priceVarRat10 = series(Moment(logPrice, 10, 2)/Moment(logPrice, 100, 2)); vars priceVarRat3N = series(Normalize(priceVarRat3, period)); vars priceVarRat5N = series(Normalize(priceVarRat5, period)); vars priceVarRat10N = series(Normalize(priceVarRat10, period)); // delta price variance ratio vars deltaPVR3 = series(priceVarRat3[0] - priceVarRat3[3]); vars deltaPVR5 = series(priceVarRat5[0] - priceVarRat5[5]); vars deltaPVR10 = series(priceVarRat10[0] - priceVarRat10[10]); vars deltaPVR3N = series(Normalize(deltaPVR3, period)); vars deltaPVR5N = series(Normalize(deltaPVR5, period)); vars deltaPVR10N = series(Normalize(deltaPVR10, period)); //ATR ratio vars atrRatSlow = series(ATR(20)/ATR(100)); vars atrRatSlowNorm = series(Normalize(atrRatSlow, period)); vars atrRatSlowZscore = series(zscore(atrRatSlow, period)); vars atrRatFast = series(ATR(10)/ATR(20)); vars atrRatFastNorm = series(Normalize(atrRatFast, period)); vars atrRatFastZscore = series(zscore(atrRatFast, period)); vars atrRat3 = series(ATR(3)/ATR(100)); vars atrRat5 = series(ATR(5)/ATR(100)); vars atrRat10 = series(ATR(10)/ATR(100)); vars atrRat3N = series(Normalize(atrRat3, period)); vars atrRat5N = series(Normalize(atrRat5, period)); vars atrRat10N = series(Normalize(atrRat10, period)); // delta ATR ratio vars deltaATRrat3 = series(atrRat3[0] - atrRat3[3]); vars deltaATRrat5 = series(atrRat5[0] - atrRat5[5]); vars deltaATRrat10 = series(atrRat10[0] - atrRat10[10]); vars deltaATRrat3N = series(Normalize(deltaATRrat3, period)); vars deltaATRrat5N = series(Normalize(deltaATRrat5, period)); vars deltaATRrat10N = series(Normalize(deltaATRrat10, period)); //Bollinger width vars bWidthSlow = series(log(sqrt(Moment(Price, 100, 2))/Moment(Price, 100, 1))); vars bWidthSlowNorm = series(Normalize(bWidthSlow, period)); vars bWidthSlowZscore = series(zscore(bWidthSlow, period)); vars bWidthFast = series(log(sqrt(Moment(Price, 20, 2))/Moment(Price, 20, 1))); vars bWidthFastNorm = series(Normalize(bWidthFast, period)); vars bWidthFastZscore = series(zscore(bWidthFast, period)); vars bWdith3 = series(log(sqrt(Moment(Price, 3, 2))/Moment(Price, 3, 1))); vars bWdith5 = series(log(sqrt(Moment(Price, 5, 2))/Moment(Price, 5, 1))); vars bWdith10 = series(log(sqrt(Moment(Price, 10, 2))/Moment(Price, 10, 1))); vars bWdith3N = series(Normalize(bWdith3, period)); vars bWdith5N = series(Normalize(bWdith5, period)); vars bWdith10N = series(Normalize(bWdith10, period)); // delta bollinger width vars deltabWdith3 = series(bWdith3[0] - bWdith3[3]); vars deltabWdith5 = series(bWdith5[0] - bWdith5[5]); vars deltabWdith10 = series(bWdith10[0] - bWdith10[10]); vars deltabWidth3N = series(Normalize(deltabWdith3, period)); vars deltabWidth5N = series(Normalize(deltabWdith5, period)); vars deltabWidth10N = series(Normalize(deltabWdith10, period)); //Close to close vars ctoC = series(log((priceClose(0)/priceClose(1)))); vars ctoCNorm = series(Normalize(ctoC, period)); vars ctoCZscore = series(zscore(ctoC, period)); //Distance from 10-day high vars highDist = series(HH(10,0)/priceClose(0)); vars highDistNorm = series(Normalize(highDist, period)); vars highDistZscore = series(zscore(highDist, period)); //Distance from 10-day low vars lowDist = series(LL(10,0)/priceClose(0)); vars lowDistNorm = series(Normalize(lowDist, period)); vars lowDistZscore = series(zscore(lowDist, period)); //Close to long-term filtered price vars trendDevSlow = series(log(priceClose(0)/LowPass(Price, 100))); vars trendDevSlowNorm = series(Normalize(trendDevSlow, period)); vars trendDevSlowZscore = series(zscore(trendDevSlow, period)); vars trendDevFast = series(log(priceClose(0)/LowPass(Price, 20))); vars trendDevFastNorm = series(Normalize(trendDevFast, period)); vars trendDevFastZscore = series(zscore(trendDevFast, period)); //ATR vars ATR3 = series(ATR(3)); vars ATR5 = series(ATR(5)); vars ATR10 = series(ATR(10)); vars ATRWeek = series(ATR(7)); vars ATRFast = series(ATR(25)); vars ATRMid = series(ATR(100)); vars ATRSlow = series(ATR(250)); vars ATR3N = series(Normalize(ATR3, period)); vars ATR5N = series(Normalize(ATR5, period)); vars ATR10N = series(Normalize(ATR10, period)); vars ATRFastN = series(Normalize(ATRFast, period)); //Market Meanness Index //vars MMISlow = series(MMI(Price, 500)); vars MMIMod = series(MMI(Price, 200)/100); vars MMIModSmooth = series(Smooth(MMIMod, 10)); vars MMIFast = series(MMI(Price, 100)/100); vars MMIFastSmooth = series(Smooth(MMIFast, 10)); vars MMIFaster = series(MMI(Price, 50)/100); vars MMIFasterSmooth = series(Smooth(MMIFaster, 10)); vars MMIFastest = series(MMI(Price, 20)/100); vars MMIFastestSmooth = series(Smooth(MMIFastest, 10)); vars MMIFasterN = series(Normalize(MMIFaster, period)); vars MMIFastestN = series(Normalize(MMIFastest, period)); //MMI deviation vars deltaMMIFastest3 = series(MMIFastest[0] - MMIFastest[3]); vars deltaMMIFastest5 = series(MMIFastest[0] - MMIFastest[5]); vars deltaMMIFastest10 = series(MMIFastest[0] - MMIFastest[10]); vars deltaMMIFastest3N = series(Normalize(deltaMMIFastest3, period)); vars deltaMMIFastest5N = series(Normalize(deltaMMIFastest5, period)); vars deltaMMIFastest10N = series(Normalize(deltaMMIFastest10, period)); //Hurst exponent vars HurstSlow = series(Hurst(Price, 200)); vars HurstMod = series(Hurst(Price, 100)); vars HurstFast = series(Hurst(Price, 50)); vars HurstFaster = series(Hurst(Price, 20)); //20 is the minimum time period for the Hurst exponent //Time series forecast //vars tsfSlow = series(Normalize(series(TSF(Price, 200) - priceClose(0)), 50)); //vars tsfMod = series(Normalize(series(TSF(Price, 100) - priceClose(0)), 50)); //vars tsfFast = series(Normalize(series(TSF(Price, 50) - priceClose(0)), 50)); //vars tsfFaster = series(Normalize(series(TSF(Price, 20) - priceClose(0)), 50)); //vars tsfFastest = series(Normalize(series(TSF(Price, 10) - priceClose(0)), 50)); //n-period IBS vars ibsOne = series(nperIBS(High, Low, Close, 1)); vars ibsOneN = series(Normalize(ibsOne, period)); //target vars target = series(100*(priceClose(0)-priceClose(1))/ATR(100)); //price change normalized to ATR) // char line[1000]; char header[1000]; //get the values for previous day - using the values from the day the trade is entered introduces look ahead bias sprintf(line,"%02i.%02i.%02i, %2.3f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f, %.5f\n", day(),month(),year(), target[0], priceHigh(1),priceLow(1),priceClose(1), velocity3N[1], velocity5N[1], velocity10N[1], accel3N[1], accel5N[1], accel10N[1], cubic10N[1], cubic25N[1], cubic50N[1], mom3N[1], mom5N[1], mom10N[1], closeDevMA3N[1], closeDevMA5N[1], closeDevMA10N[1], linDev3N[1], linDev5N[1], linDev10N[1], quadDev3N[1], quadDev5N[1], quadDev10N[1], cubicDev3N[1], cubicDev5N[1], cubicDev10N[1], apc3N[1], apc5N[1], apc10N[1], priceVarRat3N[1], priceVarRat5N[1], priceVarRat10N[1], deltaPVR3N[1], deltaPVR5N[1], deltaPVR10N[1], atrRat3N[1], atrRat5N[1], atrRat10N[1], deltaATRrat3N[1], deltaATRrat5N[1], deltaATRrat10N[1], bWdith3N[1], bWdith5N[1], bWdith10N[1], deltabWidth3N[1], deltabWidth5N[1], deltabWidth10N[1], MMIFasterN[1], MMIFastestN[1], deltaMMIFastest3N[1], deltaMMIFastest5N[1], deltaMMIFastest10N[1], HurstMod[1], HurstFast[1], HurstFaster[1], ibsOneN[1], trendNorm[1], ctoCNorm[1], highDistNorm[1], lowDistNorm[1], trendDevSlowNorm[1], trendDevFastNorm[1], priceVRSNorm[1], priceVRFNorm[1], atrRatSlowNorm[1], atrRatFastNorm[1], bWidthSlowNorm[1], bWidthFastNorm[1], ATRWeek[1], ATRFast[1], ATRMid[1], ATRSlow[1]); // if(is(INITRUN)) { file_delete("Data\\variables.csv"); sprintf(header, "DD/MM/YYYY, target, H, L, C, velocity3N, velocity5N, velocity10N, accel3N, accel3N, accel3N, cubic10N, cubic25N, cubic50N, mom3N, mom5N, mom10N, closeDevMA3N, closeDevMA5N, closeDevMA10N, linDev3N, linDev5N, linDev10N, quadDev3N, quadDev5N, quadDev10N, cubicDev3N, cubicDev5N, cubicDev10N, apc3N, apc5N, apc10N, priceVarRat3N, priceVarRat5N, priceVarRat10N, deltaPVR3N, deltaPVR5N, deltaPVR10N, atrRat3N, atrRat5N, atrRat10N, deltaATRrat3N, deltaATRrat5N, deltaATRrat10N, bWdith3N, bWdith5N, bWdith10N, deltabWidth3N, deltabWidth5N, deltabWidth10N, MMIFasterN, MMIFastestN, deltaMMIFastest3N, deltaMMIFastest5N, deltaMMIFastest10N, HurstMod, HurstFast, HurstFaster, ibsOneN, trendNorm, ctoCNorm, highDistNorm, lowDistNorm, trendDevSlowNorm, trendDevFastNorm, priceVRSNorm, priceVRFNorm, atrRatSlowNorm, atrRatFastNorm, bWidthSlowNorm, bWidthFastNorm, ATRWeek, ATRFast, ATRMod, ATRSlow\n"); file_append("Data\\variables.csv", header); } else file_append("Data\\variables.csv",line); } |

## dcast

March 4, 2016Great post! I’ll be sure to follow the next parts of the series. I’m just starting in the world of algo trading and based on what you wrote SSML looks like a good starting point. Would you recommend any other book?

## Robot Master

March 4, 2016Thanks! Glad you liked it.

My reading list for starting out with algo trading includes both of Ernie Chan’s books and Aronson’s first book, ‘Evidence Based Technical Analysis’. Jaekle and Tomasini is a handy reference too. If you don’t already have the background, an introductory statistics/probability text (sorry, can’t recommend one off the top of my head). If you want to delve into time series analysis, Ruey Tsay’s book on the topic is heavy going, but worth persisting with. In terms of machine learning, Lantz’s ‘Machine Learning with R’ will get you started. Depending on your stats background, Tibrishani et. al. released two works – ‘The Elements of Statistical Learning’ (more advanced) and ‘Introduction to Statistical Learning’ – these are the logical next steps for machine learning. Finally, I find the caret package in R to be immensely useful, and the author of the package also wrote a book called ‘Applied Predictive Modelling’ – definitely worth a read if you are interested in the machine learning side of things.

## Jacques Joubert

April 14, 2016Brilliant post Kris, Thank you.

This has really helped me to get started in the area of using ML in trading. Was even part of the recommended reading I was given.

Will email you about adding your recommended books here to the Quantocracy book list.

I am really looking forward to the follow up post.

God speed

Jacques Joubert

## Robot Master

April 16, 2016Thanks for that feedback, Jaques. I am really glad you found it useful.

Best

Kris

## Nick

March 5, 2016Excellent Work !

I love Aronson’s books also. Did you implement a version of the MCPT ?

Best, Nick

## Robot Master

March 5, 2016Thanks Nick! I haven’t yet implemented MCPT in R, but I think the ‘coin’ package could be used. Do you know of any others? MCPT would be a good inclusion as an update to this post, or perhaps in the next.

## Nick

March 5, 2016I’m not familiar with ‘coin’ but thanks for mentioning that … I’ve implemented my own version of the MCPT.

BTW, you mention that you use Zorro. Have you used their R bridge for strategy development or live trading ?

Best, Nick

## Robot Master

March 5, 2016Hey Nick,

Yes I’ve used Zorro’s R bridge extensively for strategy development and am using it for live trading right now. Its an integral part of my workflow and I am much more efficient during the development stage compared with using either Zorro or R separately. Its a very useful piece of kit to have in your toolbox.

I ‘ll do a post in the near future on exactly how this works and why it is efficient.

## Dekalog

March 5, 2016Very interesting work. You might also be interested in the “Boruta” all relevant feature selection method, which can be found at https://m2.icm.edu.pl/boruta/

I look forward to your next few posts.

## Robot Master

March 5, 2016Thanks for pointing me to Boruta, Dekalog. I’ll update the post to include this method. I see that it is implemented as an R package – how convenient!

## Robot Master

March 11, 2016Dekalog,

I’ve updated the post with a paragraph on Boruta. Thanks for pointing me to this package – very easy to use and wonderfully intuitive logic behind it.

Pingback: Best Links of the Week | Quantocracy

## IT

March 10, 2016Hi and thanks for sharing your work. One suggestion and request I would make would be to plot equity curves of some of the more promising candidates against say a simple Buy and Hold model. For example, if your in sample training set found ATR and BB (with some parameter set) useful and you created a model built around those features, you could plot both the in and out of sample resulting equity curves of the results (you also need to determine how you want to translate those predictions to actions. e.g. Buy/Sell/Flat). I’ve found the equity curve can reveal a lot about the results that may not be evident from looking at standalone metrics. One test may show promising results around a certain set of variables, but depending on how the individual results are finally aggregated, a different set of features may have performed much better.

## Robot Master

March 10, 2016Hi IT, thanks for that comment. I take your point – equity curves certainly reveal a lot more useful information than the result of a feature selection test. And after all, this is a blog about trading. I had originally intended to leave all of the strategy development stuff until the next post where I will write about how I transform a model output into actionable intelligence, combining predictions from multiple learners and all that. But you’re right – an illustrative trading strategy and equity curve would fit nicely here. I’ll update the post accordingly.

Thanks again for your feedback – always much appreciated.

## Deshawn Janas

March 13, 2016\nThat will be the common answer you will come across for every machine learning and knowledge discovery problem you face. There is no best learning system, classification system, or statistical method. You need to know your data-set well enough to realize where one approach might be better than others.

## Robot Master

March 13, 2016Couldn’t agree more, Deshawn. There is an important place for subject matter knowledge in nearly any machine learning task.

## pcz

March 22, 2016Hello and thank you for the great post. I discovered TSSB thanks to it and now I’m playing with the application. I see you also use ATR normalized return. I tried to compute it in R and compare the values to TSSB outputs but they differ. For example on D1 GBP/USD data from 2009-11-11 to 2016-03-01 mean value of my computed ATR 250 normalized return is -0.01237755 while TSSB values give mean of -0.01529271. Do you experience the same (which would probably mean that the implementations differ)? Or did I make a mistake?

## Robot Master

March 22, 2016Hello and thanks for commenting. Are you sampling your data at the same time in both applications? Since foreign exchange doesn’t have an official closing time, make sure that the one you use is consistent across both applications. Also, I believe TSSB calculates this variable using open prices, which in theory shouldn’t differ from the previous day’s closing price. Unless of course there is a price gap between one day’s close and the next’s open, which can happen due to weekends and holidays. Can those ideas account for the differences you are seeing?

I have not actually run anything in TSSB myself. I like using R too much!

## pcz

April 11, 2016Hey and sorry for the late reply, I was without my laptop for a while.

I completely understand, it’s easy to fall in love with R!:) I also wanted to rewrite the stuff into R but after reading the whole book I decided to give TSSB a go because I saw how much work had already been put into it.

Or maybe I will combine the approaches somehow. I noticed there is an automation framework for TSSB written in Python which could be modified to handle both: R and TSSB (unfortunately it’s Windows only solution).

About the problem – it turned out it was only because at the end of TSSB returns vector there was bunch of NA values while my computed one had all of them. I should have checked the end of the data frame more carefully, stupid mistake.

## Jeff

December 22, 2016Hi pcz

Where is “here is an automation framework for TSSB “? Did I miss something?

Jeff

Pingback: Build Better Strategies! Part 4: Machine Learning – The Financial Hacker

Pingback: Machine Learning Section Added to Our Library with @RobotWealth | Quantocracy

Pingback: URL

## Erk

May 30, 2016Hi Robot Master, great post! I was wondering if you thought about potential implication of using ATR normalized returns in your target, especially for feature selection. What I mean with that is; by normalizing with ATR you are introducing a strong recent volatility component in your target variable. And as any quant would appreciate volatility clusters heavily and therefore recent volatility measures popping-up in your best features across the board. (Maybe not only the time you redefined your target metric as profit where you only observed 2 variables.) And of course you can estimate volatility pretty decently utilizing some recent vola information but when it come to making a directional bet you won’t have much of an edge. What do you think? I just had a quick read of the post, so apologies if I am missing something. Keep up the good work.

## Robot Master

June 12, 2016Hi Erk

Thanks for the comment. You raise a very good point – by normalizing the target variable to the current ATR, a strong volatility component is introduced. This was actually a recommendation that Aronson makes in the book on which the post is based, but may not be the best solution, particularly in relation to feature selection. I’ve done some recent work using the raw first difference of the time series and found that the feature selection algorithms tend towards slightly different features under this scenario.

This suggests to me that there may be merit in using a differenced time series as the target variable during the feature selection phase, and then during the model building phase normalizing the target variable to the ATR so as to ensure volatility is included at some level in the trading model.

Thanks for commenting, great suggestion!

Pingback: Build Better Strategies! Part 4: Machine Learning | Trade Signal Machine Hub

## Mike Scirocco

June 13, 2016Thank you for a very thorough writeup of your very interesting work. I’m looking forward to the next installment. The recommended reading list is appreciated as well. (I hope you will consider making a separate blog entry of Recommended Reading so the books are easy to find in one entry.)

## Robot Master

June 14, 2016Hi Mike

Thanks for commenting. Great idea regarding the recommended reading list! I will make that my next post.

## Sung

June 13, 2016Really Thanks for posting these machine learning contents!

I’m trying to follow up and apply your posting using R.

but now, I get in trouble with making data table using variables in “Some candidate features”. I can see Zorro code in second post to make selected variables.

Since I’ve never learned the Zorro, I don’t understand how you make the variables in

“Some candidate features” .

Could you explain how did you make variables in “Some candidate features” using R?

I know this is time-consuming but I really appreciate if you explain about making variables…

## Robot Master

June 14, 2016Hi Sung

I created those variables using Zorro and then exported them for use in R. You can reproduce them using the descriptions in the section that you are referring to, whether that be in R, C, or some other language. If you get really stuck, shoot me an email using the contact form and I’ll send you an R implementation.

## sung

June 20, 2016Hello, I’m sung.

I’m wondering what kind of data, lev, model(input) should be put in to operate the absretSummary function?

absretSummary <- function (data, lev = NULL, model = NULL) {

positions 5, sign(data[, “pred”]), 0)

trades <- abs(c(1,diff(positions)))

profits <- positions*data[, "obs"]

profit <- sum(profits)

names(profit) <- 'profit'

return(profit)

}

Pingback: Optimal Data Windows for Training a Machine Learning Model for Financial Prediction | Robot Wealth

## Krzysztof

August 11, 2016Great post however…

1) cross validation

its not allowed to use cross validation for time series as simple it introduces future leak and bias. See Rob Hydman page http://robjhyndman.com/hyndsight/crossvalidation/

2) From my experience every subset of financial data will give you different features. The method which I use to find the best features is based on bootstraping and its called Neyman – Pearson method. Different features will be selected for BUY and different for SELL side so perhaps its good to split the selection.

3) I read the paper about Boruta method. I’m using filter methods based on correlation (FCBF) and mutal information (MRMR, JMI etc). Than I bootstrap those selection by Neyman-Person method. Wrapper methods (like Boruta) take too much time to compute so you can’t bootstrap them specially if you use big and many datasets.

Krzysztof

Krzysztof

Krzysztof

## Robot Master

August 11, 2016Hi Krzysztof

Thanks for reading my blog. Regarding your points:

1) I wouldn’t say its not allowed. But I agree regular cross-validation (k-fold, bootstrap, leave-one-out etc) is not suitable for financial data. I now use and recommend the time series cross-validation approach that is mentioned at the bottom of the link you provided. See my post on selecting optimal data windows for more information.

2) Thanks for the heads up re the Neyman-Pearson method. I haven’t used this, but I will look into it. I agree that different models for long and short sides may be a good idea.

3) I find Boruta to be quite useful for my applications. I also like the intuition behind the algorithm. Its only one of many possible feature selection algoirthms though and I’m sure your methods are great too.

Thanks again for reading.

## Krzysztof

August 11, 2016Hello again,

ad 3) None of the methods is great or the best, just combining with Nyman-Person give you better chance that you will not end up not relevant at all feature after selection. But its just better chance…All this analysis of financial data is very slippery, after more than 5 years of applying ML algos to it my only conclusion is than only very extensive backtest on multiple data sets from different instruments can give the answer if something works better or not. You can just try to repeat your steps on series from different instrument or the same instrument but from different period and I guess you will get different results. (e.g. different features are relevant or different technique is better)

Krzysztof

Pingback: Better Strategies 5: Developing a Machine Learning System – The Financial Hacker

## Tom

August 16, 2016hello,i am tom. can you tell me how to use the function absretSummary in rfeControl and rfe? Thank you. absretSummary <- function (data, lev = NULL, model = NULL) {

positions 5, sign(data[, “pred”]), 0)

trades <- abs(c(1,diff(positions)))

profits <- positions*data[, "obs"]

profit <- sum(profits)

names(profit) <- 'profit'

return(profit)

}

## Robot Master

August 31, 2016Hey Tom

This is just a custom Summary function for use with the caret

`train()`

function. The Summary function is simply the objective function for the training process; this one trains the learning algorithm towards maximizing the profit at the end of the training period. You can easily write your own for maximizing the Sharpe ratio or any other performance metric of interest.Max Kuhn (the author of the caret package) explains the use of the summary function here. In particular, scroll down to read about the

`trainControl()`

function and alternative performance metrics.Hope that helps.

## alon

August 23, 2016Great article! really good research in all aspects

is it possible to post the code showing how you make the training file (gu_data)?

Thanks,

## Robot Master

August 31, 2016Thanks for reading! I’ve added the code showing how I created the training file as an appendix. I created this code a while ago and have since added to it, so there are a bunch of other indicators and transforms that will be output to a csv file. You can pick out the ones that I’ve used in this post, or experiment with the others, as you like. The output file is called “variables.csv” and this is what I read into my training data object “gu_data” in this post.

Also, you will need to manually exclude some rows from the output file where indicator values are unable to be calculated due to an insufficient lookback period.

This code is written in Lite-C and is compatible with the Zorro platform, which is an awesome tool for doing this sort of analysis. Some exciting things you can do to extend my research by tweaking this code include:

1. Try other bar periods (I used daily)

2. If using daily bars, try sampling at a more sensible time (ie not midnight, as I have done here)

3. Obtain the same data for any market for which you have data by changing the

`asset()`

call towards the start of the script4. Use Zorro’s recently added pre-processing tools for better scaling/compressing/normalizing of the data than I have used here (these tools were added to the platform after I wrote this post)

5. Add other indicators or data that you are interested in

6. Experiment with longer prediction horizons. By default, I used a one-day prediction horizon. Could predicting further into the future give better results?

7. Try using a classification approach by changing the target variable from a return (a numerical value) to a market direction (a binary value).

8. Plenty more that I haven’t even thought of…

Hope this is useful for you.

## David

December 26, 2016Great work! Very good tutorial for beginner like me!

A small question thought, why did you times 100 to the price change of the target?

Thanks.

## Kris Longmore

January 9, 2017Hi David

No reason other than aesthetics. I just find it nicer to look at figures on the left of the decimal point. You’ll get the same results if you don’t multiply by 100.

Cheers

Kris

## David

February 13, 2017Hi Kris,

Thanks for reply.

Another small question bothers me while reading your post.

In “Removing highly correlated variables” step, we remained variables having correlation less than 0.5 with others. Then, why did some high correlation pairs still remain and show on the map? (f.i. velocity10 v.s. atrRat10 ; atrRat10 v.s. atrRat10_20)

Is there something possibly wrong?

Thanks.

## Kris Longmore

February 14, 2017Hello David

You can answer your own question very easily by looking into the documentation of the relevant function,

`caret::findCorrelation()`

. I’ll do the leg-work for you so that we have a reference here on Robot Wealth. From the documentation:The absolute values of pair-wise correlations are considered. If two variables have a high correlation, the function looks at the mean absolute correlation of each variable and removes the variable with the largest mean absolute correlation.The key point is that the function considers

mean absolute correlationsof each variable when the pair-wise correlation between them is higher than the threshold. The function will then remove the variable that has the highest mean absolute correlation with all remaining variables. This is not the same as simply removing any variable that has a pair-wise correlation with another variable greater than the cutoff value.Hope that helps.

## David

February 15, 2017Hi Kris,

I think the point is whenever two variables have a high correlation, one of them will be removed. The mean absolute correlations is just a way to decide which one would be cut out.

So, your correlation map shouldn’t have the pair like velocity10 to atrRat10 with large circle/high correlation. Otherwise, what’s the point for using the findCorrelation() and setting the cutoff ?

## Kris Longmore

February 15, 2017Hi David

The point is that the pair-wise correlations reveal useful information about the data. As do the results of the

`findCorrelation()`

function. Further, if there were many variables from which to select, the function is an efficient first step in processing them. Its also interesting to see which variables the function identifies for removal in relation to their pair-wise correlations with the remaining variables.I don’t see any basis to your statement ‘your correlation map shouldn’t have the pair like velocity10 to atrRat10 with large circle/high correlation.’ Why not? Does it not reveal useful information? Have I misrepresented the pair-wise correlations? I think you might be confusing the roles of the correlation map and the

`findCorrelation()`

function by assuming they represent the same thing, or perhaps you see them as being somehow contradictory rather than complimentary. I’m struggling to see where you’re coming from here.## David

February 15, 2017Hi Kris,

I don’t think I’m confused with the function and the map, but still thanks for the reply. I’ll think it through again and check my code.

Pingback: Site News: Dabbling in Ads and Where All the Clicks Went in 2016 | Quantocracy

Pingback: Back to Basics: Introduction to Algorithmic Trading – Robot Wealth

## Vincenzo Somma

February 16, 2017Hi,

I’m an R beginner and trying to reproduce the results.

I don’t understand where the “gu_data_filt” variable is filled with data in the first small code block of the article.

Any help appreciated.

Thanks

Vincenzo

## Kris Longmore

February 19, 2017Vincenzo, you found an error – nicely spotted. There was a missing line of code which has been added.

## David

February 20, 2017Hi Vincenzo,

I suggest adding “highCor <- highCor+1;" before "gu_data_filt <- gu_data[, -highCor];", for your reference.

## Kris Longmore

February 20, 2017David,

Can you explain that suggestion? Why do you need to increment the values in the

`highCor`

object?## David

February 20, 2017It’s because we use cor(gu_data[, -1]), making highCor contained the wrong column numbers(all are 1 smaller) of removed feature. Adding 1 back so that we can get the correct result.

## Kris Longmore

February 20, 2017Nicely spotted! You are absolutely correct. I’ve updated the code accordingly. Note that this would alter the features which get passed through to the next feature selection stage.

## Vincenzo Somma

February 20, 2017Hi Kris,

If you have time can you update the article with the correct resulting features ?

Just to compare my results with yours.

Thanks

## Kris Longmore

February 20, 2017Hey Vincenzo,

You mean the features that fall out of the correlation analysis? Or the entire post? Sure, I’ll update the figures and results, when I can find the time.

Note however that there aren’t any ‘correct’ features – a better way to approach the problem is to think in terms of which features or combination of features are most useful for the problem at hand.

## Vincenzo Somma

February 20, 2017Hi Kris,

I mean in your reply to David pasted below, the resulting features are incorrect because of the code issue he outlined. Would be nice if you could update the article feature selection results after this fix.

Thanks for setting up a great interesting web site.

“Note that this would alter the features which get passed through to the next feature selection stage”

## Kris Longmore

February 20, 2017Understood. Yes, updating the post is on my radar. Might take me a week or two to find the time, but I’ll make sure it happens.

Thanks for the kind words.

## David

February 20, 2017Glad it helped. Looking forward to the updated result. Thanks for the good work.

## corburt erilio

February 21, 2017Great post. I am facing a couple of these problems.