Machine learning for Trading:
Adventures in Feature Selection
- 2019: In this first Machine Learning for Trading post, we’ve added a section on feature selection using the Boruta package, equity curves of a simple trading system, and some Lite-C code that generates the training data.
- 2020: I’ve updated the original post with some new thinking about data-mining, refreshed the code, updated the data and plots, and added the code and data to our GitHub repository.
My thinking about data mining has evolved
Back when I originally wrote this article, there was a commonly held idea that a newly-hyped approach to predictive modeling known as machine learning could discern predictive patterns in market data. A quick search on SSRN will turn up dozens of examples of heroic attempts at this very thing, many of which have been downloaded thousands of times.
Personally, I spent more hours than I care to count on this approach. And while I learned an absolute ton, I can also say that nothing that I trade today emerged from such a data-mining exercise.
Over the years since I first wrote this article, a realisation slowly dawned on me:
Trading is very hard, and these techniques don’t really help that much with the hardest part.
I think, in general, the trading and investment community has had a similar awakening.
OK, so what’s the “hardest part” of trading?
Operational issues of running a trading business aside, the hardest part of trading is maximising the probability that the edges you choose to trade continue to pay off in the future.
Of course, we can never be entirely sure about anything in the markets. They change. Edges come and go. There’s usually anxiety that an edge isn’t really an edge at all, that it’s simply a statistical mirage. There is uncertainty everywhere.
Perhaps the most honest goal of the quantitative researcher is to reduce this uncertainty as far as reasonably possible.
Unfortunately (or perhaps fortunately, if you take the view that if it were easy, everyone would do it), reducing this uncertainty takes a lot of work and more than a little market nouse.
In the practical world of our own trading, we do this in a number of ways centred on detailed and careful analysis:
- Does the edge make sense from a structural, economic, financial, or behavioural perspective?
- Is there a reason for it to exist that I can explain in terms of taking on risk or operational overhead that others don’t want, or providing a service?
- Is it stable through time?
- Does it show up in the assets that I’d expect it to, given my explanation for why it exists?
- What else could explain it? Have I isolated the effect from things we already know about?
- What other edges can I trade with this one to diversify my risk?
In the world of machine learning and data mining, “reducing uncertainty” involves accounting for data mining bias (the tendency to eventually find things that look good if you look at enough combinations). There are statistical tests for data-mining bias, which, if being generous, offer plausible-sounding statistical tools for validating data mining efforts. However, I’m not here to be generous to myself and can admit that the appeal of such tools, at least for me, lay in the promise of avoiding the really hard work of careful analysis. I don’t need to do the analysis, because a statistical test can tell me how certain my edge is!
But what a double-edged sword such avoidance turns out to be.
If you’ve ever tried to trade a data-mined strategy, regardless of what your statistical test for data-mining bias told you, you know that it’s a constant battle with your anxiety and uncertainty. Because you haven’t done the work to understand the edge, it’s impossible to just leave it alone. You’re constantly adjusting, wondering, and looking for answers after the fact. It turns into an endless cycle – and I’ve personally seen it play out at all levels from beginner independent traders through to relatively sophisticated and mature professional trading firms.
The real tragedy about being on this endless cycle is that it short-circuits the one thing that is most effective at reducing uncertainty, at least at the level of your overall portfolio – finding new edges to trade.
This reality leads me to an approach for adding a new trade to our portfolio:
- Do the work to reduce the uncertainty to the extent possible. You don’t want to trade just anything, you want to trade high-probability edges that you understand.
- Trade it at a size that can’t hurt you at the portfolio level if you’re wrong – and we will all be wrong from time to time.
- Leave it alone and go look for something else to trade.
The third point is infinitely more palatable if you’ve done the work and understand the things you’re already trading.
Having said all that, I’m not about to abandon machine learning and other statistical tools. They absolutely have their place, but it’s worth thinking about the relative importance of what to concentrate on and what we spend our time on.
At one extreme, we might think that market insight and quantitative analysis (what we’d call “feature engineering” in machine learning speak) is the most important thing and that we should spend all our time there.
However, the problem with this approach is that there are effective and well-understood techniques (for example PCA, lasso regression, and others) that will very much help with modeling and analysis. Understanding these tools well enough to know what they are and when they might help greatly enhances your effectiveness as a quantitative researcher.
On the other extreme, we might think that spending all our time on machine learning, data mining and statistical tests is appropriate. This is akin to owning a top-notch toolkit for servicing a car, but not knowing anything about cars, and leads to the endless cycle of patching things up mentioned above.
One of the first books I read, when I began studying the markets, was David Aronson’s Evidence-Based Technical Analysis. The engineer in me was attracted to the ‘Evidence-Based’ part of the title. This was soon after I had digested a trading book that claimed a basis in chaos theory, the link to which actually turned out to be non-existent (apparently using complex-sounding terms in the title of a trading book lends some measure of credibility… and book sales).
Evidence-Based Technical Analysis promotes a scientific approach to trading, including a detailed method for the assessment of data-mining bias in backtest results. There’s also a discussion around the reasons that some traders turn away from objective methods and embrace subjective beliefs. I find this area fascinating.
Readers know that I am interested in using machine learning to profit from the markets. I was excited to discover that David Aronson had co-authored a new book with Timothy Masters titled Statistically Sound Machine Learning for Algorithmic Trading of Financial Instruments – which I’ll now refer to as SSML. While it is intended as a companion to Aronson’s (free) software platform for strategy development, it has a bunch of practical tips for anyone using machine learning for trading the financial markets and I’ve implemented many of his ideas in R.
So Kris, how does this backstory of your reading habits benefit me?
Well, SSML was a survival guide of sorts during my early forays into machine learning for trading, and I want to walk you through some of those early experiments, focusing on the more significant and practical learnings that I encountered along the way. Maybe this can be a source of inspiration for your own research.
This first post will focus on feature engineering and also introduce the data mining approach. Machine Learning for Trading Part 2 will focus on algorithm selection and ensemble methods for combining the predictions of numerous learners.
Let’s get started!
The data mining approach
Data mining is just one approach to extracting profits from the markets and is different from a model-based approach.
Rather than constructing a mathematical representation of price, returns, or volatility from first principles, data mining involves searching for patterns first and then fitting a model to those patterns. Both model-based and data mining approaches have pros and cons.
The Financial Hacker summed up the advantages and disadvantages of the data mining approach nicely:
The advantage of data mining is that you do not need to care about market hypotheses. The disadvantage: those methods usually find a vast amount of random patterns and thus generate a vast amount of worthless strategies. Since mere data mining is a blind approach, distinguishing real patterns – caused by real market inefficiencies – from random patterns is a challenging task. Even sophisticated reality checks can normally not eliminate all data mining bias. Not many successful trading systems generated by data mining methods are known today.
David Aronson himself cautions against putting blind faith in data mining methods:
Though data mining is a promising approach for finding predictive patterns in data produced by largely random complex processes such as financial markets, its findings are upwardly biased. This is the data mining bias. Thus, the profitability of methods discovered by data mining must be evaluated with specialized statistical tests designed to cope with the data mining bias.
Data mining is a term that can mean different things to different people depending on the context. When I refer to a data mining approach to trading systems development, I am referring to the use of statistical learning algorithms to uncover relationships between feature variables and a target variable (in the regression context, these would be referred to as the independent and dependent variables, respectively).
The feature variables are observations that are assumed to have some relationship to the target variable and could include, for example, historical returns, historical volatility, various transformations or derivatives of a price series, economic indicators, and sentiment barometers. The target variable is the object to be predicted from the feature variables and could be the future return (next day return, next month return, etc), the sign of the next day’s return, or the actual price level (although the latter is not really recommended, for reasons that will be explained below).
Although I differentiate between the data mining approach and the model-based approach, the data mining approach can also be considered an exercise in predictive modeling. Interestingly, the model-based approaches that I have written about previously (for example ARIMA, GARCH, Random Walk etc) assume linear relationships between variables. Modeling non-linear relationships using these approaches is (apparently) complex and time-consuming. On the other hand, some statistical learning algorithms can be considered ‘universal approximators’ in that they have the ability to model any linear or non-linear relationship. It was not my intention to get into a philosophical discussion about the differences between a model-based approach and a data mining approach, but clearly, there is some overlap between the two.
I would add that the implicit assumption behind the data mining approach is that the patterns identified will continue to repeat in the future. Of course, this assumption is made in one form or another however we approach the markets, but the point is it’s always subject to uncertainty.
Variables and feature engineering
The prediction target
The first and most obvious decision to be made is the choice of target variable. In other words, what are we trying to predict?
For one-day ahead forecasting systems, some measure of profit derived from a correct prediction seems a sensible starting point. For this post, I chose the next day’s return normalized to the recent average true range, the implication being that in live trading, position sizes would be inversely proportionate to the recent volatility. In addition, by normalizing the target variable in this way, we control for the confounding effect of volatility and provide a means to scale the target across multiple markets.
Choosing predictive variables
In SSML, Aronson states that the golden rule of feature selection is that the predictive power should come primarily from the features and not from the model itself. This is eminently good advice. A model is a tool by which we can understand a tradable effect – it’s not an effect in itself.
Later in this post, you’ll notice this golden rule playing out in a practical sense with many algorithm types returning correlated predictions for the same feature set. Further, the choice of features had a far greater impact on performance than the choice of model.
The implication is that spending considerable effort on feature selection and feature engineering is well and truly justified. This hints at my earlier comments about seeking to understand an effect rather than blindly data mining.
Many variables will have little or no relationship with the target and including these will lead to overfitting or other forms of poor performance. Aronson recommends using Chi-squared tests and Cramer’s V to quantify the relationship between variables and the target. I actually didn’t use this approach, but instead used a number of others including ranking a list of candidate features according to their Maximal Information Coefficient (MIC) and selecting the highest ranked features, Recursive Feature Elimination (RFE) via the caret package in R, feature selection via the Boruta algorithm, an exhaustive search of all linear models, and Principal Components Analysis (PCA). Each of these is discussed below.
Some candidate features
Following is the list of features that I investigated as part of this research. Most were derived from SSML. This list is by no means exhaustive and only consists of derivatives and transformations of the price series. I haven’t yet tested alternative data sets, volume features, the price histories of related instruments and the like, but I think these are deserving of attention too. The following list is by no means exhaustive, but provides a decent starting point:
- 1-day log return
- Trend deviation: the logarithm of the closing price divided by the lowpass filtered price
- Momentum: the price today relative to the price x days ago, normalized by the standard deviation of daily price changes.
- ATR: the average true range of the price series
- Velocity: a one-step-ahead linear regression forecast on closing prices
- Linear forecast deviation: the difference between the most recent closing price and the closing price predicted by a linear regression line
- Price variance ratio: the ratio of the variance of the log of closing prices over a short time period to that over a long time period.
- Delta price variance ratio: the difference between the current value of the price variance ratio and its value x periods ago.
- The Market Meanness Index: A measure of the likelihood of the market being in a state of mean reversion, created by the Financial Hacker.
- MMI deviation: The difference between the current value of the Market Meanness Index and its value x periods ago.
- The Hurst exponent
- ATR ratio: the ratio of an ATR of a short (recent) price history to an ATR of a longer period.
- Delta ATR ratio: the difference between the current value of the ATR ratio and the value x bars ago.
- Bollinger width: the log ratio of the standard deviation of closing prices to the mean of closing prices, that is a moving standard deviation of closing prices relative to the moving average of closing prices.
- Delta bollinger width: the difference between the current value of the bollinger width and its value x bars ago.
- Absolute price change oscillator: the difference between a short and long lookback mean log price divided by a 100-period ATR of the log price.
Limitations and caveats on the candidate features
These features cover various momentum, mean reversion, and volatility effects. In the feature selection work below, we use a number of data-mining tools to assess their predictive utility. However, at this point, I make no attempt to justify their inclusion in the model on any economic or structural basis – a crucial step that in the real world you wouldn’t skip. This requires some detailed and careful analysis and in the interest of demonstrating some tools, I’m not going to do that analysis here.
Thus far I have only considered the most recent value of each variable. The recent history of each variable would provide another dimension of data to mine. I left this out of the feature selection stage since it makes more sense to firstly identify features whose current values contain predictive information about the target variable before considering their recent histories. Incorporating this from the beginning of the feature selection stage would increase the complexity of the process by several orders of magnitude – so let’s also skip that for now.
Transforming the candidate features
In my experiments, the variables listed above were used with various cutoff periods (that is, the number of periods used in their calculation). Typically, I used values between 3 and 20 since Aronson states in SSML that lookback periods greater than about 20 will generally not contain information useful to the one period ahead forecast. Some variables (like the Market Meanness Index) benefit from a longer lookback. For these, I experimented with 50, 100, and 150 periods.
Additionally, it is important to enforce a degree of stationarity on the variables. Davind Aronson again:
Using stationary variables can have an enormous positive impact on a machine learning model. There are numerous adjustments that can be made in order to enforce stationarity such as centering, scaling, and normalization. So long as the historical lookback period of the adjustment is long relative to the frequency of trade signals, important information is almost never lost and the improvements to model performance are vast.
- Scaling: divide the indicator by the interquartile range (note, not by the standard deviation, since the interquartile range is not as sensitive to extremely large or small values).
- Centering: subtract the historical median from the current value.
- Normalization: both of the above. Roughly equivalent to traditional z-score standardization, but uses the median and interquartile range rather than the mean and standard deviation in order to reduce the impact of outliers.
- Regular normalization: standardizes the data to the range -1 to +1 over the lookback period (x-min)/(max-min) and re-centered to the desired range.
In my experiments, I generally adopted regular normalization using the most recent 50 values of the features.
If you’re following along with the code and data provided (see note in bold above), I used the data for the EUR/USD exchange rate (sampled daily at 9 am London time, for the period 2009-2019). The raw data was created with the Zorro Trading Automation Platform using the script in the GitHub repository here.
Before you continue....
Want to see how we trade for a living with algos — so you can too?
Learn where to start and see how systematic retail traders generate profit long-term:
Removing highly correlated variables
It makes sense to remove variables that are highly correlated with other variables since they are unlikely to provide additional information that isn’t already contained elsewhere in the feature space. Keeping these variables will also add unnecessary computation time, increase the risk of overfitting and bias the final model towards the correlated variables.
caret::findCorrelation() finds variables which, if removed, will reduce pairwise correlations below some
cutoff value. With a
cutoff of 0.3, these are the remaining variables and their pairwise correlations:
Feature selection via Maximal Information
The maximal information coefficient (MIC) is a non-parametric measure of two-variable dependence designed specifically for rapid exploration of many-dimensional data sets. While MIC is limited to univariate relationships (that is, it does not consider variable interactions), it does pick up non-linear relationships between dependent and independent variables.
Read more about MIC here. I used the
minerva package in R to rank my variables according to their MIC with the target variable (next day’s return normalized to the 100-period ATR). Here’s the output:
> mic var Y 1 atrRatSlow 0.10202538 2 trend 0.10143818 3 mom3 0.09854331 4 apc10 0.09752206 5 deltaATRrat10 0.09599392 6 deltaPVR5 0.09521441 7 deltaMMIFastest5 0.09438674 8 MMIFaster 0.09219701 9 HurstFaster 0.09126731 10 deltaPVR3 0.08766022 11 bWidthSlow 0.08661493 12 ATRSlow 0.00000000
None of the features have a particularly high MIC with respect to the target variable, which is what I would expect from noisy data such as daily exchange rates sampled at an arbitrary time.
Recursive feature elimination
I also used recursive feature elimination (RFE) via the
caret package to isolate the most predictive features from my list of candidates. RFE is an iterative process that involves constructing a model from the entire set of features, retaining the best performing features, and then repeating the process until all the features are eliminated. The model with the best performance is identified and the feature set from that model declared the most useful.
I performed cross-validated RFE using a random forest model. Using the top five variables identified in the RFE process (
atrRatSlow), we get a decent drop in root mean squared error (RMSE). If we use all twelve variables that remain after filtering on the correlation matrix, we get a further drop in cross-validated RMSE, although the top five features seem to account for the majority of the performance.
Here’s a plot of RMSE against the number of variables retained:
The RFE process assigned the features an importance value:
The RFE process has emphasized variables that describe volatility and trend and there seems to be general although imperfect agreement with the MIC results.
I am tempted to take the results of the RFE with a grain of salt because:
- The RFE algorithm does not fully account for interactions between variables. For example, assume that two variables individually have no effect on model performance, but due to some relationship between them they improve performance when both are included in the feature set. RFE is likely to miss this predictive relationship.
- Finally, the implementation of RFE that I used was the ‘out of the box’ caret version. This implementation uses root mean squared error (RMSE) as the objective function, however, I don’t believe that RMSE is best for this data due to the influence of extreme values on model performance. It is possible to have a low RMSE but poor overall performance if the model is accurate across the middle regions of the target space (corresponding to small wins and losses), but inaccurate in the tails (corresponding to big wins and losses).
In order to address (2) above, I implemented a custom summary function that maximises the cross-validated absolute return. I also applied the additional criterion that only predictions with an absolute value greater than 5 would be considered to reflect what we might do if trading the model’s predictions. The inclusion of eight features led to the highest cross-validated absolute return (although this varied between 7 and 10 with different settings of the random seed):
This approach led to some differences in the feature importance ranking:
Models with in-built feature selection
A number of machine learning algorithms have feature selection in-built. Max Kuhn’s website for the
caret package contains a list of such models that are accessible through the
caret package. I’ll apply several and compare the features selected to those selected with other methods.
For this experiment, I used a diverse range of algorithms that include various ensemble methods and both linear and non-linear interactions:
- Bagged multi-adaptive regressive splines (MARS)
- Boosted generalized additive model (bGAM)
- Spike and slab regression (SSR)
For each model, I did only very basic hyperparameter tuning using time series cross-validation with a train window length of 200 days and a test window length of 20 days. Maximization of absolute return was used as the objective function. Following cross-validation,
caret trains a model on the full data set with the best cross-validated hyperparameters – but this is not what we want if we are to mimic actual trading behaviour (we are more interested in the aggregated performance across each test window, which caret very neatly allows us to access – details on this below when we investigate a trading system).
The figure below shows the proportional frequency with which varieables were selected in the top 5 of each algorithm. For instance, a value of 0.75 indicates that a variable was selected in the top 5 by 75% of the algorithms tested:
We see less overlap with other feature selection methods, however the slow volatility and fast momentum features do appear regularly. All of the variables that passed the correlation filter were selected by at least one algorithm in the top 5 variables.
The overall lack of consistency hints that our features probably aren’t overly predictive.
Model selection using glmulti
glmulti package fits all possible unique generalized linear models from the variables and returns the ‘best’ models as determined by an information criterion (Aikake in this case).
The package is essentially a wrapper for the
glm (generalized linear model) function that allows selection of the ‘best’ model or models, providing insight into the most predictive variables. By default,
glmulti builds models from the main interactions, but there is an option to also include pairwise interactions between variables. This increases the computation time considerably, and I found that the resulting ‘best’ models were orders of magnitude more complex than those obtained using main interactions only, and results were on par.
# glmulti.analysis # Method: h / Fitting: glm / IC used: aicc # Level: 1 / Marginality: FALSE # From 100 models: # Best IC: 31216.8153639823 # Best model: #  "target ~ 1 + MMIFaster" # Evidence weight: 0.0283330120226094 # Worst IC: 31220.2936057687 # 29 models within 2 IC units. # 90 models to reach 95% of evidence weight.
We retain the models whose AICs are less than two units from the ‘best’ model. Two units is a rule of thumb for models that are likely to be on par in terms of their performance:
# model aicc weights # 1 target ~ 1 + MMIFaster 31216.82 0.02833301 # 2 target ~ 1 + apc10 + MMIFaster 31216.94 0.02668724 # 3 target ~ 1 + MMIFaster + atrRatSlow 31217.13 0.02418697 # 4 target ~ 1 + apc10 + deltaPVR3 + MMIFaster 31217.17 0.02377989 # 5 target ~ 1 + deltaPVR3 + MMIFaster 31217.18 0.02361508 # 6 target ~ 1 + apc10 + MMIFaster + atrRatSlow 31217.30 0.02226910 # 7 target ~ 1 + deltaPVR3 + MMIFaster + atrRatSlow 31217.47 0.02042182 # 8 target ~ 1 + apc10 + deltaPVR3 + MMIFaster + atrRatSlow 31217.50 0.02009271 # 9 target ~ 1 + MMIFaster + bWidthSlow 31218.33 0.01330753 # 10 target ~ 1 + MMIFaster + trend 31218.47 0.01237686 # 11 target ~ 1 + MMIFaster + trend + atrRatSlow 31218.48 0.01233432 # 12 target ~ 1 31218.52 0.01211088 # 13 target ~ 1 + deltaATRrat10 + MMIFaster + atrRatSlow 31218.52 0.01209924 # 14 target ~ 1 + deltaATRrat10 + MMIFaster 31218.52 0.01207773 # 15 target ~ 1 + apc10 + MMIFaster + bWidthSlow 31218.52 0.01207104 # 16 target ~ 1 + apc10 + deltaATRrat10 + MMIFaster 31218.56 0.01184645 # 17 target ~ 1 + apc10 + deltaATRrat10 + MMIFaster + atrRatSlow 31218.57 0.01177378 # 18 target ~ 1 + MMIFaster + HurstFaster 31218.58 0.01174280 # 19 target ~ 1 + MMIFaster + ATRSlow 31218.59 0.01165791 # 20 target ~ 1 + deltaPVR3 + MMIFaster + bWidthSlow 31218.59 0.01164927 # 21 target ~ 1 + apc10 + MMIFaster + HurstFaster 31218.65 0.01133414 # 22 target ~ 1 + apc10 + deltaPVR3 + MMIFaster + bWidthSlow 31218.66 0.01126059 # 23 target ~ 1 + MMIFaster + deltaMMIFastest5 31218.70 0.01105118 # 24 target ~ 1 + apc10 + MMIFaster + ATRSlow 31218.71 0.01100487 # 25 target ~ 1 + atrRatSlow 31218.72 0.01092551 # 26 target ~ 1 + deltaPVR3 + MMIFaster + trend + atrRatSlow 31218.75 0.01079397 # 27 target ~ 1 + deltaPVR5 + MMIFaster 31218.78 0.01059770 # 28 target ~ 1 + deltaPVR3 + MMIFaster + trend 31218.79 0.01057023 # 29 target ~ 1 + mom3 + MMIFaster 31218.81 0.01046743
Notice any patterns here? Many of the top models selected the ATR ratio and MMI features, as well as the price change oscillator (apc10). Perhaps surprisingly sparse are the momentum variables. This is confirmed with this plot of the model-averaged variable importance (averaged over the best 100 models):
Note that these models only considered the main, linear interactions between each variable and the target. Of course, there is no guarantee that any relationship is linear if it exists at all. Further, there is the implicit assumption of stationary relationships amongst the variables – which are unlikely to hold. Still, this method provides some useful insight.
One of the great things about
glmulti is that it facilitates model-averaged predictions – more on this when I delve into ensembles in part 2 of this series.
Generalized linear model with stepwise feature selection
Finally, I used a generalized linear model with stepwise feature selection:
# Coefficients: # (Intercept) trend atrRatSlow # -1.907 -3.632 -5.100 # # Degrees of Freedom: 2024 Total (i.e. Null); 2022 Residual # Null Deviance: 8625000 # Residual Deviance: 8593000 AIC: 22670
The final model selected 2 of the 15 variables: the ratio of the 20- to 100-day ATR, and the difference between a short-term and long-term trend indicator.
Boruta: all relevant feature selection
Boruta finds relevant features by comparing the importance of the original features with the importance of random variables. Random variables are obtained by permuting the order of values of the original features. Boruta finds a minimum, mean, and maximum value of the importance of these permuted variables, and then compares these to the original features. Any original feature that is found to be more relevant than the maximum random permutation is retained.
Boruta does not measure the absolute importance of individual features. Rather, it compares each feature to random permutations of the original variables and determines the relative importance. This theory very much resonates with me, particularly for weeding out uninformative features from noisy financial data. The idea of adding randomness to the sample and then comparing performance is analogous to an approach I experimented with to benchmark my systems against a random trader with a similar trade distribution.
The box plots in the figure below show the results obtained when I ran the Boruta algorithm for the 12 filtered variables for 1,000 iterations. The blue box plots show the permuted variables of minimum, mean and maximum importance, the green box plots indicate the original features that ranked higher than the maximum importance of the random permuted variables, and the variables represented by the red box plots are discarded.
# Boruta performed 103 iterations in 3.279162 mins. # 8 attributes confirmed important: apc10, atrRatSlow, ATRSlow, bWidthSlow, deltaATRrat10 and 3 more; # 4 attributes confirmed unimportant: deltaMMIFastest5, HurstFaster, MMIFaster, mom3;
These results are generally but not perfectly consistent with the results obtained through other methods.
Side note: The developers state that “Boruta” means “Slavic spirit of the forest.” As something of a Slavophile myself, I did some googling and discovered that this description is quite a euphemism. Check out some of the items that pop up in a Google image search!
Discussion of feature selection methods
Any feature selection process naturally invites a degree of selection bias. For example, from a large set of uninformative variables, a small number may randomly correlate with the target variable.
The selection algorithm would then rank these variables highly. The error would only be (potentially) uncovered through cross-validation of the selection algorithm or by using an unseen test or validation set. Feature selection is difficult and can often make predictive performance worse since it is easy to over-fit the feature selection criterion. It is all too easy to end up with a subset of attributes that works really well on one particular sample of data, but not necessarily on any other. There is a fantastic discussion of this at the Statistics Stack Exchange community that I have linked here because it is just so useful.
It is critical to take steps to minimize selection bias at every opportunity. The results of any feature selection process should be cross-validated or tested on an unseen hold out set. If the hold out set selects a vastly different set of predictors, something has obviously gone wrong – or the features are worthless.
The approach I took in this post was to cross-validate the results of each test that I performed, with the exception of the Maximal Information Criterion and
glmulti approaches. I’ve also selected features based on data for one market only. If the selected features are not robust, this would likely show up with poor performance when I attempt to build predictive models for other related markets using these features.
I think that it could be useful to apply a wide range of methods for feature selection and then look for patterns and consistencies across these methods. This approach seems to intuitively be far more likely to yield useful information than drawing absolute conclusions from a single feature selection process.
Applying this logic to the approach described above, we can conclude that the ratio of the 10- to 20-day ATR (atrRatSlow), the trend difference indicator (trend), the absolute price change oscillator (apc10), and the change in price variance ratio (deltaPVR3) are probably the most likely to yield useful information since they show up in most of the feature selection methods that I investigated.
In part 2 of this article, I’ll describe how I built and combined various models based on these variables.
Principal Components Analysis
An alternative to feature selection is Principal Components Analysis (PCA), which attempts to reduce the dimensionality of the data while retaining the majority of the information contained therein.
PCA is a linear technique: it transforms the data by linearly projecting it onto a lower dimension space while preserving as much of its variation as possible. Another way of saying this is that PCA attempts to transform the data so as to express it as a sum of uncorrelated components.
Again, note that PCA is limited to a linear transformation of the data. Another significant assumption when using PCA is that the principal components of future data will look those of the training data.
To investigate the effects of PCA on model performance, I cross-validated two random forest models, the first using the principal components of the 12 variables, and the other using those variables in their raw form. I chose the random forest model since it includes feature selection and thus may reveal some insights about how PCA stacks up in relation to other feature selection methods. For both models, I performed time-series cross-validation on a training window of 200 days and a testing window of 50 days.
In order to infer the difference in model performance, I collected the results from each resampling iteration of both final models and compared their distributions via a pair of box and whisker plots:
The model built on the raw data under-performs the model built on the principal components in this case. The mean profit is slightly higher and the distribution is shifted in the positive direction.
A simple trading system
I will go into more detail about building an example trading system using machine learning in the next post, but the following demonstrates a simple system based on some of the information gained from the analysis presented above.
The system is based on four of the features that the feature selection analysis identified as being potentially predictive of the target variable: the ratio of the 10- to 20-day ATR (atrRatSlow), the trend difference indicator (trend), the absolute price change oscillator (apc10), and the change in price variance ratio (deltaPVR3).
I trained a generalized boosted regression model using the
gbm package using these features as the independent variables predicting the next day return normalized to the recent ATR. The model was trained on a sliding window of 200 days and tested on the adjacent 50 days over the length of the entire data set. I didn’t include transaction costs.
The returns series of most financial instruments consists of a relatively large number of small positive and small negative values and a smaller number of large positive and large negative values. I hypothesize that the values whose magnitude is smaller are more random in nature than the values whose magnitude is larger. On any given day, all things being equal, a small negative return could turn out to be a small positive return by the time the close rolls around, or vice versa, as a result of any number of random occurrences related to the fundamentals of the exchange rate. These same random occurrences are less likely to push a large positive return into negative territory and vice versa, purely on account of the size of the price swings involved.
Following this logic, I think that my model is likely to be more accurate in its extreme predictions than in its ‘normal’ range. We can test this hypothesis on the simple trading strategy described above by entering positions only when the model predicts a return that is large in magnitude.
Here are the results of each “strategy” (T0 corresponds to a prediction threshold of 0, T20 to a prediction threshold of 20, etc), along with the return of the (volatility adjusted) underlying from the testing data set:
While the strategy may look interesting, it is actually not overly robust to changing the random initialization or the hyperparameters of the GBM algorithm – more work is needed to turn this into a viable strategy.
We can see that increasing the prediction threshold for entering a trade resulted in a reduced final equity, but increased risk-adjusted returns. The model significantly outperformed the underlying.
- Several approaches agreed that ratio of long- and short-term ATRs was the most important feature.
- This feature and the price change oscillator were selected by all approaches used in this post.
- The RFE analysis indicated that it may be prudent to focus on variables that measure long-term volatility or recent changes in volatility relative to the longer term.
- An exhaustive search of all possible generalized linear models that considered main interactions using
glmultiimplied that the 20- to 100-day ATR ratio and the MMI variables are most predictive.
- Stepwise feature selection using a generalized linear model returned similar results.
- Boruta identified 8 useful variables, with the ATR ratio and price change oscillator the clear winners.
- Transforming the variables using PCA slightly improved the performance of a random forest model relative to using the raw variables.
- Undoubtedly the results of the simple trading strategy are upwardly biased. After all, we’ve selected the very features we found to be most useful on our data set, albeit with some care around cross-validation. It would be informative to test a rolling feature selection approach where the features are selected at each model building stage.
The same features seem to be selected over and over again using the different methods. Is this just a fluke, or has this long and exhaustive data mining exercise revealed something of practical use to a trader? In part 2 of this series, I will investigate the performance of various machine learning algorithms based on this feature selection exercise. I’ll compare different algorithms and then investigate combining their predictions using ensemble methods with the objective of creating a useful trading system.
Before you continue....
Want to see how we trade for a living with algos — so you can too?
Learn where to start and see how systematic retail traders generate profit long-term:
Aronson, D. 2006, Evidence-Based Technical Analysis: Applying the Scientific Method and Statistical Inference to Trading Signals.
Aronson, D. and Masters, T. 2014-, Statistically Sound Machine Learning for Algorithmic Trading of Financial Instruments: Developing Predictive-Model-Based Trading Systems Using TSSB