It would be great if **machine learning** were as simple as just feeding data to an out-of-the box implementation of some learning algorithm, then standing back and admiring the predictive utility of the output. As anyone who has dabbled in this area will confirm, it is never that simple. We have features to engineer and transform (no trivial task – see here and here for an exploration with applications for finance), not to mention the vagaries of dealing with data that is **non-Independent and Identically Distributed** (non-IID). In my experience, landing on a model that fits the data acceptably at the outset of a modelling exercise is unlikely; a little (or a lot!) of effort is usually required to be expended on tuning and debugging the algorithm to achieve acceptable performance.

In the case of **non-IID time series data**, we also have the dilemma of the amount of data to use in the training of a predictive model. Given the **non-stationarity of asset prices**, if we use too much data, we run the risk of training our model on data that is no longer relevant. If we use too little data, we run the risk of building an under-fit model. This begs the question: **Is there an ideal amount of data to include in machine learning models for financial prediction?** I don’t know, but I doubt the answer is clear cut since we never know when the underlying process is about to undergo significant change. I hypothesise that it makes sense to use the minimum amount of data that leads to acceptable model performance, and testing this is the subject of this post.

**How Much Data?**

In classical data science, model performance *generally* improves as the amount of training data is increased. However, as mentioned above, due to the non-IID nature of the data we use in finance, this happy assumption is not necessarily applicable. My theory is that using too much data (that is, using a training window that extends far into the past) is actually detrimental to model performance.

In order to explore this idea, I decided to build a model based on previous asset returns and measures of volatility. The volatility measure that I used is the 5-period Average True Range (ATR) minus the 20-period ATR normalized over the last 50 periods. The data used is the EUR/USD daily exchange rate sampled at 9:00am GMT between 2006 and 2016.

The model used the previous three values of the returns and volatility series as the input features and the next day’s market direction as the target feature. I trained a simple two-class logistic regression model using R’s *glm* function with a time-series cross validation approach. This approach involves training the model on a window of data and predicting the outcome of the next period, then shifting the training window forward in time by one period. The model is then retrained on the new window and the next period’s outcome predicted. This process is repeated along the length of the time series. The cross-validated performance of the model is simply the performance of the next-day predictions using some suitable performance measure. I recorded the profit factor and sharpe ratio of the model’s predictions. I used class probabilities to determine the positions for the next day as follows:

if , go long at open

if , go short at open

if (equivalent to , remain flat

where and are the calculated probabilities for the next day’s market direction to be positive and negative respectively.

Positions were liquidated at the close.

In order to investigate the effects of the size of the data window, I varied its size between 15 and 1,600 days and recorded the cross-validated performance for each case. I also recorded the average in-sample performance on each of the training windows. Slicing up the data so that the various cross-validation samples were consistent across window lengths took some effort, but this wrangling was made simpler using Max Kuhn (to whom I once again tip my hat) and his caret package.

The results are presented below.

We can see that for the smallest window lengths, the in-sample performance greatly exceeds the cross-validated performance. In other words, when we use very little data, the model fits the training data well, but fails to generalize out of sample. It has a variance problem, which is what we would expect.

Then things get interesting. As we add slightly more data in the form of a longer training window, the in-sample performance decreases, but the cross-validated performance increases, very quickly rising to meet the in-sample performance. In-sample and cross-validated performance is very similar for a range of window lengths between 25 and 75 days. This is an important result, because when the cross-validated performance approximates the in-sample performance, we can conclude that the model is capturing the underlying signal and is therefore likely to generalise well. Encouragingly, this performance is reasonably robust in the approximate window range 25-75 days. If we had only one data point showing reasonable cross-validated performance, I wouldn’t trust that this wasn’t due to randomness. The existence of a region of reasonable performance implies that we may have a degree of confidence in the results.

As we add yet more data to our training window, we can see that the in-sample performance continues to deteriorate, eventually reaching a lower limit, and that the cross-validated performance likewise continues to decline, with a notable exception around 500 days. This suggests that as we increase the training window length, the model develops a bias problem and underfits the data.

These results are perhaps confounded by the fact that the optimal window length may be a characteristic of this particular market and the particular 10-year period used in this experiment. Actually, I feel this is quite likely. I haven’t run this experiment on other markets or time periods yet, but I strongly suspect that each market will exhibit different optimal window lengths, and that these will probably themselves vary with time. Notwithstanding this, it appears that we can at least conclude that in finance, more data is not necessarily better.

**Equity Curves**

I know how much algorithmic traders like to see an equity curve, so here is the model performance using a variety of selected window lengths, as well as the buy and hold equity curve of the underlying. Transactions costs are not included.

In this case, the absolute performance is nothing spectacular*. However, it demonstrates the differences in the quality of the predictions obtained using different window lengths for training the models. We can clearly see that more is not necessarily better, at least for this particular period of time.

# Performance as a Function of Class Probability Threshold

It is also interesting to investigate how performance varies across the different windows lengths as a function of the class probability threshold used in the trading decisions. Here is a heatmap of the model’s Sharpe ratio for various window lengths and class probability thresholds.

We can see a fairly obvious region of higher Sharpe ratios for lower window lengths and generally increasing class probability threshold. The region of the higher Sharpe ratios for longer window lengths and higher class probabilities (the upper right corner) is actually slightly misleading, since the number of trades taken for these model configurations is vanishingly small. However, we can see that when those trades do occur, they tend to be of a higher quality.

Finally, here are several equity curves for a window length of 30 days and various class probability thresholds.

**Conclusions**

This post investigated the effects of varying the length of the training window on the performance of a simple logistic regression model for predicting the next-day direction of the EUR/USD exchange rate. Results indicated that more data does not necessarily lead to a better predictive model. In fact, there may be a case for using a relatively small window of training data to force the model to continuously re-learn and adapt to the most immediate market conditions. There appears to be a trade-off to contend with, with very small windows exhibiting vast differences between performance on the training set and performance on out of sample data, and very large windows performing poorly both in-sample and out-of-sample.

While absolute performance of the model was nothing to get excited about, the model used here was a very simple logistic regression classifier and minimal effort was spent on feature engineering. This suggests that the outcomes of this research could potentially be used in conjunction with more sophisticated algorithms and features to build a model with acceptable performance. This will be the subject of future posts.

The axiom *he who has the most data wins* is widely applicable in many data science applications. This doesn’t appear to be the case when it comes to building predictive models for the financial markets. Rather, the research presented here suggests that the development and engineering of the model itself may play a far larger role in its out of sample performance. This implies that model performance is more a function of the skill of the developer than on the ability to obtain as much data as possible. I find that to be a very satisfying conclusion.

# Source Code

Here’s some source code if you are interested in reproducing my results. Warning: it is slightly hacky and takes a long time to run if you store all the in-sample performance data! By default I have commented out that part of the code.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 |
############################## #### HOW MUCH DATA??? ######## ############################## library(caret) library(deepnet) library(foreach) library(doParallel) library(e1071) library(quantmod) library(reshape2) # import data eu <- read.csv("EU_Daily.csv", header = T, stringsAsFactors = F) # create directional statistics eu$Direction <- factor(ifelse(eu$Returns>0, "up", "down")) #Calculate features - lagged return and volatility values periods <- c(1:5) Returns <- eu$Returns Volatility <- eu$atrRegime Direction <- eu$Direction lagReturns <- data.frame(lapply(periods, function(x) Lag(Returns, x))) colnames(lagReturns) <- c('Ret1', 'Ret2', 'Ret3', 'Ret4', 'Ret5') lagVolatility <- data.frame(lapply(periods, function(x) Lag(Volatility, x))) colnames(lagVolatility) <- c('Vol1', 'Vol2', 'Vol3', 'Vol4', 'Vol5') # create data set based on lagged returns and volatility indicators with next day return and direction as targets dat <- data.frame(Returns, Direction, lagReturns, lagVolatility) dat <- dat[-c(1:14), ] #remove zeros from initial volatility calcs dat <- na.omit(dat) # preserve out of sample data # Train <- dat[1:(nrow(dat)-500), ] # Test <- dat[-(1:(nrow(dat)-500)), ] Train <- dat # for using all data in training set # logistic regression models -------------------------------------------------------------- direction <- 2 # column number of direction variable features <- c(3,4,5,8,9,10) #feature columns returns <- 1 # returns column windows <- c(15, 20, 25, 30, 40, 50, 75, 100, 125, 150, 200, 250, 300, 350, 400, 450, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, 1500, 1600) modellist.lr <- list() PFTrain <- vector() SharpeTrain <- vector() j <- 1 for (i in windows) { timecontrol <- trainControl(method = 'timeslice', initialWindow = i, horizon = 1, classProbs = TRUE, returnResamp = 'final', fixedWindow = TRUE, savePredictions = 'final') cl <- makeCluster(8) registerDoParallel(cl) set.seed(503) modellist.lr[[j]] <- train(Train[, features], Train[, direction], method = 'glm', family = 'binomial', trControl = timecontrol) #### comment out lines 63-65 and uncomment lines 67-93 to run in-sample performance module j <- j+1 print(i) stopCluster(cl) } # ctrl <- trainControl(method='none', classProbs = TRUE) # tradesCP <- list() # pfCP <- vector() # srCP <- vector() # indexes <- modellist.lr[[j]]$control$index # for(k in c(1:length(modellist.lr[[j]]$control$index)) ) # { # model <- train(Train[indexes[[k]], features], Train[indexes[[k]], direction], # method = 'glm', family = 'binomial', # trControl = ctrl) # # predsCP <- predict(model, Train[indexes[[k]], features], type = 'prob') # th <- 0.5 # # tradesCP[[k]] <- ifelse(predsCP$up > th, Train[indexes[[k]], returns], ifelse(predsCP$down > th, Train[indexes[[k]], returns], 0)) # pfCP[k] <- sum(tradesCP[[k]][tradesCP[[k]] > 0])/abs(sum(tradesCP[[k]][tradesCP[[k]] < 0])) # srCP[k] <- sqrt(252)*mean(tradesCP[[k]])/sd(tradesCP[[k]]) # # cat("\niteration:",k, "PF:",round(pfCP[k], digits = 2), "SR:",round(srCP[k], digits = 2)) # # } # PFTrain[j] <- mean(pfCP) # SharpeTrain[j] <- mean(srCP) # stopCluster(cl) # j <- j+1 # cat("\n",i) # } # resampled performance - use class probabilities as a trade threshold thresholds <- seq(0.50, 0.65, 0.005) windowIndex <- c(1:length(windows)) pf.cv.th <- matrix(nrow = length(windowIndex), ncol = length(thresholds), dimnames = list(c(as.character(windows)), c(as.character(thresholds)))) # adjust matrix dimensions based on loop size sr.cv.th <- matrix(nrow = length(windowIndex), ncol = length(thresholds), dimnames = list(c(as.character(windows)), c(as.character(thresholds)))) Trades.DF <- matrix(nrow = length((max(windows)+1):nrow(Train)), ncol = length(windowIndex) , dimnames = list(c(), c(as.character(windows)))) for(i in windowIndex) { commonData <- c((max(windows)+1):nrow(Train)) j <- 1 for(th in thresholds) { trades <- ifelse(modellist.lr[[i]]$pred$up[(length(modellist.lr[[i]]$pred$up)-length(commonData)+1):length(modellist.lr[[i]]$pred$up)] > th, Train$Returns[commonData], ifelse(modellist.lr[[i]]$pred$down[(length(modellist.lr[[i]]$pred$down)-length(commonData)+1):length(modellist.lr[[i]]$pred$down)] > th, -Train$Returns[commonData], 0)) plot(cumsum(trades), type = 'l', col = 'blue', xlab = 'day', ylab = 'cumP', main = paste0('Window: ',windows[i], ' Cl.Prob Thresh: ',th), ylim = c(-0.5, 1.0)) lines(cumsum(Train$Returns[commonData]), col = 'red') pf.cv.th[i, j] <- sum(trades[trades>0])/abs(sum(trades[trades<0])) sr.cv.th[i, j] <- sqrt(252)*mean(trades)/sd(trades) if(th == 0.525) Trades.DF[, i] <- trades j <- j+1 } } #### plot training vs cv performance - class probs th <- 11 # threshold index plot(PFTrain, pf.cv.th[, th]) PF.Compare <- data.frame(windows, PFTrain, pf.cv.th[, th]) colnames(PF.Compare) <- c('window', 'IS', 'CV') SR.Compare <- data.frame(windows, SharpeTrain, sr.cv.th[, th]) colnames(SR.Compare) <- c('window', 'IS', 'CV') # base R plots plot(windows, SharpeTrain, type = 'l', ylim = c(-0.9, 1.5), col = 'blue') lines(windows, sr.cv.th[, th], col = 'red') plot(windows, PFTrain, type = 'l', ylim = c(0.8, 1.5), col = 'blue') lines(windows, pf.cv.th[, th], col = 'red') # nice ggplots pfMolten <- melt(PF.Compare, id = c('window')) srMolten <- melt(SR.Compare, id = c('window')) pfPlot <- ggplot(data=pfMolten, aes(x=window, y=value, colour=variable)) + geom_line(size = 0.75) + scale_colour_manual(values = c("steelblue3","tomato3")) + theme(legend.title=element_blank()) + ylab('Profit Factor') + ggtitle("In-Sample and Cross-Validated Performance") sharpePlot <- ggplot(data=srMolten, aes(x=window, y=value, colour=variable)) + geom_line(size = 1.25) + scale_colour_manual(values = c("steelblue3","tomato3")) + theme(legend.title=element_blank()) + ylab('Sharpe Ratio') + ggtitle("In-Sample and Cross-Validated Performance") # heatmap plots of performance and thresholds srHeat <- melt(sr.cv.th) colnames(srHeat) <- c('window', 'threshold', 'sharpe') srHeat$window <- factor(srHeat$window, levels = c(windows)) pfHeat <- melt(pf.cv.th[-c(15:22), ]) colnames(pfHeat) <- c('window', 'threshold', 'pf') pfHeat$window <- factor(pfHeat$window, levels = c(windows)) sharpeHeatMap <- ggplot(data=srHeat, aes(x = window, y = threshold)) + geom_tile(aes(fill = sharpe), colour = "white") + scale_fill_gradient(low = "#fff7bc", high = "#e6550d", name = 'Sharpe Ratio') + scale_x_discrete(breaks = windows[seq(2,28,2)]) + xlab('Window Length (not to scale)') + ylab('Class Probability Threshold') + theme(axis.line = element_line(colour = "black"), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank()) pfHeatMap <- ggplot(data=pfHeat, aes(x = window, y = threshold)) + geom_tile(aes(fill = pf), colour = "white") + scale_fill_gradient(low = "#e5f5f9", high = "#2ca25f", name = 'Profit Factor') + xlab('Window Length (not to scale)') + ylab('Class Probability Threshold') + theme(axis.line = element_line(colour = "black"), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank()) # Equity curves #colnames(Trades.DF) <- as.character(windows) Trades.DF <- as.data.frame(Trades.DF) Equity <- cumsum((Trades.DF[, as.character(c(15, 30, 500, 1000, 1500))])) Equity$Underlying <- cumsum(Train$Returns[commonData]) Equity$Index <- c(1:nrow(Equity)) EquityMolten <- melt(Equity, id = 'Index') equityPlot <- ggplot(data=EquityMolten, aes(x=Index, y=value, colour=variable)) + geom_line(size = 0.6) + scale_color_manual(name = c('Window\nLength'), values = c('darkorchid3', 'dodgerblue3', 'darkolivegreen4', 'darkred', 'goldenrod', 'grey35')) + ylab('Return') + xlab('Day') + ggtitle("Returns") ## Equity curves for a single window length thresholds <- seq(0.50, 0.95, 0.01) Index <- which(windows == 30) Trades.win <- matrix(nrow = length((max(windows)+1):nrow(Train)), ncol = length(thresholds) , dimnames = list(c(), c(as.character(thresholds)))) commonData <- c((max(windows)+1):nrow(Train)) j <- 1 for(th in thresholds) { trades <- ifelse(modellist.lr[[Index]]$pred$up[(length(modellist.lr[[Index]]$pred$up)-length(commonData)+1):length(modellist.lr[[Index]]$pred$up)] > th, Train$Returns[commonData], ifelse(modellist.lr[[Index]]$pred$down[(length(modellist.lr[[Index]]$pred$down)-length(commonData)+1):length(modellist.lr[[Index]]$pred$down)] > th, -Train$Returns[commonData], 0)) Trades.win[, j] <- trades j <- j+1 } Trades.win.DF <- as.data.frame(Trades.win) Equity.win <- cumsum(Trades.win.DF[, seq(1, 31, 5)]) Equity.win$Underlying <- cumsum(Train$Returns[commonData]) Equity.win$Index <- c(1:nrow(Equity)) Equity.winMolten <- melt(Equity.win, id = 'Index') singleWinPlot <- ggplot(data=Equity.winMolten, aes(x=Index, y=value, colour=variable)) + geom_line(size = 0.6) + scale_colour_brewer(palette = 'Dark2', name = c('Class\nProbability\nThreshold')) + ylab('Return') + xlab('Day') + ggtitle("Returns") ################################### |

*Of course, building a production trading model is not the point of the exercise. Apologies for pointing this out; I know most of you already understand this, but I invariably get emails after every post from people questioning the performance of the ‘trading algorithms’ I post on my blog. Just to be clear, I am not posting trading algorithms!! I am sharing my research. Performance on market data, particularly relative performance, is a quick and easy way to interpret the results of this research. I don’t intend for anyone (myself included) to use the simple logistic regression model presented here in a production environment. However, I do intend to use the concepts presented in this post to improve my existing models or build entirely new ones. There is more than enough information in this post for you to do the same, if you so desired.