In the last article, I described an application of the k-means clustering algorithm for classifying candlesticks based on the relative position of their open, high, low and close. This was a simple enough exercise, but now I tackle something more challenging: isolating information that is both useful and practical to real trading. I’ll initially try two approaches:
- Investigate whether there are any statistically significant patterns in certain clusters following others
- Investigate the distribution of next day returns following the appearance of a candle from each cluster
The insights gained from this analysis will hopefully inform the next direction of this research.
In the last article, I classified twelve months of daily candles (June 2014 – July 2015) into eight clusters. To simplify the analysis and ensure that enough instances of each cluster are observed, I’ll reduce the number of clusters to four and extend the history to cover 2008-2015. I’ll exclude my 2015 data for now in case I need a final, unseen test set at some point in the future.
Here’s a subset of the candles over the entire price history (2008-2014, 2015 is held out) grouped by cluster:
Roughly speaking, cluster one corresponds to a significant up day, cluster two to a significant down day, cluster three to a small up day (although there are a few instances of small down days and neutral days), and cluster four to a small down day (although again there are a few instances of small up days and doji-type candles).
The clusters are not visually perfectly homogeneous. For example, in cluster two there are one or two instances of candles with a large range, but whose overall downwards movement is small. These candles resemble what we know as ‘hammers’ or ‘pins’ in traditional candlestick parlance. Cluster three, while dominated by small up candles, also includes many small down days and hammers/pins. Cluster four is similarly imperfect. My trading experience tells me however that theoretical perfection, while interesting, is an indulgence best left to academic practitioners. I am much more interested in whether the information my research brings to light holds up enough to make a profit.
Part 1 – Clustering patterns
For part one of the analysis, I need a baseline for comparative purposes. The obvious baseline is simply the proportional occurrence of each cluster across the data set. Below is a bar plot describing this (excluding the 2015 data). I’ve added a line graph showing the same information as this will be used for comparative purposes below.
Any lagged clustering patterns will need to be statistically significantly different from these baseline occurrences to be of interest.
To start with, let’s look at the lagged proportional cluster occurrences across the entire data set. In the table below, the columns represent the percentage proportion of each cluster that immediately follows the cluster in the rows (rounded for viewing purposes). This information is reproduced as a series of line graphs in the figure below.
Looking at these results, what can we infer? Some observations that stand out are:
- A big down candle (cluster 2) is repeated in 29% of instances. It is followed by a smaller down candle in 28% of cases. Adding these, we can say that a big down day is followed by another down day 57% of the time.
- A big move in either direction is more often that not preceded by a big move.
- Neutral days or relatively smaller moves also tend to follow each other.
- The patterns of next-cluster proportions are different enough from the sample proportions to warrant further investigation.
I considered investigating the stability of these results with respect to time. For example, it would be possible to set up a moving window on the data set and look at whether these relationships vary across time. However, I’m not sure it would reveal any great insight that is actionable in a practical way. I might come back to this later. For now, lets jump into something more interesting that is more likely to reveal the information I am interested in: returns associated with individual clusters.
Part 2: Returns series analysis
For the second part of the analysis I’ll simply construct a cumulative returns series for a holding period of one day for each cluster. If anything interesting turns up in this analysis, I willl attempt to validate it using my 2015 out of sample data before using it in a trading system. Transaction costs are not included in this analysis.
Here is a chart of the cumulative next day returns for each cluster as well as the cumulative daily returns of the underlying instrument.
These returns are nothing to get excited about, but a few things stand out. Firstly, the next day returns associated with cluster four significantly out-perform the underlying. More interestingly, it seems to be agnostic to the prevailing trend in the underlying, except for the GFC period. Cluster one also outperforms the underlying, but is somewhat correlated. Cluster two does not contain enough instances to make any useful judgements, and cluster three is highly correlated with the underlying.
Below is the chart of cumulative next day returns for cluster four in the post GFC period from September 2009 (the time when the US government stepped in to rescue Fannie Mae and Freddie Mac). I realise that I am cherry picking from my results, however it does appear that there is value in investigating the performance of cluster four as financial markets adjusted to a new regime.
Performance is encouraging. Cluster four out-performs the underlying and does so with a much lower drawdown. However, it would have been impossible to trade this as a system during the period shown since the clusters were identified using that very data. Next, I will investigate cluster four on my out of sample data, which was not used in the cluster identification process.
Cluster four out-performs the underlying by a small margin, but it does so with a much more benign drawdown, although its trend-agnosticism seems to have disappeared. The evidence suggests that cluster four has a small but useful predictive utility.
Conclusions and future research directions
I’ve presented a simple application of the k-means clustering algorithm to the prediction of a single financial instrument. I found that the relative values of the daily open, high, low and close appear to have a degree of predictive power for this instrument, particularly in the post-GFC period. These results were validated in an out of sample test. Although gross returns were not spectacular, they were achieved with a much lower drawdown.
A benefit of using an unsupervised learner is that it reduces the chances of over-fitting since there is no optimization on some target value. The trade off is that the information discovered by the learner may have little or no intrinsic utility. In this case, the number of clusters could be an optimization parameter if we varied it and selected the value that resulted in the best returns series.
I used only a small amount of data, made a number of assumptions and applied the approach to just one instrument, chosen at random. There are numerous potential future research directions to take this idea further, including:
- The forex market trades on a 24-hour basis, therefore the choice of daily closing time is somewhat arbitrary. The time used in this analysis was set at 17:00 EST, however a more sensible choice for this particularly market (GBP/JPY) might be 17:00 UTC.
- The approach could be tested on intra-day data, although this would be subject to the vagaries of intra-day volatility cycles which would likely need to be accounted for.
- Other data containing potentially predictive information could also be supplied to the algorithm, for example trend and volatility indicators.
- The analysis could be extended from single candles to two- and three-candle patterns.
- The k-means algorithm is limited in that it requires the user to input the number of clusters before any classification occurs. We could investigate the effect of varying the number of clusters, however hierarchical clustering may provide a better alternative as it does not require any a priori knowledge about the number of clusters into which the data naturally groups.
- Others have suggested the application of a Markov Chain Monte Carlo model in order to build a predictive model based on joint probability tables. This is new territory for me and would require some research into the methodology before I attempted it, however I assume it would require that the joint probabilities were somewhat stable with respect to time, or at least include some means to account for any instability.
- Other markets may or may not be more amenable to this approach.
- Transaction costs should be incorporated into the model.
Here’s the complete R code used in this analysis:
library(fpc) library(cluster) library(quantmod) library(ggplot2) library(PerformanceAnalytics) # read in data data <- read.csv("GBP_JPY.csv", stringsAsFactors = F) colnames(data) <- c("Date", "GBP_JPY.Open", "GBP_JPY.High", "GBP_JPY.Low", "GBP_JPY.Close") # quantmod requires these names data$Date <- as.POSIXct(data$Date, format = "%d/%m/%Y") data <- as.xts(data[, -1], order.by = data[, 1]) data <- data["2008::2014", 1:4] # in-sample data set chart_Series(data) # create HLC relative to O data$HO <- data[,2]-data[,1] data$LO <- data[,3]-data[,1] data$CO <- data[,4]-data[,1] # # K-Means Clustering with clusters based on HO, LO, CO class_factors <- data[, 5:7] set.seed(123) # required in order to reproduce results fit <- kmeans(class_factors,4) m <- fit$cluster # vector of the cluster assigned to each candle # which canldes were classifed into each cluster? cluster <- as.xts(m) index(cluster) <- index(data) #coerce index of cluster series to match data's index new_data <- merge.xts(data, cluster) # plot candles by cluster chart_Series(xts(coredata(sample_data)[order(sample_data$cluster),],type="candlesticks", order.by = index(sample_data), theme = chartTheme('black',up.col='green',dn.col='red'))) # count and proportion of each cluster's occurrence in training data library(plyr) cluster_count <- count(new_data, vars = "cluster") cluster_count$prop_percent <- cluster_count$freq*100/sum(cluster_count$freq) ggplot(cluster_count, aes(x = cluster, y= prop_percent)) + geom_bar(stat = 'identity', fill = 'blue') + geom_line(stat = 'identity', colour = 'black') # plot as bars # proportaional probability table for next candle count_table <- table(new_data$cluster , lag(new_data$cluster, 1)) prop_table <- prop.table(table(new_data$cluster , lag(new_data$cluster, 1)), 1) * 100 round(prop_table) prop_df <- data.frame(round(prop_table)) colnames(prop_df) <- c("Following_Cluster", "Cluster", "Proportion_percent") ggplot(prop_df, aes(x=Cluster, y=Proportion_percent, group=Following_Cluster)) + geom_line(aes(colour = Following_Cluster)) # returns analysis new_data$daily_returns <- dailyReturn(new_data) new_data$cluster1 <- ifelse(new_data$cluster == 1, 1, 0) new_data$cluster2 <- ifelse(new_data$cluster == 2, 1, 0) new_data$cluster3 <- ifelse(new_data$cluster == 3, 1, 0) new_data$cluster4 <- ifelse(new_data$cluster == 4, 1, 0) cluster1_returns <- lag(new_data$cluster1, 1) * new_data$daily_returns cluster2_returns <- lag(new_data$cluster2, 1) * new_data$daily_returns cluster3_returns <- lag(new_data$cluster3, 1) * new_data$daily_returns cluster4_returns <- lag(new_data$cluster4, 1) * new_data$daily_returns # comparitive performance of each cluster vs buy and hold chart.CumReturns(cbind(dailyReturn(new_data), cluster1_returns[-1,], cluster2_returns[-1,], cluster3_returns[-1,], cluster4_returns[-1,]), legend.loc = "bottomright", main = "Cumulative Returns") # cluster 4 post-GFC chart.CumReturns(cbind(dailyReturn(new_data["200909::"]), cluster4_returns["200909::",]), legend.loc = "bottomright", main = "Cumulative Returns") ###### apply k-menas to test set (2015) test_data <- read.csv("GBP_JPY.csv", stringsAsFactors = F) colnames(test_data) <- c("Date", "GBP_JPY.Open", "GBP_JPY.High", "GBP_JPY.Low", "GBP_JPY.Close") # quantmod requires these names test_data$Date <- as.POSIXct(test_data$Date, format = "%d/%m/%Y") test_data <- as.xts(test_data[, -1], order.by = test_data[, 1]) test_data <- test_data["2015", 1:4] #nrow(data), 1:4] # create HLC relative to O test_data$HO <- test_data[,2]-test_data[,1] test_data$LO <- test_data[,3]-test_data[,1] test_data$CO <- test_data[,4]-test_data[,1] test_data_kmeans <- data.frame(test_data[, 5:7]) # predict.kmeans seems to be incompatible with xts object library(DeducerExtras) test_clusters <- predict.kmeans(fit, data = test_data_kmeans) test_data$cluster <- test_clusters test_data$daily_returns <- dailyReturn(test_data) test_data$cluster4 <- ifelse(test_data$cluster == 4, 1, 0) cluster4_test_returns <- lag(test_data$cluster4, 1) * test_data$daily_returns # comparitive performance of cluster 4 vs buy and hold chart.CumReturns(cbind(dailyReturn(test_data), cluster4_test_returns[-1, ]), legend.loc = "bottomright", main = "Cumulative Returns")