You don't need to keep chasing that one perfect strategy...
See how you can use algos to systematise your trading, remove emotions and finally make your capital grow!
Grab your FREE Guide to Algo Trading PDF:
In the last article, I described an application of the k-means clustering algorithm for classifying candlesticks based on the relative position of their open, high, low and close. This was a simple enough exercise, but now I tackle something more challenging: isolating information that is both useful and practical to real trading. I’ll initially try two approaches:
The insights gained from this analysis will hopefully inform the next direction of this research.
In the last article, I classified twelve months of daily candles (June 2014 – July 2015) into eight clusters. To simplify the analysis and ensure that enough instances of each cluster are observed, I’ll reduce the number of clusters to four and extend the history to cover 2008-2015. I’ll exclude my 2015 data for now in case I need a final, unseen test set at some point in the future.
Here’s a subset of the candles over the entire price history (2008-2014, 2015 is held out) grouped by cluster:
Roughly speaking, cluster one corresponds to a significant up day, cluster two to a significant down day, cluster three to a small up day (although there are a few instances of small down days and neutral days), and cluster four to a small down day (although again there are a few instances of small up days and doji-type candles).
The clusters are not visually perfectly homogeneous. For example, in cluster two there are one or two instances of candles with a large range, but whose overall downwards movement is small. These candles resemble what we know as ‘hammers’ or ‘pins’ in traditional candlestick parlance. Cluster three, while dominated by small up candles, also includes many small down days and hammers/pins. Cluster four is similarly imperfect. My trading experience tells me however that theoretical perfection, while interesting, is an indulgence best left to academic practitioners. I am much more interested in whether the information my research brings to light holds up enough to make a profit.
For part one of the analysis, I need a baseline for comparative purposes. The obvious baseline is simply the proportional occurrence of each cluster across the data set. Below is a bar plot describing this (excluding the 2015 data). I’ve added a line graph showing the same information as this will be used for comparative purposes below.
Any lagged clustering patterns will need to be statistically significantly different from these baseline occurrences to be of interest.
To start with, let’s look at the lagged proportional cluster occurrences across the entire data set. In the table below, the columns represent the percentage proportion of each cluster that immediately follows the cluster in the rows (rounded for viewing purposes). This information is reproduced as a series of line graphs in the figure below.
Looking at these results, what can we infer? Some observations that stand out are:
I considered investigating the stability of these results with respect to time. For example, it would be possible to set up a moving window on the data set and look at whether these relationships vary across time. However, I’m not sure it would reveal any great insight that is actionable in a practical way. I might come back to this later. For now, lets jump into something more interesting that is more likely to reveal the information I am interested in: returns associated with individual clusters.
For the second part of the analysis I’ll simply construct a cumulative returns series for a holding period of one day for each cluster. If anything interesting turns up in this analysis, I willl attempt to validate it using my 2015 out of sample data before using it in a trading system. Transaction costs are not included in this analysis.
Here is a chart of the cumulative next day returns for each cluster as well as the cumulative daily returns of the underlying instrument.
These returns are nothing to get excited about, but a few things stand out. Firstly, the next day returns associated with cluster four significantly out-perform the underlying. More interestingly, it seems to be agnostic to the prevailing trend in the underlying, except for the GFC period. Cluster one also outperforms the underlying, but is somewhat correlated. Cluster two does not contain enough instances to make any useful judgements, and cluster three is highly correlated with the underlying.
Below is the chart of cumulative next day returns for cluster four in the post GFC period from September 2009 (the time when the US government stepped in to rescue Fannie Mae and Freddie Mac). I realise that I am cherry picking from my results, however it does appear that there is value in investigating the performance of cluster four as financial markets adjusted to a new regime.
Performance is encouraging. Cluster four out-performs the underlying and does so with a much lower drawdown. However, it would have been impossible to trade this as a system during the period shown since the clusters were identified using that very data. Next, I will investigate cluster four on my out of sample data, which was not used in the cluster identification process.
Cluster four out-performs the underlying by a small margin, but it does so with a much more benign drawdown, although its trend-agnosticism seems to have disappeared. The evidence suggests that cluster four has a small but useful predictive utility.
I’ve presented a simple application of the k-means clustering algorithm to the prediction of a single financial instrument. I found that the relative values of the daily open, high, low and close appear to have a degree of predictive power for this instrument, particularly in the post-GFC period. These results were validated in an out of sample test. Although gross returns were not spectacular, they were achieved with a much lower drawdown.
A benefit of using an unsupervised learner is that it reduces the chances of over-fitting since there is no optimization on some target value. The trade off is that the information discovered by the learner may have little or no intrinsic utility. In this case, the number of clusters could be an optimization parameter if we varied it and selected the value that resulted in the best returns series.
I used only a small amount of data, made a number of assumptions and applied the approach to just one instrument, chosen at random. There are numerous potential future research directions to take this idea further, including:
Here’s the complete R code used in this analysis:
# read in data
data <- read.csv("GBP_JPY.csv", stringsAsFactors = F)
colnames(data) <- c("Date", "GBP_JPY.Open", "GBP_JPY.High", "GBP_JPY.Low", "GBP_JPY.Close") # quantmod requires these names
data$Date <- as.POSIXct(data$Date, format = "%d/%m/%Y")
data <- as.xts(data[, -1], order.by = data[, 1])
data <- data["2008::2014", 1:4] # in-sample data set
# create HLC relative to O
data$HO <- data[,2]-data[,1]
data$LO <- data[,3]-data[,1]
data$CO <- data[,4]-data[,1]
# # K-Means Clustering with clusters based on HO, LO, CO
class_factors <- data[, 5:7]
set.seed(123) # required in order to reproduce results
fit <- kmeans(class_factors,4)
m <- fit$cluster # vector of the cluster assigned to each candle
# which canldes were classifed into each cluster?
cluster <- as.xts(m)
index(cluster) <- index(data) #coerce index of cluster series to match data's index
new_data <- merge.xts(data, cluster)
# plot candles by cluster
chart_Series(xts(coredata(sample_data)[order(sample_data$cluster),],type="candlesticks", order.by = index(sample_data), theme = chartTheme('black',up.col='green',dn.col='red')))
# count and proportion of each cluster's occurrence in training data
cluster_count <- count(new_data, vars = "cluster")
cluster_count$prop_percent <- cluster_count$freq*100/sum(cluster_count$freq)
ggplot(cluster_count, aes(x = cluster, y= prop_percent)) + geom_bar(stat = 'identity', fill = 'blue') + geom_line(stat = 'identity', colour = 'black') # plot as bars
# proportaional probability table for next candle
count_table <- table(new_data$cluster , lag(new_data$cluster, 1))
prop_table <- prop.table(table(new_data$cluster , lag(new_data$cluster, 1)), 1) * 100
prop_df <- data.frame(round(prop_table))
colnames(prop_df) <- c("Following_Cluster", "Cluster", "Proportion_percent")
ggplot(prop_df, aes(x=Cluster, y=Proportion_percent, group=Following_Cluster)) + geom_line(aes(colour = Following_Cluster))
# returns analysis
new_data$daily_returns <- dailyReturn(new_data)
new_data$cluster1 <- ifelse(new_data$cluster == 1, 1, 0)
new_data$cluster2 <- ifelse(new_data$cluster == 2, 1, 0)
new_data$cluster3 <- ifelse(new_data$cluster == 3, 1, 0)
new_data$cluster4 <- ifelse(new_data$cluster == 4, 1, 0)
cluster1_returns <- lag(new_data$cluster1, 1) * new_data$daily_returns
cluster2_returns <- lag(new_data$cluster2, 1) * new_data$daily_returns
cluster3_returns <- lag(new_data$cluster3, 1) * new_data$daily_returns
cluster4_returns <- lag(new_data$cluster4, 1) * new_data$daily_returns
# comparitive performance of each cluster vs buy and hold
chart.CumReturns(cbind(dailyReturn(new_data), cluster1_returns[-1,], cluster2_returns[-1,], cluster3_returns[-1,], cluster4_returns[-1,]), legend.loc = "bottomright", main = "Cumulative Returns")
# cluster 4 post-GFC
chart.CumReturns(cbind(dailyReturn(new_data["200909::"]), cluster4_returns["200909::",]), legend.loc = "bottomright", main = "Cumulative Returns")
###### apply k-menas to test set (2015)
test_data <- read.csv("GBP_JPY.csv", stringsAsFactors = F)
colnames(test_data) <- c("Date", "GBP_JPY.Open", "GBP_JPY.High", "GBP_JPY.Low", "GBP_JPY.Close") # quantmod requires these names
test_data$Date <- as.POSIXct(test_data$Date, format = "%d/%m/%Y")
test_data <- as.xts(test_data[, -1], order.by = test_data[, 1])
test_data <- test_data["2015", 1:4] #nrow(data), 1:4]
# create HLC relative to O
test_data$HO <- test_data[,2]-test_data[,1]
test_data$LO <- test_data[,3]-test_data[,1]
test_data$CO <- test_data[,4]-test_data[,1]
test_data_kmeans <- data.frame(test_data[, 5:7]) # predict.kmeans seems to be incompatible with xts object
test_clusters <- predict.kmeans(fit, data = test_data_kmeans)
test_data$cluster <- test_clusters
test_data$daily_returns <- dailyReturn(test_data)
test_data$cluster4 <- ifelse(test_data$cluster == 4, 1, 0)
cluster4_test_returns <- lag(test_data$cluster4, 1) * test_data$daily_returns
# comparitive performance of cluster 4 vs buy and hold
chart.CumReturns(cbind(dailyReturn(test_data), cluster4_test_returns[-1, ]), legend.loc = "bottomright", main = "Cumulative Returns")
Before you continue....
Want to see how others trade for a living with algos — so maybe you can too?
Learn where to start and see how systematic retail traders generate profit long-term: