How might we calculate rolling correlations between constituents of an ETF, given a dataframe of prices?
For problems like this, the tidyverse
really shines. There are a number of ways to solve this problem … read on for our solution, and let us know if you’d approach it differently!
First, we load some packages and some data that we extracted earlier. xlfprices.RData
contains a dataframe, prices_xlf
, of constituents of the XLF ETF and their daily prices. You can get this data from our GitHub repository.
The dataset isn’t entirely accurate, as it contains prices of today’s constituents and doesn’t account for historical changes to the makeup of the ETF. But that won’t matter for our purposes.
library(tidyverse) library(lubridate) library(glue) library(here) theme_set(theme_bw()) load(here::here("data", "xlfprices.RData")) prices_xlf %>% head(10)
## # A tibble: 10 x 10 ## ticker date open high low close volume dividends closeunadj inSPX ## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> ## 1 AFL 2019-11-29 54.8 55.1 54.8 54.8 1270649 0 54.8 TRUE ## 2 AIG 2019-11-29 52.8 53.2 52.6 52.7 2865501 0 52.7 TRUE ## 3 AIZ 2019-11-29 133. 134. 133. 133. 202854 0 133. TRUE ## 4 AJG 2019-11-29 93.3 93.6 93.0 93.3 392489 0 93.3 TRUE ## 5 ALL 2019-11-29 112. 112. 111. 111. 817942 0 111. TRUE ## 6 AMP 2019-11-29 164. 165. 163. 164. 404660 0 164. TRUE ## 7 AON 2019-11-29 203. 204. 202. 204. 415940 0 204. TRUE ## 8 AXP 2019-11-29 120. 121. 120. 120. 1961463 0 120. TRUE ## 9 BAC 2019-11-29 33.4 33.5 33.2 33.3 19503395 0 33.3 TRUE ## 10 BEN 2019-11-29 27.8 27.9 27.4 27.5 1485635 0 27.5 TRUE
We’d like to be able to calculate rolling average pairwise correlations between all the stocks as tidily as possible.
That requires that we calculate the rolling pairwise correlation between all the stock combinations in the index and then take the mean of all those.
A good way to tackle such problems is to chunk them down into bite-sized pieces and then solve each piece in turn. We split the problem into the following steps:
- calculate returns for each ticker
- create a long dataframe of all the pairwise ticker combinations for each day by doing a full join of the data on itself, keyed by date
- remove instances where we had the same stock twice (corresponding to the diagonal of the correlation matrix)
- remove instances where we have the complementary pair of the same stocks, eg we only want one of APPL-GOOG and GOOG-APPL (this is equivalent to removing the upper or lower triangle of the correlation matrix)
- use
slider::slide2_dbl
to do the rolling correlation calculation - group by date and take the mean
Calculating returns
The first step is straightforward – we simply calculate close-to-close returns and return a long dataframe of dates, tickers, and returns:
# calculate returns to each stock df <- prices_xlf %>% group_by(ticker) %>% arrange(date, .by_group = TRUE) %>% mutate(return = close / dplyr::lag(close) - 1) %>% select(date, ticker, return) # function for prettier web display pretty_table <- function(df) { require(kableExtra) df %>% kable() %>% kable_styling(full_width = TRUE, position = 'center') %>% scroll_box(height = '300px') } df %>% head(50) %>% pretty_table()
date | ticker | return |
---|---|---|
2015-01-02 | AFL | NA |
2015-01-05 | AFL | -0.0261952 |
2015-01-06 | AFL | -0.0089106 |
2015-01-07 | AFL | 0.0062765 |
2015-01-08 | AFL | 0.0097775 |
2015-01-09 | AFL | -0.0212020 |
2015-01-12 | AFL | -0.0057991 |
2015-01-13 | AFL | -0.0049751 |
2015-01-14 | AFL | -0.0077586 |
2015-01-15 | AFL | -0.0033015 |
2015-01-16 | AFL | 0.0142957 |
2015-01-20 | AFL | -0.0029220 |
2015-01-21 | AFL | 0.0015515 |
2015-01-22 | AFL | 0.0180723 |
2015-01-23 | AFL | -0.0071006 |
2015-01-26 | AFL | 0.0049379 |
2015-01-27 | AFL | -0.0086411 |
2015-01-28 | AFL | -0.0196548 |
2015-01-29 | AFL | 0.0050558 |
2015-01-30 | AFL | -0.0098873 |
2015-02-02 | AFL | 0.0190960 |
2015-02-03 | AFL | 0.0158157 |
2015-02-04 | AFL | 0.0269081 |
2015-02-05 | AFL | 0.0049440 |
2015-02-06 | AFL | 0.0049196 |
2015-02-09 | AFL | -0.0022846 |
2015-02-10 | AFL | 0.0050703 |
2015-02-11 | AFL | 0.0061839 |
2015-02-12 | AFL | -0.0004852 |
2015-02-13 | AFL | 0.0053398 |
2015-02-17 | AFL | 0.0035410 |
2015-02-18 | AFL | -0.0096231 |
2015-02-19 | AFL | 0.0004858 |
2015-02-20 | AFL | 0.0056653 |
2015-02-23 | AFL | -0.0112667 |
2015-02-24 | AFL | 0.0102556 |
2015-02-25 | AFL | -0.0017725 |
2015-02-26 | AFL | 0.0037127 |
2015-02-27 | AFL | 0.0011258 |
2015-03-02 | AFL | 0.0051406 |
2015-03-03 | AFL | -0.0046348 |
2015-03-04 | AFL | 0.0011240 |
2015-03-05 | AFL | 0.0072173 |
2015-03-06 | AFL | -0.0101911 |
2015-03-09 | AFL | 0.0022523 |
2015-03-10 | AFL | -0.0200642 |
2015-03-11 | AFL | 0.0072072 |
2015-03-12 | AFL | 0.0188649 |
2015-03-13 | AFL | -0.0083001 |
2015-03-16 | AFL | 0.0133591 |
Next, we create a long dataframe of all the combinations for each day by doing a full join of the data on itself, by date.
# combinations by date pairwise_combos <- df %>% full_join(df, by = "date") pairwise_combos %>% na.omit() %>% head(20) %>% pretty_table()
date | ticker.x | return.x | ticker.y | return.y |
---|---|---|---|---|
2015-01-05 | AFL | -0.0261952 | AFL | -0.0261952 |
2015-01-05 | AFL | -0.0261952 | AIG | -0.0197826 |
2015-01-05 | AFL | -0.0261952 | AIZ | -0.0224288 |
2015-01-05 | AFL | -0.0261952 | AJG | -0.0059600 |
2015-01-05 | AFL | -0.0261952 | ALL | -0.0198232 |
2015-01-05 | AFL | -0.0261952 | AMP | -0.0320993 |
2015-01-05 | AFL | -0.0261952 | AON | -0.0096470 |
2015-01-05 | AFL | -0.0261952 | AXP | -0.0264459 |
2015-01-05 | AFL | -0.0261952 | BAC | -0.0290503 |
2015-01-05 | AFL | -0.0261952 | BEN | -0.0331591 |
2015-01-05 | AFL | -0.0261952 | BK | -0.0257044 |
2015-01-05 | AFL | -0.0261952 | BLK | -0.0258739 |
2015-01-05 | AFL | -0.0261952 | C | -0.0315149 |
2015-01-05 | AFL | -0.0261952 | CB | -0.0163404 |
2015-01-05 | AFL | -0.0261952 | CB | 0.0000000 |
2015-01-05 | AFL | -0.0261952 | CBOE | 0.0318813 |
2015-01-05 | AFL | -0.0261952 | CFG | -0.0234249 |
2015-01-05 | AFL | -0.0261952 | CINF | -0.0143383 |
2015-01-05 | AFL | -0.0261952 | CMA | -0.0356448 |
2015-01-05 | AFL | -0.0261952 | CME | 0.0056728 |
So far so good.
Now we’ve got some wrangling to do. We want to remove instances where we have the same stock for ticker.x
and ticker.y
, which corresponds to the diagonal on the correlation matrix.
We also want to remove instances where we have the same stock, but with the ticker.x
and ticker.y
designations reversed (this is equivalent to removing the upper or lower triangle of the correlation matrix).
Note that we need to ungroup our dataframe (we grouped it earlier) – if we don’t ungroup our variables, the grouping variable will be added back and thwart attempts to filter distinct cases.
pairwise_combos <- pairwise_combos %>% ungroup() %>% # important!! # drop diagonal filter(ticker.x != ticker.y) %>% # remove duplicate pairs (eg A-AAL, AAL-A) mutate(tickers = ifelse(ticker.x < ticker.y, glue("{ticker.x}, {ticker.y}"), glue("{ticker.y}, {ticker.x}"))) %>% distinct(date, tickers, .keep_all = TRUE) pairwise_combos %>% na.omit() %>% head(30) %>% pretty_table()
date | ticker.x | return.x | ticker.y | return.y | tickers |
---|---|---|---|---|---|
2015-01-05 | AFL | -0.0261952 | AIG | -0.0197826 | AFL, AIG |
2015-01-05 | AFL | -0.0261952 | AIZ | -0.0224288 | AFL, AIZ |
2015-01-05 | AFL | -0.0261952 | AJG | -0.0059600 | AFL, AJG |
2015-01-05 | AFL | -0.0261952 | ALL | -0.0198232 | AFL, ALL |
2015-01-05 | AFL | -0.0261952 | AMP | -0.0320993 | AFL, AMP |
2015-01-05 | AFL | -0.0261952 | AON | -0.0096470 | AFL, AON |
2015-01-05 | AFL | -0.0261952 | AXP | -0.0264459 | AFL, AXP |
2015-01-05 | AFL | -0.0261952 | BAC | -0.0290503 | AFL, BAC |
2015-01-05 | AFL | -0.0261952 | BEN | -0.0331591 | AFL, BEN |
2015-01-05 | AFL | -0.0261952 | BK | -0.0257044 | AFL, BK |
2015-01-05 | AFL | -0.0261952 | BLK | -0.0258739 | AFL, BLK |
2015-01-05 | AFL | -0.0261952 | C | -0.0315149 | AFL, C |
2015-01-05 | AFL | -0.0261952 | CB | -0.0163404 | AFL, CB |
2015-01-05 | AFL | -0.0261952 | CBOE | 0.0318813 | AFL, CBOE |
2015-01-05 | AFL | -0.0261952 | CFG | -0.0234249 | AFL, CFG |
2015-01-05 | AFL | -0.0261952 | CINF | -0.0143383 | AFL, CINF |
2015-01-05 | AFL | -0.0261952 | CMA | -0.0356448 | AFL, CMA |
2015-01-05 | AFL | -0.0261952 | CME | 0.0056728 | AFL, CME |
2015-01-05 | AFL | -0.0261952 | COF | -0.0230331 | AFL, COF |
2015-01-05 | AFL | -0.0261952 | DFS | -0.0223378 | AFL, DFS |
2015-01-05 | AFL | -0.0261952 | ETFC | -0.0324865 | AFL, ETFC |
2015-01-05 | AFL | -0.0261952 | FITB | -0.0301831 | AFL, FITB |
2015-01-05 | AFL | -0.0261952 | FRC | -0.0299424 | AFL, FRC |
2015-01-05 | AFL | -0.0261952 | GL | -0.0179099 | AFL, GL |
2015-01-05 | AFL | -0.0261952 | GS | -0.0312227 | AFL, GS |
2015-01-05 | AFL | -0.0261952 | HBAN | -0.0295238 | AFL, HBAN |
2015-01-05 | AFL | -0.0261952 | HIG | -0.0210577 | AFL, HIG |
2015-01-05 | AFL | -0.0261952 | ICE | 0.0054677 | AFL, ICE |
2015-01-05 | AFL | -0.0261952 | IVZ | -0.0291262 | AFL, IVZ |
2015-01-05 | AFL | -0.0261952 | JPM | -0.0310450 | AFL, JPM |
Next, we’ll use the brilliantly useful slider
package and the function slide2_dbl
to do the rolling correlation calculation (slider
implements a number of rolling window calculation functions – we’ll explore it more in another post):
period <- 60 pairwise_corrs <- pairwise_combos %>% group_by(tickers) %>% arrange(date, .by_group = TRUE) %>% mutate(rollingcor = slider::slide2_dbl( .x = return.x, .y = return.y, .f = ~cor(.x, .y), .before = period, .complete = TRUE) ) %>% select(date, tickers, rollingcor) pairwise_corrs %>% na.omit() %>% head(30) %>% pretty_table()
date | tickers | rollingcor |
---|---|---|
2015-04-01 | AFL, AIG | 0.7818676 |
2015-04-02 | AFL, AIG | 0.7718580 |
2015-04-06 | AFL, AIG | 0.7678625 |
2015-04-07 | AFL, AIG | 0.7680022 |
2015-04-08 | AFL, AIG | 0.7813979 |
2015-04-09 | AFL, AIG | 0.7711979 |
2015-04-10 | AFL, AIG | 0.7678292 |
2015-04-13 | AFL, AIG | 0.7469418 |
2015-04-14 | AFL, AIG | 0.7423953 |
2015-04-15 | AFL, AIG | 0.7422602 |
2015-04-16 | AFL, AIG | 0.7375739 |
2015-04-17 | AFL, AIG | 0.7422152 |
2015-04-20 | AFL, AIG | 0.7391824 |
2015-04-21 | AFL, AIG | 0.7285547 |
2015-04-22 | AFL, AIG | 0.7220685 |
2015-04-23 | AFL, AIG | 0.7302781 |
2015-04-24 | AFL, AIG | 0.7220693 |
2015-04-27 | AFL, AIG | 0.6898930 |
2015-04-28 | AFL, AIG | 0.6830527 |
2015-04-29 | AFL, AIG | 0.6761612 |
2015-04-30 | AFL, AIG | 0.6472679 |
2015-05-01 | AFL, AIG | 0.5972614 |
2015-05-04 | AFL, AIG | 0.6544743 |
2015-05-05 | AFL, AIG | 0.6505913 |
2015-05-06 | AFL, AIG | 0.6460236 |
2015-05-07 | AFL, AIG | 0.6449847 |
2015-05-08 | AFL, AIG | 0.6497471 |
2015-05-11 | AFL, AIG | 0.6656422 |
2015-05-12 | AFL, AIG | 0.6721218 |
2015-05-13 | AFL, AIG | 0.6832579 |
The syntax of slide2_dbl
might look odd if it’s the first time you’ve seen it, but it leverages the tidyverse’s functional programming tools to repeatedly apply a function (given by .f = ~cor(...)
) over windows of our data specified by before
(number of prior periods to use in the window) and complete
(whether to evaluate .f
on complete windows only).
The ~
notation might look odd too. In this case, it’s used as shorthand for an anonymous function: function(.x, .y) {cor(.x, .y)}
So our pipeline of operations above is exactly the same as this one:
pairwise_corrs <- pairwise_combos %>% group_by(tickers) %>% arrange(date, .by_group = TRUE) %>% mutate(rollingcor = slider::slide2_dbl( .x = return.x, .y = return.y, .f = function(.x, .y) { cor(.x, .y) }, # long-hand anonymous function .before = period, .complete = TRUE) ) %>% select(date, tickers, rollingcor)
Now, the other confusing things about this transformation are the seemingly inconsistent arguments in slider2_dbl
:
- we designate a
.x
and a.y
argument - but we also define a function with these arguments
Actually, the .x
and .y
names are conventions used throughout the tidyverse to designate variables that are subject to non-standard evaluation (more on what that means in another post – it’s not critical right now). In our slide2_dbl
function, .x
is passed as the first argument to .f
and .y
is passed as the second.
That means that we could equally write our transformation like this, and it would be equivalent:
pairwise_corrs <- pairwise_combos %>% group_by(tickers) %>% arrange(date, .by_group = TRUE) %>% mutate(rollingcor = slider::slide2_dbl( .x = return.x, .y = return.y, .f = function(arg1, arg2) { cor(arg1, arg2) }, # the name of the args doesn't matter .before = period, .complete = TRUE) ) %>% select(date, tickers, rollingcor)
Finally, to get the mean rolling correlation of the ETF constituents, we simply group by date and take the mean of the group:
mean_pw_cors <- pairwise_corrs %>% group_by(date) %>% summarise(mean_pw_corr = mean(rollingcor, na.rm = TRUE)) mean_pw_cors %>% na.omit() %>% ggplot(aes(x = date, y = mean_pw_corr)) + geom_line() + labs( x = "Date", y = "Mean Pairwise Correlation", title = "Rolling Mean Pairwise Correlation", subtitle = "XLF Constituents" )
Conclusion
In this post, we broke down our problem of calculating the rolling mean correlation of the constituents of an ETF into various chunks and solved them one at a time to get the desired output.
The tidy data manipulation snippets we used here will be useful for doing similar transformations, such as rolling beta calculations, as well as single-variable rolling calculations such as volatility.
One problem that we glossed over here is that our largest dataframe – the one containing the pairwise combinations of returns – consisted of just under 3 million rows. That means we can easily do this entire piece of analysis in memory.
Things get slightly more difficult if we want to calculate the mean rolling correlation of the constituents of a larger ETF or index.
In another post, we’ll solve this problem for the S&P 500 index. We’ll also consider how the index has changed over time.
Are correlations and covariances like standard deviation and variances? i.e just like it’s frowned upon to average up correlations, but it’s okay to average up variances, might you want to calculate the average covariance, rather than the average correlation?
I’m not sure that it’s frowned upon to average pairwise correlations as such…perhaps if you were averaging multiple correlation estimates from a singlepair of variables you’d run into some problems. But that’s not what we’re doing here.
Ultimately it depends on what you’re doing with the calculation. We’re using it as part of a broader study to understand the effect of ETF-driven trading on constituent stocks. In particular, whether ETF-driven sell-offs in the constituents tend to drive correlations up and lead to exploitable mispricings. In that context, I think the mean pairwise correlation is fine.
Hi Kris!. Awesome post! Super useful!. The full_join is a very cool approach.
For the rolling correlation with slider package, it got a bit slow when running over 17 million rows (over a different dataset). I tried tibbletime’s “rollify” function as an alternative, and it’s also a bit slow. Then tried the “Roll” package with roll_cor function.. and it runs 30X faster.
The Roll package seems works fine with tidyverse except for some functions that are crashing on my dataset like roll_max and roll_min.
The RollingWindow package won’t get you around the memory issues but is pretty quick. https://github.com/andrewuhl/RollingWindow