How to Calculate Rolling Pairwise Correlations in the Tidyverse

How might we calculate rolling correlations between constituents of an ETF, given a dataframe of prices?

For problems like this, the tidyverse really shines. There are a number of ways to solve this problem … read on for our solution, and let us know if you’d approach it differently!

First, we load some packages and some data that we extracted earlier. xlfprices.RData contains a dataframe, prices_xlf, of constituents of the XLF ETF and their daily prices. You can get this data from our GitHub repository.

The dataset isn’t entirely accurate, as it contains prices of today’s constituents and doesn’t account for historical changes to the makeup of the ETF. But that won’t matter for our purposes.

library(tidyverse)
library(lubridate)
library(glue)
library(here)
theme_set(theme_bw())

load(here::here("data", "xlfprices.RData"))

prices_xlf %>%
  head(10)
## # A tibble: 10 x 10
##    ticker date        open  high   low close   volume dividends closeunadj inSPX
##    <chr>  <date>     <dbl> <dbl> <dbl> <dbl>    <dbl>     <dbl>      <dbl> <lgl>
##  1 AFL    2019-11-29  54.8  55.1  54.8  54.8  1270649         0       54.8 TRUE 
##  2 AIG    2019-11-29  52.8  53.2  52.6  52.7  2865501         0       52.7 TRUE 
##  3 AIZ    2019-11-29 133.  134.  133.  133.    202854         0      133.  TRUE 
##  4 AJG    2019-11-29  93.3  93.6  93.0  93.3   392489         0       93.3 TRUE 
##  5 ALL    2019-11-29 112.  112.  111.  111.    817942         0      111.  TRUE 
##  6 AMP    2019-11-29 164.  165.  163.  164.    404660         0      164.  TRUE 
##  7 AON    2019-11-29 203.  204.  202.  204.    415940         0      204.  TRUE 
##  8 AXP    2019-11-29 120.  121.  120.  120.   1961463         0      120.  TRUE 
##  9 BAC    2019-11-29  33.4  33.5  33.2  33.3 19503395         0       33.3 TRUE 
## 10 BEN    2019-11-29  27.8  27.9  27.4  27.5  1485635         0       27.5 TRUE

We’d like to be able to calculate rolling average pairwise correlations between all the stocks as tidily as possible.

That requires that we calculate the rolling pairwise correlation between all the stock combinations in the index and then take the mean of all those.

A good way to tackle such problems is to chunk them down into bite-sized pieces and then solve each piece in turn. We split the problem into the following steps:

  • calculate returns for each ticker
  • create a long dataframe of all the pairwise ticker combinations for each day by doing a full join of the data on itself, keyed by date
  • remove instances where we had the same stock twice (corresponding to the diagonal of the correlation matrix)
  • remove instances where we have the complementary pair of the same stocks, eg we only want one of APPL-GOOG and GOOG-APPL (this is equivalent to removing the upper or lower triangle of the correlation matrix)
  • use slider::slide2_dbl to do the rolling correlation calculation
  • group by date and take the mean

Calculating returns

The first step is straightforward – we simply calculate close-to-close returns and return a long dataframe of dates, tickers, and returns:

# calculate returns to each stock
df <- prices_xlf %>%
  group_by(ticker) %>%
  arrange(date, .by_group = TRUE) %>%
  mutate(return = close / dplyr::lag(close) - 1) %>%
  select(date, ticker, return)

# function for prettier web display
pretty_table <- function(df) {
  require(kableExtra)

  df %>%
    kable() %>%
    kable_styling(full_width = TRUE, position = 'center') %>%
    scroll_box(height = '300px')
}

df %>%
  head(50) %>%
  pretty_table()
date ticker return
2015-01-02 AFL NA
2015-01-05 AFL -0.0261952
2015-01-06 AFL -0.0089106
2015-01-07 AFL 0.0062765
2015-01-08 AFL 0.0097775
2015-01-09 AFL -0.0212020
2015-01-12 AFL -0.0057991
2015-01-13 AFL -0.0049751
2015-01-14 AFL -0.0077586
2015-01-15 AFL -0.0033015
2015-01-16 AFL 0.0142957
2015-01-20 AFL -0.0029220
2015-01-21 AFL 0.0015515
2015-01-22 AFL 0.0180723
2015-01-23 AFL -0.0071006
2015-01-26 AFL 0.0049379
2015-01-27 AFL -0.0086411
2015-01-28 AFL -0.0196548
2015-01-29 AFL 0.0050558
2015-01-30 AFL -0.0098873
2015-02-02 AFL 0.0190960
2015-02-03 AFL 0.0158157
2015-02-04 AFL 0.0269081
2015-02-05 AFL 0.0049440
2015-02-06 AFL 0.0049196
2015-02-09 AFL -0.0022846
2015-02-10 AFL 0.0050703
2015-02-11 AFL 0.0061839
2015-02-12 AFL -0.0004852
2015-02-13 AFL 0.0053398
2015-02-17 AFL 0.0035410
2015-02-18 AFL -0.0096231
2015-02-19 AFL 0.0004858
2015-02-20 AFL 0.0056653
2015-02-23 AFL -0.0112667
2015-02-24 AFL 0.0102556
2015-02-25 AFL -0.0017725
2015-02-26 AFL 0.0037127
2015-02-27 AFL 0.0011258
2015-03-02 AFL 0.0051406
2015-03-03 AFL -0.0046348
2015-03-04 AFL 0.0011240
2015-03-05 AFL 0.0072173
2015-03-06 AFL -0.0101911
2015-03-09 AFL 0.0022523
2015-03-10 AFL -0.0200642
2015-03-11 AFL 0.0072072
2015-03-12 AFL 0.0188649
2015-03-13 AFL -0.0083001
2015-03-16 AFL 0.0133591

 

Next, we create a long dataframe of all the combinations for each day by doing a full join of the data on itself, by date.

# combinations by date
pairwise_combos <- df %>%
  full_join(df, by = "date")  

pairwise_combos %>%
  na.omit() %>%
  head(20) %>%
  pretty_table()
date ticker.x return.x ticker.y return.y
2015-01-05 AFL -0.0261952 AFL -0.0261952
2015-01-05 AFL -0.0261952 AIG -0.0197826
2015-01-05 AFL -0.0261952 AIZ -0.0224288
2015-01-05 AFL -0.0261952 AJG -0.0059600
2015-01-05 AFL -0.0261952 ALL -0.0198232
2015-01-05 AFL -0.0261952 AMP -0.0320993
2015-01-05 AFL -0.0261952 AON -0.0096470
2015-01-05 AFL -0.0261952 AXP -0.0264459
2015-01-05 AFL -0.0261952 BAC -0.0290503
2015-01-05 AFL -0.0261952 BEN -0.0331591
2015-01-05 AFL -0.0261952 BK -0.0257044
2015-01-05 AFL -0.0261952 BLK -0.0258739
2015-01-05 AFL -0.0261952 C -0.0315149
2015-01-05 AFL -0.0261952 CB -0.0163404
2015-01-05 AFL -0.0261952 CB 0.0000000
2015-01-05 AFL -0.0261952 CBOE 0.0318813
2015-01-05 AFL -0.0261952 CFG -0.0234249
2015-01-05 AFL -0.0261952 CINF -0.0143383
2015-01-05 AFL -0.0261952 CMA -0.0356448
2015-01-05 AFL -0.0261952 CME 0.0056728

 

So far so good.

Now we’ve got some wrangling to do. We want to remove instances where we have the same stock for ticker.x and ticker.y, which corresponds to the diagonal on the correlation matrix.

We also want to remove instances where we have the same stock, but with the ticker.x and ticker.y designations reversed (this is equivalent to removing the upper or lower triangle of the correlation matrix).

Note that we need to ungroup our dataframe (we grouped it earlier) – if we don’t ungroup our variables, the grouping variable will be added back and thwart attempts to filter distinct cases.

pairwise_combos <- pairwise_combos %>%
  ungroup() %>%  # important!! 
# drop diagonal 
  filter(ticker.x != ticker.y) %>% 
# remove duplicate pairs (eg A-AAL, AAL-A)
  mutate(tickers = ifelse(ticker.x < ticker.y, glue("{ticker.x}, {ticker.y}"), glue("{ticker.y}, {ticker.x}"))) %>%
  distinct(date, tickers, .keep_all = TRUE) 

pairwise_combos %>%
  na.omit() %>%
  head(30) %>%
  pretty_table()
date ticker.x return.x ticker.y return.y tickers
2015-01-05 AFL -0.0261952 AIG -0.0197826 AFL, AIG
2015-01-05 AFL -0.0261952 AIZ -0.0224288 AFL, AIZ
2015-01-05 AFL -0.0261952 AJG -0.0059600 AFL, AJG
2015-01-05 AFL -0.0261952 ALL -0.0198232 AFL, ALL
2015-01-05 AFL -0.0261952 AMP -0.0320993 AFL, AMP
2015-01-05 AFL -0.0261952 AON -0.0096470 AFL, AON
2015-01-05 AFL -0.0261952 AXP -0.0264459 AFL, AXP
2015-01-05 AFL -0.0261952 BAC -0.0290503 AFL, BAC
2015-01-05 AFL -0.0261952 BEN -0.0331591 AFL, BEN
2015-01-05 AFL -0.0261952 BK -0.0257044 AFL, BK
2015-01-05 AFL -0.0261952 BLK -0.0258739 AFL, BLK
2015-01-05 AFL -0.0261952 C -0.0315149 AFL, C
2015-01-05 AFL -0.0261952 CB -0.0163404 AFL, CB
2015-01-05 AFL -0.0261952 CBOE 0.0318813 AFL, CBOE
2015-01-05 AFL -0.0261952 CFG -0.0234249 AFL, CFG
2015-01-05 AFL -0.0261952 CINF -0.0143383 AFL, CINF
2015-01-05 AFL -0.0261952 CMA -0.0356448 AFL, CMA
2015-01-05 AFL -0.0261952 CME 0.0056728 AFL, CME
2015-01-05 AFL -0.0261952 COF -0.0230331 AFL, COF
2015-01-05 AFL -0.0261952 DFS -0.0223378 AFL, DFS
2015-01-05 AFL -0.0261952 ETFC -0.0324865 AFL, ETFC
2015-01-05 AFL -0.0261952 FITB -0.0301831 AFL, FITB
2015-01-05 AFL -0.0261952 FRC -0.0299424 AFL, FRC
2015-01-05 AFL -0.0261952 GL -0.0179099 AFL, GL
2015-01-05 AFL -0.0261952 GS -0.0312227 AFL, GS
2015-01-05 AFL -0.0261952 HBAN -0.0295238 AFL, HBAN
2015-01-05 AFL -0.0261952 HIG -0.0210577 AFL, HIG
2015-01-05 AFL -0.0261952 ICE 0.0054677 AFL, ICE
2015-01-05 AFL -0.0261952 IVZ -0.0291262 AFL, IVZ
2015-01-05 AFL -0.0261952 JPM -0.0310450 AFL, JPM

 

Next, we’ll use the brilliantly useful slider package and the function slide2_dbl to do the rolling correlation calculation (slider implements a number of rolling window calculation functions – we’ll explore it more in another post):

period <- 60

pairwise_corrs <- pairwise_combos %>%
  group_by(tickers) %>%
  arrange(date, .by_group = TRUE) %>%
  mutate(rollingcor = slider::slide2_dbl(
    .x = return.x, 
    .y = return.y, 
    .f = ~cor(.x, .y), 
    .before = period, 
    .complete = TRUE)
    ) %>%
  select(date, tickers, rollingcor)

pairwise_corrs %>%
  na.omit() %>%
  head(30) %>%
  pretty_table()
date tickers rollingcor
2015-04-01 AFL, AIG 0.7818676
2015-04-02 AFL, AIG 0.7718580
2015-04-06 AFL, AIG 0.7678625
2015-04-07 AFL, AIG 0.7680022
2015-04-08 AFL, AIG 0.7813979
2015-04-09 AFL, AIG 0.7711979
2015-04-10 AFL, AIG 0.7678292
2015-04-13 AFL, AIG 0.7469418
2015-04-14 AFL, AIG 0.7423953
2015-04-15 AFL, AIG 0.7422602
2015-04-16 AFL, AIG 0.7375739
2015-04-17 AFL, AIG 0.7422152
2015-04-20 AFL, AIG 0.7391824
2015-04-21 AFL, AIG 0.7285547
2015-04-22 AFL, AIG 0.7220685
2015-04-23 AFL, AIG 0.7302781
2015-04-24 AFL, AIG 0.7220693
2015-04-27 AFL, AIG 0.6898930
2015-04-28 AFL, AIG 0.6830527
2015-04-29 AFL, AIG 0.6761612
2015-04-30 AFL, AIG 0.6472679
2015-05-01 AFL, AIG 0.5972614
2015-05-04 AFL, AIG 0.6544743
2015-05-05 AFL, AIG 0.6505913
2015-05-06 AFL, AIG 0.6460236
2015-05-07 AFL, AIG 0.6449847
2015-05-08 AFL, AIG 0.6497471
2015-05-11 AFL, AIG 0.6656422
2015-05-12 AFL, AIG 0.6721218
2015-05-13 AFL, AIG 0.6832579

 

The syntax of slide2_dbl might look odd if it’s the first time you’ve seen it, but it leverages the tidyverse’s functional programming tools to repeatedly apply a function (given by .f = ~cor(...)) over windows of our data specified by before (number of prior periods to use in the window) and complete (whether to evaluate .f on complete windows only).

The ~ notation might look odd too. In this case, it’s used as shorthand for an anonymous function: function(.x, .y) {cor(.x, .y)}

So our pipeline of operations above is exactly the same as this one:

pairwise_corrs <- pairwise_combos %>%
  group_by(tickers) %>%
  arrange(date, .by_group = TRUE) %>%
  mutate(rollingcor = slider::slide2_dbl(
    .x = return.x, 
    .y = return.y, 
    .f = function(.x, .y) { cor(.x, .y) },  # long-hand anonymous function
    .before = period, 
    .complete = TRUE)
    ) %>%
  select(date, tickers, rollingcor)

Now, the other confusing things about this transformation are the seemingly inconsistent arguments in slider2_dbl:

  • we designate a .x and a .y argument
  • but we also define a function with these arguments

Actually, the .x and .y names are conventions used throughout the tidyverse to designate variables that are subject to non-standard evaluation (more on what that means in another post – it’s not critical right now). In our slide2_dbl function, .x is passed as the first argument to .f and .y is passed as the second.

That means that we could equally write our transformation like this, and it would be equivalent:

pairwise_corrs <- pairwise_combos %>%
  group_by(tickers) %>%
  arrange(date, .by_group = TRUE) %>%
  mutate(rollingcor = slider::slide2_dbl(
    .x = return.x, 
    .y = return.y, 
    .f = function(arg1, arg2) { cor(arg1, arg2) },  # the name of the args doesn't matter
    .before = period, 
    .complete = TRUE)
    ) %>%
  select(date, tickers, rollingcor)

Finally, to get the mean rolling correlation of the ETF constituents, we simply group by date and take the mean of the group:

mean_pw_cors <- pairwise_corrs %>%
  group_by(date) %>%
  summarise(mean_pw_corr = mean(rollingcor, na.rm = TRUE))

mean_pw_cors %>%
  na.omit() %>%
  ggplot(aes(x = date, y = mean_pw_corr)) +
    geom_line() +
    labs(
      x = "Date",
      y = "Mean Pairwise Correlation",
      title = "Rolling Mean Pairwise Correlation",
      subtitle = "XLF Constituents"
    )

plot of chunk unnamed-chunk-9

Conclusion

In this post, we broke down our problem of calculating the rolling mean correlation of the constituents of an ETF into various chunks and solved them one at a time to get the desired output.

The tidy data manipulation snippets we used here will be useful for doing similar transformations, such as rolling beta calculations, as well as single-variable rolling calculations such as volatility.

One problem that we glossed over here is that our largest dataframe – the one containing the pairwise combinations of returns – consisted of just under 3 million rows. That means we can easily do this entire piece of analysis in memory.
Things get slightly more difficult if we want to calculate the mean rolling correlation of the constituents of a larger ETF or index.

In another post, we’ll solve this problem for the S&P 500 index. We’ll also consider how the index has changed over time.

9 thoughts on “How to Calculate Rolling Pairwise Correlations in the Tidyverse”

  1. Are correlations and covariances like standard deviation and variances? i.e just like it’s frowned upon to average up correlations, but it’s okay to average up variances, might you want to calculate the average covariance, rather than the average correlation?

    Reply
    • I’m not sure that it’s frowned upon to average pairwise correlations as such…perhaps if you were averaging multiple correlation estimates from a singlepair of variables you’d run into some problems. But that’s not what we’re doing here.

      Ultimately it depends on what you’re doing with the calculation. We’re using it as part of a broader study to understand the effect of ETF-driven trading on constituent stocks. In particular, whether ETF-driven sell-offs in the constituents tend to drive correlations up and lead to exploitable mispricings. In that context, I think the mean pairwise correlation is fine.

      Reply
  2. Hi Kris!. Awesome post! Super useful!. The full_join is a very cool approach.

    For the rolling correlation with slider package, it got a bit slow when running over 17 million rows (over a different dataset). I tried tibbletime’s “rollify” function as an alternative, and it’s also a bit slow. Then tried the “Roll” package with roll_cor function.. and it runs 30X faster.
    The Roll package seems works fine with tidyverse except for some functions that are crashing on my dataset like roll_max and roll_min.

    Reply

Leave a Comment