How to Calculate Rolling Pairwise Correlations in the Tidyverse

How might we calculate rolling correlations between constituents of an ETF, given a dataframe of prices?

For problems like this, the tidyverse really shines. There are a number of ways to solve this problem … read on for our solution, and let us know if you’d approach it differently!

First, we load some packages and some data that we extracted earlier. xlfprices.RData contains a dataframe, prices_xlf, of constituents of the XLF ETF and their daily prices. You can get this data from our GitHub repository.

The dataset isn’t entirely accurate, as it contains prices of today’s constituents and doesn’t account for historical changes to the makeup of the ETF. But that won’t matter for our purposes.

library(tidyverse)
library(lubridate)
library(glue)
library(here)
theme_set(theme_bw())

load(here::here("data", "xlfprices.RData"))

prices_xlf %>%
  head(10)

## # A tibble: 10 x 10
##    ticker date        open  high   low close   volume dividends closeunadj inSPX
##    <chr>  <date>     <dbl> <dbl> <dbl> <dbl>    <dbl>     <dbl>      <dbl> <lgl>
##  1 AFL    2019-11-29  54.8  55.1  54.8  54.8  1270649         0       54.8 TRUE 
##  2 AIG    2019-11-29  52.8  53.2  52.6  52.7  2865501         0       52.7 TRUE 
##  3 AIZ    2019-11-29 133.  134.  133.  133.    202854         0      133.  TRUE 
##  4 AJG    2019-11-29  93.3  93.6  93.0  93.3   392489         0       93.3 TRUE 
##  5 ALL    2019-11-29 112.  112.  111.  111.    817942         0      111.  TRUE 
##  6 AMP    2019-11-29 164.  165.  163.  164.    404660         0      164.  TRUE 
##  7 AON    2019-11-29 203.  204.  202.  204.    415940         0      204.  TRUE 
##  8 AXP    2019-11-29 120.  121.  120.  120.   1961463         0      120.  TRUE 
##  9 BAC    2019-11-29  33.4  33.5  33.2  33.3 19503395         0       33.3 TRUE 
## 10 BEN    2019-11-29  27.8  27.9  27.4  27.5  1485635         0       27.5 TRUE

We’d like to be able to calculate rolling average pairwise correlations between all the stocks as tidily as possible.

That requires that we calculate the rolling pairwise correlation between all the stock combinations in the index and then take the mean of all those.

A good way to tackle such problems is to chunk them down into bite-sized pieces and then solve each piece in turn. We split the problem into the following steps:

calculate returns for each ticker
create a long dataframe of all the pairwise ticker combinations for each day by doing a full join of the data on itself, keyed by date
remove instances where we had the same stock twice (corresponding to the diagonal of the correlation matrix)
remove instances where we have the complementary pair of the same stocks, eg we only want one of APPL-GOOG and GOOG-APPL (this is equivalent to removing the upper or lower triangle of the correlation matrix)
use slider::slide2_dbl to do the rolling correlation calculation
group by date and take the mean

Calculating returns

The first step is straightforward – we simply calculate close-to-close returns and return a long dataframe of dates, tickers, and returns:

# calculate returns to each stock
df <- prices_xlf %>%
  group_by(ticker) %>%
  arrange(date, .by_group = TRUE) %>%
  mutate(return = close / dplyr::lag(close) - 1) %>%
  select(date, ticker, return)

# function for prettier web display
pretty_table <- function(df) {
  require(kableExtra)

  df %>%
    kable() %>%
    kable_styling(full_width = TRUE, position = 'center') %>%
    scroll_box(height = '300px')
}

df %>%
  head(50) %>%
  pretty_table()

date	ticker	return
2015-01-02	AFL	NA
2015-01-05	AFL	-0.0261952
2015-01-06	AFL	-0.0089106
2015-01-07	AFL	0.0062765
2015-01-08	AFL	0.0097775
2015-01-09	AFL	-0.0212020
2015-01-12	AFL	-0.0057991
2015-01-13	AFL	-0.0049751
2015-01-14	AFL	-0.0077586
2015-01-15	AFL	-0.0033015
2015-01-16	AFL	0.0142957
2015-01-20	AFL	-0.0029220
2015-01-21	AFL	0.0015515
2015-01-22	AFL	0.0180723
2015-01-23	AFL	-0.0071006
2015-01-26	AFL	0.0049379
2015-01-27	AFL	-0.0086411
2015-01-28	AFL	-0.0196548
2015-01-29	AFL	0.0050558
2015-01-30	AFL	-0.0098873
2015-02-02	AFL	0.0190960
2015-02-03	AFL	0.0158157
2015-02-04	AFL	0.0269081
2015-02-05	AFL	0.0049440
2015-02-06	AFL	0.0049196
2015-02-09	AFL	-0.0022846
2015-02-10	AFL	0.0050703
2015-02-11	AFL	0.0061839
2015-02-12	AFL	-0.0004852
2015-02-13	AFL	0.0053398
2015-02-17	AFL	0.0035410
2015-02-18	AFL	-0.0096231
2015-02-19	AFL	0.0004858
2015-02-20	AFL	0.0056653
2015-02-23	AFL	-0.0112667
2015-02-24	AFL	0.0102556
2015-02-25	AFL	-0.0017725
2015-02-26	AFL	0.0037127
2015-02-27	AFL	0.0011258
2015-03-02	AFL	0.0051406
2015-03-03	AFL	-0.0046348
2015-03-04	AFL	0.0011240
2015-03-05	AFL	0.0072173
2015-03-06	AFL	-0.0101911
2015-03-09	AFL	0.0022523
2015-03-10	AFL	-0.0200642
2015-03-11	AFL	0.0072072
2015-03-12	AFL	0.0188649
2015-03-13	AFL	-0.0083001
2015-03-16	AFL	0.0133591

Next, we create a long dataframe of all the combinations for each day by doing a full join of the data on itself, by date.

# combinations by date
pairwise_combos <- df %>%
  full_join(df, by = "date")  

pairwise_combos %>%
  na.omit() %>%
  head(20) %>%
  pretty_table()

date	ticker.x	return.x	ticker.y	return.y
2015-01-05	AFL	-0.0261952	AFL	-0.0261952
2015-01-05	AFL	-0.0261952	AIG	-0.0197826
2015-01-05	AFL	-0.0261952	AIZ	-0.0224288
2015-01-05	AFL	-0.0261952	AJG	-0.0059600
2015-01-05	AFL	-0.0261952	ALL	-0.0198232
2015-01-05	AFL	-0.0261952	AMP	-0.0320993
2015-01-05	AFL	-0.0261952	AON	-0.0096470
2015-01-05	AFL	-0.0261952	AXP	-0.0264459
2015-01-05	AFL	-0.0261952	BAC	-0.0290503
2015-01-05	AFL	-0.0261952	BEN	-0.0331591
2015-01-05	AFL	-0.0261952	BK	-0.0257044
2015-01-05	AFL	-0.0261952	BLK	-0.0258739
2015-01-05	AFL	-0.0261952	C	-0.0315149
2015-01-05	AFL	-0.0261952	CB	-0.0163404
2015-01-05	AFL	-0.0261952	CB	0.0000000
2015-01-05	AFL	-0.0261952	CBOE	0.0318813
2015-01-05	AFL	-0.0261952	CFG	-0.0234249
2015-01-05	AFL	-0.0261952	CINF	-0.0143383
2015-01-05	AFL	-0.0261952	CMA	-0.0356448
2015-01-05	AFL	-0.0261952	CME	0.0056728

So far so good.

Now we’ve got some wrangling to do. We want to remove instances where we have the same stock for ticker.x and ticker.y, which corresponds to the diagonal on the correlation matrix.

We also want to remove instances where we have the same stock, but with the ticker.x and ticker.y designations reversed (this is equivalent to removing the upper or lower triangle of the correlation matrix).

Note that we need to ungroup our dataframe (we grouped it earlier) – if we don’t ungroup our variables, the grouping variable will be added back and thwart attempts to filter distinct cases.

pairwise_combos <- pairwise_combos %>%
  ungroup() %>%  # important!! 
# drop diagonal 
  filter(ticker.x != ticker.y) %>% 
# remove duplicate pairs (eg A-AAL, AAL-A)
  mutate(tickers = ifelse(ticker.x < ticker.y, glue("{ticker.x}, {ticker.y}"), glue("{ticker.y}, {ticker.x}"))) %>%
  distinct(date, tickers, .keep_all = TRUE) 

pairwise_combos %>%
  na.omit() %>%
  head(30) %>%
  pretty_table()

date	ticker.x	return.x	ticker.y	return.y	tickers
2015-01-05	AFL	-0.0261952	AIG	-0.0197826	AFL, AIG
2015-01-05	AFL	-0.0261952	AIZ	-0.0224288	AFL, AIZ
2015-01-05	AFL	-0.0261952	AJG	-0.0059600	AFL, AJG
2015-01-05	AFL	-0.0261952	ALL	-0.0198232	AFL, ALL
2015-01-05	AFL	-0.0261952	AMP	-0.0320993	AFL, AMP
2015-01-05	AFL	-0.0261952	AON	-0.0096470	AFL, AON
2015-01-05	AFL	-0.0261952	AXP	-0.0264459	AFL, AXP
2015-01-05	AFL	-0.0261952	BAC	-0.0290503	AFL, BAC
2015-01-05	AFL	-0.0261952	BEN	-0.0331591	AFL, BEN
2015-01-05	AFL	-0.0261952	BK	-0.0257044	AFL, BK
2015-01-05	AFL	-0.0261952	BLK	-0.0258739	AFL, BLK
2015-01-05	AFL	-0.0261952	C	-0.0315149	AFL, C
2015-01-05	AFL	-0.0261952	CB	-0.0163404	AFL, CB
2015-01-05	AFL	-0.0261952	CBOE	0.0318813	AFL, CBOE
2015-01-05	AFL	-0.0261952	CFG	-0.0234249	AFL, CFG
2015-01-05	AFL	-0.0261952	CINF	-0.0143383	AFL, CINF
2015-01-05	AFL	-0.0261952	CMA	-0.0356448	AFL, CMA
2015-01-05	AFL	-0.0261952	CME	0.0056728	AFL, CME
2015-01-05	AFL	-0.0261952	COF	-0.0230331	AFL, COF
2015-01-05	AFL	-0.0261952	DFS	-0.0223378	AFL, DFS
2015-01-05	AFL	-0.0261952	ETFC	-0.0324865	AFL, ETFC
2015-01-05	AFL	-0.0261952	FITB	-0.0301831	AFL, FITB
2015-01-05	AFL	-0.0261952	FRC	-0.0299424	AFL, FRC
2015-01-05	AFL	-0.0261952	GL	-0.0179099	AFL, GL
2015-01-05	AFL	-0.0261952	GS	-0.0312227	AFL, GS
2015-01-05	AFL	-0.0261952	HBAN	-0.0295238	AFL, HBAN
2015-01-05	AFL	-0.0261952	HIG	-0.0210577	AFL, HIG
2015-01-05	AFL	-0.0261952	ICE	0.0054677	AFL, ICE
2015-01-05	AFL	-0.0261952	IVZ	-0.0291262	AFL, IVZ
2015-01-05	AFL	-0.0261952	JPM	-0.0310450	AFL, JPM

Next, we’ll use the brilliantly useful slider package and the function slide2_dbl to do the rolling correlation calculation (slider implements a number of rolling window calculation functions – we’ll explore it more in another post):

period <- 60

pairwise_corrs <- pairwise_combos %>%
  group_by(tickers) %>%
  arrange(date, .by_group = TRUE) %>%
  mutate(rollingcor = slider::slide2_dbl(
    .x = return.x, 
    .y = return.y, 
    .f = ~cor(.x, .y), 
    .before = period, 
    .complete = TRUE)
    ) %>%
  select(date, tickers, rollingcor)

pairwise_corrs %>%
  na.omit() %>%
  head(30) %>%
  pretty_table()

date	tickers	rollingcor
2015-04-01	AFL, AIG	0.7818676
2015-04-02	AFL, AIG	0.7718580
2015-04-06	AFL, AIG	0.7678625
2015-04-07	AFL, AIG	0.7680022
2015-04-08	AFL, AIG	0.7813979
2015-04-09	AFL, AIG	0.7711979
2015-04-10	AFL, AIG	0.7678292
2015-04-13	AFL, AIG	0.7469418
2015-04-14	AFL, AIG	0.7423953
2015-04-15	AFL, AIG	0.7422602
2015-04-16	AFL, AIG	0.7375739
2015-04-17	AFL, AIG	0.7422152
2015-04-20	AFL, AIG	0.7391824
2015-04-21	AFL, AIG	0.7285547
2015-04-22	AFL, AIG	0.7220685
2015-04-23	AFL, AIG	0.7302781
2015-04-24	AFL, AIG	0.7220693
2015-04-27	AFL, AIG	0.6898930
2015-04-28	AFL, AIG	0.6830527
2015-04-29	AFL, AIG	0.6761612
2015-04-30	AFL, AIG	0.6472679
2015-05-01	AFL, AIG	0.5972614
2015-05-04	AFL, AIG	0.6544743
2015-05-05	AFL, AIG	0.6505913
2015-05-06	AFL, AIG	0.6460236
2015-05-07	AFL, AIG	0.6449847
2015-05-08	AFL, AIG	0.6497471
2015-05-11	AFL, AIG	0.6656422
2015-05-12	AFL, AIG	0.6721218
2015-05-13	AFL, AIG	0.6832579

The syntax of slide2_dbl might look odd if it’s the first time you’ve seen it, but it leverages the tidyverse’s functional programming tools to repeatedly apply a function (given by .f = ~cor(...)) over windows of our data specified by before (number of prior periods to use in the window) and complete (whether to evaluate .f on complete windows only).

The ~ notation might look odd too. In this case, it’s used as shorthand for an anonymous function: function(.x, .y) {cor(.x, .y)}

So our pipeline of operations above is exactly the same as this one:

pairwise_corrs <- pairwise_combos %>%
  group_by(tickers) %>%
  arrange(date, .by_group = TRUE) %>%
  mutate(rollingcor = slider::slide2_dbl(
    .x = return.x, 
    .y = return.y, 
    .f = function(.x, .y) { cor(.x, .y) },  # long-hand anonymous function
    .before = period, 
    .complete = TRUE)
    ) %>%
  select(date, tickers, rollingcor)

Now, the other confusing things about this transformation are the seemingly inconsistent arguments in slider2_dbl:

we designate a .x and a .y argument
but we also define a function with these arguments

Actually, the .x and .y names are conventions used throughout the tidyverse to designate variables that are subject to non-standard evaluation (more on what that means in another post – it’s not critical right now). In our slide2_dbl function, .x is passed as the first argument to .f and .y is passed as the second.

That means that we could equally write our transformation like this, and it would be equivalent:

pairwise_corrs <- pairwise_combos %>%
  group_by(tickers) %>%
  arrange(date, .by_group = TRUE) %>%
  mutate(rollingcor = slider::slide2_dbl(
    .x = return.x, 
    .y = return.y, 
    .f = function(arg1, arg2) { cor(arg1, arg2) },  # the name of the args doesn't matter
    .before = period, 
    .complete = TRUE)
    ) %>%
  select(date, tickers, rollingcor)

Finally, to get the mean rolling correlation of the ETF constituents, we simply group by date and take the mean of the group:

mean_pw_cors <- pairwise_corrs %>%
  group_by(date) %>%
  summarise(mean_pw_corr = mean(rollingcor, na.rm = TRUE))

mean_pw_cors %>%
  na.omit() %>%
  ggplot(aes(x = date, y = mean_pw_corr)) +
    geom_line() +
    labs(
      x = "Date",
      y = "Mean Pairwise Correlation",
      title = "Rolling Mean Pairwise Correlation",
      subtitle = "XLF Constituents"
    )

plot of chunk unnamed-chunk-9

Conclusion

In this post, we broke down our problem of calculating the rolling mean correlation of the constituents of an ETF into various chunks and solved them one at a time to get the desired output.

The tidy data manipulation snippets we used here will be useful for doing similar transformations, such as rolling beta calculations, as well as single-variable rolling calculations such as volatility.

One problem that we glossed over here is that our largest dataframe – the one containing the pairwise combinations of returns – consisted of just under 3 million rows. That means we can easily do this entire piece of analysis in memory.
Things get slightly more difficult if we want to calculate the mean rolling correlation of the constituents of a larger ETF or index.

In another post, we’ll solve this problem for the S&P 500 index. We’ll also consider how the index has changed over time.

9 thoughts on “How to Calculate Rolling Pairwise Correlations in the Tidyverse”

Pingback: Handling a Large Universe of Stock Price Data in R: Profiling with profvis - Robot Wealth
Pingback: Weekly Roundup 22 May - Doubling Down in Losing Trades Like a Drunken Hedge Fund Manager - Robot Wealth
T.C.

May 23, 2020 at 9:53 am

Are correlations and covariances like standard deviation and variances? i.e just like it’s frowned upon to average up correlations, but it’s okay to average up variances, might you want to calculate the average covariance, rather than the average correlation?
- Kris Longmore
  
  May 25, 2020 at 3:01 pm
  
  I’m not sure that it’s frowned upon to average pairwise correlations as such…perhaps if you were averaging multiple correlation estimates from a singlepair of variables you’d run into some problems. But that’s not what we’re doing here.
  
  Ultimately it depends on what you’re doing with the calculation. We’re using it as part of a broader study to understand the effect of ETF-driven trading on constituent stocks. In particular, whether ETF-driven sell-offs in the constituents tend to drive correlations up and lead to exploitable mispricings. In that context, I think the mean pairwise correlation is fine.
Pingback: Performant R Programming: Chunking a Problem into Smaller Pieces - Robot Wealth
Carlos Mata

May 31, 2020 at 5:06 am

Hi Kris!. Awesome post! Super useful!. The full_join is a very cool approach.

For the rolling correlation with slider package, it got a bit slow when running over 17 million rows (over a different dataset). I tried tibbletime’s “rollify” function as an alternative, and it’s also a bit slow. Then tried the “Roll” package with roll_cor function.. and it runs 30X faster.
The Roll package seems works fine with tidyverse except for some functions that are crashing on my dataset like roll_max and roll_min.
Pingback: Corr-correlation – Data Science Austria
Pingback: Corr-correlation | R-bloggers
Brent Morrison

October 17, 2020 at 6:05 pm

The RollingWindow package won’t get you around the memory issues but is pretty quick. https://github.com/andrewuhl/RollingWindow

How to Calculate Rolling Pairwise Correlations in the Tidyverse

Calculating returns

Conclusion

9 thoughts on “How to Calculate Rolling Pairwise Correlations in the Tidyverse”

Leave a Comment Cancel reply

© 2022 Robot Wealth Pty LtD

Privacy Policy | Terms Of Use | FAQ | [email protected]