# slider

When data is too big to fit into memory, one approach is to break it into smaller pieces, operate on each piece, and then join the results back together. Here's how to do that to calculate rolling mean pairwise correlations of a large stock universe. Background We've been using the problem of calculating mean rolling correlations of ETF constituents as a test case for solving in-memory computation limitations in R. We're interested in this calculation as a research input to a statistical arbitrage strategy that leverages ETF-driven trading in the constituents. We wrote about an early foray into this trade. Previously, we introduced this problem along with the concept of profiling code for performance bottlenecks here. We can do the calculation in-memory without any trouble for a regular ETF, say XLF (the SPDR financial sector ETF), but we quickly run into problems if we want to look at SPY. In this post, we're going to explore one workaround for R's in-memory limitations by splitting the problem into smaller pieces and recombining them to get our desired result. The problem When...

Recently, we wrote about calculating mean rolling pairwise correlations between the constituent stocks of an ETF. The tidyverse tools dplyr and slider solve this somewhat painful data wrangling operation about as elegantly and intuitively as possible. Why did you want to do that? We're building a statistical arbitrage strategy that relies on indexation-driven trading in the constituents. We wrote about an early foray into this trade - we're now taking the concepts a bit further. But what about the problem of scaling it up? When we performed this operation on the constituents of the XLF ETF, our largest intermediate dataframe consisted of around 3-million rows, easily within the capabilities of modern laptops. XLF currently holds 68 constituent stocks. So for any day, we have [latex] \frac{68*67}{2} = 2,278 [/latex] correlations to estimate (67 because we don't want the diagonal of the correlation matrix, take half as we only need its upper or lower triangle). We calculated five years of rolling correlations, so we had [latex] 5*250*2,278 = 2,847,500 [/latex] correlations in total. Piece of cake. The problem gets a lot...

How might we calculate rolling correlations between constituents of an ETF, given a dataframe of prices? For problems like this, the tidyverse really shines. There are a number of ways to solve this problem … read on for our solution, and let us know if you'd approach it differently! First, we load some packages and some data that we extracted earlier. xlfprices.RData contains a dataframe, prices_xlf, of constituents of the XLF ETF and their daily prices. You can get this data from our GitHub repository. The dataset isn't entirely accurate, as it contains prices of today's constituents and doesn't account for historical changes to the makeup of the ETF. But that won't matter for our purposes. library(tidyverse) library(lubridate) library(glue) library(here) theme_set(theme_bw()) load(here::here("data", "xlfprices.RData")) prices_xlf %>% head(10) ## # A tibble: 10 x 10 ## ticker date open high low close volume dividends closeunadj inSPX ## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> ## 1 AFL 2019-11-29 54.8 55.1 54.8 54.8 1270649 0 54.8 TRUE ## 2 AIG 2019-11-29 52.8 53.2 52.6 52.7 2865501 0 52.7 TRUE ##...