# dplyr

Posted on May 28, 2020 by
0 Views

When data is too big to fit into memory, one approach is to break it into smaller pieces, operate on each piece, and then join the results back together. Here's how to do that to calculate rolling mean pairwise correlations of a large stock universe. Background We've been using the problem of calculating mean rolling correlations of ETF constituents as a test case for solving in-memory computation limitations in R. We're interested in this calculation as a research input to a statistical arbitrage strategy that leverages ETF-driven trading in the constituents. We wrote about an early foray into this trade. Previously, we introduced this problem along with the concept of profiling code for performance bottlenecks here. We can do the calculation in-memory without any trouble for a regular ETF, say XLF (the SPDR financial sector ETF), but we quickly run into problems if we want to look at SPY. In this post, we're going to explore one workaround for R's in-memory limitations by splitting the problem into smaller pieces and recombining them to get our desired result. The problem When...

Posted on May 22, 2020 by
1 Comment.
0 Views

Recently, we wrote about calculating mean rolling pairwise correlations between the constituent stocks of an ETF. The tidyverse tools dplyr and slider solve this somewhat painful data wrangling operation about as elegantly and intuitively as possible. Why did you want to do that? We're building a statistical arbitrage strategy that relies on indexation-driven trading in the constituents. We wrote about an early foray into this trade - we're now taking the concepts a bit further. But what about the problem of scaling it up? When we performed this operation on the constituents of the XLF ETF, our largest intermediate dataframe consisted of around 3-million rows, easily within the capabilities of modern laptops. XLF currently holds 68 constituent stocks. So for any day, we have  $\frac{68*67}{2} = 2,278$ correlations to estimate (67 because we don't want the diagonal of the correlation matrix, take half as we only need its upper or lower triangle). We calculated five years of rolling correlations, so we had  $5*250*2,278 = 2,847,500$ correlations in total. Piece of cake. The problem gets a lot...

Posted on May 20, 2020 by
1 Comment.
0 Views

Working with modern APIs you will often have to wrangle with data in JSON format. This article presents some tools and recipes for working with JSON data with R in the tidyverse. We'll use purrr::map functions to extract and transform our JSON data. And we'll provide intuitive examples of the cross-overs and differences between purrr and dplyr. library(tidyverse) library(here) library(kableExtra) pretty_print <- function(df, num_rows) { df %>% head(num_rows) %>% kable() %>% kable_styling(full_width = TRUE, position = 'center') %>% scroll_box(height = '300px') } Load JSON as nested named lists This data has been converted from raw JSON to nested named lists using jsonlite::fromJSON with the simplify argument set to FALSE (that is, all elements are converted to named lists). The data consists of market data for SPY options with various strikes and expiries. We got it from the options data vendor Orats, whose data API I enjoy almost as much as their orange website. If you want to follow along, you can sign-up for a free trial of the API, and load the data directly from the Orats API with the...

Posted on May 18, 2020 by
0 Views

How might we calculate rolling correlations between constituents of an ETF, given a dataframe of prices? For problems like this, the tidyverse really shines. There are a number of ways to solve this problem … read on for our solution, and let us know if you'd approach it differently! First, we load some packages and some data that we extracted earlier. xlfprices.RData contains a dataframe, prices_xlf, of constituents of the XLF ETF and their daily prices. You can get this data from our GitHub repository. The dataset isn't entirely accurate, as it contains prices of today's constituents and doesn't account for historical changes to the makeup of the ETF. But that won't matter for our purposes. library(tidyverse) library(lubridate) library(glue) library(here) theme_set(theme_bw()) load(here::here("data", "xlfprices.RData")) prices_xlf %>% head(10) ## # A tibble: 10 x 10 ## ticker date open high low close volume dividends closeunadj inSPX ## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> ## 1 AFL 2019-11-29 54.8 55.1 54.8 54.8 1270649 0 54.8 TRUE ## 2 AIG 2019-11-29 52.8 53.2 52.6 52.7 2865501 0 52.7 TRUE ##...

Posted on May 14, 2020 by
0 Views

In this post, we're going to show how a quant trader can manipulate stock price data using the dplyr R package. Getting set up and loading data Load the dplyr package via the tidyverse package. if (!require('tidyverse')) install.packages('tidyverse') library(tidyverse) First, load some price data. energystockprices.RDS contains a data frame of daily price observations for 3 energy stocks. prices <- readRDS('energystockprices.RDS') prices We've organised our data so that Every column is variable. Every row is an observation. In this data set: We have 13,314 rows in our data frame. Each row represents a daily price observation for a given stock. For each observation measure the open, high, low and close prices, and the volume traded. This is a very helpful way to structure your price data. We'll see how we can use the dplyr package to manipulate price data for quant analysis. The main dplyr verbs There are 6 main functions to master in dplyr. filter() picks outs observations (rows) by some filter criteria arrange() reorders the observations (rows) select() picks out the variables (columns) mutate() creates new variables (columns) by...

Posted on May 12, 2020 by
0 Views

In this post, we are going to construct snapshots of historic S&P 500 index constituents, from freely available data on the internet. Why? Well, one of the biggest challenges in looking for opportunities amongst a broad universe of stocks is choosing what stock "universe" to look at. One approach to dealing with this is to pick the stocks that are currently in the S&P 500 index. Unfortunately, the stocks that are currently in the S&P 500 index weren't all there last year. A third of them weren't there ten years ago... If we create a historical data set by picking current S&P 500 index constituents, then we will be including historical data for smaller stocks that weren't in the index at that time. These are all going to be stocks that did very well, historically, or else they wouldn't have gotten in the index! So this universe selection technique biases our stock returns higher. The average past returns of current SPX constituents is higher than the average past returns of historic SPX constituents, due to this upward bias. It's easy...

Posted on Apr 30, 2020 by