# tidyverse

Posted on Jun 02, 2020 by
0 Views

Holding data in a tidy format works wonders for one's productivity. Here we will explore the tidyr package, which is all about creating tidy data. In particular, let's develop an understanding of the tidyr::pivot_longer and tidyr::pivot_wider functions for switching between different formats of tidy data. In this video, you'll learn: What tidy data looks like Why it's a sensible approach The difference between long and wide tidy data How to efficiently switch between the two format When and why you'd use each of the two formats   What's tidy data? Tidy data is data where: Every column is variable. Every row is an observation. Every cell is a single value. Why do we care? It turns out there are huge benefits to thinking about the “shape” of your data and the best way to structure and manipulate it for your problem. Tidy data is a standard way of shaping data that facilitates analysis. In particular, tidy data works very well with the tidyverse tools. Which means less time spent transforming and cleaning data and more time spent solving problems. In...

Posted on May 28, 2020 by
0 Views

When data is too big to fit into memory, one approach is to break it into smaller pieces, operate on each piece, and then join the results back together. Here's how to do that to calculate rolling mean pairwise correlations of a large stock universe. Background We've been using the problem of calculating mean rolling correlations of ETF constituents as a test case for solving in-memory computation limitations in R. We're interested in this calculation as a research input to a statistical arbitrage strategy that leverages ETF-driven trading in the constituents. We wrote about an early foray into this trade. Previously, we introduced this problem along with the concept of profiling code for performance bottlenecks here. We can do the calculation in-memory without any trouble for a regular ETF, say XLF (the SPDR financial sector ETF), but we quickly run into problems if we want to look at SPY. In this post, we're going to explore one workaround for R's in-memory limitations by splitting the problem into smaller pieces and recombining them to get our desired result. The problem When...

Posted on May 27, 2020 by
0 Views

When you're working with large universes of stock data you'll come across a lot of challenges: Stocks pay dividends and other distributions that have to be accounted for. Stocks are subject to splits and other corporate actions which also have to be accounted for. New stocks are listed all the time - you won't have as much history for these stocks as for other stocks. Stocks are delisted, and many datasets do not include the price history of delisted stocks Stocks can be suspended or halted for a period of time, leading to trading gaps. Companies grow and shrink: the “top 100 stocks by market cap” in 1990 looks very different to the same group in 2020; “growth stocks” in 1990 look very different to “growth stocks” in 2020 etc. The challenges are well understood, but dealing with them is not always straightforward. One significant challenge is gaps in data. Quant analysis gets very hard if you have missing or misaligned data. If you're working with a universe of 1,000 stocks life is a lot easier if you have an...

Posted on May 22, 2020 by
1 Comment.
0 Views

Recently, we wrote about calculating mean rolling pairwise correlations between the constituent stocks of an ETF. The tidyverse tools dplyr and slider solve this somewhat painful data wrangling operation about as elegantly and intuitively as possible. Why did you want to do that? We're building a statistical arbitrage strategy that relies on indexation-driven trading in the constituents. We wrote about an early foray into this trade - we're now taking the concepts a bit further. But what about the problem of scaling it up? When we performed this operation on the constituents of the XLF ETF, our largest intermediate dataframe consisted of around 3-million rows, easily within the capabilities of modern laptops. XLF currently holds 68 constituent stocks. So for any day, we have  $\frac{68*67}{2} = 2,278$ correlations to estimate (67 because we don't want the diagonal of the correlation matrix, take half as we only need its upper or lower triangle). We calculated five years of rolling correlations, so we had  $5*250*2,278 = 2,847,500$ correlations in total. Piece of cake. The problem gets a lot...

Posted on May 20, 2020 by
1 Comment.
0 Views

Working with modern APIs you will often have to wrangle with data in JSON format. This article presents some tools and recipes for working with JSON data with R in the tidyverse. We'll use purrr::map functions to extract and transform our JSON data. And we'll provide intuitive examples of the cross-overs and differences between purrr and dplyr. library(tidyverse) library(here) library(kableExtra) pretty_print <- function(df, num_rows) { df %>% head(num_rows) %>% kable() %>% kable_styling(full_width = TRUE, position = 'center') %>% scroll_box(height = '300px') } Load JSON as nested named lists This data has been converted from raw JSON to nested named lists using jsonlite::fromJSON with the simplify argument set to FALSE (that is, all elements are converted to named lists). The data consists of market data for SPY options with various strikes and expiries. We got it from the options data vendor Orats, whose data API I enjoy almost as much as their orange website. If you want to follow along, you can sign-up for a free trial of the API, and load the data directly from the Orats API with the...

Posted on May 18, 2020 by
0 Views

How might we calculate rolling correlations between constituents of an ETF, given a dataframe of prices? For problems like this, the tidyverse really shines. There are a number of ways to solve this problem … read on for our solution, and let us know if you'd approach it differently! First, we load some packages and some data that we extracted earlier. xlfprices.RData contains a dataframe, prices_xlf, of constituents of the XLF ETF and their daily prices. You can get this data from our GitHub repository. The dataset isn't entirely accurate, as it contains prices of today's constituents and doesn't account for historical changes to the makeup of the ETF. But that won't matter for our purposes. library(tidyverse) library(lubridate) library(glue) library(here) theme_set(theme_bw()) load(here::here("data", "xlfprices.RData")) prices_xlf %>% head(10) ## # A tibble: 10 x 10 ## ticker date open high low close volume dividends closeunadj inSPX ## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> ## 1 AFL 2019-11-29 54.8 55.1 54.8 54.8 1270649 0 54.8 TRUE ## 2 AIG 2019-11-29 52.8 53.2 52.6 52.7 2865501 0 52.7 TRUE ##...

Posted on May 14, 2020 by
0 Views

In this post, we're going to show how a quant trader can manipulate stock price data using the dplyr R package. Getting set up and loading data Load the dplyr package via the tidyverse package. if (!require('tidyverse')) install.packages('tidyverse') library(tidyverse) First, load some price data. energystockprices.RDS contains a data frame of daily price observations for 3 energy stocks. prices <- readRDS('energystockprices.RDS') prices We've organised our data so that Every column is variable. Every row is an observation. In this data set: We have 13,314 rows in our data frame. Each row represents a daily price observation for a given stock. For each observation measure the open, high, low and close prices, and the volume traded. This is a very helpful way to structure your price data. We'll see how we can use the dplyr package to manipulate price data for quant analysis. The main dplyr verbs There are 6 main functions to master in dplyr. filter() picks outs observations (rows) by some filter criteria arrange() reorders the observations (rows) select() picks out the variables (columns) mutate() creates new variables (columns) by...

Posted on May 12, 2020 by
0 Views

In this post, we are going to construct snapshots of historic S&P 500 index constituents, from freely available data on the internet. Why? Well, one of the biggest challenges in looking for opportunities amongst a broad universe of stocks is choosing what stock "universe" to look at. One approach to dealing with this is to pick the stocks that are currently in the S&P 500 index. Unfortunately, the stocks that are currently in the S&P 500 index weren't all there last year. A third of them weren't there ten years ago... If we create a historical data set by picking current S&P 500 index constituents, then we will be including historical data for smaller stocks that weren't in the index at that time. These are all going to be stocks that did very well, historically, or else they wouldn't have gotten in the index! So this universe selection technique biases our stock returns higher. The average past returns of current SPX constituents is higher than the average past returns of historic SPX constituents, due to this upward bias. It's easy...

Posted on May 08, 2020 by
There are 2 good reasons to buy put options: because you think they are cheap because you want downside protection. In the latter case, you are looking to use the skewed payoff profile of the put option to protect a portfolio against large downside moves without capping your upside too much. The first requires a pricing model. Or, at the least, an understanding of when and under what conditions put options tend to be cheap. The second doesn't necessarily. We'll assume that we're going to have to pay a premium to protect our portfolio - and that not losing a large amount of money is more important than the exact price we pay for it. Let's run through an example… We have a portfolio comprised entirely of 100 shares of SPY. About $29k worth. We can plot a payoff profile for our whole portfolio. This is going to show the dollar P&L from our portfolio at various SPY prices. At the time of writing, SPY closed at$287.05 if (!require("pacman")) install.packages("pacman") pacman::p_load(tidyverse, rvest, slider, tidyquant, alphavantager, kableExtra) SPYprice <- 287.05...