Finally, there’s one other method that belongs partially to DSP, but probably more to control theory: the Kalman filter, which finds application in trading strategies like this one.

The post Weekly Roundup 29 May – Crash Protection, Sloppy Regressions and Data Munging Skillz appeared first on Robot Wealth.

]]>*Here’s a round-up of our new articles this week. They cover crash protection, sloppy, noisy regressions, and data-munging skills.*

Large capital losses can be devastating to your trading account.

A couple of weeks ago, we explained how you can use SPY put options to protect your portfolio against severe market downside.

*If you’re prepared to take on a little more sloppiness, there are often cheaper approaches available…*

Find Cheap Options for Effective Crash Protection Using Crash Regressions

Data manipulation skills are crucial to efficient quant trading. In the following posts, Ajet, Kris and I explain some of the skills you need to work with modern financial datasets.

It’s important not to use data from the future to analyse the past. Rolling and expanding windows are essential tools to help “walk your data forward” to avoid these issues.

When you’re working with large universes of stock data then you’ll come across a lot of challenges. This article explains a trick to help deal with missing stock data.

How to Fill Gaps in Large Stock Data Universes Using tidyr and dplyr

The kind of stuff that makes money tends to involve looking for edge in across massive data sets. You can’t work with data this size on your laptop. Here’s how to start thinking about chunking it down.

Performant R Programming: Chunking a Problem into Smaller Pieces

*Have a wonderful weekend y’all…*

The post Weekly Roundup 29 May – Crash Protection, Sloppy Regressions and Data Munging Skillz appeared first on Robot Wealth.

]]>The post Performant R Programming: Chunking a Problem into Smaller Pieces appeared first on Robot Wealth.

]]>When data is too big to fit into memory, one approach is to break it into smaller pieces, operate on each piece, and then join the results back together. Here’s how to do that to calculate rolling mean pairwise correlations of a large stock universe.

We’ve been using the problem of calculating mean rolling correlations of ETF constituents as a test case for solving in-memory computation limitations in R.

We’re interested in this calculation as a research input to a statistical arbitrage strategy that leverages ETF-driven trading in the constituents. We wrote about an early foray into this trade.

Previously, we introduced this problem along with the concept of profiling code for performance bottlenecks here. We can do the calculation in-memory without any trouble for a regular ETF, say XLF (the SPDR financial sector ETF), but we quickly run into problems if we want to look at SPY.

In this post, we’re going to explore one workaround for R’s in-memory limitations by splitting the problem into smaller pieces and recombining them to get our desired result.

When we performed this operation on the constituents of the XLF ETF, our largest intermediate dataframe consisted of around 3-million rows, easily within the capabilities of modern laptops.

XLF currently holds 68 constituent stocks. So for any day, we have \frac{68*67}{2} = 2,278 correlations to estimate (67 because we don’t want the diagonal of the correlation matrix, take half as we only need its upper or lower triangle).

We calculated five years of rolling correlations, so we had 5*250*2,278 = 2,847,500 correlations in total.

*Piece of cake.*

The problem gets a lot more interesting if we consider the SPY ETF and its 500 constituents.

For any day, we’d have \frac{500*499}{2} = 124,750 correlations to estimate. On five years of data, that’s 5*250*124,750 = 155,937,500 correlations in total.

I tried to do all of that at once in memory on my laptop…and failed.

So our original problem of designing the data wrangling pipeline to achieve our goal has now morphed into a problem of overcoming performance barriers. Let’s see what we can do about that.

First, we load some libraries and data (you can get the data used in this post from our github repo:

library(tidyverse) library(lubridate) library(glue) library(here) library(microbenchmark) library(profvis) theme_set(theme_bw()) load(here::here("data", "spxprices_2015.RData")) spx_prices <- spx_prices %>% filter(inSPX == TRUE)

Next, load the functions in our pipeline (we explored these in more detail in the last post):

# pad any missing values pad_missing <- function(df) { df %>% complete(ticker, date) } # calculate returns to each stock get_returns <- function(df) { df %>% group_by(ticker) %>% arrange(date, .by_group = TRUE) %>% mutate(return_simple = close / dplyr::lag(close) - 1) %>% select(date, ticker, return_simple) } # full join on date fjoin_on_date <- function(df) { df %>% full_join(df, by = "date") } # ditch corr matrix diagonal, one half wrangle_combos <- function(combinations_df) { combinations_df %>% ungroup() %>% # drop diagonal filter(ticker.x != ticker.y) %>% # remove duplicate pairs (eg A-AAL, AAL-A) mutate(tickers = ifelse(ticker.x < ticker.y, glue("{ticker.x}, {ticker.y}"), glue("{ticker.y}, {ticker.x}"))) %>% distinct(date, tickers, .keep_all = TRUE) } pairwise_corrs <- function(combination_df, period) { combination_df %>% group_by(tickers) %>% arrange(date, .by_group = TRUE) %>% mutate(rollingcor = slider::slide2_dbl( .x = return_simple.x, .y = return_simple.y, .f = ~cor(.x, .y), .before = (wdw-1), # resulting window size is before + current element .complete = TRUE) ) %>% select(date, tickers, rollingcor) } mean_pw_cors <- function(correlations_df) { correlations_df %>% group_by(date) %>% summarise(mean_pw_corr = mean(rollingcor, na.rm = TRUE)) }

For completeness, here’s our full pipeline with and without the intermediate objects:

spx_prices <- spx_prices %>% pad_missing() returns_df <- spx_prices %>% get_returns() combos_df <- returns_df %>% fjoin_on_date() wrangled_combos_df <- combos_df %>% wrangle_combos() corr_df <- wrangled_combos_df %>% pairwise_corrs(period = 60) meancorr_df <- corr_df %>% mean_pw_cors() meancorr_df <- spx_prices %>% pad_missing() %>% get_returns() %>% fjoin_on_date() %>% wrangle_combos() %>% pairwise_corrs(period = 60) %>% mean_pw_cors()

We know that the bottleneck is the rolling pairwise correlations calculation. But the prior steps can also blow our memory limits, particularly if we’ve got other objects in our environment. So we’ll split the entire pipeline into chunks.

But first, let’s talk about *why* it’s valid to split our pipeline into chunks.

We can chunk our data in this case because the output – the mean of the rolling pairwise correlations – is only dependent on the window of returns data over which those correlations are calculated.

For example, if our window used 20 periods, we could calculate today’s value from the matrix of returns for our stock universe over the last 20 periods. The calculation has no other dependencies.

The implication is that we could do all of those 20-period mean correlation calculations independently, then jam all the individual outputs together and get the correct answer.

Sweet!

But there are a couple of things to consider.

Every time we calculate a window of returns, the first value of the window will be `NA`

: we need yesterday’s price to calculate the (close to close) return today.

Those prices exist in our raw data, but by extracting each window, we’re artificially dropping them for our calculation. We can see this if we take a few slices of our prices:

wdw <- 5 spx_prices <- spx_prices %>% pad_missing() %>% group_by(ticker) %>% arrange(date) spx_prices %>% slice(1:wdw) %>% get_returns() %>% pivot_wider(id_cols = date, names_from = ticker, values_from = return_simple) %>% select(starts_with("A")) spx_prices %>% slice((1+wdw):(wdw+wdw)) %>% get_returns() %>% pivot_wider(id_cols = date, names_from = ticker, values_from = return_simple) %>% select(starts_with("A"))

If you run that code, you can see that the first row of each slice is NA.

One solution would be to do that return calculation on all the data upfront, which we can do in memory, but isn’t really in the spirit of what we’re trying to demonstrate here.

Instead, we’ll extract more price data than we need for our return windows such that each window is complete.

That implies that we could process our data one chunk at a time where each chunk was a minimum size of `wdw + 1`

. Let’s test that out.

wdw <- 60 test <- spx_prices %>% slice(1:(wdw+1)) system.time({ test_corr <- test %>% get_returns() %>% fjoin_on_date() %>% wrangle_combos() %>% na.omit() %>% # comfortable omitting NA here as we've been careful about alignment via padding missing values pairwise_corrs(period = wdw) %>% na.omit() %>% # this na.omit removes NA in prior window due to complete = TRUE requirement mean_pw_cors() }) # user system elapsed # 50.64 1.14 51.84

Hmmm. That took the best part of a minute. And that’s a single calculation! Clearly that’s not going to be feasible when we have hundreds or thousands of calculations to perform.

Let’s use a larger chunk size this time so that from one chunk we can do five rolling window calculations instead of one. Is processing time additive, or are there speedups to be had by going for scale?

wdw <- 60 test <- spx_prices %>% slice(1:(wdw+1+5)) system.time({ test_corrs <- test %>% get_returns() %>% fjoin_on_date() %>% wrangle_combos() %>% na.omit() %>% # comfortable omitting NA here as we've been careful about alignment via padding missing values pairwise_corrs(period = wdw) %>% na.omit() %>% # this na.omit removes NA in prior window due to complete = TRUE requirement mean_pw_cors() }) # user system elapsed # 76.01 1.90 78.37

That’s interesting. Doing five windows didn’t result in five times the computation time.

This would be partly explained by the execution path being “hot” following the first calculation, but there does seem to be some significant increase in efficiency when we do several calculations in a single chunk.

Let’s use profvis to see if we can figure out what’s going on.

First, we profile the single window case:

library(profvis) profvis({ test <- spx_prices %>% slice(1:(wdw+1)) returns_df <- test %>% get_returns() combos_df <- returns_df %>% fjoin_on_date() wrangled_combos_df <- combos_df %>% wrangle_combos() %>% na.omit() corr_df <- wrangled_combos_df %>% pairwise_corrs(period = 60) %>% na.omit() meancorr_df <- corr_df %>% mean_pw_cors() })

Next, let’s profile the multi-window case:

profvis({ test <- spx_prices %>% slice(1:(wdw+1+5)) returns_df <- test %>% get_returns() combos_df <- returns_df %>% fjoin_on_date() wrangled_combos_df <- combos_df %>% wrangle_combos() %>% na.omit() corr_df <- wrangled_combos_df %>% pairwise_corrs(period = 60) %>% na.omit() meancorr_df <- corr_df %>% mean_pw_cors() })

In both cases, we spend a comparable amount of time in each step, except for the pairwise correlation step, in which the five-window case saw a 50% increase in time spent – which seems like a very good deal!

The time spent calculating the mean of the correlations doubled, but this accounts for only a negligible amount of the total time spent so it’s not worth worrying about.

This suggests that doing as many windows as possible in each chunk will likely give us the biggest bang for our buck.

Notice also in the previous `profiz`

output that the largest amount of memory is allocated at the `wrangle_combos`

step (this is not the *slowest* step, but it produces the largest dataframe). This step will give us a good proxy for estimating a sensible chunk size.

But first, we need to know how much memory we can actually allocate.

It depends on your operating system and your machine specs. I’m on Windows 10 and my machine has 32 GB of RAM.

I can use `memory.limit()`

and `pryr::mem_used`

to see the memory status of my machine:

library(pryr) memory.limit() mem_used() # 32537 # 4.76 GB

Cool – apparently R can max out my RAM (not that you’d let it…) and R has used 4.76 GB. That’s mostly because I’ve got a bunch of large objects in memory from things I did previously, and which I’ll remove before I do anything serious.

We can’t get too precise when it comes to estimating how much memory an R object might require.

The memory allocation of R objects doesn’t grow linearly with size, as R requests oversized blocks of memory and then manages those blocks, rather than incrementally asking the operating system for more each time something is created.

There are also memory overheads with the data structures themselves, such as metadata and pointers to other objects in memory.

Let’s make a dataframe of combinations for 250 days of stock data (about one year), calculate the size of the object, and use that to estimate whether we might be able to cope with that chunk size:

test <- spx_prices %>% slice(1:250) %>% get_returns() %>% fjoin_on_date() %>% wrangle_combos() test %>% object_size() # 2.28 GB

OK – we should be able to process our data a year at a time with a bit of wiggle room to overlap our chunks (we must overlap our chunks because the first value calculated in a new chunk needs the previous `wdw+1`

values for the first calculation). Let’s give it a shot.

# helper functions get_slice <- function(df, idx, chunk_size) { df %>% slice(idx:(idx+chunk_size)) } process_chunk <- function(chunk) { chunk %>% get_returns() %>% fjoin_on_date() %>% wrangle_combos() %>% na.omit() %>% # comfortable omitting NA here as we've been careful about alignment via padding missing values pairwise_corrs(period = wdw) %>% na.omit() %>% # this na.omit removes NA in prior window due to complete = TRUE requirement mean_pw_cors() } # set up sequential chunk processing wdw <- 50 chunk_days <- 250 num_days <- spx_prices %>% ungroup() %>% pull(date) %>% n_distinct(na.rm = TRUE) - wdw num_chunks <- ceiling(num_days / (chunk_days + wdw + 1)) + 1 corr_list <- list() system.time({ for(i in c(1, c(1:num_chunks)*chunk_days)) { corr_list[[i]] <- spx_prices %>% get_slice(i, (chunk_days+wdw+1)) %>% process_chunk() } }) # user system elapsed # 3670 109.8 3819

Result! That takes quite a long time, but at least we’ve managed to get the job done.

Let’s check out the final product:

bind_rows(corr_list) %>% filter(date >= "2017-01-01", date <= "2019-01-01") %>% ggplot(aes(x = date, y = mean_pw_corr)) + geom_line() + labs( x = 'Date', y = 'Mean pairwise correlation', title = 'Rolling Mean Pairwise Correlation', subtitle = 'SPX constituents' ) + theme_bw()

Looks good!

As well as testing the output for correctness, the next step would be to consider getting the job done faster by farming the chunks out to individual workers so that they could be processed in parallel.

We can do this because each chunk is independent of the other chunks – it’s simply a matter of splitting our data appropriately, performing the calculation on each chunk, and combining the results back together.

There are a bunch of other potentially cheap optimisations that we mentioned in the introductory post) – we’ll explore these as well.

In this post, we split our big data problem of calculating mean rolling pairwise correlations of a large universe of stocks into manageable chunks and processed the entire job in memory using an everyday laptop.

The main obstacles that we needed to think about were the introduction of NA values upon calculation of returns from prices, correct alignment of our data by date, and the need for overlapping chunks due to the rolling nature of the operation being performed.

The post Performant R Programming: Chunking a Problem into Smaller Pieces appeared first on Robot Wealth.

]]>The post How to Fill Gaps in Large Stock Data Universes Using tidyr and dplyr appeared first on Robot Wealth.

]]>When you’re working with large universes of stock data you’ll come across a lot of challenges:

- Stocks pay dividends and other distributions that have to be accounted for.
- Stocks are subject to splits and other corporate actions which also have to be accounted for.
- New stocks are listed all the time – you won’t have as much history for these stocks as for other stocks.
- Stocks are delisted, and many datasets do not include the price history of delisted stocks
- Stocks can be suspended or halted for a period of time, leading to trading gaps.
- Companies grow and shrink: the “top 100 stocks by market cap” in 1990 looks very different to the same group in 2020; “growth stocks” in 1990 look very different to “growth stocks” in 2020 etc.

The challenges are well understood, but dealing with them is not always straightforward.

One significant challenge is gaps in data.

Quant analysis gets very hard if you have missing or misaligned data.

If you’re working with a universe of 1,000 stocks life is a lot easier if you have an observation for each stock for each trading date, regardless of whether it actually traded that day. That way:

- you can always do look-ups by date
- any grouped aggregations or rolling window aggregations will be operating on the date range for every ticker
- you can easily sense check the size of your data to have
`trading_days * number_of_stocks`

rows.

If you work with “wide” matrix-like data, these challenges are obvious because you have one row for every date in your data set, and the columns represent an observation for each ticker.

We usually work with long or “tidy” data – where each observation is an observation for a stock for a given day.

How do we work productively in this data, whilst still ensuring that we fill in any gaps in our long data with NAs?

The tidyverse makes this very straightforward. Let me show you!

First, here’s some dummy data to illustrate the problem:

library(tidyverse) testdata <- tibble(date = c(1,1,2,2,2,3,3), ticker = c('AMZN','FB','AMZN','FB','TSLA','AMZN','TSLA'), returns = 1:7 / 100) testdata

## # A tibble: 7 x 3 ## date ticker returns ## <dbl> <chr> <dbl> ## 1 1 AMZN 0.01 ## 2 1 FB 0.02 ## 3 2 AMZN 0.03 ## 4 2 FB 0.04 ## 5 2 TSLA 0.05 ## 6 3 AMZN 0.06 ## 7 3 TSLA 0.07

- TSLA is missing from date 1 as it only started trading after the others
- FB is missing from date 3 as it was put on trading halt after Citron Research hacked into Zuck’s memory banks

Ideally we want a row for every date for every stock – with returns set to NA in the case where data is missing.

That way we can always look up a price by date. And we can always be sure that any grouped operations by ticker return the same size data set.

Turns out that the `tidyr::complete`

function is exactly what we’re looking for. It turns *implicit* missing values – like the returns for TSLA on date 1 and FB on date 3 – into *explicit* missing values:

tidydata <- testdata %>% complete(date, ticker) tidydata

## # A tibble: 9 x 3 ## date ticker returns ## <dbl> <chr> <dbl> ## 1 1 AMZN 0.01 ## 2 1 FB 0.02 ## 3 1 TSLA NA ## 4 2 AMZN 0.03 ## 5 2 FB 0.04 ## 6 2 TSLA 0.05 ## 7 3 AMZN 0.06 ## 8 3 FB NA ## 9 3 TSLA 0.07

Easy!

Now we have a row for every date for every stock.

Now we can safely do grouped aggregations by ticker, on the understanding that the data is the same size for all tickers, and we’ve removed one large source of potential analysis screw-up…

tidydata %>% group_by(ticker) %>% summarise(count = n())

## # A tibble: 3 x 2 ## ticker count ## <chr> <int> ## 1 AMZN 3 ## 2 FB 3 ## 3 TSLA 3

There’s also a more verbose way to achieve our aim, and I’m showing it here because I think it’s useful to see how different functions and libraries connect and cross-over in the tidyverse (right now I’m fascinated by the intersection of the `purrr::map`

functions and the `dplyr::summarise_if, _at, _all`

functions…but that’s a story for another time).

The verbose approach is as follows:

- use
`tidyr::pivot_wide`

to reshape the data to row per date, with a column for each stock - use
`tidyr::pivot_long`

to reshape it back to its longer format.

Let’s do it step by step…

First, we make it wide:

widedata <- testdata %>% pivot_wider(id_cols = date, names_from = ticker, values_from = returns) widedata

## # A tibble: 3 x 4 ## date AMZN FB TSLA ## <dbl> <dbl> <dbl> <dbl> ## 1 1 0.01 0.02 NA ## 2 2 0.03 0.04 0.05 ## 3 3 0.06 NA 0.07

Where we had missing rows, we now have `NA`

.

Now we make it long again:

tidydata <- widedata %>% pivot_longer(-date, names_to = 'ticker', values_to = 'returns') tidydata

## # A tibble: 9 x 3 ## date ticker returns ## <dbl> <chr> <dbl> ## 1 1 AMZN 0.01 ## 2 1 FB 0.02 ## 3 1 TSLA NA ## 4 2 AMZN 0.03 ## 5 2 FB 0.04 ## 6 2 TSLA 0.05 ## 7 3 AMZN 0.06 ## 8 3 FB NA ## 9 3 TSLA 0.07

And again we have a row for every date for every stock.

tidydata %>% group_by(ticker) %>% summarise(count = n())

## # A tibble: 3 x 2 ## ticker count ## <chr> <int> ## 1 AMZN 3 ## 2 FB 3 ## 3 TSLA 3

Here’s the complete pipeline:

testdata %>% pivot_wider(id_cols = date, names_from = ticker, values_from = returns) %>% pivot_longer(-date, names_to = 'ticker', values_to = 'returns')

## # A tibble: 9 x 3 ## date ticker returns ## <dbl> <chr> <dbl> ## 1 1 AMZN 0.01 ## 2 1 FB 0.02 ## 3 1 TSLA NA ## 4 2 AMZN 0.03 ## 5 2 FB 0.04 ## 6 2 TSLA 0.05 ## 7 3 AMZN 0.06 ## 8 3 FB NA ## 9 3 TSLA 0.07

One of the benefits of working with longer “tidy” data is that we can have multiple variables per date/stock observation.

testwider <- testdata %>% mutate(volume = 100:106, otherfeature = 200:206) testwider

## # A tibble: 7 x 5 ## date ticker returns volume otherfeature ## <dbl> <chr> <dbl> <int> <int> ## 1 1 AMZN 0.01 100 200 ## 2 1 FB 0.02 101 201 ## 3 2 AMZN 0.03 102 202 ## 4 2 FB 0.04 103 203 ## 5 2 TSLA 0.05 104 204 ## 6 3 AMZN 0.06 105 205 ## 7 3 TSLA 0.07 106 206

Again, we’re missing data for TSLA on date 1 and FB on date 3, but now we’re also missing `volume`

and `otherfeature`

in addition to `returns`

.

To use `complete`

, nothing changes from earlier:

testwider %>% complete(date, ticker)

## # A tibble: 9 x 5 ## date ticker returns volume otherfeature ## <dbl> <chr> <dbl> <int> <int> ## 1 1 AMZN 0.01 100 200 ## 2 1 FB 0.02 101 201 ## 3 1 TSLA NA NA NA ## 4 2 AMZN 0.03 102 202 ## 5 2 FB 0.04 103 203 ## 6 2 TSLA 0.05 104 204 ## 7 3 AMZN 0.06 105 205 ## 8 3 FB NA NA NA ## 9 3 TSLA 0.07 106 206

However if we want to pivot back and forth, we do the following:

- use
`pivot_wide`

to reshape the data to row per date, with a column for each stock - use
`pivot_long`

to reshape it back to its longer format - use
`left_join`

to recover the rest of the variables from the original data.

testwider %>% pivot_wider(id_cols = date, names_from = ticker, values_from = returns) %>% pivot_longer(-date, names_to = 'ticker', values_to = 'returns') %>% left_join(testwider, by = c('date', 'ticker', 'returns'))

## # A tibble: 9 x 5 ## date ticker returns volume otherfeature ## <dbl> <chr> <dbl> <int> <int> ## 1 1 AMZN 0.01 100 200 ## 2 1 FB 0.02 101 201 ## 3 1 TSLA NA NA NA ## 4 2 AMZN 0.03 102 202 ## 5 2 FB 0.04 103 203 ## 6 2 TSLA 0.05 104 204 ## 7 3 AMZN 0.06 105 205 ## 8 3 FB NA NA NA ## 9 3 TSLA 0.07 106 206

- Missing values in financial data threaten the validity of quant analysis due to inadvertent misalignment
- Wide data tends to highlight such missing data
- Long data tends to hide it
`tidyr::complete`

is a succinct and efficient way to ensure that missing observations are accounted for with`NA`

- Like most tasks in R, there is more than one way to go about it. But
`complete`

should be your go-to function.

All the code in this post is available in our github repo where you can find lots of other recipes and tools to make your life as a quant researcher easier.

Handling a Large Universe of Stock Price Data in R: Profiling with profvis

How to Calculate Rolling Pairwise Correlations in the Tidyverse

The post How to Fill Gaps in Large Stock Data Universes Using tidyr and dplyr appeared first on Robot Wealth.

]]>The post Find Cheap Options for Effective Crash Protection Using Crash Regressions appeared first on Robot Wealth.

]]>One way we can quantify a stock’s movement relative to the market index is by calculating its “beta” to the market.

To calculate the beta of MSFT to SPY (for example) we:

- calculate daily MSFT returns and daily SPY returns
- align the returns with one another
- regress MSFT returns against SPY returns.

This shows the procedure, graphically:

library(tidyverse) library(ggpmisc) msftspyreturns %>% ggplot(aes(x = spy_returns, y = stock_returns, color = date)) + geom_point() + geom_smooth(method = 'lm', formula = 'y ~ x', color = 'red') + stat_poly_eq(aes(label = stat(eq.label)), formula = 'y ~ x', parse = TRUE) + ggtitle('Stock returns vs SPY returns')

The formula in the top left shows the slope of the linear regression is 1.08. So we’d say that we have estimated the beta of MSFT to be 1.08.

To make this estimation we used all available daily return observations back to 2000.

**But we don’t have to… **

If we’re looking to buy “out of the money” put options for crash protection then we’re really just interested in the behaviour of the stock **during severe market downside.**

So, following Paul Willmott’s Crashmetrics approach, we might choose to only do the regression on days when SPY decreased by more than 4.5%.

msftspyreturns %>% filter(spy_returns < -0.045) %>% ggplot(aes(x = spy_returns, y = stock_returns, color = date)) + geom_point() + geom_smooth(method = 'lm', formula = 'y ~ x', color = 'red') + stat_poly_eq(aes(label = stat(eq.label)), formula = 'y ~ x', parse = TRUE) + ggtitle('Stock returns vs SPY returns for SPY decline larger than -4.5%')

In this analysis, we’ve concentrated only on extreme negative returns.

We estimate the beta of MSFT under crash conditions to be 1.1, which is slightly higher than the estimation for the full sample.

Historically, MSFT has traded similarly with respect to the market index in market crash conditions as it does in more benign times. It’s probably a reasonable choice for buying put options for crash protection.

*But, what if we could find stocks that had a high beta to the index during big down days, but lower betas during more benign market moves?*

Buying put options in these stocks may get us good protection in a crash, for a reasonable price.

For each current S&P 500 constituent we will calculate the beta to SPY ETF returns for:

- all available daily observations
- days in which SPY declined more than 4.5%

library(broom) full_betas <- spx_returns %>% inner_join(select(SPY_returns, date, spy_returns = c2c_returns_simple), by = 'date') %>% group_by(ticker) %>% group_modify(~ broom::tidy(lm(c2c_returns_simple ~ spy_returns, data = .x))) %>% filter(term == 'spy_returns') %>% select(ticker, beta = estimate) crash_betas <- spx_returns %>% inner_join(select(SPY_returns, date, spy_returns = c2c_returns_simple), by = 'date') %>% filter(spy_returns <= 0.045) %>% group_by(ticker) %>% group_modify(~ broom::tidy(lm(c2c_returns_simple ~ spy_returns, data = .x))) %>% filter(term == 'spy_returns') %>% select(ticker, beta = estimate)

We’re most interested in stocks which have relatively high betas in crash times and lower betas in normal times…

Next we:

- calculate
`convexity`

as the difference between the crash beta coefficient and the full sample coefficient for each stock - sort by
`convexity`

descending.

This gives us a shortlist of stocks to look at as candidates to buy put options as portfolio crash protection.

Let’s look at WYNN, the casino and hotel company:

With the limited data we have available (we threw out most of our data!) it appears we may get effective downside protection by buying WYNN puts.

You might debate whether to force the regression to go through the intercept – we chose not to, but I could argue both points.

You might use this kind of screen in conjunction with other factors which are predictive of option value, to find cheap portfolio hedges.

The Google Sheets document below lists all data for the S&P 500 as at 25/5/2020:

How to Find Cheap Options to Buy and Expensive Options to Sell

The post Find Cheap Options for Effective Crash Protection Using Crash Regressions appeared first on Robot Wealth.

]]>The post Rolling and Expanding Windows For Dummies appeared first on Robot Wealth.

]]>In today’s article, we are going to take a look at **rolling and expanding windows.**

By the end of the post, you will be able to answer these questions:

- What is a
**rolling window**? - What is an
**expanding window**? - Why are they useful?

Here is a normal window.

We use normal windows because we want to have a glimpse of the outside, the bigger the window the more of the outside we get to see.

Also as a general rule of thumb, the bigger the windows on someone’s house, the better their stock portfolio did …

Just like real windows, data windows also offer us a small glimpse into something larger.

**A moving window allows us to investigate a subset of our data.**

Often times, we want to know a statistical property of our time series data, but because all of the time machines are locked up in Roswell, we can’t calculate a statistic over the full sample and use that to gain insight.

That would introduce look-ahead bias in our research.

Here is an extreme example of that. Here we’ve plotted the TSLA price and its mean over the full-sample.

import pandas as pd import matplotlib.pyplot as plt #Load TSLA OHLC df = pd.read_csv('TSLA.csv') #Calculate full sample mean full_sample_mean = df['close'].mean() #Plot plt.plot(df['close'],label='TSLA') plt.axhline(full_sample_mean,linestyle='--',color='red',label='Full Sample Mean') plt.legend() plt.show()

In this case, if we just bought TSLA when the price was under the mean and Sold it above the mean, we would have made a killing, well at least up to 2019…

**But the problem is that we wouldn’t have known the mean value at that point in time.**

So it’s pretty obvious why we can’t use the entire sample, but what can we do then? One way we could approach this problem is by using rolling or expanding windows.

If you’ve ever used a * Simple Moving Average*, then congratulations – you’ve used a rolling window.

Let’s say you have 20 days of stock data and you want to know the mean price of the stock for the last 5 days. What do you do?

You take the last 5 days, sum them up and divide by 5.

But what if you want to know the average of the previous 5 days for each day in your data set?

This is where rolling windows can help.

In this case, our window would have a size of 5, meaning for each point in time it contains the mean of the last 5 data points.

Let’s visualize an example with a moving window of size 5 step by step.

#Random stock prices data = [100,101,99,105,102,103,104,101,105,102,99,98,105,109,105,120,115,109,105,108] #Create pandas DataFrame from list df = pd.DataFrame(data,columns=['close']) #Calculate a 5 period simple moving average sma5 = df['close'].rolling(window=5).mean() #Plot plt.plot(df['close'],label='Stock Data') plt.plot(sma5,label='SMA',color='red') plt.legend() plt.show()

So let’s breakdown this chart.

- We have 20 days of stock prices in this chart, labelled Stock Data.
- For each point in time (the blue dot) we want to know what’s the 5 day mean price.
- The stock data used for the calculation is the stuff between the 2 blue vertical lines.
- After we calculate the mean from 0-5 our mean for day 5 becomes available.
- To get the mean for day 6 we need to shift the window by 1 so, the data window becomes 1-6.

And this is what’s known as a Rolling Window, the size of the window is fixed. All we are doing is rolling it forward.

As you have probably noticed we don’t have SMA values for points 0-4. This is because our window size (also known as a lookback period) requires at least 5 data points to do the calculation.

Where rolling windows are a fixed size, expanding windows have a fixed starting point, and incorporate new data as it becomes available.

Here’s the way I like to think about this:

*“What’s the mean of the past n values at this point in time?” *– Use rolling windows here.

*“What’s the mean of all the data available up to this point in time?” – *Use expanding windows here.

Expanding windows have a fixed lower bound. Only the upper bound of the window is rolled forward (the window gets bigger).

Let’s visualize an expanding window with the same data from the previous plot.

#Random stock prices data = [100,101,99,105,102,103,104,101,105,102,99,98,105,109,105,120,115,109,105,108] #Create pandas DataFrame from list df = pd.DataFrame(data,columns=['close']) #Calculate expanding window mean expanding_mean = df.expanding(min_periods=1).mean() #Calculate full sample mean for reference full_sample_mean = df['close'].mean() #Plot plt.plot(df['close'],label='Stock Data') plt.plot(expanding_mean,label='Expanding Mean',color='red') plt.axhline(full_sample_mean,label='Full Sample Mean',linestyle='--',color='red') plt.legend() plt.show()

You can see that in the beginning, the SMA is a bit jittery. That’s because we have a smaller number of data points at the beginning of the plot, and as we get more data, the window expands until eventually the expanding window mean converges to the full sample mean, because the window has reached the size of the entire data set.

It is important not to use data from the future to analyse the past. Rolling and expanding windows are essential tools to help “walk your data forward” to avoid these issues.

Using Digital Signal Processing in Quantitative Trading Strategies

The post Rolling and Expanding Windows For Dummies appeared first on Robot Wealth.

]]>The post Weekly Roundup 22 May – Doubling Down in Losing Trades Like a Drunken Hedge Fund Manager appeared first on Robot Wealth.

]]>*Here’s a round-up of our new articles this week. They cover options trading, digital signal processing, data munging and Kris’s luxurious moustache…*

Every new trader tries out a few insane trading ideas!

In a new series on the blog, Kris explores **three insane trading strategies** that tempted him back when he didn’t know any better.

First, he looks at the Martingale betting scheme.

*Is doubling your bet size after a losing trade really a good idea?*

Most approaches to **options trading** are stupid. Here is a non-stupid approach.

Options trading is just like anything else. You’ve got to buy the cheap stuff and sell the expensive stuff.

How to Find Cheap Options to Buy and Expensive Options to Sell

In this monster post, Kris explores techniques from the field of **digital signal processing**, and whether they can be useful to us as **systematic traders**.

Using Digital Signal Processing in Quantitative Trading Strategies

One of the biggest advantages you can have in **equity trading** is going broad…

*But how do you pick a broad bias-free universe for equity strategy backtest?*

Here’s how:

You’ll want to bookmark this one. Usually, you need to pay good money for this data.

Data manipulation skills are crucial to efficient quant trading. In the following posts, Kris and I appear in video form to talk you through the skills you need to be a **financial data munging** machine…

How to Calculate Rolling Pairwise Correlations in the Tidyverse

Handling a Large Universe of Stock Price Data in R: Profiling with profvis

We’ve recently started a new GitHub repository containing “recipes” for doing trading analysis in R.

It will include the code from our blog posts, some of our older proprietary research, and recipes you can copy and paste in your own analysis.

It’s very early days, but we’re filling it out each week.

You can find it here. Be sure to “star” the repository.

*If you like this stuff, find it annoying, or connect with it emotionally in any significant way, please share it with your friends…*

The post Weekly Roundup 22 May – Doubling Down in Losing Trades Like a Drunken Hedge Fund Manager appeared first on Robot Wealth.

]]>The post Handling a Large Universe of Stock Price Data in R: Profiling with profvis appeared first on Robot Wealth.

]]>Recently, we wrote about calculating mean rolling pairwise correlations between the constituent stocks of an ETF.

The tidyverse tools `dplyr`

and `slider`

solve this somewhat painful data wrangling operation about as elegantly and intuitively as possible.

Why did you want to do that?

We’re building a statistical arbitrage strategy that relies on indexation-driven trading in the constituents. We wrote about an early foray into this trade – we’re now taking the concepts a bit further.

But what about the problem of scaling it up?

When we performed this operation on the constituents of the XLF ETF, our largest intermediate dataframe consisted of around 3-million rows, easily within the capabilities of modern laptops.

XLF currently holds 68 constituent stocks. So for any day, we have \frac{68*67}{2} = 2,278 correlations to estimate (67 because we don’t want the diagonal of the correlation matrix, take half as we only need its upper or lower triangle).

We calculated five years of rolling correlations, so we had 5*250*2,278 = 2,847,500 correlations in total.

*Piece of cake.*

The problem gets a lot more interesting if we consider the SPY ETF and its 500 constituents.

For any day, we’d have \frac{500*499}{2} = 124,750 correlations to estimate. On five years of data, that’s 5*250*124,750 = 155,937,500 correlations in total.

I tried to do all of that at once in memory on my laptop…and failed.

So our original problem of designing the data wrangling pipeline to achieve our goal has now morphed into a problem of overcoming performance barriers.

There are a number of strategies that could be employed to solve this. So over a series of posts, we’ll explore the concept of writing performant R code via various approaches to solving our problem:

- Getting more RAM by renting a big virtual machine in the cloud
- Splitting the data into chunks and doing it sequentially in local memory (RAM and hard disk)
- We could parallelise the above, say using
`foreach`

, but speed isn’t really the issue here, it’s RAM - Consider different data structures –
`data.table`

,`Matrix`

– which will carry less overhead than a regular`data.frame`

or`tibble`

. - Horizontal scaling with the
`future`

package - R packages for dealing with memory issues:
`ff`

,`bigmemory`

,`MonetDB.R`

- Doing the calculation in Rcpp, since representing data in C++ carries less overhead than in R
- Use Spark, a cluster computing platform for farming the problem out to multiple machines
- Use Dataflow, a serverless batch data processing tool available on commercial cloud providers
- Use Bigquery, a Google Cloud database for storing and querying massive datasets (there are equivalents on AWS)
- Use Cloud Run, a Google Cloud managed compute platform for scaling containerised applications

These all have their own tradeoffs, from re-writing code for compatibility reasons (and the additional burden of testing for correctness against the original algorithm), to foregoing interactivity, to operating system compatibility, to cost.

We’ll explore these in some detail over the coming days and weeks.

It’s all too easy to focus on optimising code before you really should. That can be an incredible waste of time, and more than invalidate any speedups you get from your optimisation efforts.

In general, the following steps will mostly prevent you from doing this:

- Get the code working (we did that in the previous post)
- Profile (we’ll do that shortly)
- Fix obvious things, for instance, pre-allocating large objects instead of growing them in a loop
- Weigh options for optimising code, if required:
- do nothing
- implement
`lapply`

, or better, vectorisation (noting that it can consume a lot of memory) - Rcpp
- parallelisation
- bytecode compiler for small speedups
- some combination of the above

Profiling is the process of identifying bottlenecks in code.

Before we even think about optimising our code, we need to know what we should be optimising. Profiling is the detective work that helps you understand where your development time is best spent.

And while premature optimisation is to be avoided, performance *does* matter.

Bottlenecks in your code might surprise you. We all want to be better programmers – profiling code provides useful lessons and is an opportunity to identify and fix bad practices.

So even if you delay optimising your code (you definitely should delay it), there’s little overhead and lots to be gained from profiling your code as you develop it.

And sometimes, you just have to, in order to fix a critical bottleneck.

In our case, we essentially know where the bottleneck is – it’s the enormous data frame of pairwise correlations that we try to compute in memory. But still, we’ll profile the code to demonstrate how it’s done, and to ensure we don’t have any surprises.

Getting simple timings as a basic measure of performance is straightforward.

`system.time()`

is useful for timing blocks of code by running them once – but timing one evaluation can be misleading.`Rprof()`

can be used for timing execution of functions and statements.`microbenchmark`

is a de-facto standard among many R users. It provides statistical timing measurements and has some nice plot outputs. There’s also`rbenchmark`

.

We’ll start by timing our code for calculating mean rolling pairwise correlations using `microbenchmark`

.

First, we load our packages and data (you can get the data from our GitHub repository – which if you clone will enable you to run the relevant Rmd document directly). It consists of prices for SPX constituents since 2015; we filter on a flag we added to indicate whether a particular stock was in the index on a particular date:

library(tidyverse) library(lubridate) library(glue) library(here) library(microbenchmark) library(profvis) theme_set(theme_bw()) load(here::here("data", "spxprices_2015.RData")) spx_prices <- spx_prices %>% filter(inSPX ==TRUE)

Next we “functionise” the steps in our pipeline of operations. This will make profiling more straightforward, and the output of `microbenchmark`

easier to interpret:

# calculate returns to each stock get_returns <- function(df) { df %>% group_by(ticker) %>% arrange(date, .by_group = TRUE) %>% mutate(return = close / dplyr::lag(close) - 1) %>% select(date, ticker, return) } # full join on date fjoin_on_date <- function(df) { df %>% full_join(df, by = "date") } # ditch corr matrix diagonal, one half wrangle_combos <- function(combinations_df) { combinations_df %>% ungroup() %>% # drop diagonal filter(ticker.x != ticker.y) %>% # remove duplicate pairs (eg A-AAL, AAL-A) mutate(tickers = ifelse(ticker.x < ticker.y, glue("{ticker.x}, {ticker.y}"), glue("{ticker.y}, {ticker.x}"))) %>% distinct(date, tickers, .keep_all = TRUE) } pairwise_corrs <- function(combination_df, period) { combination_df %>% group_by(tickers) %>% arrange(date, .by_group = TRUE) %>% mutate(rollingcor = slider::slide2_dbl( .x = return.x, .y = return.y, .f = ~cor(.x, .y), .before = period, .complete = TRUE) ) %>% select(date, tickers, rollingcor) } mean_pw_cors <- function(correlations_df) { correlations_df %>% group_by(date) %>% summarise(mean_pw_corr = mean(rollingcor, na.rm = TRUE)) }

Now, let’s see what happens if we try to run the pipeline on the full dataset:

returns_df <- get_returns(spx_prices) combos_df <- fjoin_on_date(returns_df) wrangled_combos_df <- wrangle_combos(combos_df) corr_df <- pairwise_corrs(wrangled_combos_df, period = 60) meancorr_df <- mean_pw_cors(corr_df) # Error: cannot allocate vector of size 3.7 Gb

I can’t allocate enough RAM to hold one of the dataframes in memory.

I could change R’s memory allocation (by default on Windows I *think* it’s 4GB) by doing `memory.limit(size = new_size)`

, but from experience, I know that simply going large won’t solve this particular problem, at least on my machine.

Now if we ask `microbenchmark`

to time our code for us, we’ll get some insights into where this breaks down.

Admittedly this is something of an awkward use-case – the typical use-case is comparing the speed of different implementations of the same thing, but here we’re using `microbenchmark`

to get insight into the timings of each key step in our pipeline, to infer potential bottlenecks (which may or may not be related to our memory issue, but it’s a decent starting point).

As the process is quite long-running, we only run it twice (the default is 100), measure the output in seconds, and only operate on a subset of our data:

prices_subset <- spx_prices %>% filter(date >= "2019-07-01", date < "2020-01-01") mb <- microbenchmark( returns_df <- get_returns(prices_subset), combos_df <- fjoin_on_date(returns_df), wrangled_combos_df <- wrangle_combos(combos_df), corr_df <- pairwise_corrs(wrangled_combos_df, period = 60), meancorr_df <- mean_pw_cors(corr_df), times = 2, unit = "s", control = list(order = "block", warmup = 1) ) mb

## Unit: seconds ## expr min lq mean ## returns_df <- get_returns(prices_subset) 0.0493790 0.0493790 0.05453255 ## combos_df <- fjoin_on_date(returns_df) 2.3088978 2.3088978 2.34896230 ## wrangled_combos_df <- wrangle_combos(combos_df) 54.3862645 54.3862645 55.17421270 ## corr_df <- pairwise_corrs(wrangled_combos_df, period = 60) 262.8221508 262.8221508 267.15088150 ## meancorr_df <- mean_pw_cors(corr_df) 0.8372799 0.8372799 0.84102875 ## median uq max neval cld ## 0.05453255 0.0596861 0.0596861 2 a ## 2.34896230 2.3890268 2.3890268 2 a ## 55.17421270 55.9621609 55.9621609 2 b ## 267.15088150 271.4796122 271.4796122 2 c ## 0.84102875 0.8447776 0.8447776 2 a

The bottleneck is quite obvious: it’s the operation that calculates the rolling pairwise correlations. No surprise there.

If you call `boxplot`

on the output of `microbenchmark`

, you get a nice graphical view of the results:

boxplot(mb, unit = "s", log = FALSE)

So now we’ve got some basic insight into which operations represent bottlenecks in our pipeline.

However, we often want to get more detailed information. For instance, `microbenchmark`

only tells us about time, it doesn’t tell us about memory usage – and in this case, running out of RAM is the important thing.

Looking into this requires a different profiling tool. `Rprof`

ships with base R, but recently I’ve been using `profvis`

, which has a highly interpretable graphical output.

`profvis`

is simple to use: just wrap an expression or function call in `profvis({...})`

and observe the HTML output, which opens in a new tab in R Studio, and which you can save via the `prof_output`

argument.

`profvis`

can handle a block of code:

profvis({ returns_df <- get_returns(prices_subset) combos_df <- fjoin_on_date(returns_df) wrangled_combos_df <- wrangle_combos(combos_df) corr_df <- pairwise_corrs(wrangled_combos_df, period = 60) meancorr_df <- mean_pw_cors(corr_df) }, prof_output = 'profile_out.Rprof')

The graphical output shows the time spent on each line of code in milliseconds. The graph is interactive within R Studio – you can zoom and move around to get better views.

You can also see each line of code and the memory allocated and deallocated (negative values), and the time spent on each line:

We can see that the vast majority of time was spent in the `pairwise_corrs`

function, and that it allocated about 18.5GB of memory.

It’s no surprise that this function is our bottleneck, but seeing the RAM usage quantified like that is certainly useful. Remember also that we’re only using a small subset of our data here. Our actual problem is much bigger.

So it’s quite clear that in order to solve this particular problem, we need to find a way around R’s memory limitations with respect to the `pairwise_corrs`

operation.

Finally, we should take a look at the rolling pairwise correlations that we calculated for the subset of our larger problem, since we really like visualisation (and to check that things at least look superficially sensible):

meancorr_df %>% na.omit() %>% ggplot(aes(x = date, y = mean_pw_corr)) + geom_line() + labs( x = "Date", y = "Mean Pairwise Correlation", title = "Rolling Mean Pairwise Correlation", subtitle = "SPX Constituents" )

In this post, we introduced the idea of scaling up our mean rolling pairwise correlation operation to accommodate the constituents of the S&P 500. We listed some options for doing so, noting that they all involve various trade-offs.

Basic profiling indicated that, as expected, the operation that performs the pairwise rolling correlations is the bottleneck, allocating over 18 GB of RAM even on a small subset of the total problem.

Since R computations are by default carried out in-memory, we have a problem. The following posts will explore various solutions.

How to Calculate Rolling Pairwise Correlations in the Tidyverse

How to Run Trading Algorithms on Google Cloud Platform in 6 Easy Steps

The post Handling a Large Universe of Stock Price Data in R: Profiling with profvis appeared first on Robot Wealth.

]]>The post How to Wrangle JSON Data in R with jsonlite, purr and dplyr appeared first on Robot Wealth.

]]>Working with modern APIs you will often have to wrangle with data in JSON format.

This article presents some tools and recipes for working with JSON data with R in the tidyverse.

We’ll use `purrr::map`

functions to extract and transform our JSON data. And we’ll provide intuitive examples of the cross-overs and differences between `purrr`

and `dplyr`

.

library(tidyverse) library(here) library(kableExtra) pretty_print <- function(df, num_rows) { df %>% head(num_rows) %>% kable() %>% kable_styling(full_width = TRUE, position = 'center') %>% scroll_box(height = '300px') }

This data has been converted from raw JSON to nested named lists using `jsonlite::fromJSON`

with the `simplify`

argument set to `FALSE`

(that is, all elements are converted to named lists).

The data consists of market data for SPY options with various strikes and expiries. We got it from the options data vendor Orats, whose data API I enjoy almost as much as their orange website.

If you want to follow along, you can sign-up for a free trial of the API, and load the data directly from the Orats API with the following code *(just define your API key in the ORATS_token variable):*

library(httr) ORATS_token <- 'YOUR_KEY_HERE' res <- GET('https://api.orats.io/data/strikes?tickers=SPY', add_headers(Authorization = ORATS_token)) if (http_type(res) == 'application/json') { strikes <- jsonlite::fromJSON(content(res, 'text'), simplifyVector = FALSE) } else { stop('No json returned') } if (http_error(res)) { stop(paste('API request error:',status_code(res), odata$message, odata$documentation_url)) }

Now, if you want to read this data directly into a nicely formatted dataframe, replace the line:

`strikes <- jsonlite::fromJSON(content(res, 'text'), simplifyVector = FALSE)`

with

`strikes <- jsonlite::fromJSON(content(res, 'text'), simplifyVector = TRUE, flatten = TRUE)`

However, you should know that it isn’t always possible to coerce JSON into nicely shaped dataframes this easily – often the raw JSON won’t contain primitive types, or will have nested key-value pairs on the same level as your desired dataframe columns, to name a couple of obstacles.

In that case, it’s useful to have some tools – like the ones in this post – for wrangling your source data.

So let’s look at that `strikes`

object, and show how we can wrangle it into something useful…

str(strikes, max.level = 1)

## List of 1 ## $ data:List of 2440

This tells us we have a component named “data”. Let’s look at that a little more closely:

str(strikes$data, max.level = 1, list.len = 10)

## List of 2440 ## $ :List of 40 ## $ :List of 40 ## $ :List of 40 ## $ :List of 40 ## $ :List of 40 ## $ :List of 40 ## $ :List of 40 ## $ :List of 40 ## $ :List of 40 ## $ :List of 40 ## [list output truncated]

This suggests we have homogenous lists of 40 elements each (an assumption we’ll check shortly).

Let’s look at one of those lists:

str(strikes$data[[1]])

## List of 40 ## $ ticker : chr "SPY" ## $ tradeDate : chr "2020-05-19" ## $ expirDate : chr "2020-05-29" ## $ dte : int 11 ## $ strike : int 140 ## $ stockPrice : num 293 ## $ callVolume : int 0 ## $ callOpenInterest: int 0 ## $ callBidSize : int 20 ## $ callAskSize : int 23 ## $ putVolume : int 0 ## $ putOpenInterest : int 2312 ## $ putBidSize : int 0 ## $ putAskSize : int 7117 ## $ callBidPrice : num 152 ## $ callValue : num 153 ## $ callAskPrice : num 153 ## $ putBidPrice : int 0 ## $ putValue : num 1.12e-25 ## $ putAskPrice : num 0.01 ## $ callBidIv : int 0 ## $ callMidIv : num 0.98 ## $ callAskIv : num 1.96 ## $ smvVol : num 0.476 ## $ putBidIv : int 0 ## $ putMidIv : num 0.709 ## $ putAskIv : num 1.42 ## $ residualRate : num -0.00652 ## $ delta : int 1 ## $ gamma : num 9.45e-16 ## $ theta : num -0.00288 ## $ vega : num 2e-11 ## $ rho : num 0.0384 ## $ phi : num -0.0802 ## $ driftlessTheta : num -6.07e-09 ## $ extSmvVol : num 0.478 ## $ extCallValue : num 153 ## $ extPutValue : num 1.77e-25 ## $ spotPrice : num 293 ## $ updatedAt : chr "2020-05-19 20:02:33"

All these elements look like they can be easily handled. For instance, I don’t see any more deeply nested lists, weird missing values, or anything else that looks difficult.

So now I’ll pull out the interesting bit:

strikes <- strikes[["data"]]

length(strikes)

## [1] 2440

This is where we’ll check that our sublists are indeed homogeneously named, as we assumed above:

strikes %>% map(names) %>% # this applies the base R function names to each sublist, and returns a list of lists with the output unique() %>% length() == 1

## [1] TRUE

We should also check the variable types are consistent as we need single types in each column of a dataframe (although R will warn if it is forced to coerce one type to another).

Here’s an interesting thing. It uses a nested `purrr::map`

to get the variable types for each element of each sublist. They’re actually not identical according to this:

strikes %>% map(.f = ~{map_chr(.x, .f = class)}) %>% unique() %>% length()

## [1] 39

This is actually a little puzzling. Inspecting the individual objects suggests that we do have identical types. If anyone has anything to say about this, I’d love to hear about it in the comments. In any event, after we make our dataframe, we should check that the variable types are as expected.

Now, to that dataframe…

`purrr::flatten`

removes one level of hierarchy from a list (`unlist`

removes them all). Here, `flatten`

is applied to each sub-list in `strikes`

via `purrr::map_df`

.

We use the variant `flatten_df`

which returns each sublist as a dataframe, which makes it compatible with `purrr::map_df`

,which requires a function that returns a dataframe.

strikes_df <- strikes %>% map_df(flatten_df) strikes_df %>% pretty_print(30)

ticker | tradeDate | expirDate | dte | strike | stockPrice | callVolume | callOpenInterest | callBidSize | callAskSize | putVolume | putOpenInterest | putBidSize | putAskSize | callBidPrice | callValue | callAskPrice | putBidPrice | putValue | putAskPrice | callBidIv | callMidIv | callAskIv | smvVol | putBidIv | putMidIv | putAskIv | residualRate | delta | gamma | theta | vega | rho | phi | driftlessTheta | extSmvVol | extCallValue | extPutValue | spotPrice | updatedAt |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

SPY | 2020-05-19 | 2020-05-29 | 11 | 140 | 292.55 | 0 | 0 | 20 | 23 | 0 | 2312 | 0 | 7117 | 152.37 | 152.5790 | 152.83 | 0.00 | 0.0000000 | 0.01 | 0 | 0.980149 | 1.960300 | 0.476046 | 0.000000 | 0.708976 | 1.417950 | -0.0065171 | 1.000000 | 0.0000000 | -0.0028827 | 0.0000000 | 0.0383618 | -0.0801790 | 0.0000000 | 0.478157 | 152.5790 | 0.0000000 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 145 | 292.55 | 0 | 0 | 20 | 23 | 0 | 2322 | 0 | 5703 | 147.37 | 147.5800 | 147.83 | 0.00 | 0.0000000 | 0.01 | 0 | 0.936511 | 1.873020 | 0.476046 | 0.000000 | 0.676907 | 1.353810 | -0.0065171 | 1.000000 | 0.0000000 | -0.0029856 | 0.0000000 | 0.0397319 | -0.0801790 | 0.0000000 | 0.478157 | 147.5800 | 0.0000000 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 150 | 292.55 | 0 | 1 | 20 | 18 | 0 | 1912 | 0 | 5703 | 142.39 | 142.5810 | 142.83 | 0.00 | 0.0000000 | 0.01 | 0 | 0.894396 | 1.788790 | 0.476046 | 0.000000 | 0.645945 | 1.291890 | -0.0065171 | 1.000000 | 0.0000000 | -0.0030886 | 0.0000000 | 0.0411020 | -0.0801790 | 0.0000000 | 0.478157 | 142.5810 | 0.0000000 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 155 | 292.55 | 0 | 0 | 22 | 23 | 0 | 1483 | 0 | 5583 | 137.36 | 137.5820 | 137.81 | 0.00 | 0.0000000 | 0.01 | 0 | 0.844663 | 1.689330 | 0.476046 | 0.000000 | 0.616016 | 1.232030 | -0.0065171 | 1.000000 | 0.0000000 | -0.0031915 | 0.0000000 | 0.0424720 | -0.0801790 | 0.0000000 | 0.478157 | 137.5820 | 0.0000000 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 160 | 292.55 | 0 | 0 | 20 | 22 | 0 | 929 | 0 | 7016 | 132.39 | 132.5830 | 132.85 | 0.00 | 0.0000000 | 0.01 | 0 | 0.823228 | 1.646460 | 0.476046 | 0.000000 | 0.587053 | 1.174110 | -0.0065171 | 1.000000 | 0.0000000 | -0.0032945 | 0.0000000 | 0.0438421 | -0.0801790 | 0.0000000 | 0.478157 | 132.5830 | 0.0000000 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 165 | 292.55 | 0 | 0 | 20 | 22 | 0 | 1874 | 0 | 5943 | 127.39 | 127.5840 | 127.85 | 0.00 | 0.0000000 | 0.01 | 0 | 0.784980 | 1.569960 | 0.476046 | 0.000000 | 0.558997 | 1.117990 | -0.0065171 | 1.000000 | 0.0000000 | -0.0033974 | 0.0000000 | 0.0452122 | -0.0801790 | 0.0000000 | 0.478157 | 127.5840 | 0.0000000 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 170 | 292.55 | 0 | 0 | 22 | 23 | 0 | 4055 | 0 | 7407 | 122.37 | 122.5850 | 122.83 | 0.00 | 0.0000000 | 0.01 | 0 | 0.739266 | 1.478530 | 0.476046 | 0.000000 | 0.531711 | 1.063420 | -0.0065171 | 1.000000 | 0.0000000 | -0.0035004 | 0.0000000 | 0.0465822 | -0.0801790 | 0.0000000 | 0.478157 | 122.5850 | 0.0000000 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 175 | 292.55 | 0 | 0 | 22 | 23 | 0 | 2992 | 0 | 5103 | 117.36 | 117.5860 | 117.81 | 0.00 | 0.0000000 | 0.01 | 0 | 0.694933 | 1.389870 | 0.476046 | 0.000000 | 0.504757 | 1.009510 | -0.0065171 | 1.000000 | 0.0000000 | -0.0036034 | 0.0000000 | 0.0479523 | -0.0801790 | 0.0000000 | 0.478157 | 117.5860 | 0.0000000 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 180 | 292.55 | 0 | 0 | 20 | 31 | 0 | 4320 | 0 | 9302 | 112.39 | 112.5870 | 112.83 | 0.00 | 0.0000000 | 0.01 | 0 | 0.668237 | 1.336470 | 0.476046 | 0.000000 | 0.478572 | 0.957144 | -0.0065171 | 1.000000 | 0.0000000 | -0.0037063 | 0.0000000 | 0.0493224 | -0.0801790 | 0.0000000 | 0.478157 | 112.5870 | 0.0000000 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 185 | 292.55 | 0 | 0 | 20 | 31 | 1200 | 5863 | 0 | 4686 | 107.39 | 107.5880 | 107.83 | 0.00 | 0.0000000 | 0.01 | 0 | 0.633658 | 1.267320 | 0.476046 | 0.000000 | 0.453113 | 0.906225 | -0.0065171 | 1.000000 | 0.0000000 | -0.0038093 | 0.0000000 | 0.0506924 | -0.0801790 | 0.0000000 | 0.478157 | 107.5880 | 0.0000000 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 190 | 292.55 | 0 | 2 | 8 | 31 | 5 | 8253 | 0 | 4199 | 102.39 | 102.5890 | 102.83 | 0.00 | 0.0000000 | 0.01 | 0 | 0.600019 | 1.200040 | 0.476046 | 0.000000 | 0.428340 | 0.856680 | -0.0065171 | 1.000000 | 0.0000000 | -0.0039123 | 0.0000000 | 0.0520625 | -0.0801790 | -0.0000001 | 0.478157 | 102.5890 | 0.0000000 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 195 | 292.55 | 0 | 10 | 20 | 31 | 130 | 5417 | 0 | 5965 | 97.39 | 97.5902 | 97.83 | 0.00 | 0.0000002 | 0.01 | 0 | 0.567271 | 1.134540 | 0.476046 | 0.000000 | 0.404219 | 0.808438 | -0.0065171 | 1.000000 | 0.0000000 | -0.0040155 | 0.0000002 | 0.0534326 | -0.0801790 | -0.0000003 | 0.478157 | 97.5902 | 0.0000003 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 200 | 292.55 | 0 | 72 | 8 | 31 | 2139 | 11544 | 0 | 3657 | 92.39 | 92.5912 | 92.83 | 0.00 | 0.0000015 | 0.01 | 0 | 0.535369 | 1.070740 | 0.476046 | 0.000000 | 0.380715 | 0.761431 | -0.0065171 | 1.000000 | 0.0000001 | -0.0041200 | 0.0000008 | 0.0548026 | -0.0801790 | -0.0000019 | 0.478157 | 92.5912 | 0.0000016 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 205 | 292.55 | 0 | 20 | 23 | 20 | 0 | 2035 | 0 | 2949 | 87.38 | 87.5922 | 87.84 | 0.00 | 0.0000084 | 0.01 | 0 | 0.507409 | 1.014820 | 0.476046 | 0.000000 | 0.357803 | 0.715607 | -0.0065171 | 0.999998 | 0.0000004 | -0.0042307 | 0.0000037 | 0.0561726 | -0.0801789 | -0.0000097 | 0.478157 | 87.5922 | 0.0000092 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 210 | 292.55 | 0 | 7 | 23 | 18 | 177 | 2745 | 2433 | 6196 | 82.39 | 82.5933 | 82.83 | 0.01 | 0.0000393 | 0.02 | 0 | 0.473935 | 0.947870 | 0.476046 | 0.670012 | 0.691367 | 0.712722 | -0.0065171 | 0.999992 | 0.0000015 | -0.0043642 | 0.0000158 | 0.0575421 | -0.0801784 | -0.0000402 | 0.478157 | 82.5933 | 0.0000438 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 215 | 292.55 | 0 | 6 | 100 | 100 | 182 | 3439 | 5133 | 4919 | 77.40 | 77.5945 | 77.83 | 0.01 | 0.0001671 | 0.02 | 0 | 0.444328 | 0.888655 | 0.476046 | 0.625396 | 0.645686 | 0.665975 | -0.0065171 | 0.999969 | 0.0000056 | -0.0045768 | 0.0000572 | 0.0589103 | -0.0801765 | -0.0001500 | 0.478157 | 77.5945 | 0.0001791 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 220 | 292.55 | 0 | 7 | 23 | 31 | 253 | 67134 | 6636 | 4014 | 72.39 | 72.5959 | 72.84 | 0.01 | 0.0006089 | 0.02 | 0 | 0.417191 | 0.834383 | 0.476046 | 0.581817 | 0.600771 | 0.619725 | -0.0065171 | 0.999894 | 0.0000181 | -0.0050092 | 0.0001931 | 0.0602742 | -0.0801705 | -0.0004800 | 0.478157 | 72.5960 | 0.0006497 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 225 | 292.55 | 0 | 26 | 20 | 8 | 72 | 21021 | 4297 | 5090 | 67.40 | 67.5984 | 67.84 | 0.02 | 0.0020239 | 0.03 | 0 | 0.388099 | 0.776197 | 0.476046 | 0.574524 | 0.586509 | 0.598494 | -0.0065171 | 0.999669 | 0.0000522 | -0.0060175 | 0.0005729 | 0.0616259 | -0.0801525 | -0.0013866 | 0.478157 | 67.5985 | 0.0021449 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 230 | 292.55 | 1 | 3 | 22 | 20 | 496 | 60686 | 2857 | 5356 | 62.41 | 62.6033 | 62.86 | 0.03 | 0.0058940 | 0.04 | 0 | 0.364226 | 0.728453 | 0.476046 | 0.552809 | 0.561179 | 0.569550 | -0.0065171 | 0.999099 | 0.0001322 | -0.0082422 | 0.0015687 | 0.0629492 | -0.0801068 | -0.0035119 | 0.478157 | 62.6036 | 0.0062251 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 235 | 292.55 | 0 | 10 | 50 | 50 | 63 | 4115 | 3681 | 10023 | 57.43 | 57.6142 | 57.81 | 0.04 | 0.0157637 | 0.05 | 0 | 0.324739 | 0.649478 | 0.476046 | 0.523494 | 0.530320 | 0.537145 | -0.0065171 | 0.997754 | 0.0003047 | -0.0129200 | 0.0038339 | 0.0642087 | -0.0799990 | -0.0080951 | 0.478157 | 57.6150 | 0.0165731 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 240 | 292.55 | 1 | 279 | 50 | 50 | 1170 | 39193 | 3012 | 9390 | 52.46 | 52.6379 | 52.83 | 0.06 | 0.0384204 | 0.07 | 0 | 0.301680 | 0.603359 | 0.476046 | 0.501376 | 0.505950 | 0.510524 | -0.0065171 | 0.994905 | 0.0006370 | -0.0218349 | 0.0066480 | 0.0653442 | -0.0797706 | -0.0169246 | 0.478157 | 52.6393 | 0.0398239 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 245 | 292.55 | 0 | 438 | 50 | 50 | 119 | 10747 | 2182 | 9653 | 47.48 | 47.6773 | 47.89 | 0.09 | 0.0768502 | 0.10 | 0 | 0.285231 | 0.570462 | 0.469306 | 0.478714 | 0.482291 | 0.485867 | -0.0065171 | 0.990302 | 0.0011422 | -0.0344797 | 0.0139756 | 0.0663346 | -0.0794014 | -0.0294951 | 0.475861 | 47.6865 | 0.0860111 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 247 | 292.55 | 0 | 50 | 50 | 50 | 5 | 1290 | 5483 | 1760 | 45.50 | 45.6971 | 45.89 | 0.10 | 0.0962040 | 0.11 | 0 | 0.273995 | 0.547991 | 0.463901 | 0.466226 | 0.469087 | 0.471948 | -0.0065171 | 0.987998 | 0.0013914 | -0.0401191 | 0.0140348 | 0.0666926 | -0.0792168 | -0.0351075 | 0.473638 | 45.7138 | 0.1129540 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 248 | 292.55 | 0 | 15 | 50 | 50 | 16 | 2129 | 3666 | 700 | 44.52 | 44.7063 | 44.92 | 0.11 | 0.1052830 | 0.12 | 0 | 0.273081 | 0.546162 | 0.459372 | 0.462066 | 0.464921 | 0.467776 | -0.0065171 | 0.986881 | 0.0015163 | -0.0425395 | 0.0175110 | 0.0668746 | -0.0791272 | -0.0375142 | 0.471728 | 44.7280 | 0.1269210 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 249 | 292.55 | 0 | 4 | 50 | 50 | 10 | 13505 | 4517 | 800 | 43.51 | 43.7161 | 43.93 | 0.12 | 0.1148780 | 0.13 | 0 | 0.269065 | 0.538129 | 0.455004 | 0.457923 | 0.460772 | 0.463621 | -0.0065171 | 0.985691 | 0.0016497 | -0.0450826 | 0.0175471 | 0.0670505 | -0.0790318 | -0.0400441 | 0.469783 | 43.7437 | 0.1424310 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 250 | 292.55 | 2 | 426 | 50 | 50 | 1799 | 26307 | 4278 | 700 | 42.53 | 42.7269 | 42.95 | 0.13 | 0.1254170 | 0.14 | 0 | 0.266140 | 0.532280 | 0.451191 | 0.453797 | 0.456145 | 0.458493 | -0.0065171 | 0.984389 | 0.0017949 | -0.0478911 | 0.0175834 | 0.0672173 | -0.0789274 | -0.0428401 | 0.467820 | 42.7618 | 0.1603170 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 251 | 292.55 | 0 | 7 | 50 | 50 | 271 | 15028 | 4687 | 2012 | 41.53 | 41.7392 | 41.92 | 0.14 | 0.1375140 | 0.15 | 0 | 0.256440 | 0.512881 | 0.447223 | 0.448362 | 0.450653 | 0.452945 | -0.0065171 | 0.982918 | 0.0019560 | -0.0509297 | 0.0218205 | 0.0673700 | -0.0788094 | -0.0458673 | 0.465778 | 41.7812 | 0.1795280 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 252 | 292.55 | 0 | 10 | 50 | 50 | 104 | 661 | 4033 | 1857 | 40.55 | 40.7531 | 40.93 | 0.15 | 0.1512220 | 0.16 | 0 | 0.252133 | 0.504267 | 0.443405 | 0.442846 | 0.445133 | 0.447419 | -0.0065171 | 0.981266 | 0.0021342 | -0.0542692 | 0.0218653 | 0.0675078 | -0.0786770 | -0.0491964 | 0.463703 | 40.8038 | 0.2019230 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 253 | 292.55 | 0 | 38 | 50 | 50 | 1312 | 3752 | 582 | 7464 | 39.57 | 39.7627 | 39.94 | 0.17 | 0.1606130 | 0.18 | 0 | 0.247748 | 0.495497 | 0.437633 | 0.441328 | 0.443198 | 0.445068 | -0.0065171 | 0.979986 | 0.0022891 | -0.0564869 | 0.0219092 | 0.0676766 | -0.0785743 | -0.0514013 | 0.460124 | 39.8223 | 0.2202700 | 292.55 | 2020-05-19 20:02:33 |

SPY | 2020-05-19 | 2020-05-29 | 11 | 254 | 292.55 | 0 | 22 | 50 | 50 | 67 | 2797 | 2927 | 2528 | 38.58 | 38.7807 | 38.96 | 0.18 | 0.1784280 | 0.19 | 0 | 0.244667 | 0.489333 | 0.434100 | 0.434687 | 0.436553 | 0.438419 | -0.0065171 | 0.977890 | 0.0025081 | -0.0605062 | 0.0267923 | 0.0677777 | -0.0784063 | -0.0554131 | 0.457933 | 38.8483 | 0.2459960 | 292.55 | 2020-05-19 20:02:33 |

`purrr`

and `dplyr`

Here are some other interesting things that we can do with the nested lists via `purrr`

, and their equivalent operation on the `strikes_df`

dataframe using `dplyr`

.

The intent is to gain some intuition for `purrr`

using what you already know about `dplyr`

.

strikes %>% map(names) %>% unique() %>% unlist()

## [1] "ticker" "tradeDate" "expirDate" "dte" "strike" "stockPrice" ## [7] "callVolume" "callOpenInterest" "callBidSize" "callAskSize" "putVolume" "putOpenInterest" ## [13] "putBidSize" "putAskSize" "callBidPrice" "callValue" "callAskPrice" "putBidPrice" ## [19] "putValue" "putAskPrice" "callBidIv" "callMidIv" "callAskIv" "smvVol" ## [25] "putBidIv" "putMidIv" "putAskIv" "residualRate" "delta" "gamma" ## [31] "theta" "vega" "rho" "phi" "driftlessTheta" "extSmvVol" ## [37] "extCallValue" "extPutValue" "spotPrice" "updatedAt"

This is equivalent to the following `dplyr`

operation on the `strikes_df`

dataframe:

strikes_df %>% names

## [1] "ticker" "tradeDate" "expirDate" "dte" "strike" "stockPrice" ## [7] "callVolume" "callOpenInterest" "callBidSize" "callAskSize" "putVolume" "putOpenInterest" ## [13] "putBidSize" "putAskSize" "callBidPrice" "callValue" "callAskPrice" "putBidPrice" ## [19] "putValue" "putAskPrice" "callBidIv" "callMidIv" "callAskIv" "smvVol" ## [25] "putBidIv" "putMidIv" "putAskIv" "residualRate" "delta" "gamma" ## [31] "theta" "vega" "rho" "phi" "driftlessTheta" "extSmvVol" ## [37] "extCallValue" "extPutValue" "spotPrice" "updatedAt"

You can see the connection: `map(strikes, names)`

applies `names`

to each sublist in `strikes`

, returning a list of names for each sublist, which we then check for a single unique case and convert to a charcter vector via `unlist`

.

In the dataframe version, we’ve already mapped each sublist to a dataframe row. We can get the column names of the dataframe by calling `names`

directly on this object.

strikes %>% map_chr("ticker") %>% # this makes a character vector of list elements "ticker" unique()

## [1] "SPY"

Calling the `purrr::map`

functions on a list with the name of a common sub-element returns the value associated with each sub-element. `map`

returns a list; here we use `map_chr`

to return a character vector.

This only works if the thing being returned from the sub-element is indeed a character.

This is equivalent to the following `dplyr`

operation on the `strikes_df`

dataframe:

strikes_df %>% distinct(ticker) %>% pull()

## [1] "SPY"

In the `dplyr`

dataframe version, we’ve already mapped our tickers to their own column. So we simply call `distinct`

on that column to get the unique values. A `pull`

converts the resulting tibble to a vector.

In this case, the `purrr`

solution is somewhat convoluted:

callBids <- strikes %>% map_dbl("callBidPrice") callAsks <- strikes %>% map_dbl("callAskPrice") putBids <- strikes %>% map_dbl("putBidPrice") putAsks <- strikes %>% map_dbl("putAskPrice") data.frame( strike = strikes %>% map_dbl("strike"), expirDate = strikes %>% map_chr("expirDate"), callMid = map2_dbl(.x = callBids, .y = callAsks, ~{(.x + .y)/2}), putMid = map2_dbl(.x = putBids, .y = putAsks, ~{(.x + .y)/2}) ) %>% pretty_print(10)

strike | expirDate | callMid | putMid |
---|---|---|---|

140 | 2020-05-29 | 152.600 | 0.005 |

145 | 2020-05-29 | 147.600 | 0.005 |

150 | 2020-05-29 | 142.610 | 0.005 |

155 | 2020-05-29 | 137.585 | 0.005 |

160 | 2020-05-29 | 132.620 | 0.005 |

165 | 2020-05-29 | 127.620 | 0.005 |

170 | 2020-05-29 | 122.600 | 0.005 |

175 | 2020-05-29 | 117.585 | 0.005 |

180 | 2020-05-29 | 112.610 | 0.005 |

185 | 2020-05-29 | 107.610 | 0.005 |

Since our mapping function requires two inputs, we need to use the `map2`

functions, and must set up the inputs as a first step.

The `dplyr`

equivalent on the dataframe object is much more succinct:

strikes_df %>% mutate( callMid = (callBidPrice + callAskPrice)/2, putMid = (putBidPrice + putAskPrice)/2 ) %>% select(strike, expirDate, callMid, putMid) %>% pretty_print(10)

strike | expirDate | callMid | putMid |
---|---|---|---|

140 | 2020-05-29 | 152.600 | 0.005 |

145 | 2020-05-29 | 147.600 | 0.005 |

150 | 2020-05-29 | 142.610 | 0.005 |

155 | 2020-05-29 | 137.585 | 0.005 |

160 | 2020-05-29 | 132.620 | 0.005 |

165 | 2020-05-29 | 127.620 | 0.005 |

170 | 2020-05-29 | 122.600 | 0.005 |

175 | 2020-05-29 | 117.585 | 0.005 |

180 | 2020-05-29 | 112.610 | 0.005 |

185 | 2020-05-29 | 107.610 | 0.005 |

We can also leverage the fact that a dataframe is represented as a list of columns to use `purrr`

functions directly on dataframes. These recipes are quite useful for quickly getting to know a dataframe.

For instance, we can get the type of each column:

strikes_df %>% map_chr(class)

## ticker tradeDate expirDate dte strike stockPrice callVolume ## "character" "character" "character" "integer" "numeric" "numeric" "integer" ## callOpenInterest callBidSize callAskSize putVolume putOpenInterest putBidSize putAskSize ## "integer" "integer" "integer" "integer" "integer" "integer" "integer" ## callBidPrice callValue callAskPrice putBidPrice putValue putAskPrice callBidIv ## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" ## callMidIv callAskIv smvVol putBidIv putMidIv putAskIv residualRate ## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" ## delta gamma theta vega rho phi driftlessTheta ## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" ## extSmvVol extCallValue extPutValue spotPrice updatedAt ## "numeric" "numeric" "numeric" "numeric" "character"

Which is equivalent to a `dplyr::summarise_all`

, except that this returns a tibble rather than a vector:

strikes_df %>% summarise_all(~class(.x))

## # A tibble: 1 x 40 ## ticker tradeDate expirDate dte strike stockPrice callVolume callOpenInterest callBidSize callAskSize putVolume ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 chara~ character character inte~ numer~ numeric integer integer integer integer integer ## # ... with 29 more variables: putOpenInterest <chr>, putBidSize <chr>, putAskSize <chr>, callBidPrice <chr>, ## # callValue <chr>, callAskPrice <chr>, putBidPrice <chr>, putValue <chr>, putAskPrice <chr>, callBidIv <chr>, ## # callMidIv <chr>, callAskIv <chr>, smvVol <chr>, putBidIv <chr>, putMidIv <chr>, putAskIv <chr>, residualRate <chr>, ## # delta <chr>, gamma <chr>, theta <chr>, vega <chr>, rho <chr>, phi <chr>, driftlessTheta <chr>, extSmvVol <chr>, ## # extCallValue <chr>, extPutValue <chr>, spotPrice <chr>, updatedAt <chr>

We can also get the number of distinct values in each column using `purrr`

functions:

strikes_df %>% map_dbl(n_distinct)

## ticker tradeDate expirDate dte strike stockPrice callVolume ## 1 1 31 31 151 1 162 ## callOpenInterest callBidSize callAskSize putVolume putOpenInterest putBidSize putAskSize ## 1097 144 165 498 1691 1054 926 ## callBidPrice callValue callAskPrice putBidPrice putValue putAskPrice callBidIv ## 2151 2228 2199 1281 2419 1311 1907 ## callMidIv callAskIv smvVol putBidIv putMidIv putAskIv residualRate ## 2423 2430 2044 2345 2425 2428 2039 ## delta gamma theta vega rho phi driftlessTheta ## 2137 2238 2192 2211 2188 2133 2196 ## extSmvVol extCallValue extPutValue spotPrice updatedAt ## 2016 2168 2413 1 1

Again, this is equivalent to a `dplyr::summarise_all`

, different return objects aside:

strikes_df %>% summarise_all(~n_distinct(.x))

## # A tibble: 1 x 40 ## ticker tradeDate expirDate dte strike stockPrice callVolume callOpenInterest callBidSize callAskSize putVolume ## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> ## 1 1 1 31 31 151 1 162 1097 144 165 498 ## # ... with 29 more variables: putOpenInterest <int>, putBidSize <int>, putAskSize <int>, callBidPrice <int>, ## # callValue <int>, callAskPrice <int>, putBidPrice <int>, putValue <int>, putAskPrice <int>, callBidIv <int>, ## # callMidIv <int>, callAskIv <int>, smvVol <int>, putBidIv <int>, putMidIv <int>, putAskIv <int>, residualRate <int>, ## # delta <int>, gamma <int>, theta <int>, vega <int>, rho <int>, phi <int>, driftlessTheta <int>, extSmvVol <int>, ## # extCallValue <int>, extPutValue <int>, spotPrice <int>, updatedAt <int>

If we wanted to put both of these things together, there’s an elegant `purrr`

solution:

strikes_df %>% map_df( ~data.frame(num_distinct = n_distinct(.x), type = class(.x)), .id = "variable" )

## variable num_distinct type ## 1 ticker 1 character ## 2 tradeDate 1 character ## 3 expirDate 31 character ## 4 dte 31 integer ## 5 strike 151 numeric ## 6 stockPrice 1 numeric ## 7 callVolume 162 integer ## 8 callOpenInterest 1097 integer ## 9 callBidSize 144 integer ## 10 callAskSize 165 integer ## 11 putVolume 498 integer ## 12 putOpenInterest 1691 integer ## 13 putBidSize 1054 integer ## 14 putAskSize 926 integer ## 15 callBidPrice 2151 numeric ## 16 callValue 2228 numeric ## 17 callAskPrice 2199 numeric ## 18 putBidPrice 1281 numeric ## 19 putValue 2419 numeric ## 20 putAskPrice 1311 numeric ## 21 callBidIv 1907 numeric ## 22 callMidIv 2423 numeric ## 23 callAskIv 2430 numeric ## 24 smvVol 2044 numeric ## 25 putBidIv 2345 numeric ## 26 putMidIv 2425 numeric ## 27 putAskIv 2428 numeric ## 28 residualRate 2039 numeric ## 29 delta 2137 numeric ## 30 gamma 2238 numeric ## 31 theta 2192 numeric ## 32 vega 2211 numeric ## 33 rho 2188 numeric ## 34 phi 2133 numeric ## 35 driftlessTheta 2196 numeric ## 36 extSmvVol 2016 numeric ## 37 extCallValue 2168 numeric ## 38 extPutValue 2413 numeric ## 39 spotPrice 1 numeric ## 40 updatedAt 1 character

But the best I can do with `dplyr`

is somewhat less elegant:

strikes_df %>% summarise_all( list(~n_distinct(.x), ~class(.x)) )

## # A tibble: 1 x 80 ## ticker_n_distin~ tradeDate_n_dis~ expirDate_n_dis~ dte_n_distinct strike_n_distin~ stockPrice_n_di~ callVolume_n_di~ ## <int> <int> <int> <int> <int> <int> <int> ## 1 1 1 31 31 151 1 162 ## # ... with 73 more variables: callOpenInterest_n_distinct <int>, callBidSize_n_distinct <int>, ## # callAskSize_n_distinct <int>, putVolume_n_distinct <int>, putOpenInterest_n_distinct <int>, ## # putBidSize_n_distinct <int>, putAskSize_n_distinct <int>, callBidPrice_n_distinct <int>, callValue_n_distinct <int>, ## # callAskPrice_n_distinct <int>, putBidPrice_n_distinct <int>, putValue_n_distinct <int>, putAskPrice_n_distinct <int>, ## # callBidIv_n_distinct <int>, callMidIv_n_distinct <int>, callAskIv_n_distinct <int>, smvVol_n_distinct <int>, ## # putBidIv_n_distinct <int>, putMidIv_n_distinct <int>, putAskIv_n_distinct <int>, residualRate_n_distinct <int>, ## # delta_n_distinct <int>, gamma_n_distinct <int>, theta_n_distinct <int>, vega_n_distinct <int>, rho_n_distinct <int>, ## # phi_n_distinct <int>, driftlessTheta_n_distinct <int>, extSmvVol_n_distinct <int>, extCallValue_n_distinct <int>, ## # extPutValue_n_distinct <int>, spotPrice_n_distinct <int>, updatedAt_n_distinct <int>, ticker_class <chr>, ## # tradeDate_class <chr>, expirDate_class <chr>, dte_class <chr>, strike_class <chr>, stockPrice_class <chr>, ## # callVolume_class <chr>, callOpenInterest_class <chr>, callBidSize_class <chr>, callAskSize_class <chr>, ## # putVolume_class <chr>, putOpenInterest_class <chr>, putBidSize_class <chr>, putAskSize_class <chr>, ## # callBidPrice_class <chr>, callValue_class <chr>, callAskPrice_class <chr>, putBidPrice_class <chr>, ## # putValue_class <chr>, putAskPrice_class <chr>, callBidIv_class <chr>, callMidIv_class <chr>, callAskIv_class <chr>, ## # smvVol_class <chr>, putBidIv_class <chr>, putMidIv_class <chr>, putAskIv_class <chr>, residualRate_class <chr>, ## # delta_class <chr>, gamma_class <chr>, theta_class <chr>, vega_class <chr>, rho_class <chr>, phi_class <chr>, ## # driftlessTheta_class <chr>, extSmvVol_class <chr>, extCallValue_class <chr>, extPutValue_class <chr>, ## # spotPrice_class <chr>, updatedAt_class <chr>

Intuitively, you’d reach for something like this:

try( strikes_df %>% summarise_all( ~data.frame(num_distinct = n_distinct(.x), type = class(.x)) ) )

## Error : Column `ticker` must be length 1 (a summary value), not 2

But we get an error related to the fact `summarise`

wants to return a single value for each variable being summarised, that is, a dataframe with a single row.

There are probably better `dplyr`

solutions out there, but this illustrates an important point: the `purrr::map`

functions are highly customisable, able to apply a function to individual elements in a collection, returning a data object of your choosing. `dplyr::summarise`

really shines when you need to aggregate or reduce variables to a single value.

In this post we explored the `purrr::map`

functions for wrangling a data set consisting of nested lists, as you might have if you were reading in JSON data to R.

We also explored the cross-over and differences in use-cases for `purrr`

and `dplyr`

functions.

The post How to Wrangle JSON Data in R with jsonlite, purr and dplyr appeared first on Robot Wealth.

]]>The post Using Digital Signal Processing in Quantitative Trading Strategies appeared first on Robot Wealth.

]]>In this post, we look at tools and functions from the field of digital signal processing. *Can these tools be useful to us as quantitative traders?*

A digital signal is a representation of physical phenomena created by sampling that phenomena at discrete time intervals.

If you think about the way we typically construct a price chart, there are obvious parallels: we sample a stream of ticks at regular intervals and treat that sample as our measure of price.* (Of course, we often aggregate or summarize price data at such intervals too, creating the familiar open-high-low-close bars or candles).*

You can, therefore, see the connection between digital signals and analysis of time-based financial data.

And since the techniques used to process and make sense of digital signals have proven their worth in electrical engineering, telecommunications and other fields, it is tempting to assume that they can unravel the mysteries of the financial markets too.

However, the existence of such a connection doesn’t necessarily imply that DSP holds the key to the markets, or indeed that it is of any use at all in financial applications.

Financial data is much, much noisier than the data in those other applications, for example.

The purpose of most of the DSP tools you’ll come across, including the ones implemented in the Zorro Trading Automation Software, is to uncover useful information about some aspect of one or more cycles present in the signal being analyzed.

A cycle is a repeating pattern (although non-repeating cycles can also exist), and it can be described by its:

- period (\(T\)) – how much time is taken by one complete cycle.
- frequency (\(f\)) – how many times it repeats in a given time interval. Frequency and period are inversely related: \(f = \frac{1}{T}\)
- amplitude (\(A\)) – the magnitude of the distance between the peak and trough of the cycle. Often \(A\) is given as a distance from the midline to the peak (equivalently, the midline to the trough), but in Zorro’s functions it is the peak-to-trough range.
- phase (\(\theta\)) – the fraction of the cycle that has been completed at any point in time (usually measured in degrees or radians). By definition, one complete cycle occupies \(2\pi\) radians, or 360 degrees.

Here’s one full cycle of a sine wave with an amplitude of 2 (the peak is at one, the trough is at minus one):

Here’s an example of a time series with a clear cycle and trend component:

And here’s one that I constructed by adding various trends and cycles with different amplitudes, phases and frequencies:

The last plot above was generated using this R code.

x <- seq(0, 20, 0.1) signal <- 0.7*sin(2*x) - 0.001*x^3 + 0.05*x^2 - x/3 - 0.9*cos(3*x-4) + 0.3*sin(10*x) plot(x, signal, type='l')

Note how the signal variable is constructed by simply summing multiple cycles and trends.

You can see that a seemingly random looking signal can actually be decomposed into constituent cycles. In fact, using enough different cycles, you can create a ** good representation of any time series**. The tools to do so, for example, the Fourier Transform, are well established and accessible in many software packages.

But just because you *can*, doesn’t necessarily mean that you *should*.

Consider that in order for our combination of cycles to be an accurate representation of a time series, they should actually be real and present in the time series. Otherwise, we are simply engaging in an exercise of overfitting, where our parameters are the constituent cycles and their properties.

**Since DSP is concerned with extracting information about cycles in a signal, it will be useful to the extent that there are real and present cycles in the signal. **

What do you think? Are financial time series composed of multiple cycles of varying frequency and amplitude?

We’ll explore this question throughout this post…

There’s another potential problem in using DSP to analyze the markets, beyond the question of the fundamental existence of cycles in financial time series data. Most of the DSP techniques that you’ll encounter are particularly useful in engineering applications where the signal typically repeats – think for example of the voltage in an AC electrical circuit.

Unfortunately, financial time series tend to be non-stationary and non-repeating.

So, even if we accept that financial data *does* consist of multiple cycles that we can detect and measure, and then we do manage to detect and measure them, we are still faced with the problem of them changing through time, perhaps even before we manage to detect them!

Those issues aside, there may still be applications in practical trading where DSP techniques can come in handy. Consider for example some classic technical indicators like the simple moving average (SMA) and the Relative Strength Index (RSI).

Both of these indicators implicitly assume some form of cyclic behaviour in the underlying time series. The SMA looks back a certain time period in its calculation, which can be interpreted as the length of the “trend” the SMA seeks to detect. *(In this case, a trend is just a cycle with a long period, or equivalently a low frequency).*

Likewise, the RSI is calculated over a certain time period, but, in this case, its time period can be interpreted as the length of a high-frequency cycle that it seeks to detect.

When viewing technical indicators from that perspective, DSP techniques may provide superior performance: they will be more responsive to the underlying data and introduce less lag than their traditional counterparts.

Here’s a plot of the indicators and digital filters mentioned above *(with lookback periods of 100 and 14 for the low and high-frequency indicators respectively):*

You can see that the digital filter for detecting low-frequency cycles *(called a Low-Pass Filter, since it lets low-frequency cycles pass and attenuates high-frequency cycles)* is more responsive and tracks price closer than the SMA.

The digital filter for detecting high-frequency cycles (called a High-Pass Filter) is smoother than the RSI and tends to emphasize turning points a little more clearly. Also notice that its output was relatively small during the period in which the market trended down, but became more pronounced as the trend ended. That’s because during the trend it correctly filtered out the low-frequency trend components of the signal which dominated during that time, hence leaving little in the output of the filter.

If nothing else, it appears that digital filters can replace certain traditional indicators. So let’s talk a little more about them.

Digital filters work by detecting cycles of various periods (lengths) in a signal, and then either attenuating (filtering) or passing those cycles, depending on their period.

The *cutoff period *is the period at which the filter begins to attenuate the signal.

- Low-Pass filters attenuate periods
*below*their cutoff period - High-Pass filters attenuate periods
*above*their cutoff period.

The terminology can be a bit confusing, but just remember that Low and High refer to frequency, not period. Low frequency = long period, high frequency = short period.

Another type of digital filter is the Band-Pass filter, which passes cycles with a period within a certain range (the bandwidth), and attenuates cycles outside that range.

A useful way to visualize what these different filters do is to apply them to a signal with a known period and observe their output.

Even better, we can supply a signal whose period gradually changes to get an even better understanding. Here’s some Zorro code for generating a sine wave with an increasing period, and plotting the output of various filters applied to this signal:

function run() { set(PLOTNOW); MaxBars = 1000; asset(""); ColorUp = ColorDn = 0; vars Signal = series(genSine(10, 100)); vars LowPass50 = series(LowPass(Signal, 50)); vars LowPass100 = series(LowPass(Signal, 100)); vars HighPass5 = series(HighPass(Signal, 5)); vars HighPass20 = series(HighPass(Signal, 20)); vars BandPass40 = series(BandPass(Signal, 40, 0.1)); vars BP4040 = series(BandPass(BandPass40, 40, 0.1)); vars BP404040 = series(BandPass(BP4040, 40, 0.1)); PlotHeight1 = PlotHeight2 = 150; PlotWidth = 800; plot("Signal", Signal, MAIN, BLUE); plot("LP50", LowPass50, NEW, RED); plot("LP100", LowPass100, NEW, RED); plot("HP5", HighPass5, NEW, RED); plot("HP20", HighPass20, NEW, RED); plot("BP40", BandPass40, NEW, RED); plot("BP4040", BP4040, NEW, RED); plot("BP404040", BP404040, NEW, RED); }

This code outputs the following plot:

Study the plots above, and experiment with the code yourself. Note that none of the filters perfectly attenuate cycles that lie beyond their cutoff periods, although the stacked Band-Pass filters (the bottom two plots) come closer than the others do. They also introduce some lag.

In the script above, you can see that the `BandPass`

function has an additional argument: as well as a cutoff period, it also takes a `Delta`

value. The delta value is just a number between 0 and 1 that defines the width of the filter’s pass region scaled to and centred on the cutoff period.

For example, a Band-Pass filter with a cutoff period of 40 and Delta of 0.1 results in a filter with a bandwidth of 4, running from 38 to 42. It will attempt to filter cycles outside that range.

Digital filters can also be stacked by passing the output of one filter to another filter. This generally has the effect of enabling the filter to more tightly focus on the cutoff period of interest, but comes at the cost of introducing additional lag. In the script above, we demonstrated stacking a Band-Pass filter, but you can stack the other filters too. In some of the later examples, we’ll stack various filters to pre-condition our raw data before analyzing the remaining cycles. Note that doing this unavoidably introduces some amount of lag.

All this discussion of digital filters and their cutoff periods leads to the obvious question: what cutoff period should we use? It makes sense that if there were a *dominant cycle* present in the signal, that is, a cycle whose amplitude swamped the smaller cycles, we should target that cycle’s period, because it will present the best trading opportunities.

Zorro implements the `DominantPeriod`

function that calculates the dominant cycle period in a price series using the Hilbert Transformation.

The Hilbert Transformation is one of the fundamental functions of signal processing and enables calculation of, among other things, the instantaneous frequency of a signal.

The documentation of this function is likely to leave you scratching your head though. Testing different values of Period on a signal with a known period reveals that it has very little impact on the result and that the calculated period is quite accurate, up to a limit of around 60. My recommendation is to simply set this argument to a value of 30, which lies approximately halfway along the argument’s valid range.

Here’s a plot of the output of `DominantPeriod`

under different conditions:

- The first signal (the top blue plot) is a sine wave of period 50.
- The next two plots are the output of DominantPeriod() with Period arguments set to 30 and 50 respectively. You can see that even when Period is set incorrectly, it does a reasonable job of extracting the cycle length.
- The second signal (blue plot, second from the bottom) is a sine wave whose period decreases from 100 to 10.
- The red plot below is the output of DominantPeriod() with Period set to 30.

You can see that the function’s maximum output is 60, and that it returns 60 for all cycle lengths of 60 or above. It follows that it makes little sense to use this function to detect cycles of length greater than 60. On the other hand, it does a decent job of detecting the decreasing cycle length.

An important note about using this function is that it carries a significant lag, on the order of ten bars. The important implication of this is that by the time a dominant period is detected, *it will already be ten bars old*. That means that if we detected a dominant cycle whose period was 11 bars, we’d only have 1 bar left to act on it due to the lag associated with its detection.

Therefore, probably the minimum cycle length we’d be interested in detecting would be around 13 bars. In reality, however, you might not want to cut it quite so fine. A cycle length of 20 detected after 10 bars is detected half-way through its cycle (at its midline). A cycle of length 40 detected after 10 bars is only a quarter complete and is detected at its peak or trough.

Here’s a plot that demonstrates this. It shows a 13 (blue), 20 (green) and 40 (red) bar cycle with bar 10 marked by the black vertical line.

*Look at the phase of each of those cycles at bar 10 – which do think represents the best fake trading opportunities?*

A lot of Zorro’s DSP functionality is based on the work of John Ehlers. In one of his books, Ehlers demonstrates the use of the Hilbert Transform for calculating the dominant cycle period, which is the method used in Zorro’s implementation. However, Ehlers actually warns against that approach, saying that a better method is his autocorrelation periodogram.1

Let’s summarize what we’ve learned so far and then set up an experiment to look at the validity of the DSP approach in a trading context.

Here’s what we know so far:

- DSP tools and techniques outperform traditional indicators in terms of responsiveness and lag when used for analogous purposes.
- Digital filters can attenuate cycles of certain lengths, and pass others. We have Low-Pass filters for targeting long cycles, High-Pass filters for targeting short cycles, and Band-Pass filters for targeting a range of cycle lengths. Filters can be connected, or stacked. This often improves their accuracy but comes at the expense of added lag.
- We can estimate the period of the dominant cycle using Zorro’s DominantPeriod() function, which lags by approximately ten bars.

The nice thing about the DSP approach is that is based almost entirely on theory, which means we can test its validity by finding out whether the theory holds up in practice.

Since it’s a theory, it should also hold anywhere the assumptions which underpin it are valid. So we can actually run some experiments to test the theory and then based on the results infer whether its assumptions are valid. In particular, I want to know more about whether decomposing cycles is a sensible approach to the markets.

Here I’ll test one of John Ehlers’ indicators from *Cycle Analytics for Traders.*2

- At the heart of the indicator is a Band-Pass filter of an arbitrary bandwidth.
- Preceding the Band-Pass filter, the raw price series is passed to a High-Pass filter that is tuned to twice the period of the Band-Pass filter’s upper cutoff. That’s to remove the effects of
*spectral dilation*, which Ehlers says arises due to the phenomenon that market swings tend to be larger in amplitude the longer they are measured over. - Finally, the output of the Band-Pass filter is fed to another High-Pass filter, this time tuned to the other side of the Band-Pass filter’s bandwidth. The output of this High-Pass filter, according to Ehlers, creates a waveform that leads the Band-Pass filter’s output and will cross that output at its exact peaks and valleys, except when the data are trending.3We’ll refer to this leading waveform as our trigger.

Let’s look at this idea in the context of a very basic trading strategy. We’ll simply reverse long when the trigger crosses under the output of the Band-Pass filter and reverse short when it crosses over.

Here’s the Zorro code for setting this up for a few different assets and time frames. We set transaction costs to zero so that we can get an undistorted comparison of the raw edge (if any) of this approach across time frames. Here’s the code and the output of the portfolio report:

/* DSP Experiments from Cycle Analytics For Traders Note Ehlers often speaks in frequency terms, we work in period terms High frequency = low cutoff period Low frequency = high cutoff period */ function run() { set(PLOTNOW); StartDate = 2007; EndDate = 2016; BarPeriod = 60; LookBack = 250; MaxLong = MaxShort = 1; while(asset(loop("EUR/USD", "USD/JPY", "SPX500"))) while(algo(loop("H1", "H4", "D1"))) { if(Algo == "H1") TimeFrame = 60/BarPeriod; else if(Algo == "H4") TimeFrame = 240/BarPeriod; else if(Algo == "D1") TimeFrame = 1440/BarPeriod; Spread = Commission = Slippage = RollLong = RollShort = 0; vars Price = series(price()); // Set up filter parameters int BP_Cutoff = 30; var Delta = 0.3; int Bandwidth = ceil(BP_Cutoff*Delta); int UpperBP_Cutoff = ceil(BP_Cutoff + Bandwidth/2); int LowerBP_Cutoff = floor(BP_Cutoff - Bandwidth/2); if(Day == StartBar and Asset == "EUR/USD" and Algo == "H1") printf("\nFilter Bandwith is %d to %d Periods", LowerBP_Cutoff, UpperBP_Cutoff); // Pre-filter with High-Pass filter vars HP = series(HighPass(Price, UpperBP_Cutoff*2)); // Band-Pass filter vars BP = series(BandPass(HP, BP_Cutoff, Delta)); // Trigger: another High-Pass filter vars Trigger = series(HighPass(BP, LowerBP_Cutoff)); // Trade logic if(crossOver(Trigger, BP)) enterLong(); if(crossUnder(Trigger, BP)) enterShort(); plot("BP", BP, NEW, BLUE); plot("Trigger", Trigger, 0, RED); } } /* OUTPUT: Portfolio analysis OptF ProF Win/Loss Wgt% EUR/USD avg .003 0.94 2018/1611 18.9 SPX500 avg .003 0.88 1824/1522 58.0 USD/JPY avg .000 0.88 1946/1637 23.1 D1 avg .002 0.90 184/157 12.1 H1 avg .000 0.88 4449/3693 71.9 H4 avg .005 0.95 1155/920 16.0 EUR/USD:D1 .000 0.79 61/52 9.6 EUR/USD:D1:L .000 0.68 32/24 7.8 EUR/USD:D1:S .000 0.91 29/28 1.8 EUR/USD:H1 .000 0.96 1538/1244 8.1 EUR/USD:H1:L .000 0.93 786/605 7.0 EUR/USD:H1:S .000 0.99 752/639 1.1 EUR/USD:H4 .000 0.99 419/315 1.2 EUR/USD:H4:L .000 0.93 210/157 3.5 EUR/USD:H4:S .027 1.05 209/158 -2.4 SPX500:D1 .000 0.98 54/51 0.8 SPX500:D1:L .016 1.40 32/21 -8.5 SPX500:D1:S .000 0.70 22/30 9.3 SPX500:H1 .000 0.84 1427/1173 47.3 SPX500:H1:L .000 0.89 766/534 15.1 SPX500:H1:S .000 0.80 661/639 32.2 SPX500:H4 .000 0.93 343/298 9.9 SPX500:H4:L .004 1.05 195/126 -3.6 SPX500:H4:S .000 0.82 148/172 13.5 USD/JPY:D1 .000 0.93 69/54 1.7 USD/JPY:D1:L .000 0.89 32/30 1.2 USD/JPY:D1:S .000 0.96 37/24 0.5 USD/JPY:H1 .000 0.86 1484/1276 16.5 USD/JPY:H1:L .000 0.85 744/636 8.6 USD/JPY:H1:S .000 0.86 740/640 8.0 USD/JPY:H4 .000 0.92 393/307 4.9 USD/JPY:H4:L .000 0.91 196/154 2.8 USD/JPY:H4:S .000 0.93 197/153 2.1 */

Not much to like about that at all. Just for completeness, here’s the equity curve:

This result feels like we are almost completely out of sync with the cycles we are trying to exploit! Either that or the cycles are just randomness and can’t be exploited anyway. Let’s explore some more.

In this case, the bandwidth of our Band-Pass filter was tuned to periods of length 25 to 35. That’s neither a very wide range, nor are those particularly long cycles.

*Perhaps if we increase both our Band-Pass cutoff period and the Delta parameter in order to target longer cycles we can get better results?*

Here’s the portfolio report for a Band-Pass cutoff of 50 and a Delta of 0.8, which corresponds to a bandwidth that encompasses cycles of length 30 to 70 bars:

Portfolio analysis OptF ProF Win/Loss Wgt% EUR/USD avg .019 1.01 1156/817 4.2 SPX500 avg .012 1.21 1145/771 79.9 USD/JPY avg .024 1.10 1214/798 15.9 D1 avg .024 1.34 117/75 32.2 H1 avg .018 1.06 2683/1848 28.5 H4 avg .019 1.16 715/463 39.3 EUR/USD:D1 .008 1.04 35/29 1.5 EUR/USD:D1:L .000 0.86 18/14 -3.1 EUR/USD:D1:S .074 1.36 17/15 4.6 EUR/USD:H1 .000 0.98 891/628 -2.7 EUR/USD:H1:L .000 0.95 452/307 -4.9 EUR/USD:H1:S .023 1.03 439/321 2.2 EUR/USD:H4 .026 1.06 230/160 5.5 EUR/USD:H4:L .000 0.98 119/76 -0.8 EUR/USD:H4:S .055 1.15 111/84 6.3 SPX500:D1 .012 1.59 33/25 23.2 SPX500:D1:L .015 2.66 21/8 22.4 SPX500:D1:S .001 1.03 12/17 0.7 SPX500:H1 .015 1.11 869/604 25.4 SPX500:H1:L .023 1.21 468/268 22.9 SPX500:H1:S .003 1.02 401/336 2.4 SPX500:H4 .020 1.28 243/142 31.4 SPX500:H4:L .027 1.51 131/62 25.9 SPX500:H4:S .007 1.09 112/80 5.4 USD/JPY:D1 .028 1.36 49/21 7.6 USD/JPY:D1:L .029 1.38 26/9 3.6 USD/JPY:D1:S .027 1.35 23/12 4.0 USD/JPY:H1 .031 1.06 923/616 5.9 USD/JPY:H1:L .027 1.06 464/305 2.5 USD/JPY:H1:S .034 1.07 459/311 3.3 USD/JPY:H4 .014 1.05 242/161 2.4 USD/JPY:H4:L .009 1.03 117/85 0.7 USD/JPY:H4:S .018 1.07 125/76 1.7

Interesting. This appears to be a better result, although long trades on SPX500 contributed the majority of the profit. Still, maybe this deserves a closer look. Here’s the portfolio equity curve:

Ehlers says later in his book4that it makes sense to tune the Band-Pass filter to the dominant cycle in order to eliminate other cycles that are of little or no interest.

He then goes on to say that even better than tuning it to dominant cycle, tuning it to a slightly shorter period (but with a bandwidth that also captures the dominant cycle) causes the output of the filter to *lead* the dominant cycle slightly. We test this idea in the code below *(the tuning of the Band-Pass filter as described above occurs in line 34).*

In this case, Ehlers suggests that we precede the Band-Pass filter with what he calls a Roofing Filter, which is simply a High-Pass and Low-Pass filter connected in series for pre-conditioning the data and removing unwanted cycles. Zorro implements the Roofing Filter as the function Roof() and we use it here. Our trigger line is calculated as 90% of the amplitude of the 1-bar lagged Band-Pass filter output.

The “strategy” now does the following:

- Pre-processes the raw price series using Ehlers’ Roofing Filter
- Calculates the period of the dominant cycle of the pre-processed price series
- Tunes a Band-Pass filter centred on a period slightly lower than the dominant cycle, but with a bandwidth that incorporates the dominant cycle.
- Calculates a trigger line based on 90% of the one-bar lagged Band-Pass filter output.
- We simply reverse long and short when the trigger line crosses under and over the Band-Pass output respectively.

In this script, as well as the Band-Pass output and the trigger line, we also plot the dominant cycle and the bandwidth of the Band-Pass filter. I used a Delta of 0.3.

Here’s the code and the portfolio analysis (again we look at a few different assets and time frames):

/* DSP Experiments from Cycle Analytics For Traders Note Ehlers often speaks in frequency terms, we work in period terms High frequency = low cutoff period Low frequency = high cutoff period */ function run() { set(PLOTNOW); StartDate = 2007; EndDate = 2016; BarPeriod = 60; LookBack = 200*24; MaxLong = MaxShort = 1; while(asset(loop("EUR/USD", "USD/JPY", "SPX500"))) while(algo(loop("H1", "H4", "D1"))) { if(Algo == "H1") TimeFrame = 60/BarPeriod; else if(Algo == "H4") TimeFrame = 240/BarPeriod; else if(Algo == "D1") TimeFrame = 1440/BarPeriod; Spread = Commission = Slippage = RollLong = RollShort = 0; vars Price = series(price()); // Pre-filter with High-Pass filter vars HP = series(Roof(Price, 10, 70)); // Set up filter parameters var DomPeriod = DominantPeriod(HP, 30); var Delta = 0.3; var BP_Cutoff = (1 - 2/3*(0.5 * Delta)) * DomPeriod; //tune to a shorter period int Bandwidth = ceil(BP_Cutoff*Delta); int UpperBP_Cutoff = ceil(BP_Cutoff + Bandwidth/2); int LowerBP_Cutoff = floor(BP_Cutoff - Bandwidth/2); // Band-Pass filter vars BP = series(BandPass(HP, BP_Cutoff, Delta)); // Trigger: another High-Pass filter vars Trigger = series(0.9*BP[1]); // Trade logic if(crossOver(Trigger, BP)) enterLong(); if(crossUnder(Trigger, BP)) enterShort(); plot("BP", BP, NEW, BLUE); plot("Trigger", Trigger, 0, RED); plot("DomCyclePeriod", DomPeriod, NEW, BLACK); plot("UpperCutoff", UpperBP_Cutoff, 0, RED); plot("LowerCutoff", LowerBP_Cutoff, 0, RED); } } /* OUTPUT Portfolio analysis OptF ProF Win/Loss Wgt% EUR/USD avg .031 1.04 3325/2678 22.0 SPX500 avg .014 1.11 3079/2753 80.0 USD/JPY avg .006 0.99 3262/2758 -2.0 D1 avg .021 1.23 325/268 39.2 H1 avg .020 1.04 7420/6351 36.2 H4 avg .020 1.05 1921/1570 24.6 EUR/USD:D1 .049 1.18 113/91 11.2 EUR/USD:D1:L .000 1.00 53/49 -0.1 EUR/USD:D1:S .090 1.40 60/42 11.3 EUR/USD:H1 .041 1.03 2559/2057 8.8 EUR/USD:H1:L .000 0.99 1282/1026 -1.1 EUR/USD:H1:S .092 1.07 1277/1031 9.9 EUR/USD:H4 .009 1.01 653/530 2.0 EUR/USD:H4:L .000 0.95 330/261 -4.5 EUR/USD:H4:S .068 1.09 323/269 6.5 SPX500:D1 .023 1.44 114/78 31.4 SPX500:D1:L .027 1.78 68/28 25.5 SPX500:D1:S .010 1.15 46/50 5.9 SPX500:H1 .015 1.07 2349/2160 29.5 SPX500:H1:L .022 1.11 1246/1009 23.9 SPX500:H1:S .006 1.02 1103/1151 5.5 SPX500:H4 .011 1.09 616/515 19.2 SPX500:H4:L .018 1.17 336/229 18.8 SPX500:H4:S .000 1.00 280/286 0.4 USD/JPY:D1 .000 0.91 98/99 -3.4 USD/JPY:D1:L .000 0.91 48/51 -1.7 USD/JPY:D1:S .000 0.92 50/48 -1.7 USD/JPY:H1 .000 0.99 2512/2134 -2.1 USD/JPY:H1:L .000 0.99 1266/1057 -1.0 USD/JPY:H1:S .000 0.99 1246/1077 -1.1 USD/JPY:H4 .018 1.04 652/525 3.5 USD/JPY:H4:L .020 1.04 321/267 1.8 USD/JPY:H4:S .016 1.04 331/258 1.6 */

And here’s the equity curve *(I also plotted the upper and lower range of the Band-Pass filter’s bandwidth around the dominant cycle):*

It occurred to me that maybe we want to avoid reversing our trades when the dominant cycle turns into a trend. When that happens, wouldn’t we want to hold onto our current trade, rather than reversing?

Now, recall that our *DominantPeriod() * function topped out at a cycle period of around 60 bars. It couldn’t detect any cycles longer than that period effectively *(go back and look at the plots of the sine waves and the output of DominantPeriod() above if you’re unsure). *

That means that at the upper end of *DominantPeriod() ‘*s output range, we have a significant amount of uncertainty about the actual dominant cycle period: it could be 60 bars, it could be 160 bars, *DominantPeriod() * would output roughly the same value in either case.

To test this, we can simply prevent the reversing of our positions when *DominantPeriod() * approaches the upper boundary of its range. That will mean that any currently open trade will stay open until *DominantPeriod() * detects a shorter cycle, hopefully exploiting any cycles that turn into trends. Just replace the trade logic (lines 45-47) with this code:

// Trade logic if(DomPeriod < 45) { if(crossOver(Trigger, BP)) enterLong(); if(crossUnder(Trigger, BP)) enterShort(); }

Here’s the resulting equity curve; the boost in performance is obvious and not insignificant:

The Fourier Transform is another fundamental operation of DSP. It is used to:

- Decompose a signal into its component cycles, and
- Determine the contribution of each cycle to the signal

That is, the Fourier transform can tell us the period (or frequency) of the most significant cycles and also tell us something about their amplitude.

For example, if our Fourier Transform identified a strong cycle with a period of 40 days, and assuming we could detect that cycle, we would know the optimum times to buy and sell: at day 10 and 30, corresponding to the peak and trough of the cycle respectively.

*If only it were that simple!*

A plot of the contribution of each cycle to a signal is referred to as the signal’s frequency spectrum. Zorro has a tool for creating such plots.

I mentioned above that one of the confounding factors in the application of DSP to financial time series is that such series are typically non-stationary. That means that the frequency spectrum calculated over one period may be completely different over another.

How different?

Well, we already saw in our use of the *DominantPeriod()* function that the strongest cycle tended to wander around quite a bit. Let’s find out what analysis of the frequency spectrum says about that.

To plot the frequency spectrum of a price series, we need to use the *Spectrum()* function. This function calculates the relative contribution of a single cycle to the overall signal. If we plot the contribution of several different cycles, we’ll see just how unstable the relative contributions are. Here’s some code for plotting the contribution of the 24, 48, 72, 96 and 120 hour cycles to the EUR/USD over a few years:

/* SPECTRAL ANALYSIS */ function run() { set(PLOTNOW); StartDate = 20120101; EndDate = 2016; LookBack = 120*5; vars Price = series(price()); PlotHeight1 = 200; PlotHeight2 = 100; int i; for(i=24; i<121; i+=24) { plot(strf("Spectrum_%d", i), Spectrum(Price, i, i*5), NEW, color(100*i/120, BLUE, DARKGREEN, RED, BLACK)); } }

And here’s the plot:

You can see that the relative contributions (or amplitude, or strength) of these cycles are anything but constant.

I wanted to show you this plot as a first step in exploring frequency spectra because often you’ll see a histogram or some other plot that shows the relative contributions of *all* the cycles *over some period of time*.

If you haven’t seen a plot like the one above, you could be forgiven for assuming that such a histogram describes general characteristics of a market: in fact, **what it really shows is just a snapshot in time.**

Let’s build one such snapshot in time. To do that, we call Spectrum() for every cycle period that we are interested in (inside a for() loop) and plot its relative contribution as a bar in a histogram. Here’s the code to do just that, where we plot the spectrum for a single 3-month period:

* SPECTRAL ANALYSIS */ function run() { set(PLOTNOW); StartDate = 20160101; EndDate = StartDate + 00000300; LookBack = 120*2; vars Price = series(price()); vars Rand = series(0); if(is(LOOKBACK)) Rand[0] = Price[0]; else Rand[0] = Rand[1] + random(0.005)*random(); // plot("random", Rand, MAIN, BLUE); PlotWidth = 800; PlotHeight1 = 350; int Period; for(Period = 6; Period < 120; Period += 1) { plotBar("Spectrum",Period,Period,Spectrum(Price,Period,2*Period),BARS+AVG+LBL2,DARKGREEN); } }

Here’s the plot for EUR/USD:

If we advance the script to look at the next three-month period, we get this plot:

There are definite peaks in the frequency spectra plots, which suggests that certain cycles do in fact exist in this market.

However, you can see that for the most part, they are not constant, and some peaks shift more than others.

If you repeat the analysis over different periods (months, weeks, or years) you’ll find other shifting, ephemeral, but clearly visible cycles.

*But couldn’t those peaks just be random fluctuations? What evidence is there that they are real cycles, and not just an artefact of randomness? *

Well, we can construct a random price curve that starts at the same level as our real price curve and changes with roughly the same volatility, but does so using a random number generator. Then, we can plot the frequency spectrum of that random price curve and see if we get similar fluctuations.

I already added the code to create the random price series (lines 13-17 in the script above). To test it, just replace *Price* with *Rand* in the *Spectrum()* function. Here’s a plot of one realization of that random price curve (blue), overlaid with the real one (black):

Looks pretty believable, doesn’t it? Visual inspection certainly suggests that it could pass for a real price series. Let’s see what sort of frequency spectrum such a random price curve generates:

Not really the same as the one generated by our real price curve.

Here’s another one in case the first one was a fluke (every time you run the code above, you’ll get a different random price curve, and I encourage you to test this yourself):

This *hints* that the cycles evident in the frequency spectrum of our price curve are not just random artefacts.

If we were convinced by this, then the challenge we would face is overcoming their shifting and ephemeral nature in order to take advantage of these cycles.

I take a somewhat measured attitude to using DSP as an approach to the markets.

We have seen that some financial time series *appear* to exhibit cyclical behaviour and are therefore candidates for exploitation using tools and techniques that can identify and quantify those cycles.

The challenge we face is that those cycles – if they even exist – are not constant in time.

Our analysis suggests that – even if they are not just random – these cycles are constantly shifting, evolving, appearing and disappearing, and will show up in different ways depending on the lens through which you view the markets. That’s no small challenge to overcome.

While it is possible that cycle decomposition, done skillfully, may yield good results, DSP is certainly more readily useful for replacing or upgrading traditional technical indicators. This in spite of the fact that non-stationary, and non-repeating financial data is something of a departure from the typical use cases of these techniques – perhaps because traditional indicators are already burdened with that particular shortcoming, in addition to their sub-optimal lag and attenuation characteristics.

If you wanted to take this approach further, you might want to look at the concept of wavelet decomposition. Wavelets are used to decompose a signal, much like the Fourier Transform, but do so in both the frequency *and* time domains. They may, therefore, be more useful than the frequency (equivalently, period) based decomposition we looked at here.

In addition, there are other time-frequency decompositions which could be explored as well. If you are interested in pursuing wavelets for applications in algorithmic trading, expect to do *a lot* of research, particularly if you are new to DSP. That’s not meant to scare you off, rather I just want to give you a taste of what’s out there and set some realistic expectations.

Finally, there’s one other method that belongs partially to DSP, but probably more to control theory: the Kalman filter, which finds application in trading strategies like this one.

The post Using Digital Signal Processing in Quantitative Trading Strategies appeared first on Robot Wealth.

]]>The post How to Calculate Rolling Pairwise Correlations in the Tidyverse appeared first on Robot Wealth.

]]>How might we calculate rolling correlations between constituents of an ETF, given a dataframe of prices?

For problems like this, the `tidyverse`

really shines. There are a number of ways to solve this problem … read on for our solution, and let us know if you’d approach it differently!

First, we load some packages and some data that we extracted earlier. `xlfprices.RData`

contains a dataframe, `prices_xlf`

, of constituents of the XLF ETF and their daily prices. You can get this data from our GitHub repository.

The dataset isn’t entirely accurate, as it contains prices of today’s constituents and doesn’t account for historical changes to the makeup of the ETF. But that won’t matter for our purposes.

library(tidyverse) library(lubridate) library(glue) library(here) theme_set(theme_bw()) load(here::here("data", "xlfprices.RData")) prices_xlf %>% head(10)

## # A tibble: 10 x 10 ## ticker date open high low close volume dividends closeunadj inSPX ## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> ## 1 AFL 2019-11-29 54.8 55.1 54.8 54.8 1270649 0 54.8 TRUE ## 2 AIG 2019-11-29 52.8 53.2 52.6 52.7 2865501 0 52.7 TRUE ## 3 AIZ 2019-11-29 133. 134. 133. 133. 202854 0 133. TRUE ## 4 AJG 2019-11-29 93.3 93.6 93.0 93.3 392489 0 93.3 TRUE ## 5 ALL 2019-11-29 112. 112. 111. 111. 817942 0 111. TRUE ## 6 AMP 2019-11-29 164. 165. 163. 164. 404660 0 164. TRUE ## 7 AON 2019-11-29 203. 204. 202. 204. 415940 0 204. TRUE ## 8 AXP 2019-11-29 120. 121. 120. 120. 1961463 0 120. TRUE ## 9 BAC 2019-11-29 33.4 33.5 33.2 33.3 19503395 0 33.3 TRUE ## 10 BEN 2019-11-29 27.8 27.9 27.4 27.5 1485635 0 27.5 TRUE

We’d like to be able to calculate *rolling average pairwise correlations* between all the stocks as tidily as possible.

That requires that we calculate the rolling pairwise correlation between all the stock combinations in the index and then take the mean of all those.

A good way to tackle such problems is to chunk them down into bite-sized pieces and then solve each piece in turn. We split the problem into the following steps:

- calculate returns for each ticker
- create a long dataframe of all the pairwise ticker combinations for each day by doing a full join of the data on itself, keyed by date
- remove instances where we had the same stock twice (corresponding to the diagonal of the correlation matrix)
- remove instances where we have the complementary pair of the same stocks, eg we only want one of APPL-GOOG and GOOG-APPL (this is equivalent to removing the upper or lower triangle of the correlation matrix)
- use
`slider::slide2_dbl`

to do the rolling correlation calculation - group by date and take the mean

The first step is straightforward – we simply calculate close-to-close returns and return a long dataframe of dates, tickers, and returns:

# calculate returns to each stock df <- prices_xlf %>% group_by(ticker) %>% arrange(date, .by_group = TRUE) %>% mutate(return = close / dplyr::lag(close) - 1) %>% select(date, ticker, return) # function for prettier web display pretty_table <- function(df) { require(kableExtra) df %>% kable() %>% kable_styling(full_width = TRUE, position = 'center') %>% scroll_box(height = '300px') } df %>% head(50) %>% pretty_table()

date | ticker | return |
---|---|---|

2015-01-02 | AFL | NA |

2015-01-05 | AFL | -0.0261952 |

2015-01-06 | AFL | -0.0089106 |

2015-01-07 | AFL | 0.0062765 |

2015-01-08 | AFL | 0.0097775 |

2015-01-09 | AFL | -0.0212020 |

2015-01-12 | AFL | -0.0057991 |

2015-01-13 | AFL | -0.0049751 |

2015-01-14 | AFL | -0.0077586 |

2015-01-15 | AFL | -0.0033015 |

2015-01-16 | AFL | 0.0142957 |

2015-01-20 | AFL | -0.0029220 |

2015-01-21 | AFL | 0.0015515 |

2015-01-22 | AFL | 0.0180723 |

2015-01-23 | AFL | -0.0071006 |

2015-01-26 | AFL | 0.0049379 |

2015-01-27 | AFL | -0.0086411 |

2015-01-28 | AFL | -0.0196548 |

2015-01-29 | AFL | 0.0050558 |

2015-01-30 | AFL | -0.0098873 |

2015-02-02 | AFL | 0.0190960 |

2015-02-03 | AFL | 0.0158157 |

2015-02-04 | AFL | 0.0269081 |

2015-02-05 | AFL | 0.0049440 |

2015-02-06 | AFL | 0.0049196 |

2015-02-09 | AFL | -0.0022846 |

2015-02-10 | AFL | 0.0050703 |

2015-02-11 | AFL | 0.0061839 |

2015-02-12 | AFL | -0.0004852 |

2015-02-13 | AFL | 0.0053398 |

2015-02-17 | AFL | 0.0035410 |

2015-02-18 | AFL | -0.0096231 |

2015-02-19 | AFL | 0.0004858 |

2015-02-20 | AFL | 0.0056653 |

2015-02-23 | AFL | -0.0112667 |

2015-02-24 | AFL | 0.0102556 |

2015-02-25 | AFL | -0.0017725 |

2015-02-26 | AFL | 0.0037127 |

2015-02-27 | AFL | 0.0011258 |

2015-03-02 | AFL | 0.0051406 |

2015-03-03 | AFL | -0.0046348 |

2015-03-04 | AFL | 0.0011240 |

2015-03-05 | AFL | 0.0072173 |

2015-03-06 | AFL | -0.0101911 |

2015-03-09 | AFL | 0.0022523 |

2015-03-10 | AFL | -0.0200642 |

2015-03-11 | AFL | 0.0072072 |

2015-03-12 | AFL | 0.0188649 |

2015-03-13 | AFL | -0.0083001 |

2015-03-16 | AFL | 0.0133591 |

Next, we create a long dataframe of all the combinations for each day by doing a full join of the data on itself, by date.

# combinations by date pairwise_combos <- df %>% full_join(df, by = "date") pairwise_combos %>% na.omit() %>% head(20) %>% pretty_table()

date | ticker.x | return.x | ticker.y | return.y |
---|---|---|---|---|

2015-01-05 | AFL | -0.0261952 | AFL | -0.0261952 |

2015-01-05 | AFL | -0.0261952 | AIG | -0.0197826 |

2015-01-05 | AFL | -0.0261952 | AIZ | -0.0224288 |

2015-01-05 | AFL | -0.0261952 | AJG | -0.0059600 |

2015-01-05 | AFL | -0.0261952 | ALL | -0.0198232 |

2015-01-05 | AFL | -0.0261952 | AMP | -0.0320993 |

2015-01-05 | AFL | -0.0261952 | AON | -0.0096470 |

2015-01-05 | AFL | -0.0261952 | AXP | -0.0264459 |

2015-01-05 | AFL | -0.0261952 | BAC | -0.0290503 |

2015-01-05 | AFL | -0.0261952 | BEN | -0.0331591 |

2015-01-05 | AFL | -0.0261952 | BK | -0.0257044 |

2015-01-05 | AFL | -0.0261952 | BLK | -0.0258739 |

2015-01-05 | AFL | -0.0261952 | C | -0.0315149 |

2015-01-05 | AFL | -0.0261952 | CB | -0.0163404 |

2015-01-05 | AFL | -0.0261952 | CB | 0.0000000 |

2015-01-05 | AFL | -0.0261952 | CBOE | 0.0318813 |

2015-01-05 | AFL | -0.0261952 | CFG | -0.0234249 |

2015-01-05 | AFL | -0.0261952 | CINF | -0.0143383 |

2015-01-05 | AFL | -0.0261952 | CMA | -0.0356448 |

2015-01-05 | AFL | -0.0261952 | CME | 0.0056728 |

So far so good.

Now we’ve got some wrangling to do. We want to remove instances where we have the same stock for `ticker.x`

and `ticker.y`

, which corresponds to the diagonal on the correlation matrix.

We also want to remove instances where we have the same stock, but with the `ticker.x`

and `ticker.y`

designations reversed (this is equivalent to removing the upper or lower triangle of the correlation matrix).

Note that we need to ungroup our dataframe (we grouped it earlier) – if we don’t ungroup our variables, the grouping variable will be added back and thwart attempts to filter distinct cases.

pairwise_combos <- pairwise_combos %>% ungroup() %>% # important!! # drop diagonal filter(ticker.x != ticker.y) %>% # remove duplicate pairs (eg A-AAL, AAL-A) mutate(tickers = ifelse(ticker.x < ticker.y, glue("{ticker.x}, {ticker.y}"), glue("{ticker.y}, {ticker.x}"))) %>% distinct(date, tickers, .keep_all = TRUE) pairwise_combos %>% na.omit() %>% head(30) %>% pretty_table()

date | ticker.x | return.x | ticker.y | return.y | tickers |
---|---|---|---|---|---|

2015-01-05 | AFL | -0.0261952 | AIG | -0.0197826 | AFL, AIG |

2015-01-05 | AFL | -0.0261952 | AIZ | -0.0224288 | AFL, AIZ |

2015-01-05 | AFL | -0.0261952 | AJG | -0.0059600 | AFL, AJG |

2015-01-05 | AFL | -0.0261952 | ALL | -0.0198232 | AFL, ALL |

2015-01-05 | AFL | -0.0261952 | AMP | -0.0320993 | AFL, AMP |

2015-01-05 | AFL | -0.0261952 | AON | -0.0096470 | AFL, AON |

2015-01-05 | AFL | -0.0261952 | AXP | -0.0264459 | AFL, AXP |

2015-01-05 | AFL | -0.0261952 | BAC | -0.0290503 | AFL, BAC |

2015-01-05 | AFL | -0.0261952 | BEN | -0.0331591 | AFL, BEN |

2015-01-05 | AFL | -0.0261952 | BK | -0.0257044 | AFL, BK |

2015-01-05 | AFL | -0.0261952 | BLK | -0.0258739 | AFL, BLK |

2015-01-05 | AFL | -0.0261952 | C | -0.0315149 | AFL, C |

2015-01-05 | AFL | -0.0261952 | CB | -0.0163404 | AFL, CB |

2015-01-05 | AFL | -0.0261952 | CBOE | 0.0318813 | AFL, CBOE |

2015-01-05 | AFL | -0.0261952 | CFG | -0.0234249 | AFL, CFG |

2015-01-05 | AFL | -0.0261952 | CINF | -0.0143383 | AFL, CINF |

2015-01-05 | AFL | -0.0261952 | CMA | -0.0356448 | AFL, CMA |

2015-01-05 | AFL | -0.0261952 | CME | 0.0056728 | AFL, CME |

2015-01-05 | AFL | -0.0261952 | COF | -0.0230331 | AFL, COF |

2015-01-05 | AFL | -0.0261952 | DFS | -0.0223378 | AFL, DFS |

2015-01-05 | AFL | -0.0261952 | ETFC | -0.0324865 | AFL, ETFC |

2015-01-05 | AFL | -0.0261952 | FITB | -0.0301831 | AFL, FITB |

2015-01-05 | AFL | -0.0261952 | FRC | -0.0299424 | AFL, FRC |

2015-01-05 | AFL | -0.0261952 | GL | -0.0179099 | AFL, GL |

2015-01-05 | AFL | -0.0261952 | GS | -0.0312227 | AFL, GS |

2015-01-05 | AFL | -0.0261952 | HBAN | -0.0295238 | AFL, HBAN |

2015-01-05 | AFL | -0.0261952 | HIG | -0.0210577 | AFL, HIG |

2015-01-05 | AFL | -0.0261952 | ICE | 0.0054677 | AFL, ICE |

2015-01-05 | AFL | -0.0261952 | IVZ | -0.0291262 | AFL, IVZ |

2015-01-05 | AFL | -0.0261952 | JPM | -0.0310450 | AFL, JPM |

Next, we’ll use the brilliantly useful `slider`

package and the function `slide2_dbl`

to do the rolling correlation calculation (`slider`

implements a number of rolling window calculation functions – we’ll explore it more in another post):

period <- 60 pairwise_corrs <- pairwise_combos %>% group_by(tickers) %>% arrange(date, .by_group = TRUE) %>% mutate(rollingcor = slider::slide2_dbl( .x = return.x, .y = return.y, .f = ~cor(.x, .y), .before = period, .complete = TRUE) ) %>% select(date, tickers, rollingcor) pairwise_corrs %>% na.omit() %>% head(30) %>% pretty_table()

date | tickers | rollingcor |
---|---|---|

2015-04-01 | AFL, AIG | 0.7818676 |

2015-04-02 | AFL, AIG | 0.7718580 |

2015-04-06 | AFL, AIG | 0.7678625 |

2015-04-07 | AFL, AIG | 0.7680022 |

2015-04-08 | AFL, AIG | 0.7813979 |

2015-04-09 | AFL, AIG | 0.7711979 |

2015-04-10 | AFL, AIG | 0.7678292 |

2015-04-13 | AFL, AIG | 0.7469418 |

2015-04-14 | AFL, AIG | 0.7423953 |

2015-04-15 | AFL, AIG | 0.7422602 |

2015-04-16 | AFL, AIG | 0.7375739 |

2015-04-17 | AFL, AIG | 0.7422152 |

2015-04-20 | AFL, AIG | 0.7391824 |

2015-04-21 | AFL, AIG | 0.7285547 |

2015-04-22 | AFL, AIG | 0.7220685 |

2015-04-23 | AFL, AIG | 0.7302781 |

2015-04-24 | AFL, AIG | 0.7220693 |

2015-04-27 | AFL, AIG | 0.6898930 |

2015-04-28 | AFL, AIG | 0.6830527 |

2015-04-29 | AFL, AIG | 0.6761612 |

2015-04-30 | AFL, AIG | 0.6472679 |

2015-05-01 | AFL, AIG | 0.5972614 |

2015-05-04 | AFL, AIG | 0.6544743 |

2015-05-05 | AFL, AIG | 0.6505913 |

2015-05-06 | AFL, AIG | 0.6460236 |

2015-05-07 | AFL, AIG | 0.6449847 |

2015-05-08 | AFL, AIG | 0.6497471 |

2015-05-11 | AFL, AIG | 0.6656422 |

2015-05-12 | AFL, AIG | 0.6721218 |

2015-05-13 | AFL, AIG | 0.6832579 |

The syntax of `slide2_dbl`

might look odd if it’s the first time you’ve seen it, but it leverages the tidyverse’s functional programming tools to repeatedly apply a function (given by `.f = ~cor(...)`

) over windows of our data specified by `before`

(number of prior periods to use in the window) and `complete`

(whether to evaluate `.f`

on complete windows only).

The `~`

notation might look odd too. In this case, it’s used as shorthand for an anonymous function: `function(.x, .y) {cor(.x, .y)}`

So our pipeline of operations above is exactly the same as this one:

pairwise_corrs <- pairwise_combos %>% group_by(tickers) %>% arrange(date, .by_group = TRUE) %>% mutate(rollingcor = slider::slide2_dbl( .x = return.x, .y = return.y, .f = function(.x, .y) { cor(.x, .y) }, # long-hand anonymous function .before = period, .complete = TRUE) ) %>% select(date, tickers, rollingcor)

Now, the other confusing things about this transformation are the seemingly inconsistent arguments in `slider2_dbl`

:

- we designate a
`.x`

and a`.y`

argument - but we also define a function with these arguments

Actually, the `.x`

and `.y`

names are conventions used throughout the tidyverse to designate variables that are subject to non-standard evaluation (more on what that means in another post – it’s not critical right now). In our `slide2_dbl`

function, `.x`

is passed as the first argument to `.f`

and `.y`

is passed as the second.

That means that we could equally write our transformation like this, and it would be equivalent:

pairwise_corrs <- pairwise_combos %>% group_by(tickers) %>% arrange(date, .by_group = TRUE) %>% mutate(rollingcor = slider::slide2_dbl( .x = return.x, .y = return.y, .f = function(arg1, arg2) { cor(arg1, arg2) }, # the name of the args doesn't matter .before = period, .complete = TRUE) ) %>% select(date, tickers, rollingcor)

Finally, to get the mean rolling correlation of the ETF constituents, we simply group by date and take the mean of the group:

mean_pw_cors <- pairwise_corrs %>% group_by(date) %>% summarise(mean_pw_corr = mean(rollingcor, na.rm = TRUE)) mean_pw_cors %>% na.omit() %>% ggplot(aes(x = date, y = mean_pw_corr)) + geom_line() + labs( x = "Date", y = "Mean Pairwise Correlation", title = "Rolling Mean Pairwise Correlation", subtitle = "XLF Constituents" )

In this post, we broke down our problem of calculating the rolling mean correlation of the constituents of an ETF into various chunks and solved them one at a time to get the desired output.

The tidy data manipulation snippets we used here will be useful for doing similar transformations, such as rolling beta calculations, as well as single-variable rolling calculations such as volatility.

One problem that we glossed over here is that our largest dataframe – the one containing the pairwise combinations of returns – consisted of just under 3 million rows. That means we can easily do this entire piece of analysis in memory.

Things get slightly more difficult if we want to calculate the mean rolling correlation of the constituents of a larger ETF or index.

In another post, we’ll solve this problem for the S&P 500 index. We’ll also consider how the index has changed over time.

The post How to Calculate Rolling Pairwise Correlations in the Tidyverse appeared first on Robot Wealth.

]]>