How to Fill Gaps in Large Stock Data Universes Using tidyr and dplyr

Posted on May 27, 2020 by Robot James
No Comments
272 Views

When you’re working with large universes of stock data you’ll come across a lot of challenges:

  • Stocks pay dividends and other distributions that have to be accounted for.
  • Stocks are subject to splits and other corporate actions which also have to be accounted for.
  • New stocks are listed all the time – you won’t have as much history for these stocks as for other stocks.
  • Stocks are delisted, and many datasets do not include the price history of delisted stocks
  • Stocks can be suspended or halted for a period of time, leading to trading gaps.
  • Companies grow and shrink: the “top 100 stocks by market cap” in 1990 looks very different to the same group in 2020; “growth stocks” in 1990 look very different to “growth stocks” in 2020 etc.

The challenges are well understood, but dealing with them is not always straightforward.

One significant challenge is gaps in data.

Quant analysis gets very hard if you have missing or misaligned data.

If you’re working with a universe of 1,000 stocks life is a lot easier if you have an observation for each stock for each trading date, regardless of whether it actually traded that day. That way:

  • you can always do look-ups by date
  • any grouped aggregations or rolling window aggregations will be operating on the date range for every ticker
  • you can easily sense check the size of your data to have trading_days * number_of_stocks rows.

If you work with “wide” matrix-like data, these challenges are obvious because you have one row for every date in your data set, and the columns represent an observation for each ticker.

We usually work with long or “tidy” data – where each observation is an observation for a stock for a given day.

How do we work productively in this data, whilst still ensuring that we fill in any gaps in our long data with NAs?

The tidyverse makes this very straightforward. Let me show you!

First, here’s some dummy data to illustrate the problem:

library(tidyverse)
testdata <- tibble(date = c(1,1,2,2,2,3,3),
                       ticker = c('AMZN','FB','AMZN','FB','TSLA','AMZN','TSLA'),
                       returns = 1:7 / 100)
testdata
## # A tibble: 7 x 3
##    date ticker returns
##   <dbl> <chr>    <dbl>
## 1     1 AMZN      0.01
## 2     1 FB        0.02
## 3     2 AMZN      0.03
## 4     2 FB        0.04
## 5     2 TSLA      0.05
## 6     3 AMZN      0.06
## 7     3 TSLA      0.07
  • TSLA is missing from date 1 as it only started trading after the others
  • FB is missing from date 3 as it was put on trading halt after Citron Research hacked into Zuck’s memory banks

Ideally we want a row for every date for every stock – with returns set to NA in the case where data is missing.

That way we can always look up a price by date. And we can always be sure that any grouped operations by ticker return the same size data set.

Turns out that the tidyr::complete function is exactly what we’re looking for. It turns implicit missing values – like the returns for TSLA on date 1 and FB on date 3 – into explicit missing values:

tidydata <- testdata %>%
  complete(date, ticker)

tidydata
## # A tibble: 9 x 3
##    date ticker returns
##   <dbl> <chr>    <dbl>
## 1     1 AMZN      0.01
## 2     1 FB        0.02
## 3     1 TSLA     NA   
## 4     2 AMZN      0.03
## 5     2 FB        0.04
## 6     2 TSLA      0.05
## 7     3 AMZN      0.06
## 8     3 FB       NA   
## 9     3 TSLA      0.07

Easy!

Now we have a row for every date for every stock.

Now we can safely do grouped aggregations by ticker, on the understanding that the data is the same size for all tickers, and we’ve removed one large source of potential analysis screw-up…

tidydata %>%
  group_by(ticker) %>%
  summarise(count = n())
## # A tibble: 3 x 2
##   ticker count
##   <chr>  <int>
## 1 AMZN       3
## 2 FB         3
## 3 TSLA       3

Another approach

There’s also a more verbose way to achieve our aim, and I’m showing it here because I think it’s useful to see how different functions and libraries connect and cross-over in the tidyverse (right now I’m fascinated by the intersection of the purrr::map functions and the dplyr::summarise_if, _at, _all functions…but that’s a story for another time).

The verbose approach is as follows:

  • use tidyr::pivot_wide to reshape the data to row per date, with a column for each stock
  • use tidyr::pivot_long to reshape it back to its longer format.

Let’s do it step by step…

First, we make it wide:

widedata <- testdata %>%
  pivot_wider(id_cols = date, names_from = ticker, values_from = returns)

widedata
## # A tibble: 3 x 4
##    date  AMZN    FB  TSLA
##   <dbl> <dbl> <dbl> <dbl>
## 1     1  0.01  0.02 NA   
## 2     2  0.03  0.04  0.05
## 3     3  0.06 NA     0.07

Where we had missing rows, we now have NA.

Now we make it long again:

tidydata <- widedata %>%
  pivot_longer(-date, names_to = 'ticker', values_to =  'returns')

tidydata
## # A tibble: 9 x 3
##    date ticker returns
##   <dbl> <chr>    <dbl>
## 1     1 AMZN      0.01
## 2     1 FB        0.02
## 3     1 TSLA     NA   
## 4     2 AMZN      0.03
## 5     2 FB        0.04
## 6     2 TSLA      0.05
## 7     3 AMZN      0.06
## 8     3 FB       NA   
## 9     3 TSLA      0.07

And again we have a row for every date for every stock.

tidydata %>%
  group_by(ticker) %>%
  summarise(count = n())
## # A tibble: 3 x 2
##   ticker count
##   <chr>  <int>
## 1 AMZN       3
## 2 FB         3
## 3 TSLA       3

Here’s the complete pipeline:

testdata %>%
  pivot_wider(id_cols = date, names_from = ticker, values_from = returns) %>%
  pivot_longer(-date, names_to = 'ticker', values_to =  'returns')
## # A tibble: 9 x 3
##    date ticker returns
##   <dbl> <chr>    <dbl>
## 1     1 AMZN      0.01
## 2     1 FB        0.02
## 3     1 TSLA     NA   
## 4     2 AMZN      0.03
## 5     2 FB        0.04
## 6     2 TSLA      0.05
## 7     3 AMZN      0.06
## 8     3 FB       NA   
## 9     3 TSLA      0.07

What if we have more than one variable in our orignal data?

One of the benefits of working with longer “tidy” data is that we can have multiple variables per date/stock observation.

testwider <- testdata %>%
  mutate(volume = 100:106,
         otherfeature = 200:206)

testwider
## # A tibble: 7 x 5
##    date ticker returns volume otherfeature
##   <dbl> <chr>    <dbl>  <int>        <int>
## 1     1 AMZN      0.01    100          200
## 2     1 FB        0.02    101          201
## 3     2 AMZN      0.03    102          202
## 4     2 FB        0.04    103          203
## 5     2 TSLA      0.05    104          204
## 6     3 AMZN      0.06    105          205
## 7     3 TSLA      0.07    106          206

Again, we’re missing data for TSLA on date 1 and FB on date 3, but now we’re also missing volume and otherfeature in addition to returns.

To use complete, nothing changes from earlier:

testwider %>%
  complete(date, ticker)
## # A tibble: 9 x 5
##    date ticker returns volume otherfeature
##   <dbl> <chr>    <dbl>  <int>        <int>
## 1     1 AMZN      0.01    100          200
## 2     1 FB        0.02    101          201
## 3     1 TSLA     NA        NA           NA
## 4     2 AMZN      0.03    102          202
## 5     2 FB        0.04    103          203
## 6     2 TSLA      0.05    104          204
## 7     3 AMZN      0.06    105          205
## 8     3 FB       NA        NA           NA
## 9     3 TSLA      0.07    106          206

However if we want to pivot back and forth, we do the following:

  • use pivot_wide to reshape the data to row per date, with a column for each stock
  • use pivot_long to reshape it back to its longer format
  • use left_join to recover the rest of the variables from the original data.
testwider %>%
  pivot_wider(id_cols = date, names_from = ticker, values_from = returns) %>%
  pivot_longer(-date, names_to = 'ticker', values_to =  'returns') %>%
  left_join(testwider, by = c('date', 'ticker', 'returns'))
## # A tibble: 9 x 5
##    date ticker returns volume otherfeature
##   <dbl> <chr>    <dbl>  <int>        <int>
## 1     1 AMZN      0.01    100          200
## 2     1 FB        0.02    101          201
## 3     1 TSLA     NA        NA           NA
## 4     2 AMZN      0.03    102          202
## 5     2 FB        0.04    103          203
## 6     2 TSLA      0.05    104          204
## 7     3 AMZN      0.06    105          205
## 8     3 FB       NA        NA           NA
## 9     3 TSLA      0.07    106          206

Conclusions

  • Missing values in financial data threaten the validity of quant analysis due to inadvertent misalignment
  • Wide data tends to highlight such missing data
  • Long data tends to hide it
  • tidyr::complete is a succinct and efficient way to ensure that missing observations are accounted for with NA
  • Like most tasks in R, there is more than one way to go about it. But complete should be your go-to function.

Want all the code?

All the code in this post is available in our github repo where you can find lots of other recipes and tools to make your life as a quant researcher easier.

If you liked this you’ll probably like these too…

Financial Data Manipulation in dplyr for Quant Traders

Handling a Large Universe of Stock Price Data in R: Profiling with profvis

 

How to Calculate Rolling Pairwise Correlations in the Tidyverse

 

Leave a Comment