How to Wrangle JSON Data in R with jsonlite, purr and dplyr

Posted on May 20, 2020 by Kris Longmore
1 Comment
123 Views

Working with modern APIs you will often have to wrangle with data in JSON format.

This article presents some tools and recipes for working with JSON data with R in the tidyverse.

We’ll use purrr::map functions to extract and transform our JSON data. And we’ll provide intuitive examples of the cross-overs and differences between purrr and dplyr.

library(tidyverse)
library(here)
library(kableExtra)

pretty_print <- function(df, num_rows) {
  df %>%
  head(num_rows) %>%
    kable() %>%
    kable_styling(full_width = TRUE, position = 'center') %>%
    scroll_box(height = '300px')
}

Load JSON as nested named lists

This data has been converted from raw JSON to nested named lists using jsonlite::fromJSON with the simplify argument set to FALSE (that is, all elements are converted to named lists).

The data consists of market data for SPY options with various strikes and expiries. We got it from the options data vendor Orats, whose data API I enjoy almost as much as their orange website.

If you want to follow along, you can sign-up for a free trial of the API, and load the data directly from the Orats API with the following code (just define your API key in the ORATS_token variable):

library(httr)

ORATS_token <- 'YOUR_KEY_HERE'
res <- GET('https://api.orats.io/data/strikes?tickers=SPY', add_headers(Authorization = ORATS_token))

if (http_type(res) == 'application/json') {
  strikes <- jsonlite::fromJSON(content(res, 'text'), simplifyVector = FALSE)
} else {
  stop('No json returned')
}

if (http_error(res)) {
  stop(paste('API request error:',status_code(res), odata$message, odata$documentation_url))
} 

Now, if you want to read this data directly into a nicely formatted dataframe, replace the line:

strikes <- jsonlite::fromJSON(content(res, 'text'), simplifyVector = FALSE)

with

strikes <- jsonlite::fromJSON(content(res, 'text'), simplifyVector = TRUE, flatten = TRUE)

However, you should know that it isn’t always possible to coerce JSON into nicely shaped dataframes this easily – often the raw JSON won’t contain primitive types, or will have nested key-value pairs on the same level as your desired dataframe columns, to name a couple of obstacles.

In that case, it’s useful to have some tools – like the ones in this post – for wrangling your source data.

So let’s look at that strikes object, and show how we can wrangle it into something useful…

Look inside JSON lists

str(strikes, max.level = 1)  
## List of 1
##  $ data:List of 2440

This tells us we have a component named “data”. Let’s look at that a little more closely:

str(strikes$data, max.level = 1, list.len = 10)
## List of 2440
##  $ :List of 40
##  $ :List of 40
##  $ :List of 40
##  $ :List of 40
##  $ :List of 40
##  $ :List of 40
##  $ :List of 40
##  $ :List of 40
##  $ :List of 40
##  $ :List of 40
##   [list output truncated]

This suggests we have homogenous lists of 40 elements each (an assumption we’ll check shortly).

Let’s look at one of those lists:

str(strikes$data[[1]])
## List of 40
##  $ ticker          : chr "SPY"
##  $ tradeDate       : chr "2020-05-19"
##  $ expirDate       : chr "2020-05-29"
##  $ dte             : int 11
##  $ strike          : int 140
##  $ stockPrice      : num 293
##  $ callVolume      : int 0
##  $ callOpenInterest: int 0
##  $ callBidSize     : int 20
##  $ callAskSize     : int 23
##  $ putVolume       : int 0
##  $ putOpenInterest : int 2312
##  $ putBidSize      : int 0
##  $ putAskSize      : int 7117
##  $ callBidPrice    : num 152
##  $ callValue       : num 153
##  $ callAskPrice    : num 153
##  $ putBidPrice     : int 0
##  $ putValue        : num 1.12e-25
##  $ putAskPrice     : num 0.01
##  $ callBidIv       : int 0
##  $ callMidIv       : num 0.98
##  $ callAskIv       : num 1.96
##  $ smvVol          : num 0.476
##  $ putBidIv        : int 0
##  $ putMidIv        : num 0.709
##  $ putAskIv        : num 1.42
##  $ residualRate    : num -0.00652
##  $ delta           : int 1
##  $ gamma           : num 9.45e-16
##  $ theta           : num -0.00288
##  $ vega            : num 2e-11
##  $ rho             : num 0.0384
##  $ phi             : num -0.0802
##  $ driftlessTheta  : num -6.07e-09
##  $ extSmvVol       : num 0.478
##  $ extCallValue    : num 153
##  $ extPutValue     : num 1.77e-25
##  $ spotPrice       : num 293
##  $ updatedAt       : chr "2020-05-19 20:02:33"

All these elements look like they can be easily handled. For instance, I don’t see any more deeply nested lists, weird missing values, or anything else that looks difficult.

So now I’ll pull out the interesting bit:

strikes <- strikes[["data"]]

How many observations do we have?

length(strikes)
## [1] 2440

Are all strike sublists identically named?

This is where we’ll check that our sublists are indeed homogeneously named, as we assumed above:

strikes %>%
  map(names) %>%  # this applies the base R function names to each sublist, and returns a list of lists with the output
  unique() %>%
  length() == 1
## [1] TRUE

Make a dataframe

We should also check the variable types are consistent as we need single types in each column of a dataframe (although R will warn if it is forced to coerce one type to another).

Here’s an interesting thing. It uses a nested purrr::map to get the variable types for each element of each sublist. They’re actually not identical according to this:

strikes %>%
  map(.f = ~{map_chr(.x, .f = class)}) %>%
  unique() %>%
  length()
## [1] 39

This is actually a little puzzling. Inspecting the individual objects suggests that we do have identical types. If anyone has anything to say about this, I’d love to hear about it in the comments. In any event, after we make our dataframe, we should check that the variable types are as expected.

Now, to that dataframe…

purrr::flatten removes one level of hierarchy from a list (unlist removes them all). Here, flatten is applied to each sub-list in strikes via purrr::map_df.

We use the variant flatten_df which returns each sublist as a dataframe, which makes it compatible with purrr::map_df,which requires a function that returns a dataframe.

strikes_df <- strikes %>%
  map_df(flatten_df)

strikes_df %>%
  pretty_print(30)
ticker tradeDate expirDate dte strike stockPrice callVolume callOpenInterest callBidSize callAskSize putVolume putOpenInterest putBidSize putAskSize callBidPrice callValue callAskPrice putBidPrice putValue putAskPrice callBidIv callMidIv callAskIv smvVol putBidIv putMidIv putAskIv residualRate delta gamma theta vega rho phi driftlessTheta extSmvVol extCallValue extPutValue spotPrice updatedAt
SPY 2020-05-19 2020-05-29 11 140 292.55 0 0 20 23 0 2312 0 7117 152.37 152.5790 152.83 0.00 0.0000000 0.01 0 0.980149 1.960300 0.476046 0.000000 0.708976 1.417950 -0.0065171 1.000000 0.0000000 -0.0028827 0.0000000 0.0383618 -0.0801790 0.0000000 0.478157 152.5790 0.0000000 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 145 292.55 0 0 20 23 0 2322 0 5703 147.37 147.5800 147.83 0.00 0.0000000 0.01 0 0.936511 1.873020 0.476046 0.000000 0.676907 1.353810 -0.0065171 1.000000 0.0000000 -0.0029856 0.0000000 0.0397319 -0.0801790 0.0000000 0.478157 147.5800 0.0000000 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 150 292.55 0 1 20 18 0 1912 0 5703 142.39 142.5810 142.83 0.00 0.0000000 0.01 0 0.894396 1.788790 0.476046 0.000000 0.645945 1.291890 -0.0065171 1.000000 0.0000000 -0.0030886 0.0000000 0.0411020 -0.0801790 0.0000000 0.478157 142.5810 0.0000000 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 155 292.55 0 0 22 23 0 1483 0 5583 137.36 137.5820 137.81 0.00 0.0000000 0.01 0 0.844663 1.689330 0.476046 0.000000 0.616016 1.232030 -0.0065171 1.000000 0.0000000 -0.0031915 0.0000000 0.0424720 -0.0801790 0.0000000 0.478157 137.5820 0.0000000 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 160 292.55 0 0 20 22 0 929 0 7016 132.39 132.5830 132.85 0.00 0.0000000 0.01 0 0.823228 1.646460 0.476046 0.000000 0.587053 1.174110 -0.0065171 1.000000 0.0000000 -0.0032945 0.0000000 0.0438421 -0.0801790 0.0000000 0.478157 132.5830 0.0000000 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 165 292.55 0 0 20 22 0 1874 0 5943 127.39 127.5840 127.85 0.00 0.0000000 0.01 0 0.784980 1.569960 0.476046 0.000000 0.558997 1.117990 -0.0065171 1.000000 0.0000000 -0.0033974 0.0000000 0.0452122 -0.0801790 0.0000000 0.478157 127.5840 0.0000000 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 170 292.55 0 0 22 23 0 4055 0 7407 122.37 122.5850 122.83 0.00 0.0000000 0.01 0 0.739266 1.478530 0.476046 0.000000 0.531711 1.063420 -0.0065171 1.000000 0.0000000 -0.0035004 0.0000000 0.0465822 -0.0801790 0.0000000 0.478157 122.5850 0.0000000 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 175 292.55 0 0 22 23 0 2992 0 5103 117.36 117.5860 117.81 0.00 0.0000000 0.01 0 0.694933 1.389870 0.476046 0.000000 0.504757 1.009510 -0.0065171 1.000000 0.0000000 -0.0036034 0.0000000 0.0479523 -0.0801790 0.0000000 0.478157 117.5860 0.0000000 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 180 292.55 0 0 20 31 0 4320 0 9302 112.39 112.5870 112.83 0.00 0.0000000 0.01 0 0.668237 1.336470 0.476046 0.000000 0.478572 0.957144 -0.0065171 1.000000 0.0000000 -0.0037063 0.0000000 0.0493224 -0.0801790 0.0000000 0.478157 112.5870 0.0000000 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 185 292.55 0 0 20 31 1200 5863 0 4686 107.39 107.5880 107.83 0.00 0.0000000 0.01 0 0.633658 1.267320 0.476046 0.000000 0.453113 0.906225 -0.0065171 1.000000 0.0000000 -0.0038093 0.0000000 0.0506924 -0.0801790 0.0000000 0.478157 107.5880 0.0000000 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 190 292.55 0 2 8 31 5 8253 0 4199 102.39 102.5890 102.83 0.00 0.0000000 0.01 0 0.600019 1.200040 0.476046 0.000000 0.428340 0.856680 -0.0065171 1.000000 0.0000000 -0.0039123 0.0000000 0.0520625 -0.0801790 -0.0000001 0.478157 102.5890 0.0000000 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 195 292.55 0 10 20 31 130 5417 0 5965 97.39 97.5902 97.83 0.00 0.0000002 0.01 0 0.567271 1.134540 0.476046 0.000000 0.404219 0.808438 -0.0065171 1.000000 0.0000000 -0.0040155 0.0000002 0.0534326 -0.0801790 -0.0000003 0.478157 97.5902 0.0000003 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 200 292.55 0 72 8 31 2139 11544 0 3657 92.39 92.5912 92.83 0.00 0.0000015 0.01 0 0.535369 1.070740 0.476046 0.000000 0.380715 0.761431 -0.0065171 1.000000 0.0000001 -0.0041200 0.0000008 0.0548026 -0.0801790 -0.0000019 0.478157 92.5912 0.0000016 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 205 292.55 0 20 23 20 0 2035 0 2949 87.38 87.5922 87.84 0.00 0.0000084 0.01 0 0.507409 1.014820 0.476046 0.000000 0.357803 0.715607 -0.0065171 0.999998 0.0000004 -0.0042307 0.0000037 0.0561726 -0.0801789 -0.0000097 0.478157 87.5922 0.0000092 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 210 292.55 0 7 23 18 177 2745 2433 6196 82.39 82.5933 82.83 0.01 0.0000393 0.02 0 0.473935 0.947870 0.476046 0.670012 0.691367 0.712722 -0.0065171 0.999992 0.0000015 -0.0043642 0.0000158 0.0575421 -0.0801784 -0.0000402 0.478157 82.5933 0.0000438 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 215 292.55 0 6 100 100 182 3439 5133 4919 77.40 77.5945 77.83 0.01 0.0001671 0.02 0 0.444328 0.888655 0.476046 0.625396 0.645686 0.665975 -0.0065171 0.999969 0.0000056 -0.0045768 0.0000572 0.0589103 -0.0801765 -0.0001500 0.478157 77.5945 0.0001791 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 220 292.55 0 7 23 31 253 67134 6636 4014 72.39 72.5959 72.84 0.01 0.0006089 0.02 0 0.417191 0.834383 0.476046 0.581817 0.600771 0.619725 -0.0065171 0.999894 0.0000181 -0.0050092 0.0001931 0.0602742 -0.0801705 -0.0004800 0.478157 72.5960 0.0006497 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 225 292.55 0 26 20 8 72 21021 4297 5090 67.40 67.5984 67.84 0.02 0.0020239 0.03 0 0.388099 0.776197 0.476046 0.574524 0.586509 0.598494 -0.0065171 0.999669 0.0000522 -0.0060175 0.0005729 0.0616259 -0.0801525 -0.0013866 0.478157 67.5985 0.0021449 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 230 292.55 1 3 22 20 496 60686 2857 5356 62.41 62.6033 62.86 0.03 0.0058940 0.04 0 0.364226 0.728453 0.476046 0.552809 0.561179 0.569550 -0.0065171 0.999099 0.0001322 -0.0082422 0.0015687 0.0629492 -0.0801068 -0.0035119 0.478157 62.6036 0.0062251 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 235 292.55 0 10 50 50 63 4115 3681 10023 57.43 57.6142 57.81 0.04 0.0157637 0.05 0 0.324739 0.649478 0.476046 0.523494 0.530320 0.537145 -0.0065171 0.997754 0.0003047 -0.0129200 0.0038339 0.0642087 -0.0799990 -0.0080951 0.478157 57.6150 0.0165731 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 240 292.55 1 279 50 50 1170 39193 3012 9390 52.46 52.6379 52.83 0.06 0.0384204 0.07 0 0.301680 0.603359 0.476046 0.501376 0.505950 0.510524 -0.0065171 0.994905 0.0006370 -0.0218349 0.0066480 0.0653442 -0.0797706 -0.0169246 0.478157 52.6393 0.0398239 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 245 292.55 0 438 50 50 119 10747 2182 9653 47.48 47.6773 47.89 0.09 0.0768502 0.10 0 0.285231 0.570462 0.469306 0.478714 0.482291 0.485867 -0.0065171 0.990302 0.0011422 -0.0344797 0.0139756 0.0663346 -0.0794014 -0.0294951 0.475861 47.6865 0.0860111 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 247 292.55 0 50 50 50 5 1290 5483 1760 45.50 45.6971 45.89 0.10 0.0962040 0.11 0 0.273995 0.547991 0.463901 0.466226 0.469087 0.471948 -0.0065171 0.987998 0.0013914 -0.0401191 0.0140348 0.0666926 -0.0792168 -0.0351075 0.473638 45.7138 0.1129540 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 248 292.55 0 15 50 50 16 2129 3666 700 44.52 44.7063 44.92 0.11 0.1052830 0.12 0 0.273081 0.546162 0.459372 0.462066 0.464921 0.467776 -0.0065171 0.986881 0.0015163 -0.0425395 0.0175110 0.0668746 -0.0791272 -0.0375142 0.471728 44.7280 0.1269210 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 249 292.55 0 4 50 50 10 13505 4517 800 43.51 43.7161 43.93 0.12 0.1148780 0.13 0 0.269065 0.538129 0.455004 0.457923 0.460772 0.463621 -0.0065171 0.985691 0.0016497 -0.0450826 0.0175471 0.0670505 -0.0790318 -0.0400441 0.469783 43.7437 0.1424310 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 250 292.55 2 426 50 50 1799 26307 4278 700 42.53 42.7269 42.95 0.13 0.1254170 0.14 0 0.266140 0.532280 0.451191 0.453797 0.456145 0.458493 -0.0065171 0.984389 0.0017949 -0.0478911 0.0175834 0.0672173 -0.0789274 -0.0428401 0.467820 42.7618 0.1603170 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 251 292.55 0 7 50 50 271 15028 4687 2012 41.53 41.7392 41.92 0.14 0.1375140 0.15 0 0.256440 0.512881 0.447223 0.448362 0.450653 0.452945 -0.0065171 0.982918 0.0019560 -0.0509297 0.0218205 0.0673700 -0.0788094 -0.0458673 0.465778 41.7812 0.1795280 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 252 292.55 0 10 50 50 104 661 4033 1857 40.55 40.7531 40.93 0.15 0.1512220 0.16 0 0.252133 0.504267 0.443405 0.442846 0.445133 0.447419 -0.0065171 0.981266 0.0021342 -0.0542692 0.0218653 0.0675078 -0.0786770 -0.0491964 0.463703 40.8038 0.2019230 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 253 292.55 0 38 50 50 1312 3752 582 7464 39.57 39.7627 39.94 0.17 0.1606130 0.18 0 0.247748 0.495497 0.437633 0.441328 0.443198 0.445068 -0.0065171 0.979986 0.0022891 -0.0564869 0.0219092 0.0676766 -0.0785743 -0.0514013 0.460124 39.8223 0.2202700 292.55 2020-05-19 20:02:33
SPY 2020-05-19 2020-05-29 11 254 292.55 0 22 50 50 67 2797 2927 2528 38.58 38.7807 38.96 0.18 0.1784280 0.19 0 0.244667 0.489333 0.434100 0.434687 0.436553 0.438419 -0.0065171 0.977890 0.0025081 -0.0605062 0.0267923 0.0677777 -0.0784063 -0.0554131 0.457933 38.8483 0.2459960 292.55 2020-05-19 20:02:33

Cross-over and differences between purrr and dplyr

Here are some other interesting things that we can do with the nested lists via purrr, and their equivalent operation on the strikes_df dataframe using dplyr.

The intent is to gain some intuition for purrr using what you already know about dplyr.

Get vector of column names

strikes %>%
  map(names) %>%
  unique() %>%
  unlist()
##  [1] "ticker"           "tradeDate"        "expirDate"        "dte"              "strike"           "stockPrice"      
##  [7] "callVolume"       "callOpenInterest" "callBidSize"      "callAskSize"      "putVolume"        "putOpenInterest" 
## [13] "putBidSize"       "putAskSize"       "callBidPrice"     "callValue"        "callAskPrice"     "putBidPrice"     
## [19] "putValue"         "putAskPrice"      "callBidIv"        "callMidIv"        "callAskIv"        "smvVol"          
## [25] "putBidIv"         "putMidIv"         "putAskIv"         "residualRate"     "delta"            "gamma"           
## [31] "theta"            "vega"             "rho"              "phi"              "driftlessTheta"   "extSmvVol"       
## [37] "extCallValue"     "extPutValue"      "spotPrice"        "updatedAt"

This is equivalent to the following dplyr operation on the strikes_df dataframe:

strikes_df %>%
  names
##  [1] "ticker"           "tradeDate"        "expirDate"        "dte"              "strike"           "stockPrice"      
##  [7] "callVolume"       "callOpenInterest" "callBidSize"      "callAskSize"      "putVolume"        "putOpenInterest" 
## [13] "putBidSize"       "putAskSize"       "callBidPrice"     "callValue"        "callAskPrice"     "putBidPrice"     
## [19] "putValue"         "putAskPrice"      "callBidIv"        "callMidIv"        "callAskIv"        "smvVol"          
## [25] "putBidIv"         "putMidIv"         "putAskIv"         "residualRate"     "delta"            "gamma"           
## [31] "theta"            "vega"             "rho"              "phi"              "driftlessTheta"   "extSmvVol"       
## [37] "extCallValue"     "extPutValue"      "spotPrice"        "updatedAt"

You can see the connection: map(strikes, names) applies names to each sublist in strikes, returning a list of names for each sublist, which we then check for a single unique case and convert to a charcter vector via unlist.

In the dataframe version, we’ve already mapped each sublist to a dataframe row. We can get the column names of the dataframe by calling names directly on this object.

Check that all elements have the same ticker

strikes %>%
  map_chr("ticker") %>%  # this makes a character vector of list elements "ticker"
  unique()
## [1] "SPY"

Calling the purrr::map functions on a list with the name of a common sub-element returns the value associated with each sub-element. map returns a list; here we use map_chr to return a character vector.

This only works if the thing being returned from the sub-element is indeed a character.

This is equivalent to the following dplyr operation on the strikes_df dataframe:

strikes_df %>%
  distinct(ticker) %>%
  pull()
## [1] "SPY"

In the dplyr dataframe version, we’ve already mapped our tickers to their own column. So we simply call distinct on that column to get the unique values. A pull converts the resulting tibble to a vector.

Get the strike prices, expiries and call and put mid-prices

In this case, the purrr solution is somewhat convoluted:

callBids <- strikes %>%
  map_dbl("callBidPrice")

callAsks <- strikes %>%
  map_dbl("callAskPrice")

putBids <- strikes %>%
  map_dbl("putBidPrice")

putAsks <- strikes %>%
  map_dbl("putAskPrice")

data.frame(
  strike = strikes %>% map_dbl("strike"),
  expirDate = strikes %>% map_chr("expirDate"),
  callMid = map2_dbl(.x = callBids, .y = callAsks, ~{(.x + .y)/2}),
  putMid = map2_dbl(.x = putBids, .y = putAsks, ~{(.x + .y)/2})
) %>%
  pretty_print(10)
strike expirDate callMid putMid
140 2020-05-29 152.600 0.005
145 2020-05-29 147.600 0.005
150 2020-05-29 142.610 0.005
155 2020-05-29 137.585 0.005
160 2020-05-29 132.620 0.005
165 2020-05-29 127.620 0.005
170 2020-05-29 122.600 0.005
175 2020-05-29 117.585 0.005
180 2020-05-29 112.610 0.005
185 2020-05-29 107.610 0.005

Since our mapping function requires two inputs, we need to use the map2 functions, and must set up the inputs as a first step.

The dplyr equivalent on the dataframe object is much more succinct:

strikes_df %>%
  mutate(
    callMid = (callBidPrice + callAskPrice)/2,
    putMid = (putBidPrice + putAskPrice)/2
  ) %>%
  select(strike, expirDate, callMid, putMid) %>%
  pretty_print(10)
strike expirDate callMid putMid
140 2020-05-29 152.600 0.005
145 2020-05-29 147.600 0.005
150 2020-05-29 142.610 0.005
155 2020-05-29 137.585 0.005
160 2020-05-29 132.620 0.005
165 2020-05-29 127.620 0.005
170 2020-05-29 122.600 0.005
175 2020-05-29 117.585 0.005
180 2020-05-29 112.610 0.005
185 2020-05-29 107.610 0.005

Leverage the dataframe’s structure

We can also leverage the fact that a dataframe is represented as a list of columns to use purrr functions directly on dataframes. These recipes are quite useful for quickly getting to know a dataframe.

For instance, we can get the type of each column:

strikes_df %>%
  map_chr(class)
##           ticker        tradeDate        expirDate              dte           strike       stockPrice       callVolume 
##      "character"      "character"      "character"        "integer"        "numeric"        "numeric"        "integer" 
## callOpenInterest      callBidSize      callAskSize        putVolume  putOpenInterest       putBidSize       putAskSize 
##        "integer"        "integer"        "integer"        "integer"        "integer"        "integer"        "integer" 
##     callBidPrice        callValue     callAskPrice      putBidPrice         putValue      putAskPrice        callBidIv 
##        "numeric"        "numeric"        "numeric"        "numeric"        "numeric"        "numeric"        "numeric" 
##        callMidIv        callAskIv           smvVol         putBidIv         putMidIv         putAskIv     residualRate 
##        "numeric"        "numeric"        "numeric"        "numeric"        "numeric"        "numeric"        "numeric" 
##            delta            gamma            theta             vega              rho              phi   driftlessTheta 
##        "numeric"        "numeric"        "numeric"        "numeric"        "numeric"        "numeric"        "numeric" 
##        extSmvVol     extCallValue      extPutValue        spotPrice        updatedAt 
##        "numeric"        "numeric"        "numeric"        "numeric"      "character"

Which is equivalent to a dplyr::summarise_all, except that this returns a tibble rather than a vector:

strikes_df %>%
  summarise_all(~class(.x)) 
## # A tibble: 1 x 40
##   ticker tradeDate expirDate dte   strike stockPrice callVolume callOpenInterest callBidSize callAskSize putVolume
##   <chr>  <chr>     <chr>     <chr> <chr>  <chr>      <chr>      <chr>            <chr>       <chr>       <chr>    
## 1 chara~ character character inte~ numer~ numeric    integer    integer          integer     integer     integer  
## # ... with 29 more variables: putOpenInterest <chr>, putBidSize <chr>, putAskSize <chr>, callBidPrice <chr>,
## #   callValue <chr>, callAskPrice <chr>, putBidPrice <chr>, putValue <chr>, putAskPrice <chr>, callBidIv <chr>,
## #   callMidIv <chr>, callAskIv <chr>, smvVol <chr>, putBidIv <chr>, putMidIv <chr>, putAskIv <chr>, residualRate <chr>,
## #   delta <chr>, gamma <chr>, theta <chr>, vega <chr>, rho <chr>, phi <chr>, driftlessTheta <chr>, extSmvVol <chr>,
## #   extCallValue <chr>, extPutValue <chr>, spotPrice <chr>, updatedAt <chr>

We can also get the number of distinct values in each column using purrr functions:

strikes_df %>%
  map_dbl(n_distinct)
##           ticker        tradeDate        expirDate              dte           strike       stockPrice       callVolume 
##                1                1               31               31              151                1              162 
## callOpenInterest      callBidSize      callAskSize        putVolume  putOpenInterest       putBidSize       putAskSize 
##             1097              144              165              498             1691             1054              926 
##     callBidPrice        callValue     callAskPrice      putBidPrice         putValue      putAskPrice        callBidIv 
##             2151             2228             2199             1281             2419             1311             1907 
##        callMidIv        callAskIv           smvVol         putBidIv         putMidIv         putAskIv     residualRate 
##             2423             2430             2044             2345             2425             2428             2039 
##            delta            gamma            theta             vega              rho              phi   driftlessTheta 
##             2137             2238             2192             2211             2188             2133             2196 
##        extSmvVol     extCallValue      extPutValue        spotPrice        updatedAt 
##             2016             2168             2413                1                1

Again, this is equivalent to a dplyr::summarise_all, different return objects aside:

strikes_df %>%
  summarise_all(~n_distinct(.x))
## # A tibble: 1 x 40
##   ticker tradeDate expirDate   dte strike stockPrice callVolume callOpenInterest callBidSize callAskSize putVolume
##    <int>     <int>     <int> <int>  <int>      <int>      <int>            <int>       <int>       <int>     <int>
## 1      1         1        31    31    151          1        162             1097         144         165       498
## # ... with 29 more variables: putOpenInterest <int>, putBidSize <int>, putAskSize <int>, callBidPrice <int>,
## #   callValue <int>, callAskPrice <int>, putBidPrice <int>, putValue <int>, putAskPrice <int>, callBidIv <int>,
## #   callMidIv <int>, callAskIv <int>, smvVol <int>, putBidIv <int>, putMidIv <int>, putAskIv <int>, residualRate <int>,
## #   delta <int>, gamma <int>, theta <int>, vega <int>, rho <int>, phi <int>, driftlessTheta <int>, extSmvVol <int>,
## #   extCallValue <int>, extPutValue <int>, spotPrice <int>, updatedAt <int>

If we wanted to put both of these things together, there’s an elegant purrr solution:

strikes_df %>%
  map_df(
    ~data.frame(num_distinct = n_distinct(.x), type = class(.x)),
      .id = "variable"
  )
##            variable num_distinct      type
## 1            ticker            1 character
## 2         tradeDate            1 character
## 3         expirDate           31 character
## 4               dte           31   integer
## 5            strike          151   numeric
## 6        stockPrice            1   numeric
## 7        callVolume          162   integer
## 8  callOpenInterest         1097   integer
## 9       callBidSize          144   integer
## 10      callAskSize          165   integer
## 11        putVolume          498   integer
## 12  putOpenInterest         1691   integer
## 13       putBidSize         1054   integer
## 14       putAskSize          926   integer
## 15     callBidPrice         2151   numeric
## 16        callValue         2228   numeric
## 17     callAskPrice         2199   numeric
## 18      putBidPrice         1281   numeric
## 19         putValue         2419   numeric
## 20      putAskPrice         1311   numeric
## 21        callBidIv         1907   numeric
## 22        callMidIv         2423   numeric
## 23        callAskIv         2430   numeric
## 24           smvVol         2044   numeric
## 25         putBidIv         2345   numeric
## 26         putMidIv         2425   numeric
## 27         putAskIv         2428   numeric
## 28     residualRate         2039   numeric
## 29            delta         2137   numeric
## 30            gamma         2238   numeric
## 31            theta         2192   numeric
## 32             vega         2211   numeric
## 33              rho         2188   numeric
## 34              phi         2133   numeric
## 35   driftlessTheta         2196   numeric
## 36        extSmvVol         2016   numeric
## 37     extCallValue         2168   numeric
## 38      extPutValue         2413   numeric
## 39        spotPrice            1   numeric
## 40        updatedAt            1 character

But the best I can do with dplyr is somewhat less elegant:

strikes_df %>%
  summarise_all(
    list(~n_distinct(.x), ~class(.x))
  ) 
## # A tibble: 1 x 80
##   ticker_n_distin~ tradeDate_n_dis~ expirDate_n_dis~ dte_n_distinct strike_n_distin~ stockPrice_n_di~ callVolume_n_di~
##              <int>            <int>            <int>          <int>            <int>            <int>            <int>
## 1                1                1               31             31              151                1              162
## # ... with 73 more variables: callOpenInterest_n_distinct <int>, callBidSize_n_distinct <int>,
## #   callAskSize_n_distinct <int>, putVolume_n_distinct <int>, putOpenInterest_n_distinct <int>,
## #   putBidSize_n_distinct <int>, putAskSize_n_distinct <int>, callBidPrice_n_distinct <int>, callValue_n_distinct <int>,
## #   callAskPrice_n_distinct <int>, putBidPrice_n_distinct <int>, putValue_n_distinct <int>, putAskPrice_n_distinct <int>,
## #   callBidIv_n_distinct <int>, callMidIv_n_distinct <int>, callAskIv_n_distinct <int>, smvVol_n_distinct <int>,
## #   putBidIv_n_distinct <int>, putMidIv_n_distinct <int>, putAskIv_n_distinct <int>, residualRate_n_distinct <int>,
## #   delta_n_distinct <int>, gamma_n_distinct <int>, theta_n_distinct <int>, vega_n_distinct <int>, rho_n_distinct <int>,
## #   phi_n_distinct <int>, driftlessTheta_n_distinct <int>, extSmvVol_n_distinct <int>, extCallValue_n_distinct <int>,
## #   extPutValue_n_distinct <int>, spotPrice_n_distinct <int>, updatedAt_n_distinct <int>, ticker_class <chr>,
## #   tradeDate_class <chr>, expirDate_class <chr>, dte_class <chr>, strike_class <chr>, stockPrice_class <chr>,
## #   callVolume_class <chr>, callOpenInterest_class <chr>, callBidSize_class <chr>, callAskSize_class <chr>,
## #   putVolume_class <chr>, putOpenInterest_class <chr>, putBidSize_class <chr>, putAskSize_class <chr>,
## #   callBidPrice_class <chr>, callValue_class <chr>, callAskPrice_class <chr>, putBidPrice_class <chr>,
## #   putValue_class <chr>, putAskPrice_class <chr>, callBidIv_class <chr>, callMidIv_class <chr>, callAskIv_class <chr>,
## #   smvVol_class <chr>, putBidIv_class <chr>, putMidIv_class <chr>, putAskIv_class <chr>, residualRate_class <chr>,
## #   delta_class <chr>, gamma_class <chr>, theta_class <chr>, vega_class <chr>, rho_class <chr>, phi_class <chr>,
## #   driftlessTheta_class <chr>, extSmvVol_class <chr>, extCallValue_class <chr>, extPutValue_class <chr>,
## #   spotPrice_class <chr>, updatedAt_class <chr>

Intuitively, you’d reach for something like this:

try(
  strikes_df %>%
    summarise_all(
      ~data.frame(num_distinct = n_distinct(.x), type = class(.x))
    )
)
## Error : Column `ticker` must be length 1 (a summary value), not 2

But we get an error related to the fact summarise wants to return a single value for each variable being summarised, that is, a dataframe with a single row.

There are probably better dplyr solutions out there, but this illustrates an important point: the purrr::map functions are highly customisable, able to apply a function to individual elements in a collection, returning a data object of your choosing. dplyr::summarise really shines when you need to aggregate or reduce variables to a single value.

Conclusion

In this post we explored the purrr::map functions for wrangling a data set consisting of nested lists, as you might have if you were reading in JSON data to R.

We also explored the cross-over and differences in use-cases for purrr and dplyr functions.

Comment (1)

[…] How to Wrangle JSON Data in R with jsonlite, purr and dplyr […]

Leave a Comment