Working with modern APIs you will often have to wrangle with data in JSON format.
This article presents some tools and recipes for working with JSON data with R in the tidyverse.
We’ll use purrr::map
functions to extract and transform our JSON data. And we’ll provide intuitive examples of the cross-overs and differences between purrr
and dplyr
.
library(tidyverse) library(here) library(kableExtra) pretty_print <- function(df, num_rows) { df %>% head(num_rows) %>% kable() %>% kable_styling(full_width = TRUE, position = 'center') %>% scroll_box(height = '300px') }
Load JSON as nested named lists
This data has been converted from raw JSON to nested named lists using jsonlite::fromJSON
with the simplify
argument set to FALSE
(that is, all elements are converted to named lists).
The data consists of market data for SPY options with various strikes and expiries. We got it from the options data vendor Orats, whose data API I enjoy almost as much as their orange website.
If you want to follow along, you can sign-up for a free trial of the API, and load the data directly from the Orats API with the following code (just define your API key in the ORATS_token
variable):
library(httr) ORATS_token <- 'YOUR_KEY_HERE' res <- GET('https://api.orats.io/data/strikes?tickers=SPY', add_headers(Authorization = ORATS_token)) if (http_type(res) == 'application/json') { strikes <- jsonlite::fromJSON(content(res, 'text'), simplifyVector = FALSE) } else { stop('No json returned') } if (http_error(res)) { stop(paste('API request error:',status_code(res), odata$message, odata$documentation_url)) }
Now, if you want to read this data directly into a nicely formatted dataframe, replace the line:
strikes <- jsonlite::fromJSON(content(res, 'text'), simplifyVector = FALSE)
with
strikes <- jsonlite::fromJSON(content(res, 'text'), simplifyVector = TRUE, flatten = TRUE)
However, you should know that it isn’t always possible to coerce JSON into nicely shaped dataframes this easily – often the raw JSON won’t contain primitive types, or will have nested key-value pairs on the same level as your desired dataframe columns, to name a couple of obstacles.
In that case, it’s useful to have some tools – like the ones in this post – for wrangling your source data.
So let’s look at that strikes
object, and show how we can wrangle it into something useful…
Look inside JSON lists
str(strikes, max.level = 1)
## List of 1 ## $ data:List of 2440
This tells us we have a component named “data”. Let’s look at that a little more closely:
str(strikes$data, max.level = 1, list.len = 10)
## List of 2440 ## $ :List of 40 ## $ :List of 40 ## $ :List of 40 ## $ :List of 40 ## $ :List of 40 ## $ :List of 40 ## $ :List of 40 ## $ :List of 40 ## $ :List of 40 ## $ :List of 40 ## [list output truncated]
This suggests we have homogenous lists of 40 elements each (an assumption we’ll check shortly).
Let’s look at one of those lists:
str(strikes$data[[1]])
## List of 40 ## $ ticker : chr "SPY" ## $ tradeDate : chr "2020-05-19" ## $ expirDate : chr "2020-05-29" ## $ dte : int 11 ## $ strike : int 140 ## $ stockPrice : num 293 ## $ callVolume : int 0 ## $ callOpenInterest: int 0 ## $ callBidSize : int 20 ## $ callAskSize : int 23 ## $ putVolume : int 0 ## $ putOpenInterest : int 2312 ## $ putBidSize : int 0 ## $ putAskSize : int 7117 ## $ callBidPrice : num 152 ## $ callValue : num 153 ## $ callAskPrice : num 153 ## $ putBidPrice : int 0 ## $ putValue : num 1.12e-25 ## $ putAskPrice : num 0.01 ## $ callBidIv : int 0 ## $ callMidIv : num 0.98 ## $ callAskIv : num 1.96 ## $ smvVol : num 0.476 ## $ putBidIv : int 0 ## $ putMidIv : num 0.709 ## $ putAskIv : num 1.42 ## $ residualRate : num -0.00652 ## $ delta : int 1 ## $ gamma : num 9.45e-16 ## $ theta : num -0.00288 ## $ vega : num 2e-11 ## $ rho : num 0.0384 ## $ phi : num -0.0802 ## $ driftlessTheta : num -6.07e-09 ## $ extSmvVol : num 0.478 ## $ extCallValue : num 153 ## $ extPutValue : num 1.77e-25 ## $ spotPrice : num 293 ## $ updatedAt : chr "2020-05-19 20:02:33"
All these elements look like they can be easily handled. For instance, I don’t see any more deeply nested lists, weird missing values, or anything else that looks difficult.
So now I’ll pull out the interesting bit:
strikes <- strikes[["data"]]
How many observations do we have?
length(strikes)
## [1] 2440
Are all strike sublists identically named?
This is where we’ll check that our sublists are indeed homogeneously named, as we assumed above:
strikes %>% map(names) %>% # this applies the base R function names to each sublist, and returns a list of lists with the output unique() %>% length() == 1
## [1] TRUE
Make a dataframe
We should also check the variable types are consistent as we need single types in each column of a dataframe (although R will warn if it is forced to coerce one type to another).
Here’s an interesting thing. It uses a nested purrr::map
to get the variable types for each element of each sublist. They’re actually not identical according to this:
strikes %>% map(.f = ~{map_chr(.x, .f = class)}) %>% unique() %>% length()
## [1] 39
This is actually a little puzzling. Inspecting the individual objects suggests that we do have identical types. If anyone has anything to say about this, I’d love to hear about it in the comments. In any event, after we make our dataframe, we should check that the variable types are as expected.
Now, to that dataframe…
purrr::flatten
removes one level of hierarchy from a list (unlist
removes them all). Here, flatten
is applied to each sub-list in strikes
via purrr::map_df
.
We use the variant flatten_df
which returns each sublist as a dataframe, which makes it compatible with purrr::map_df
,which requires a function that returns a dataframe.
strikes_df <- strikes %>% map_df(flatten_df) strikes_df %>% pretty_print(30)
ticker | tradeDate | expirDate | dte | strike | stockPrice | callVolume | callOpenInterest | callBidSize | callAskSize | putVolume | putOpenInterest | putBidSize | putAskSize | callBidPrice | callValue | callAskPrice | putBidPrice | putValue | putAskPrice | callBidIv | callMidIv | callAskIv | smvVol | putBidIv | putMidIv | putAskIv | residualRate | delta | gamma | theta | vega | rho | phi | driftlessTheta | extSmvVol | extCallValue | extPutValue | spotPrice | updatedAt |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SPY | 2020-05-19 | 2020-05-29 | 11 | 140 | 292.55 | 0 | 0 | 20 | 23 | 0 | 2312 | 0 | 7117 | 152.37 | 152.5790 | 152.83 | 0.00 | 0.0000000 | 0.01 | 0 | 0.980149 | 1.960300 | 0.476046 | 0.000000 | 0.708976 | 1.417950 | -0.0065171 | 1.000000 | 0.0000000 | -0.0028827 | 0.0000000 | 0.0383618 | -0.0801790 | 0.0000000 | 0.478157 | 152.5790 | 0.0000000 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 145 | 292.55 | 0 | 0 | 20 | 23 | 0 | 2322 | 0 | 5703 | 147.37 | 147.5800 | 147.83 | 0.00 | 0.0000000 | 0.01 | 0 | 0.936511 | 1.873020 | 0.476046 | 0.000000 | 0.676907 | 1.353810 | -0.0065171 | 1.000000 | 0.0000000 | -0.0029856 | 0.0000000 | 0.0397319 | -0.0801790 | 0.0000000 | 0.478157 | 147.5800 | 0.0000000 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 150 | 292.55 | 0 | 1 | 20 | 18 | 0 | 1912 | 0 | 5703 | 142.39 | 142.5810 | 142.83 | 0.00 | 0.0000000 | 0.01 | 0 | 0.894396 | 1.788790 | 0.476046 | 0.000000 | 0.645945 | 1.291890 | -0.0065171 | 1.000000 | 0.0000000 | -0.0030886 | 0.0000000 | 0.0411020 | -0.0801790 | 0.0000000 | 0.478157 | 142.5810 | 0.0000000 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 155 | 292.55 | 0 | 0 | 22 | 23 | 0 | 1483 | 0 | 5583 | 137.36 | 137.5820 | 137.81 | 0.00 | 0.0000000 | 0.01 | 0 | 0.844663 | 1.689330 | 0.476046 | 0.000000 | 0.616016 | 1.232030 | -0.0065171 | 1.000000 | 0.0000000 | -0.0031915 | 0.0000000 | 0.0424720 | -0.0801790 | 0.0000000 | 0.478157 | 137.5820 | 0.0000000 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 160 | 292.55 | 0 | 0 | 20 | 22 | 0 | 929 | 0 | 7016 | 132.39 | 132.5830 | 132.85 | 0.00 | 0.0000000 | 0.01 | 0 | 0.823228 | 1.646460 | 0.476046 | 0.000000 | 0.587053 | 1.174110 | -0.0065171 | 1.000000 | 0.0000000 | -0.0032945 | 0.0000000 | 0.0438421 | -0.0801790 | 0.0000000 | 0.478157 | 132.5830 | 0.0000000 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 165 | 292.55 | 0 | 0 | 20 | 22 | 0 | 1874 | 0 | 5943 | 127.39 | 127.5840 | 127.85 | 0.00 | 0.0000000 | 0.01 | 0 | 0.784980 | 1.569960 | 0.476046 | 0.000000 | 0.558997 | 1.117990 | -0.0065171 | 1.000000 | 0.0000000 | -0.0033974 | 0.0000000 | 0.0452122 | -0.0801790 | 0.0000000 | 0.478157 | 127.5840 | 0.0000000 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 170 | 292.55 | 0 | 0 | 22 | 23 | 0 | 4055 | 0 | 7407 | 122.37 | 122.5850 | 122.83 | 0.00 | 0.0000000 | 0.01 | 0 | 0.739266 | 1.478530 | 0.476046 | 0.000000 | 0.531711 | 1.063420 | -0.0065171 | 1.000000 | 0.0000000 | -0.0035004 | 0.0000000 | 0.0465822 | -0.0801790 | 0.0000000 | 0.478157 | 122.5850 | 0.0000000 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 175 | 292.55 | 0 | 0 | 22 | 23 | 0 | 2992 | 0 | 5103 | 117.36 | 117.5860 | 117.81 | 0.00 | 0.0000000 | 0.01 | 0 | 0.694933 | 1.389870 | 0.476046 | 0.000000 | 0.504757 | 1.009510 | -0.0065171 | 1.000000 | 0.0000000 | -0.0036034 | 0.0000000 | 0.0479523 | -0.0801790 | 0.0000000 | 0.478157 | 117.5860 | 0.0000000 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 180 | 292.55 | 0 | 0 | 20 | 31 | 0 | 4320 | 0 | 9302 | 112.39 | 112.5870 | 112.83 | 0.00 | 0.0000000 | 0.01 | 0 | 0.668237 | 1.336470 | 0.476046 | 0.000000 | 0.478572 | 0.957144 | -0.0065171 | 1.000000 | 0.0000000 | -0.0037063 | 0.0000000 | 0.0493224 | -0.0801790 | 0.0000000 | 0.478157 | 112.5870 | 0.0000000 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 185 | 292.55 | 0 | 0 | 20 | 31 | 1200 | 5863 | 0 | 4686 | 107.39 | 107.5880 | 107.83 | 0.00 | 0.0000000 | 0.01 | 0 | 0.633658 | 1.267320 | 0.476046 | 0.000000 | 0.453113 | 0.906225 | -0.0065171 | 1.000000 | 0.0000000 | -0.0038093 | 0.0000000 | 0.0506924 | -0.0801790 | 0.0000000 | 0.478157 | 107.5880 | 0.0000000 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 190 | 292.55 | 0 | 2 | 8 | 31 | 5 | 8253 | 0 | 4199 | 102.39 | 102.5890 | 102.83 | 0.00 | 0.0000000 | 0.01 | 0 | 0.600019 | 1.200040 | 0.476046 | 0.000000 | 0.428340 | 0.856680 | -0.0065171 | 1.000000 | 0.0000000 | -0.0039123 | 0.0000000 | 0.0520625 | -0.0801790 | -0.0000001 | 0.478157 | 102.5890 | 0.0000000 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 195 | 292.55 | 0 | 10 | 20 | 31 | 130 | 5417 | 0 | 5965 | 97.39 | 97.5902 | 97.83 | 0.00 | 0.0000002 | 0.01 | 0 | 0.567271 | 1.134540 | 0.476046 | 0.000000 | 0.404219 | 0.808438 | -0.0065171 | 1.000000 | 0.0000000 | -0.0040155 | 0.0000002 | 0.0534326 | -0.0801790 | -0.0000003 | 0.478157 | 97.5902 | 0.0000003 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 200 | 292.55 | 0 | 72 | 8 | 31 | 2139 | 11544 | 0 | 3657 | 92.39 | 92.5912 | 92.83 | 0.00 | 0.0000015 | 0.01 | 0 | 0.535369 | 1.070740 | 0.476046 | 0.000000 | 0.380715 | 0.761431 | -0.0065171 | 1.000000 | 0.0000001 | -0.0041200 | 0.0000008 | 0.0548026 | -0.0801790 | -0.0000019 | 0.478157 | 92.5912 | 0.0000016 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 205 | 292.55 | 0 | 20 | 23 | 20 | 0 | 2035 | 0 | 2949 | 87.38 | 87.5922 | 87.84 | 0.00 | 0.0000084 | 0.01 | 0 | 0.507409 | 1.014820 | 0.476046 | 0.000000 | 0.357803 | 0.715607 | -0.0065171 | 0.999998 | 0.0000004 | -0.0042307 | 0.0000037 | 0.0561726 | -0.0801789 | -0.0000097 | 0.478157 | 87.5922 | 0.0000092 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 210 | 292.55 | 0 | 7 | 23 | 18 | 177 | 2745 | 2433 | 6196 | 82.39 | 82.5933 | 82.83 | 0.01 | 0.0000393 | 0.02 | 0 | 0.473935 | 0.947870 | 0.476046 | 0.670012 | 0.691367 | 0.712722 | -0.0065171 | 0.999992 | 0.0000015 | -0.0043642 | 0.0000158 | 0.0575421 | -0.0801784 | -0.0000402 | 0.478157 | 82.5933 | 0.0000438 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 215 | 292.55 | 0 | 6 | 100 | 100 | 182 | 3439 | 5133 | 4919 | 77.40 | 77.5945 | 77.83 | 0.01 | 0.0001671 | 0.02 | 0 | 0.444328 | 0.888655 | 0.476046 | 0.625396 | 0.645686 | 0.665975 | -0.0065171 | 0.999969 | 0.0000056 | -0.0045768 | 0.0000572 | 0.0589103 | -0.0801765 | -0.0001500 | 0.478157 | 77.5945 | 0.0001791 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 220 | 292.55 | 0 | 7 | 23 | 31 | 253 | 67134 | 6636 | 4014 | 72.39 | 72.5959 | 72.84 | 0.01 | 0.0006089 | 0.02 | 0 | 0.417191 | 0.834383 | 0.476046 | 0.581817 | 0.600771 | 0.619725 | -0.0065171 | 0.999894 | 0.0000181 | -0.0050092 | 0.0001931 | 0.0602742 | -0.0801705 | -0.0004800 | 0.478157 | 72.5960 | 0.0006497 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 225 | 292.55 | 0 | 26 | 20 | 8 | 72 | 21021 | 4297 | 5090 | 67.40 | 67.5984 | 67.84 | 0.02 | 0.0020239 | 0.03 | 0 | 0.388099 | 0.776197 | 0.476046 | 0.574524 | 0.586509 | 0.598494 | -0.0065171 | 0.999669 | 0.0000522 | -0.0060175 | 0.0005729 | 0.0616259 | -0.0801525 | -0.0013866 | 0.478157 | 67.5985 | 0.0021449 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 230 | 292.55 | 1 | 3 | 22 | 20 | 496 | 60686 | 2857 | 5356 | 62.41 | 62.6033 | 62.86 | 0.03 | 0.0058940 | 0.04 | 0 | 0.364226 | 0.728453 | 0.476046 | 0.552809 | 0.561179 | 0.569550 | -0.0065171 | 0.999099 | 0.0001322 | -0.0082422 | 0.0015687 | 0.0629492 | -0.0801068 | -0.0035119 | 0.478157 | 62.6036 | 0.0062251 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 235 | 292.55 | 0 | 10 | 50 | 50 | 63 | 4115 | 3681 | 10023 | 57.43 | 57.6142 | 57.81 | 0.04 | 0.0157637 | 0.05 | 0 | 0.324739 | 0.649478 | 0.476046 | 0.523494 | 0.530320 | 0.537145 | -0.0065171 | 0.997754 | 0.0003047 | -0.0129200 | 0.0038339 | 0.0642087 | -0.0799990 | -0.0080951 | 0.478157 | 57.6150 | 0.0165731 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 240 | 292.55 | 1 | 279 | 50 | 50 | 1170 | 39193 | 3012 | 9390 | 52.46 | 52.6379 | 52.83 | 0.06 | 0.0384204 | 0.07 | 0 | 0.301680 | 0.603359 | 0.476046 | 0.501376 | 0.505950 | 0.510524 | -0.0065171 | 0.994905 | 0.0006370 | -0.0218349 | 0.0066480 | 0.0653442 | -0.0797706 | -0.0169246 | 0.478157 | 52.6393 | 0.0398239 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 245 | 292.55 | 0 | 438 | 50 | 50 | 119 | 10747 | 2182 | 9653 | 47.48 | 47.6773 | 47.89 | 0.09 | 0.0768502 | 0.10 | 0 | 0.285231 | 0.570462 | 0.469306 | 0.478714 | 0.482291 | 0.485867 | -0.0065171 | 0.990302 | 0.0011422 | -0.0344797 | 0.0139756 | 0.0663346 | -0.0794014 | -0.0294951 | 0.475861 | 47.6865 | 0.0860111 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 247 | 292.55 | 0 | 50 | 50 | 50 | 5 | 1290 | 5483 | 1760 | 45.50 | 45.6971 | 45.89 | 0.10 | 0.0962040 | 0.11 | 0 | 0.273995 | 0.547991 | 0.463901 | 0.466226 | 0.469087 | 0.471948 | -0.0065171 | 0.987998 | 0.0013914 | -0.0401191 | 0.0140348 | 0.0666926 | -0.0792168 | -0.0351075 | 0.473638 | 45.7138 | 0.1129540 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 248 | 292.55 | 0 | 15 | 50 | 50 | 16 | 2129 | 3666 | 700 | 44.52 | 44.7063 | 44.92 | 0.11 | 0.1052830 | 0.12 | 0 | 0.273081 | 0.546162 | 0.459372 | 0.462066 | 0.464921 | 0.467776 | -0.0065171 | 0.986881 | 0.0015163 | -0.0425395 | 0.0175110 | 0.0668746 | -0.0791272 | -0.0375142 | 0.471728 | 44.7280 | 0.1269210 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 249 | 292.55 | 0 | 4 | 50 | 50 | 10 | 13505 | 4517 | 800 | 43.51 | 43.7161 | 43.93 | 0.12 | 0.1148780 | 0.13 | 0 | 0.269065 | 0.538129 | 0.455004 | 0.457923 | 0.460772 | 0.463621 | -0.0065171 | 0.985691 | 0.0016497 | -0.0450826 | 0.0175471 | 0.0670505 | -0.0790318 | -0.0400441 | 0.469783 | 43.7437 | 0.1424310 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 250 | 292.55 | 2 | 426 | 50 | 50 | 1799 | 26307 | 4278 | 700 | 42.53 | 42.7269 | 42.95 | 0.13 | 0.1254170 | 0.14 | 0 | 0.266140 | 0.532280 | 0.451191 | 0.453797 | 0.456145 | 0.458493 | -0.0065171 | 0.984389 | 0.0017949 | -0.0478911 | 0.0175834 | 0.0672173 | -0.0789274 | -0.0428401 | 0.467820 | 42.7618 | 0.1603170 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 251 | 292.55 | 0 | 7 | 50 | 50 | 271 | 15028 | 4687 | 2012 | 41.53 | 41.7392 | 41.92 | 0.14 | 0.1375140 | 0.15 | 0 | 0.256440 | 0.512881 | 0.447223 | 0.448362 | 0.450653 | 0.452945 | -0.0065171 | 0.982918 | 0.0019560 | -0.0509297 | 0.0218205 | 0.0673700 | -0.0788094 | -0.0458673 | 0.465778 | 41.7812 | 0.1795280 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 252 | 292.55 | 0 | 10 | 50 | 50 | 104 | 661 | 4033 | 1857 | 40.55 | 40.7531 | 40.93 | 0.15 | 0.1512220 | 0.16 | 0 | 0.252133 | 0.504267 | 0.443405 | 0.442846 | 0.445133 | 0.447419 | -0.0065171 | 0.981266 | 0.0021342 | -0.0542692 | 0.0218653 | 0.0675078 | -0.0786770 | -0.0491964 | 0.463703 | 40.8038 | 0.2019230 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 253 | 292.55 | 0 | 38 | 50 | 50 | 1312 | 3752 | 582 | 7464 | 39.57 | 39.7627 | 39.94 | 0.17 | 0.1606130 | 0.18 | 0 | 0.247748 | 0.495497 | 0.437633 | 0.441328 | 0.443198 | 0.445068 | -0.0065171 | 0.979986 | 0.0022891 | -0.0564869 | 0.0219092 | 0.0676766 | -0.0785743 | -0.0514013 | 0.460124 | 39.8223 | 0.2202700 | 292.55 | 2020-05-19 20:02:33 |
SPY | 2020-05-19 | 2020-05-29 | 11 | 254 | 292.55 | 0 | 22 | 50 | 50 | 67 | 2797 | 2927 | 2528 | 38.58 | 38.7807 | 38.96 | 0.18 | 0.1784280 | 0.19 | 0 | 0.244667 | 0.489333 | 0.434100 | 0.434687 | 0.436553 | 0.438419 | -0.0065171 | 0.977890 | 0.0025081 | -0.0605062 | 0.0267923 | 0.0677777 | -0.0784063 | -0.0554131 | 0.457933 | 38.8483 | 0.2459960 | 292.55 | 2020-05-19 20:02:33 |
Cross-over and differences between purrr
and dplyr
Here are some other interesting things that we can do with the nested lists via purrr
, and their equivalent operation on the strikes_df
dataframe using dplyr
.
The intent is to gain some intuition for purrr
using what you already know about dplyr
.
Get vector of column names
strikes %>% map(names) %>% unique() %>% unlist()
## [1] "ticker" "tradeDate" "expirDate" "dte" "strike" "stockPrice" ## [7] "callVolume" "callOpenInterest" "callBidSize" "callAskSize" "putVolume" "putOpenInterest" ## [13] "putBidSize" "putAskSize" "callBidPrice" "callValue" "callAskPrice" "putBidPrice" ## [19] "putValue" "putAskPrice" "callBidIv" "callMidIv" "callAskIv" "smvVol" ## [25] "putBidIv" "putMidIv" "putAskIv" "residualRate" "delta" "gamma" ## [31] "theta" "vega" "rho" "phi" "driftlessTheta" "extSmvVol" ## [37] "extCallValue" "extPutValue" "spotPrice" "updatedAt"
This is equivalent to the following dplyr
operation on the strikes_df
dataframe:
strikes_df %>% names
## [1] "ticker" "tradeDate" "expirDate" "dte" "strike" "stockPrice" ## [7] "callVolume" "callOpenInterest" "callBidSize" "callAskSize" "putVolume" "putOpenInterest" ## [13] "putBidSize" "putAskSize" "callBidPrice" "callValue" "callAskPrice" "putBidPrice" ## [19] "putValue" "putAskPrice" "callBidIv" "callMidIv" "callAskIv" "smvVol" ## [25] "putBidIv" "putMidIv" "putAskIv" "residualRate" "delta" "gamma" ## [31] "theta" "vega" "rho" "phi" "driftlessTheta" "extSmvVol" ## [37] "extCallValue" "extPutValue" "spotPrice" "updatedAt"
You can see the connection: map(strikes, names)
applies names
to each sublist in strikes
, returning a list of names for each sublist, which we then check for a single unique case and convert to a charcter vector via unlist
.
In the dataframe version, we’ve already mapped each sublist to a dataframe row. We can get the column names of the dataframe by calling names
directly on this object.
Check that all elements have the same ticker
strikes %>% map_chr("ticker") %>% # this makes a character vector of list elements "ticker" unique()
## [1] "SPY"
Calling the purrr::map
functions on a list with the name of a common sub-element returns the value associated with each sub-element. map
returns a list; here we use map_chr
to return a character vector.
This only works if the thing being returned from the sub-element is indeed a character.
This is equivalent to the following dplyr
operation on the strikes_df
dataframe:
strikes_df %>% distinct(ticker) %>% pull()
## [1] "SPY"
In the dplyr
dataframe version, we’ve already mapped our tickers to their own column. So we simply call distinct
on that column to get the unique values. A pull
converts the resulting tibble to a vector.
Get the strike prices, expiries and call and put mid-prices
In this case, the purrr
solution is somewhat convoluted:
callBids <- strikes %>% map_dbl("callBidPrice") callAsks <- strikes %>% map_dbl("callAskPrice") putBids <- strikes %>% map_dbl("putBidPrice") putAsks <- strikes %>% map_dbl("putAskPrice") data.frame( strike = strikes %>% map_dbl("strike"), expirDate = strikes %>% map_chr("expirDate"), callMid = map2_dbl(.x = callBids, .y = callAsks, ~{(.x + .y)/2}), putMid = map2_dbl(.x = putBids, .y = putAsks, ~{(.x + .y)/2}) ) %>% pretty_print(10)
strike | expirDate | callMid | putMid |
---|---|---|---|
140 | 2020-05-29 | 152.600 | 0.005 |
145 | 2020-05-29 | 147.600 | 0.005 |
150 | 2020-05-29 | 142.610 | 0.005 |
155 | 2020-05-29 | 137.585 | 0.005 |
160 | 2020-05-29 | 132.620 | 0.005 |
165 | 2020-05-29 | 127.620 | 0.005 |
170 | 2020-05-29 | 122.600 | 0.005 |
175 | 2020-05-29 | 117.585 | 0.005 |
180 | 2020-05-29 | 112.610 | 0.005 |
185 | 2020-05-29 | 107.610 | 0.005 |
Since our mapping function requires two inputs, we need to use the map2
functions, and must set up the inputs as a first step.
The dplyr
equivalent on the dataframe object is much more succinct:
strikes_df %>% mutate( callMid = (callBidPrice + callAskPrice)/2, putMid = (putBidPrice + putAskPrice)/2 ) %>% select(strike, expirDate, callMid, putMid) %>% pretty_print(10)
strike | expirDate | callMid | putMid |
---|---|---|---|
140 | 2020-05-29 | 152.600 | 0.005 |
145 | 2020-05-29 | 147.600 | 0.005 |
150 | 2020-05-29 | 142.610 | 0.005 |
155 | 2020-05-29 | 137.585 | 0.005 |
160 | 2020-05-29 | 132.620 | 0.005 |
165 | 2020-05-29 | 127.620 | 0.005 |
170 | 2020-05-29 | 122.600 | 0.005 |
175 | 2020-05-29 | 117.585 | 0.005 |
180 | 2020-05-29 | 112.610 | 0.005 |
185 | 2020-05-29 | 107.610 | 0.005 |
Leverage the dataframe’s structure
We can also leverage the fact that a dataframe is represented as a list of columns to use purrr
functions directly on dataframes. These recipes are quite useful for quickly getting to know a dataframe.
For instance, we can get the type of each column:
strikes_df %>% map_chr(class)
## ticker tradeDate expirDate dte strike stockPrice callVolume ## "character" "character" "character" "integer" "numeric" "numeric" "integer" ## callOpenInterest callBidSize callAskSize putVolume putOpenInterest putBidSize putAskSize ## "integer" "integer" "integer" "integer" "integer" "integer" "integer" ## callBidPrice callValue callAskPrice putBidPrice putValue putAskPrice callBidIv ## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" ## callMidIv callAskIv smvVol putBidIv putMidIv putAskIv residualRate ## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" ## delta gamma theta vega rho phi driftlessTheta ## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" ## extSmvVol extCallValue extPutValue spotPrice updatedAt ## "numeric" "numeric" "numeric" "numeric" "character"
Which is equivalent to a dplyr::summarise_all
, except that this returns a tibble rather than a vector:
strikes_df %>% summarise_all(~class(.x))
## # A tibble: 1 x 40 ## ticker tradeDate expirDate dte strike stockPrice callVolume callOpenInterest callBidSize callAskSize putVolume ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 chara~ character character inte~ numer~ numeric integer integer integer integer integer ## # ... with 29 more variables: putOpenInterest <chr>, putBidSize <chr>, putAskSize <chr>, callBidPrice <chr>, ## # callValue <chr>, callAskPrice <chr>, putBidPrice <chr>, putValue <chr>, putAskPrice <chr>, callBidIv <chr>, ## # callMidIv <chr>, callAskIv <chr>, smvVol <chr>, putBidIv <chr>, putMidIv <chr>, putAskIv <chr>, residualRate <chr>, ## # delta <chr>, gamma <chr>, theta <chr>, vega <chr>, rho <chr>, phi <chr>, driftlessTheta <chr>, extSmvVol <chr>, ## # extCallValue <chr>, extPutValue <chr>, spotPrice <chr>, updatedAt <chr>
We can also get the number of distinct values in each column using purrr
functions:
strikes_df %>% map_dbl(n_distinct)
## ticker tradeDate expirDate dte strike stockPrice callVolume ## 1 1 31 31 151 1 162 ## callOpenInterest callBidSize callAskSize putVolume putOpenInterest putBidSize putAskSize ## 1097 144 165 498 1691 1054 926 ## callBidPrice callValue callAskPrice putBidPrice putValue putAskPrice callBidIv ## 2151 2228 2199 1281 2419 1311 1907 ## callMidIv callAskIv smvVol putBidIv putMidIv putAskIv residualRate ## 2423 2430 2044 2345 2425 2428 2039 ## delta gamma theta vega rho phi driftlessTheta ## 2137 2238 2192 2211 2188 2133 2196 ## extSmvVol extCallValue extPutValue spotPrice updatedAt ## 2016 2168 2413 1 1
Again, this is equivalent to a dplyr::summarise_all
, different return objects aside:
strikes_df %>% summarise_all(~n_distinct(.x))
## # A tibble: 1 x 40 ## ticker tradeDate expirDate dte strike stockPrice callVolume callOpenInterest callBidSize callAskSize putVolume ## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> ## 1 1 1 31 31 151 1 162 1097 144 165 498 ## # ... with 29 more variables: putOpenInterest <int>, putBidSize <int>, putAskSize <int>, callBidPrice <int>, ## # callValue <int>, callAskPrice <int>, putBidPrice <int>, putValue <int>, putAskPrice <int>, callBidIv <int>, ## # callMidIv <int>, callAskIv <int>, smvVol <int>, putBidIv <int>, putMidIv <int>, putAskIv <int>, residualRate <int>, ## # delta <int>, gamma <int>, theta <int>, vega <int>, rho <int>, phi <int>, driftlessTheta <int>, extSmvVol <int>, ## # extCallValue <int>, extPutValue <int>, spotPrice <int>, updatedAt <int>
If we wanted to put both of these things together, there’s an elegant purrr
solution:
strikes_df %>% map_df( ~data.frame(num_distinct = n_distinct(.x), type = class(.x)), .id = "variable" )
## variable num_distinct type ## 1 ticker 1 character ## 2 tradeDate 1 character ## 3 expirDate 31 character ## 4 dte 31 integer ## 5 strike 151 numeric ## 6 stockPrice 1 numeric ## 7 callVolume 162 integer ## 8 callOpenInterest 1097 integer ## 9 callBidSize 144 integer ## 10 callAskSize 165 integer ## 11 putVolume 498 integer ## 12 putOpenInterest 1691 integer ## 13 putBidSize 1054 integer ## 14 putAskSize 926 integer ## 15 callBidPrice 2151 numeric ## 16 callValue 2228 numeric ## 17 callAskPrice 2199 numeric ## 18 putBidPrice 1281 numeric ## 19 putValue 2419 numeric ## 20 putAskPrice 1311 numeric ## 21 callBidIv 1907 numeric ## 22 callMidIv 2423 numeric ## 23 callAskIv 2430 numeric ## 24 smvVol 2044 numeric ## 25 putBidIv 2345 numeric ## 26 putMidIv 2425 numeric ## 27 putAskIv 2428 numeric ## 28 residualRate 2039 numeric ## 29 delta 2137 numeric ## 30 gamma 2238 numeric ## 31 theta 2192 numeric ## 32 vega 2211 numeric ## 33 rho 2188 numeric ## 34 phi 2133 numeric ## 35 driftlessTheta 2196 numeric ## 36 extSmvVol 2016 numeric ## 37 extCallValue 2168 numeric ## 38 extPutValue 2413 numeric ## 39 spotPrice 1 numeric ## 40 updatedAt 1 character
But the best I can do with dplyr
is somewhat less elegant:
strikes_df %>% summarise_all( list(~n_distinct(.x), ~class(.x)) )
## # A tibble: 1 x 80 ## ticker_n_distin~ tradeDate_n_dis~ expirDate_n_dis~ dte_n_distinct strike_n_distin~ stockPrice_n_di~ callVolume_n_di~ ## <int> <int> <int> <int> <int> <int> <int> ## 1 1 1 31 31 151 1 162 ## # ... with 73 more variables: callOpenInterest_n_distinct <int>, callBidSize_n_distinct <int>, ## # callAskSize_n_distinct <int>, putVolume_n_distinct <int>, putOpenInterest_n_distinct <int>, ## # putBidSize_n_distinct <int>, putAskSize_n_distinct <int>, callBidPrice_n_distinct <int>, callValue_n_distinct <int>, ## # callAskPrice_n_distinct <int>, putBidPrice_n_distinct <int>, putValue_n_distinct <int>, putAskPrice_n_distinct <int>, ## # callBidIv_n_distinct <int>, callMidIv_n_distinct <int>, callAskIv_n_distinct <int>, smvVol_n_distinct <int>, ## # putBidIv_n_distinct <int>, putMidIv_n_distinct <int>, putAskIv_n_distinct <int>, residualRate_n_distinct <int>, ## # delta_n_distinct <int>, gamma_n_distinct <int>, theta_n_distinct <int>, vega_n_distinct <int>, rho_n_distinct <int>, ## # phi_n_distinct <int>, driftlessTheta_n_distinct <int>, extSmvVol_n_distinct <int>, extCallValue_n_distinct <int>, ## # extPutValue_n_distinct <int>, spotPrice_n_distinct <int>, updatedAt_n_distinct <int>, ticker_class <chr>, ## # tradeDate_class <chr>, expirDate_class <chr>, dte_class <chr>, strike_class <chr>, stockPrice_class <chr>, ## # callVolume_class <chr>, callOpenInterest_class <chr>, callBidSize_class <chr>, callAskSize_class <chr>, ## # putVolume_class <chr>, putOpenInterest_class <chr>, putBidSize_class <chr>, putAskSize_class <chr>, ## # callBidPrice_class <chr>, callValue_class <chr>, callAskPrice_class <chr>, putBidPrice_class <chr>, ## # putValue_class <chr>, putAskPrice_class <chr>, callBidIv_class <chr>, callMidIv_class <chr>, callAskIv_class <chr>, ## # smvVol_class <chr>, putBidIv_class <chr>, putMidIv_class <chr>, putAskIv_class <chr>, residualRate_class <chr>, ## # delta_class <chr>, gamma_class <chr>, theta_class <chr>, vega_class <chr>, rho_class <chr>, phi_class <chr>, ## # driftlessTheta_class <chr>, extSmvVol_class <chr>, extCallValue_class <chr>, extPutValue_class <chr>, ## # spotPrice_class <chr>, updatedAt_class <chr>
Intuitively, you’d reach for something like this:
try( strikes_df %>% summarise_all( ~data.frame(num_distinct = n_distinct(.x), type = class(.x)) ) )
## Error : Column `ticker` must be length 1 (a summary value), not 2
But we get an error related to the fact summarise
wants to return a single value for each variable being summarised, that is, a dataframe with a single row.
There are probably better dplyr
solutions out there, but this illustrates an important point: the purrr::map
functions are highly customisable, able to apply a function to individual elements in a collection, returning a data object of your choosing. dplyr::summarise
really shines when you need to aggregate or reduce variables to a single value.
Conclusion
In this post we explored the purrr::map
functions for wrangling a data set consisting of nested lists, as you might have if you were reading in JSON data to R.
We also explored the cross-over and differences in use-cases for purrr
and dplyr
functions.
1 thought on “How to Wrangle JSON Data in R with jsonlite, purr and dplyr”