How to Wrangle JSON Data in R with jsonlite, purr and dplyr

Working with modern APIs you will often have to wrangle with data in JSON format.

This article presents some tools and recipes for working with JSON data with R in the tidyverse.

We’ll use purrr::map functions to extract and transform our JSON data. And we’ll provide intuitive examples of the cross-overs and differences between purrr and dplyr.

library(tidyverse)
library(here)
library(kableExtra)

pretty_print <- function(df, num_rows) {
  df %>%
  head(num_rows) %>%
    kable() %>%
    kable_styling(full_width = TRUE, position = 'center') %>%
    scroll_box(height = '300px')
}

Load JSON as nested named lists

This data has been converted from raw JSON to nested named lists using jsonlite::fromJSON with the simplify argument set to FALSE (that is, all elements are converted to named lists).

The data consists of market data for SPY options with various strikes and expiries. We got it from the options data vendor Orats, whose data API I enjoy almost as much as their orange website.

If you want to follow along, you can sign-up for a free trial of the API, and load the data directly from the Orats API with the following code (just define your API key in the ORATS_token variable):

library(httr)

ORATS_token <- 'YOUR_KEY_HERE'
res <- GET('https://api.orats.io/data/strikes?tickers=SPY', add_headers(Authorization = ORATS_token))

if (http_type(res) == 'application/json') {
  strikes <- jsonlite::fromJSON(content(res, 'text'), simplifyVector = FALSE)
} else {
  stop('No json returned')
}

if (http_error(res)) {
  stop(paste('API request error:',status_code(res), odata$message, odata$documentation_url))
}

Now, if you want to read this data directly into a nicely formatted dataframe, replace the line:

strikes <- jsonlite::fromJSON(content(res, 'text'), simplifyVector = FALSE)

with

strikes <- jsonlite::fromJSON(content(res, 'text'), simplifyVector = TRUE, flatten = TRUE)

However, you should know that it isn’t always possible to coerce JSON into nicely shaped dataframes this easily – often the raw JSON won’t contain primitive types, or will have nested key-value pairs on the same level as your desired dataframe columns, to name a couple of obstacles.

In that case, it’s useful to have some tools – like the ones in this post – for wrangling your source data.

So let’s look at that strikes object, and show how we can wrangle it into something useful…

Look inside JSON lists

str(strikes, max.level = 1)

## List of 1
##  $ data:List of 2440

This tells us we have a component named “data”. Let’s look at that a little more closely:

str(strikes$data, max.level = 1, list.len = 10)

## List of 2440
##  $ :List of 40
##  $ :List of 40
##  $ :List of 40
##  $ :List of 40
##  $ :List of 40
##  $ :List of 40
##  $ :List of 40
##  $ :List of 40
##  $ :List of 40
##  $ :List of 40
##   [list output truncated]

This suggests we have homogenous lists of 40 elements each (an assumption we’ll check shortly).

Let’s look at one of those lists:

str(strikes$data[[1]])

## List of 40
##  $ ticker          : chr "SPY"
##  $ tradeDate       : chr "2020-05-19"
##  $ expirDate       : chr "2020-05-29"
##  $ dte             : int 11
##  $ strike          : int 140
##  $ stockPrice      : num 293
##  $ callVolume      : int 0
##  $ callOpenInterest: int 0
##  $ callBidSize     : int 20
##  $ callAskSize     : int 23
##  $ putVolume       : int 0
##  $ putOpenInterest : int 2312
##  $ putBidSize      : int 0
##  $ putAskSize      : int 7117
##  $ callBidPrice    : num 152
##  $ callValue       : num 153
##  $ callAskPrice    : num 153
##  $ putBidPrice     : int 0
##  $ putValue        : num 1.12e-25
##  $ putAskPrice     : num 0.01
##  $ callBidIv       : int 0
##  $ callMidIv       : num 0.98
##  $ callAskIv       : num 1.96
##  $ smvVol          : num 0.476
##  $ putBidIv        : int 0
##  $ putMidIv        : num 0.709
##  $ putAskIv        : num 1.42
##  $ residualRate    : num -0.00652
##  $ delta           : int 1
##  $ gamma           : num 9.45e-16
##  $ theta           : num -0.00288
##  $ vega            : num 2e-11
##  $ rho             : num 0.0384
##  $ phi             : num -0.0802
##  $ driftlessTheta  : num -6.07e-09
##  $ extSmvVol       : num 0.478
##  $ extCallValue    : num 153
##  $ extPutValue     : num 1.77e-25
##  $ spotPrice       : num 293
##  $ updatedAt       : chr "2020-05-19 20:02:33"

All these elements look like they can be easily handled. For instance, I don’t see any more deeply nested lists, weird missing values, or anything else that looks difficult.

So now I’ll pull out the interesting bit:

strikes <- strikes[["data"]]

How many observations do we have?

length(strikes)

## [1] 2440

Are all strike sublists identically named?

This is where we’ll check that our sublists are indeed homogeneously named, as we assumed above:

strikes %>%
  map(names) %>%  # this applies the base R function names to each sublist, and returns a list of lists with the output
  unique() %>%
  length() == 1

## [1] TRUE

Make a dataframe

We should also check the variable types are consistent as we need single types in each column of a dataframe (although R will warn if it is forced to coerce one type to another).

Here’s an interesting thing. It uses a nested purrr::map to get the variable types for each element of each sublist. They’re actually not identical according to this:

strikes %>%
  map(.f = ~{map_chr(.x, .f = class)}) %>%
  unique() %>%
  length()

## [1] 39

This is actually a little puzzling. Inspecting the individual objects suggests that we do have identical types. If anyone has anything to say about this, I’d love to hear about it in the comments. In any event, after we make our dataframe, we should check that the variable types are as expected.

Now, to that dataframe…

purrr::flatten removes one level of hierarchy from a list (unlist removes them all). Here, flatten is applied to each sub-list in strikes via purrr::map_df.

We use the variant flatten_df which returns each sublist as a dataframe, which makes it compatible with purrr::map_df,which requires a function that returns a dataframe.

strikes_df <- strikes %>%
  map_df(flatten_df)

strikes_df %>%
  pretty_print(30)

ticker	tradeDate	expirDate	dte	strike	stockPrice	callVolume	callOpenInterest	callBidSize	callAskSize	putVolume	putOpenInterest	putBidSize	putAskSize	callBidPrice	callValue	callAskPrice	putBidPrice	putValue	putAskPrice	callMidIv	callAskIv	smvVol	putBidIv	putMidIv	putAskIv	residualRate	delta	gamma	theta	vega	rho	phi	driftlessTheta	extSmvVol	extCallValue	extPutValue	spotPrice	updatedAt
SPY	2020-05-19	2020-05-29	11	140	292.55	0	0	20	23	0	2312	0	7117	152.37	152.5790	152.83	0.00	0.0000000	0.01	0.980149	1.960300	0.476046	0.000000	0.708976	1.417950	-0.0065171	1.000000	0.0000000	-0.0028827	0.0000000	0.0383618	-0.0801790	0.0000000	0.478157	152.5790	0.0000000	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	145	292.55	0	0	20	23	0	2322	0	5703	147.37	147.5800	147.83	0.00	0.0000000	0.01	0.936511	1.873020	0.476046	0.000000	0.676907	1.353810	-0.0065171	1.000000	0.0000000	-0.0029856	0.0000000	0.0397319	-0.0801790	0.0000000	0.478157	147.5800	0.0000000	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	150	292.55	0	1	20	18	0	1912	0	5703	142.39	142.5810	142.83	0.00	0.0000000	0.01	0.894396	1.788790	0.476046	0.000000	0.645945	1.291890	-0.0065171	1.000000	0.0000000	-0.0030886	0.0000000	0.0411020	-0.0801790	0.0000000	0.478157	142.5810	0.0000000	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	155	292.55	0	0	22	23	0	1483	0	5583	137.36	137.5820	137.81	0.00	0.0000000	0.01	0.844663	1.689330	0.476046	0.000000	0.616016	1.232030	-0.0065171	1.000000	0.0000000	-0.0031915	0.0000000	0.0424720	-0.0801790	0.0000000	0.478157	137.5820	0.0000000	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	160	292.55	0	0	20	22	0	929	0	7016	132.39	132.5830	132.85	0.00	0.0000000	0.01	0.823228	1.646460	0.476046	0.000000	0.587053	1.174110	-0.0065171	1.000000	0.0000000	-0.0032945	0.0000000	0.0438421	-0.0801790	0.0000000	0.478157	132.5830	0.0000000	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	165	292.55	0	0	20	22	0	1874	0	5943	127.39	127.5840	127.85	0.00	0.0000000	0.01	0.784980	1.569960	0.476046	0.000000	0.558997	1.117990	-0.0065171	1.000000	0.0000000	-0.0033974	0.0000000	0.0452122	-0.0801790	0.0000000	0.478157	127.5840	0.0000000	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	170	292.55	0	0	22	23	0	4055	0	7407	122.37	122.5850	122.83	0.00	0.0000000	0.01	0.739266	1.478530	0.476046	0.000000	0.531711	1.063420	-0.0065171	1.000000	0.0000000	-0.0035004	0.0000000	0.0465822	-0.0801790	0.0000000	0.478157	122.5850	0.0000000	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	175	292.55	0	0	22	23	0	2992	0	5103	117.36	117.5860	117.81	0.00	0.0000000	0.01	0.694933	1.389870	0.476046	0.000000	0.504757	1.009510	-0.0065171	1.000000	0.0000000	-0.0036034	0.0000000	0.0479523	-0.0801790	0.0000000	0.478157	117.5860	0.0000000	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	180	292.55	0	0	20	31	0	4320	0	9302	112.39	112.5870	112.83	0.00	0.0000000	0.01	0.668237	1.336470	0.476046	0.000000	0.478572	0.957144	-0.0065171	1.000000	0.0000000	-0.0037063	0.0000000	0.0493224	-0.0801790	0.0000000	0.478157	112.5870	0.0000000	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	185	292.55	0	0	20	31	1200	5863	0	4686	107.39	107.5880	107.83	0.00	0.0000000	0.01	0.633658	1.267320	0.476046	0.000000	0.453113	0.906225	-0.0065171	1.000000	0.0000000	-0.0038093	0.0000000	0.0506924	-0.0801790	0.0000000	0.478157	107.5880	0.0000000	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	190	292.55	0	2	8	31	5	8253	0	4199	102.39	102.5890	102.83	0.00	0.0000000	0.01	0.600019	1.200040	0.476046	0.000000	0.428340	0.856680	-0.0065171	1.000000	0.0000000	-0.0039123	0.0000000	0.0520625	-0.0801790	-0.0000001	0.478157	102.5890	0.0000000	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	195	292.55	0	10	20	31	130	5417	0	5965	97.39	97.5902	97.83	0.00	0.0000002	0.01	0.567271	1.134540	0.476046	0.000000	0.404219	0.808438	-0.0065171	1.000000	0.0000000	-0.0040155	0.0000002	0.0534326	-0.0801790	-0.0000003	0.478157	97.5902	0.0000003	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	200	292.55	0	72	8	31	2139	11544	0	3657	92.39	92.5912	92.83	0.00	0.0000015	0.01	0.535369	1.070740	0.476046	0.000000	0.380715	0.761431	-0.0065171	1.000000	0.0000001	-0.0041200	0.0000008	0.0548026	-0.0801790	-0.0000019	0.478157	92.5912	0.0000016	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	205	292.55	0	20	23	20	0	2035	0	2949	87.38	87.5922	87.84	0.00	0.0000084	0.01	0.507409	1.014820	0.476046	0.000000	0.357803	0.715607	-0.0065171	0.999998	0.0000004	-0.0042307	0.0000037	0.0561726	-0.0801789	-0.0000097	0.478157	87.5922	0.0000092	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	210	292.55	0	7	23	18	177	2745	2433	6196	82.39	82.5933	82.83	0.01	0.0000393	0.02	0.473935	0.947870	0.476046	0.670012	0.691367	0.712722	-0.0065171	0.999992	0.0000015	-0.0043642	0.0000158	0.0575421	-0.0801784	-0.0000402	0.478157	82.5933	0.0000438	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	215	292.55	0	6	100	100	182	3439	5133	4919	77.40	77.5945	77.83	0.01	0.0001671	0.02	0.444328	0.888655	0.476046	0.625396	0.645686	0.665975	-0.0065171	0.999969	0.0000056	-0.0045768	0.0000572	0.0589103	-0.0801765	-0.0001500	0.478157	77.5945	0.0001791	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	220	292.55	0	7	23	31	253	67134	6636	4014	72.39	72.5959	72.84	0.01	0.0006089	0.02	0.417191	0.834383	0.476046	0.581817	0.600771	0.619725	-0.0065171	0.999894	0.0000181	-0.0050092	0.0001931	0.0602742	-0.0801705	-0.0004800	0.478157	72.5960	0.0006497	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	225	292.55	0	26	20	8	72	21021	4297	5090	67.40	67.5984	67.84	0.02	0.0020239	0.03	0.388099	0.776197	0.476046	0.574524	0.586509	0.598494	-0.0065171	0.999669	0.0000522	-0.0060175	0.0005729	0.0616259	-0.0801525	-0.0013866	0.478157	67.5985	0.0021449	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	230	292.55	1	3	22	20	496	60686	2857	5356	62.41	62.6033	62.86	0.03	0.0058940	0.04	0.364226	0.728453	0.476046	0.552809	0.561179	0.569550	-0.0065171	0.999099	0.0001322	-0.0082422	0.0015687	0.0629492	-0.0801068	-0.0035119	0.478157	62.6036	0.0062251	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	235	292.55	0	10	50	50	63	4115	3681	10023	57.43	57.6142	57.81	0.04	0.0157637	0.05	0.324739	0.649478	0.476046	0.523494	0.530320	0.537145	-0.0065171	0.997754	0.0003047	-0.0129200	0.0038339	0.0642087	-0.0799990	-0.0080951	0.478157	57.6150	0.0165731	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	240	292.55	1	279	50	50	1170	39193	3012	9390	52.46	52.6379	52.83	0.06	0.0384204	0.07	0.301680	0.603359	0.476046	0.501376	0.505950	0.510524	-0.0065171	0.994905	0.0006370	-0.0218349	0.0066480	0.0653442	-0.0797706	-0.0169246	0.478157	52.6393	0.0398239	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	245	292.55	0	438	50	50	119	10747	2182	9653	47.48	47.6773	47.89	0.09	0.0768502	0.10	0.285231	0.570462	0.469306	0.478714	0.482291	0.485867	-0.0065171	0.990302	0.0011422	-0.0344797	0.0139756	0.0663346	-0.0794014	-0.0294951	0.475861	47.6865	0.0860111	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	247	292.55	0	50	50	50	5	1290	5483	1760	45.50	45.6971	45.89	0.10	0.0962040	0.11	0.273995	0.547991	0.463901	0.466226	0.469087	0.471948	-0.0065171	0.987998	0.0013914	-0.0401191	0.0140348	0.0666926	-0.0792168	-0.0351075	0.473638	45.7138	0.1129540	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	248	292.55	0	15	50	50	16	2129	3666	700	44.52	44.7063	44.92	0.11	0.1052830	0.12	0.273081	0.546162	0.459372	0.462066	0.464921	0.467776	-0.0065171	0.986881	0.0015163	-0.0425395	0.0175110	0.0668746	-0.0791272	-0.0375142	0.471728	44.7280	0.1269210	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	249	292.55	0	4	50	50	10	13505	4517	800	43.51	43.7161	43.93	0.12	0.1148780	0.13	0.269065	0.538129	0.455004	0.457923	0.460772	0.463621	-0.0065171	0.985691	0.0016497	-0.0450826	0.0175471	0.0670505	-0.0790318	-0.0400441	0.469783	43.7437	0.1424310	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	250	292.55	2	426	50	50	1799	26307	4278	700	42.53	42.7269	42.95	0.13	0.1254170	0.14	0.266140	0.532280	0.451191	0.453797	0.456145	0.458493	-0.0065171	0.984389	0.0017949	-0.0478911	0.0175834	0.0672173	-0.0789274	-0.0428401	0.467820	42.7618	0.1603170	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	251	292.55	0	7	50	50	271	15028	4687	2012	41.53	41.7392	41.92	0.14	0.1375140	0.15	0.256440	0.512881	0.447223	0.448362	0.450653	0.452945	-0.0065171	0.982918	0.0019560	-0.0509297	0.0218205	0.0673700	-0.0788094	-0.0458673	0.465778	41.7812	0.1795280	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	252	292.55	0	10	50	50	104	661	4033	1857	40.55	40.7531	40.93	0.15	0.1512220	0.16	0.252133	0.504267	0.443405	0.442846	0.445133	0.447419	-0.0065171	0.981266	0.0021342	-0.0542692	0.0218653	0.0675078	-0.0786770	-0.0491964	0.463703	40.8038	0.2019230	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	253	292.55	0	38	50	50	1312	3752	582	7464	39.57	39.7627	39.94	0.17	0.1606130	0.18	0.247748	0.495497	0.437633	0.441328	0.443198	0.445068	-0.0065171	0.979986	0.0022891	-0.0564869	0.0219092	0.0676766	-0.0785743	-0.0514013	0.460124	39.8223	0.2202700	292.55	2020-05-19 20:02:33
SPY	2020-05-19	2020-05-29	11	254	292.55	0	22	50	50	67	2797	2927	2528	38.58	38.7807	38.96	0.18	0.1784280	0.19	0.244667	0.489333	0.434100	0.434687	0.436553	0.438419	-0.0065171	0.977890	0.0025081	-0.0605062	0.0267923	0.0677777	-0.0784063	-0.0554131	0.457933	38.8483	0.2459960	292.55	2020-05-19 20:02:33

Cross-over and differences between `purrr` and `dplyr`

Here are some other interesting things that we can do with the nested lists via purrr, and their equivalent operation on the strikes_df dataframe using dplyr.

The intent is to gain some intuition for purrr using what you already know about dplyr.

Get vector of column names

strikes %>%
  map(names) %>%
  unique() %>%
  unlist()

##  [1] "ticker"           "tradeDate"        "expirDate"        "dte"              "strike"           "stockPrice"      
##  [7] "callVolume"       "callOpenInterest" "callBidSize"      "callAskSize"      "putVolume"        "putOpenInterest" 
## [13] "putBidSize"       "putAskSize"       "callBidPrice"     "callValue"        "callAskPrice"     "putBidPrice"     
## [19] "putValue"         "putAskPrice"      "callBidIv"        "callMidIv"        "callAskIv"        "smvVol"          
## [25] "putBidIv"         "putMidIv"         "putAskIv"         "residualRate"     "delta"            "gamma"           
## [31] "theta"            "vega"             "rho"              "phi"              "driftlessTheta"   "extSmvVol"       
## [37] "extCallValue"     "extPutValue"      "spotPrice"        "updatedAt"

This is equivalent to the following dplyr operation on the strikes_df dataframe:

strikes_df %>%
  names

##  [1] "ticker"           "tradeDate"        "expirDate"        "dte"              "strike"           "stockPrice"      
##  [7] "callVolume"       "callOpenInterest" "callBidSize"      "callAskSize"      "putVolume"        "putOpenInterest" 
## [13] "putBidSize"       "putAskSize"       "callBidPrice"     "callValue"        "callAskPrice"     "putBidPrice"     
## [19] "putValue"         "putAskPrice"      "callBidIv"        "callMidIv"        "callAskIv"        "smvVol"          
## [25] "putBidIv"         "putMidIv"         "putAskIv"         "residualRate"     "delta"            "gamma"           
## [31] "theta"            "vega"             "rho"              "phi"              "driftlessTheta"   "extSmvVol"       
## [37] "extCallValue"     "extPutValue"      "spotPrice"        "updatedAt"

You can see the connection: map(strikes, names) applies names to each sublist in strikes, returning a list of names for each sublist, which we then check for a single unique case and convert to a charcter vector via unlist.

In the dataframe version, we’ve already mapped each sublist to a dataframe row. We can get the column names of the dataframe by calling names directly on this object.

Check that all elements have the same ticker

strikes %>%
  map_chr("ticker") %>%  # this makes a character vector of list elements "ticker"
  unique()

## [1] "SPY"

Calling the purrr::map functions on a list with the name of a common sub-element returns the value associated with each sub-element. map returns a list; here we use map_chr to return a character vector.

This only works if the thing being returned from the sub-element is indeed a character.

This is equivalent to the following dplyr operation on the strikes_df dataframe:

strikes_df %>%
  distinct(ticker) %>%
  pull()

## [1] "SPY"

In the dplyr dataframe version, we’ve already mapped our tickers to their own column. So we simply call distinct on that column to get the unique values. A pull converts the resulting tibble to a vector.

Get the strike prices, expiries and call and put mid-prices

In this case, the purrr solution is somewhat convoluted:

callBids <- strikes %>%
  map_dbl("callBidPrice")

callAsks <- strikes %>%
  map_dbl("callAskPrice")

putBids <- strikes %>%
  map_dbl("putBidPrice")

putAsks <- strikes %>%
  map_dbl("putAskPrice")

data.frame(
  strike = strikes %>% map_dbl("strike"),
  expirDate = strikes %>% map_chr("expirDate"),
  callMid = map2_dbl(.x = callBids, .y = callAsks, ~{(.x + .y)/2}),
  putMid = map2_dbl(.x = putBids, .y = putAsks, ~{(.x + .y)/2})
) %>%
  pretty_print(10)

strike	expirDate	callMid	putMid
140	2020-05-29	152.600	0.005
145	2020-05-29	147.600	0.005
150	2020-05-29	142.610	0.005
155	2020-05-29	137.585	0.005
160	2020-05-29	132.620	0.005
165	2020-05-29	127.620	0.005
170	2020-05-29	122.600	0.005
175	2020-05-29	117.585	0.005
180	2020-05-29	112.610	0.005
185	2020-05-29	107.610	0.005

Since our mapping function requires two inputs, we need to use the map2 functions, and must set up the inputs as a first step.

The dplyr equivalent on the dataframe object is much more succinct:

strikes_df %>%
  mutate(
    callMid = (callBidPrice + callAskPrice)/2,
    putMid = (putBidPrice + putAskPrice)/2
  ) %>%
  select(strike, expirDate, callMid, putMid) %>%
  pretty_print(10)

strike	expirDate	callMid	putMid
140	2020-05-29	152.600	0.005
145	2020-05-29	147.600	0.005
150	2020-05-29	142.610	0.005
155	2020-05-29	137.585	0.005
160	2020-05-29	132.620	0.005
165	2020-05-29	127.620	0.005
170	2020-05-29	122.600	0.005
175	2020-05-29	117.585	0.005
180	2020-05-29	112.610	0.005
185	2020-05-29	107.610	0.005

Leverage the dataframe’s structure

We can also leverage the fact that a dataframe is represented as a list of columns to use purrr functions directly on dataframes. These recipes are quite useful for quickly getting to know a dataframe.

For instance, we can get the type of each column:

strikes_df %>%
  map_chr(class)

##           ticker        tradeDate        expirDate              dte           strike       stockPrice       callVolume 
##      "character"      "character"      "character"        "integer"        "numeric"        "numeric"        "integer" 
## callOpenInterest      callBidSize      callAskSize        putVolume  putOpenInterest       putBidSize       putAskSize 
##        "integer"        "integer"        "integer"        "integer"        "integer"        "integer"        "integer" 
##     callBidPrice        callValue     callAskPrice      putBidPrice         putValue      putAskPrice        callBidIv 
##        "numeric"        "numeric"        "numeric"        "numeric"        "numeric"        "numeric"        "numeric" 
##        callMidIv        callAskIv           smvVol         putBidIv         putMidIv         putAskIv     residualRate 
##        "numeric"        "numeric"        "numeric"        "numeric"        "numeric"        "numeric"        "numeric" 
##            delta            gamma            theta             vega              rho              phi   driftlessTheta 
##        "numeric"        "numeric"        "numeric"        "numeric"        "numeric"        "numeric"        "numeric" 
##        extSmvVol     extCallValue      extPutValue        spotPrice        updatedAt 
##        "numeric"        "numeric"        "numeric"        "numeric"      "character"

Which is equivalent to a dplyr::summarise_all, except that this returns a tibble rather than a vector:

strikes_df %>%
  summarise_all(~class(.x))

## # A tibble: 1 x 40
##   ticker tradeDate expirDate dte   strike stockPrice callVolume callOpenInterest callBidSize callAskSize putVolume
##   <chr>  <chr>     <chr>     <chr> <chr>  <chr>      <chr>      <chr>            <chr>       <chr>       <chr>    
## 1 chara~ character character inte~ numer~ numeric    integer    integer          integer     integer     integer  
## # ... with 29 more variables: putOpenInterest <chr>, putBidSize <chr>, putAskSize <chr>, callBidPrice <chr>,
## #   callValue <chr>, callAskPrice <chr>, putBidPrice <chr>, putValue <chr>, putAskPrice <chr>, callBidIv <chr>,
## #   callMidIv <chr>, callAskIv <chr>, smvVol <chr>, putBidIv <chr>, putMidIv <chr>, putAskIv <chr>, residualRate <chr>,
## #   delta <chr>, gamma <chr>, theta <chr>, vega <chr>, rho <chr>, phi <chr>, driftlessTheta <chr>, extSmvVol <chr>,
## #   extCallValue <chr>, extPutValue <chr>, spotPrice <chr>, updatedAt <chr>

We can also get the number of distinct values in each column using purrr functions:

strikes_df %>%
  map_dbl(n_distinct)

##           ticker        tradeDate        expirDate              dte           strike       stockPrice       callVolume 
##                1                1               31               31              151                1              162 
## callOpenInterest      callBidSize      callAskSize        putVolume  putOpenInterest       putBidSize       putAskSize 
##             1097              144              165              498             1691             1054              926 
##     callBidPrice        callValue     callAskPrice      putBidPrice         putValue      putAskPrice        callBidIv 
##             2151             2228             2199             1281             2419             1311             1907 
##        callMidIv        callAskIv           smvVol         putBidIv         putMidIv         putAskIv     residualRate 
##             2423             2430             2044             2345             2425             2428             2039 
##            delta            gamma            theta             vega              rho              phi   driftlessTheta 
##             2137             2238             2192             2211             2188             2133             2196 
##        extSmvVol     extCallValue      extPutValue        spotPrice        updatedAt 
##             2016             2168             2413                1                1

Again, this is equivalent to a dplyr::summarise_all, different return objects aside:

strikes_df %>%
  summarise_all(~n_distinct(.x))

## # A tibble: 1 x 40
##   ticker tradeDate expirDate   dte strike stockPrice callVolume callOpenInterest callBidSize callAskSize putVolume
##    <int>     <int>     <int> <int>  <int>      <int>      <int>            <int>       <int>       <int>     <int>
## 1      1         1        31    31    151          1        162             1097         144         165       498
## # ... with 29 more variables: putOpenInterest <int>, putBidSize <int>, putAskSize <int>, callBidPrice <int>,
## #   callValue <int>, callAskPrice <int>, putBidPrice <int>, putValue <int>, putAskPrice <int>, callBidIv <int>,
## #   callMidIv <int>, callAskIv <int>, smvVol <int>, putBidIv <int>, putMidIv <int>, putAskIv <int>, residualRate <int>,
## #   delta <int>, gamma <int>, theta <int>, vega <int>, rho <int>, phi <int>, driftlessTheta <int>, extSmvVol <int>,
## #   extCallValue <int>, extPutValue <int>, spotPrice <int>, updatedAt <int>

If we wanted to put both of these things together, there’s an elegant purrr solution:

strikes_df %>%
  map_df(
    ~data.frame(num_distinct = n_distinct(.x), type = class(.x)),
      .id = "variable"
  )

##            variable num_distinct      type
## 1            ticker            1 character
## 2         tradeDate            1 character
## 3         expirDate           31 character
## 4               dte           31   integer
## 5            strike          151   numeric
## 6        stockPrice            1   numeric
## 7        callVolume          162   integer
## 8  callOpenInterest         1097   integer
## 9       callBidSize          144   integer
## 10      callAskSize          165   integer
## 11        putVolume          498   integer
## 12  putOpenInterest         1691   integer
## 13       putBidSize         1054   integer
## 14       putAskSize          926   integer
## 15     callBidPrice         2151   numeric
## 16        callValue         2228   numeric
## 17     callAskPrice         2199   numeric
## 18      putBidPrice         1281   numeric
## 19         putValue         2419   numeric
## 20      putAskPrice         1311   numeric
## 21        callBidIv         1907   numeric
## 22        callMidIv         2423   numeric
## 23        callAskIv         2430   numeric
## 24           smvVol         2044   numeric
## 25         putBidIv         2345   numeric
## 26         putMidIv         2425   numeric
## 27         putAskIv         2428   numeric
## 28     residualRate         2039   numeric
## 29            delta         2137   numeric
## 30            gamma         2238   numeric
## 31            theta         2192   numeric
## 32             vega         2211   numeric
## 33              rho         2188   numeric
## 34              phi         2133   numeric
## 35   driftlessTheta         2196   numeric
## 36        extSmvVol         2016   numeric
## 37     extCallValue         2168   numeric
## 38      extPutValue         2413   numeric
## 39        spotPrice            1   numeric
## 40        updatedAt            1 character

But the best I can do with dplyr is somewhat less elegant:

strikes_df %>%
  summarise_all(
    list(~n_distinct(.x), ~class(.x))
  )

## # A tibble: 1 x 80
##   ticker_n_distin~ tradeDate_n_dis~ expirDate_n_dis~ dte_n_distinct strike_n_distin~ stockPrice_n_di~ callVolume_n_di~
##              <int>            <int>            <int>          <int>            <int>            <int>            <int>
## 1                1                1               31             31              151                1              162
## # ... with 73 more variables: callOpenInterest_n_distinct <int>, callBidSize_n_distinct <int>,
## #   callAskSize_n_distinct <int>, putVolume_n_distinct <int>, putOpenInterest_n_distinct <int>,
## #   putBidSize_n_distinct <int>, putAskSize_n_distinct <int>, callBidPrice_n_distinct <int>, callValue_n_distinct <int>,
## #   callAskPrice_n_distinct <int>, putBidPrice_n_distinct <int>, putValue_n_distinct <int>, putAskPrice_n_distinct <int>,
## #   callBidIv_n_distinct <int>, callMidIv_n_distinct <int>, callAskIv_n_distinct <int>, smvVol_n_distinct <int>,
## #   putBidIv_n_distinct <int>, putMidIv_n_distinct <int>, putAskIv_n_distinct <int>, residualRate_n_distinct <int>,
## #   delta_n_distinct <int>, gamma_n_distinct <int>, theta_n_distinct <int>, vega_n_distinct <int>, rho_n_distinct <int>,
## #   phi_n_distinct <int>, driftlessTheta_n_distinct <int>, extSmvVol_n_distinct <int>, extCallValue_n_distinct <int>,
## #   extPutValue_n_distinct <int>, spotPrice_n_distinct <int>, updatedAt_n_distinct <int>, ticker_class <chr>,
## #   tradeDate_class <chr>, expirDate_class <chr>, dte_class <chr>, strike_class <chr>, stockPrice_class <chr>,
## #   callVolume_class <chr>, callOpenInterest_class <chr>, callBidSize_class <chr>, callAskSize_class <chr>,
## #   putVolume_class <chr>, putOpenInterest_class <chr>, putBidSize_class <chr>, putAskSize_class <chr>,
## #   callBidPrice_class <chr>, callValue_class <chr>, callAskPrice_class <chr>, putBidPrice_class <chr>,
## #   putValue_class <chr>, putAskPrice_class <chr>, callBidIv_class <chr>, callMidIv_class <chr>, callAskIv_class <chr>,
## #   smvVol_class <chr>, putBidIv_class <chr>, putMidIv_class <chr>, putAskIv_class <chr>, residualRate_class <chr>,
## #   delta_class <chr>, gamma_class <chr>, theta_class <chr>, vega_class <chr>, rho_class <chr>, phi_class <chr>,
## #   driftlessTheta_class <chr>, extSmvVol_class <chr>, extCallValue_class <chr>, extPutValue_class <chr>,
## #   spotPrice_class <chr>, updatedAt_class <chr>

Intuitively, you’d reach for something like this:

try(
  strikes_df %>%
    summarise_all(
      ~data.frame(num_distinct = n_distinct(.x), type = class(.x))
    )
)

## Error : Column `ticker` must be length 1 (a summary value), not 2

But we get an error related to the fact summarise wants to return a single value for each variable being summarised, that is, a dataframe with a single row.

There are probably better dplyr solutions out there, but this illustrates an important point: the purrr::map functions are highly customisable, able to apply a function to individual elements in a collection, returning a data object of your choosing. dplyr::summarise really shines when you need to aggregate or reduce variables to a single value.

Conclusion

In this post we explored the purrr::map functions for wrangling a data set consisting of nested lists, as you might have if you were reading in JSON data to R.

We also explored the cross-over and differences in use-cases for purrr and dplyr functions.

How to Wrangle JSON Data in R with jsonlite, purr and dplyr

Load JSON as nested named lists

Look inside JSON lists

How many observations do we have?

Are all strike sublists identically named?

Make a dataframe

Cross-over and differences between `purrr` and `dplyr`

Get vector of column names

Check that all elements have the same ticker

Get the strike prices, expiries and call and put mid-prices

Leverage the dataframe’s structure

Conclusion

1 thought on “How to Wrangle JSON Data in R with jsonlite, purr and dplyr”

Leave a Comment Cancel reply

© 2024 Robot Wealth Pty LtD

Privacy Policy | Terms Of Use | FAQ | [email protected]

Load JSON as nested named lists

Look inside JSON lists

How many observations do we have?

Are all strike sublists identically named?

Make a dataframe

Cross-over and differences between purrr and dplyr

Get vector of column names

Check that all elements have the same ticker

Get the strike prices, expiries and call and put mid-prices

Leverage the dataframe’s structure

Conclusion

1 thought on “How to Wrangle JSON Data in R with jsonlite, purr and dplyr”

Leave a Comment Cancel reply

© 2024 Robot Wealth Pty LtD

Privacy Policy | Terms Of Use | FAQ | [email protected]

Cross-over and differences between `purrr` and `dplyr`