More Intuitive Joins in dplyr 1.1.0 - how to do an asof join on trades and quotes data

dplyr 1.1.0 was a significant release that makes several common data operations more syntactically intuitive.

The most significant changes relate to joins and grouping/aggregating operations.

In this post we’ll look at the changes to joins.

First, install and load the latest version of dplyr:

install.packages("dplyr")
library(dplyr)

A new approach to joins

The best way to explore the new syntax is through examples.

Say we had a table of stock prices and a table of industry sectors, and we wanted to join them. The prices table has the field ticker, which is equivalent to the field symbol on the sectors table.

An example of this data might look like this:

prices <- data.frame(list(
  ticker = c("AAPL", "TSLA", "AAPL", "TSLA", "AAPL", "TSLA"), 
  date = structure(c(18716, 18716, 18717, 18717, 18718, 18718), class = "Date"), 
  open = c(120.11, 601.75, 121.65, 646.62, 123.66, 688.37), 
  high = c(120.40305, 637.66, 123.52, 672, 124.18, 692.42015), 
  low = c(118.86, 591.01, 121.15, 641.11, 122.49, 659.42), 
  close = c(119.9, 635.62, 122.15, 667.93, 123, 661.75), 
  unadjusted_close = c(119.9, 635.62, 122.15, 667.93, 123, 661.75), 
  volume = c(85671920, 39432360, 118323820, 33337288, 75089140, 35298376)
))

prices %>% 
  pretty_table_1()

ticker	date	open	high	low	close	unadjusted_close	volume
AAPL	2021-03-30	120.11	120.4030	118.86	119.90	119.90	85671920
TSLA	2021-03-30	601.75	637.6600	591.01	635.62	635.62	39432360
AAPL	2021-03-31	121.65	123.5200	121.15	122.15	122.15	118323820
TSLA	2021-03-31	646.62	672.0000	641.11	667.93	667.93	33337288
AAPL	2021-04-01	123.66	124.1800	122.49	123.00	123.00	75089140
TSLA	2021-04-01	688.37	692.4202	659.42	661.75	661.75	35298376

sectors <- data.frame(list(
  symbol = c("AAPL", "TSLA"), name = c("Apple Inc", "Tesla Inc"), 
  sector = c("Technology", "Consumer Cyclical"), 
  industry = c("Consumer Electronics", "Auto Manufacturers"), 
  sicindustry = c("Electronic Computers", "Motor Vehicles & Passenger Car Bodies")
))

sectors %>% 
  pretty_table_1()

symbol	name	sector	industry	sicindustry
AAPL	Apple Inc	Technology	Consumer Electronics	Electronic Computers
TSLA	Tesla Inc	Consumer Cyclical	Auto Manufacturers	Motor Vehicles & Passenger Car Bodies

The legacy approach to doing this join would look like this:

prices %>% 
  left_join(sectors, by = c(ticker = "symbol")) %>% 
  pretty_table_2()

ticker	date	open	high	low	close	unadjusted_close	volume	name	sector	industry	sicindustry
AAPL	2021-03-30	120.11	120.4030	118.86	119.90	119.90	85671920	Apple Inc	Technology	Consumer Electronics	Electronic Computers
TSLA	2021-03-30	601.75	637.6600	591.01	635.62	635.62	39432360	Tesla Inc	Consumer Cyclical	Auto Manufacturers	Motor Vehicles & Passenger Car Bodies
AAPL	2021-03-31	121.65	123.5200	121.15	122.15	122.15	118323820	Apple Inc	Technology	Consumer Electronics	Electronic Computers
TSLA	2021-03-31	646.62	672.0000	641.11	667.93	667.93	33337288	Tesla Inc	Consumer Cyclical	Auto Manufacturers	Motor Vehicles & Passenger Car Bodies
AAPL	2021-04-01	123.66	124.1800	122.49	123.00	123.00	75089140	Apple Inc	Technology	Consumer Electronics	Electronic Computers
TSLA	2021-04-01	688.37	692.4202	659.42	661.75	661.75	35298376	Tesla Inc	Consumer Cyclical	Auto Manufacturers	Motor Vehicles & Passenger Car Bodies

Have you ever felt like that by = c(ticker = "symbol") argument is a bit off? I know it’s only semantics, but ticker = "symbol" feels like an assignment operation rather than an equivalence operation. It’s unintuitive.

The new approach uses a helper function: join_by.

join_by takes an expression that more naturally conveys the intent behind the join:

join_by(ticker == symbol)

## Join By:
## - ticker == symbol

Here it is in action:

prices %>% 
  left_join(sectors, by = join_by(ticker == symbol)) %>% 
  pretty_table_2()

ticker	date	open	high	low	close	unadjusted_close	volume	name	sector	industry	sicindustry
AAPL	2021-03-30	120.11	120.4030	118.86	119.90	119.90	85671920	Apple Inc	Technology	Consumer Electronics	Electronic Computers
TSLA	2021-03-30	601.75	637.6600	591.01	635.62	635.62	39432360	Tesla Inc	Consumer Cyclical	Auto Manufacturers	Motor Vehicles & Passenger Car Bodies
AAPL	2021-03-31	121.65	123.5200	121.15	122.15	122.15	118323820	Apple Inc	Technology	Consumer Electronics	Electronic Computers
TSLA	2021-03-31	646.62	672.0000	641.11	667.93	667.93	33337288	Tesla Inc	Consumer Cyclical	Auto Manufacturers	Motor Vehicles & Passenger Car Bodies
AAPL	2021-04-01	123.66	124.1800	122.49	123.00	123.00	75089140	Apple Inc	Technology	Consumer Electronics	Electronic Computers
TSLA	2021-04-01	688.37	692.4202	659.42	661.75	661.75	35298376	Tesla Inc	Consumer Cyclical	Auto Manufacturers	Motor Vehicles & Passenger Car Bodies

That feels much more natural to me.

Inequality joins with `join_by`

join_by has some useful features that were hard to implement in the previous version of dplyr.

Say I have a bunch of trades and quotes data. I want to match each trade with the previous quote that’s closest in time.

Here are some trades and quotes data:

Trades:

options("digits.secs" = 6)

trades <- data.frame(list(
  time = structure(c(1464197400.023, 1464197400.038, 1464197400.048, 1464197400.048, 
1464197400.048), class = c("POSIXct", "POSIXt"), tzone = "America/New_York"), 
  ticker = c("MSFT", "MSFT", "GOOG", "GOOG", "AAPL"), 
  price = c(51.95, 51.95, 720.77, 720.92, 98), 
  quantity = c(75, 155, 100, 100, 100)
))

trades %>% 
  pretty_table_1()

time	ticker	price	quantity
2016-05-25 13:30:00.023	MSFT	51.95	75
2016-05-25 13:30:00.038	MSFT	51.95	155
2016-05-25 13:30:00.048	GOOG	720.77	100
2016-05-25 13:30:00.048	GOOG	720.92	100
2016-05-25 13:30:00.048	AAPL	98.00	100

Quotes:

quotes <- data.frame(list(
  time = structure(c(1464197400.023, 1464197400.023, 1464197400.03, 1464197400.041, 
1464197400.048, 1464197400.049, 1464197400.072, 1464197400.075), class = c("POSIXct", "POSIXt"), tzone = "America/New_York", class = c("POSIXct", "POSIXt"), tzone = "America/New_York"), 
  ticker = c("GOOG", "MSFT", "MSFT", "MSFT", "GOOG", "AAPL", "GOOG", "MSFT"), 
  bid = c(720.5, 51.95, 51.97, 51.99, 720.5, 97.99, 720.5, 52.01), 
  ask = c(720.93, 51.96, 51.98, 52, 720.93, 98.01, 720.88, 52.03)
))

quotes %>% 
  pretty_table_1()

time	ticker	bid	ask
2016-05-25 13:30:00.023	GOOG	720.50	720.93
2016-05-25 13:30:00.023	MSFT	51.95	51.96
2016-05-25 13:30:00.029	MSFT	51.97	51.98
2016-05-25 13:30:00.040	MSFT	51.99	52.00
2016-05-25 13:30:00.048	GOOG	720.50	720.93
2016-05-25 13:30:00.049	AAPL	97.99	98.01
2016-05-25 13:30:00.072	GOOG	720.50	720.88
2016-05-25 13:30:00.075	MSFT	52.01	52.03

We can find the most recent quote prior to each trade by passing an equality expression to join_by, which we wrap in the helper function closest:

join_by(ticker, closest(time >= time))

## Join By:
## - ticker
## - closest(time >= time)

trades %>% 
  left_join(quotes, by = join_by(ticker, closest(time >= time)), suffix = c(".trade", ".quote")) %>% 
  pretty_table_3()

time.trade	ticker	price	quantity	time.quote	bid	ask
2016-05-25 13:30:00.023	MSFT	51.95	75	2016-05-25 13:30:00.023	51.95	51.96
2016-05-25 13:30:00.038	MSFT	51.95	155	2016-05-25 13:30:00.029	51.97	51.98
2016-05-25 13:30:00.048	GOOG	720.77	100	2016-05-25 13:30:00.048	720.50	720.93
2016-05-25 13:30:00.048	GOOG	720.92	100	2016-05-25 13:30:00.048	720.50	720.93
2016-05-25 13:30:00.048	AAPL	98.00	100	NA	NA	NA

This is functionally equivalent to the SQL operation ASOF and the merge_asof function in pandas (in fact, I took the data for this example from the pandas docs).

Note in the above example, we got a trade in MSFT outside the quoted bid-ask, which can happen.

It might be reasonable to not match when the times are equal, in which case we can change the >= equality to > (here we have millisecond precision, and it’s entirely reasonable that a quote and a trade could occur in the same millisecond, but ignore that for the sake of the example):

trades %>% 
  left_join(quotes, by = join_by(ticker, closest(time > time)), suffix = c(".trade", ".quote")) %>% 
  pretty_table_3()

time.trade	ticker	price	quantity	time.quote	bid	ask
2016-05-25 13:30:00.023	MSFT	51.95	75	NA	NA	NA
2016-05-25 13:30:00.038	MSFT	51.95	155	2016-05-25 13:30:00.029	51.97	51.98
2016-05-25 13:30:00.048	GOOG	720.77	100	2016-05-25 13:30:00.023	720.50	720.93
2016-05-25 13:30:00.048	GOOG	720.92	100	2016-05-25 13:30:00.023	720.50	720.93
2016-05-25 13:30:00.048	AAPL	98.00	100	NA	NA	NA

We could also impose a 2ms gap between our quote and our matched trade, but to do this we need to make a new column, as join_by can’t take a computed column:

library(lubridate)

trades %>%
  left_join(
    # create new column before doing the join
    quotes %>% mutate(time_buffer = time + lubridate::seconds(2/1e3)), 
    by = join_by(ticker, closest(time >= time_buffer)), 
    suffix = c(".trade", ".quote")
  ) %>% 
  pretty_table_4()

time.trade	ticker	price	quantity	time.quote	bid	ask	time_buffer
2016-05-25 13:30:00.023	MSFT	51.95	75	NA	NA	NA	NA
2016-05-25 13:30:00.038	MSFT	51.95	155	2016-05-25 13:30:00.029	51.97	51.98	2016-05-25 13:30:00.032
2016-05-25 13:30:00.048	GOOG	720.77	100	2016-05-25 13:30:00.023	720.50	720.93	2016-05-25 13:30:00.025
2016-05-25 13:30:00.048	GOOG	720.92	100	2016-05-25 13:30:00.023	720.50	720.93	2016-05-25 13:30:00.025
2016-05-25 13:30:00.048	AAPL	98.00	100	NA	NA	NA	NA

Lastly, join_by can take helpers other than closest to create overlapping joins:

between
within
overlaps

For example, we can match trades to all quotes that occur within the prior 15ms:

trades %>%
  left_join(
    quotes %>% mutate(time_buffer = time + lubridate::seconds(15/1e3)), 
    by = join_by(ticker, between(x = time, y_lower = time, y_upper = time_buffer, bounds = "[]")), 
    suffix = c(".trade", ".quote")
  ) %>% 
  pretty_table_4()

time.trade	ticker	price	quantity	time.quote	bid	ask	time_buffer
2016-05-25 13:30:00.023	MSFT	51.95	75	2016-05-25 13:30:00.023	51.95	51.96	2016-05-25 13:30:00.038
2016-05-25 13:30:00.038	MSFT	51.95	155	2016-05-25 13:30:00.023	51.95	51.96	2016-05-25 13:30:00.038
2016-05-25 13:30:00.038	MSFT	51.95	155	2016-05-25 13:30:00.029	51.97	51.98	2016-05-25 13:30:00.045
2016-05-25 13:30:00.048	GOOG	720.77	100	2016-05-25 13:30:00.048	720.50	720.93	2016-05-25 13:30:00.063
2016-05-25 13:30:00.048	GOOG	720.92	100	2016-05-25 13:30:00.048	720.50	720.93	2016-05-25 13:30:00.063
2016-05-25 13:30:00.048	AAPL	98.00	100	NA	NA	NA	NA

Note how this method matches the second MSFT trade at 13:30:00.038 to two quotes at 13:30:00.023 and 13:30:00.029.

Will the old approaches still work?

Yes, for now. The dplyr docs state that both approaches to joining are supported, and there is no mention of deprecating the legacy approach. Good news for backwards compatibility.

However, I’m not even sure how you’d implement an ASOF or an overlapping join using that approach. The new approach opens up a whole new set of joining operations.

Other changes in dplyr 1.1.0

The other big change in dplyr 1.1.0 is the approach to grouping and aggregating data. I’ll take a look at that in another post.

Your thoughts?

What do you think? Do you prefer this new approach to joining data in dplyr? Does it feel more natural, or more aligned with what you might have done in SQL? Let me know in the comments.

More Intuitive Joins in dplyr 1.1.0 – how to do an asof join on trades and quotes data

A new approach to joins

Inequality joins with `join_by`

Will the old approaches still work?

Other changes in dplyr 1.1.0

Your thoughts?

1 thought on “More Intuitive Joins in dplyr 1.1.0 – how to do an asof join on trades and quotes data”

Leave a Comment Cancel reply

© 2024 Robot Wealth Pty LtD

Privacy Policy | Terms Of Use | FAQ | [email protected]

A new approach to joins

Inequality joins with join_by

Will the old approaches still work?

Other changes in dplyr 1.1.0

Your thoughts?

1 thought on “More Intuitive Joins in dplyr 1.1.0 – how to do an asof join on trades and quotes data”

Leave a Comment Cancel reply

© 2024 Robot Wealth Pty LtD

Privacy Policy | Terms Of Use | FAQ | [email protected]

Inequality joins with `join_by`