dplyr 1.1.0 was a significant release that makes several common data operations more syntactically intuitive.
The most significant changes relate to joins and grouping/aggregating operations.
In this post we’ll look at the changes to joins.
First, install and load the latest version of dplyr:
install.packages("dplyr")
library(dplyr)
A new approach to joins
The best way to explore the new syntax is through examples.
Say we had a table of stock prices and a table of industry sectors, and we wanted to join them. The prices table has the field ticker, which is equivalent to the field symbol on the sectors table.
An example of this data might look like this:
prices <- data.frame(list(
ticker = c("AAPL", "TSLA", "AAPL", "TSLA", "AAPL", "TSLA"),
date = structure(c(18716, 18716, 18717, 18717, 18718, 18718), class = "Date"),
open = c(120.11, 601.75, 121.65, 646.62, 123.66, 688.37),
high = c(120.40305, 637.66, 123.52, 672, 124.18, 692.42015),
low = c(118.86, 591.01, 121.15, 641.11, 122.49, 659.42),
close = c(119.9, 635.62, 122.15, 667.93, 123, 661.75),
unadjusted_close = c(119.9, 635.62, 122.15, 667.93, 123, 661.75),
volume = c(85671920, 39432360, 118323820, 33337288, 75089140, 35298376)
))
prices %>%
pretty_table_1()
| ticker | date | open | high | low | close | unadjusted_close | volume |
|---|---|---|---|---|---|---|---|
| AAPL | 2021-03-30 | 120.11 | 120.4030 | 118.86 | 119.90 | 119.90 | 85671920 |
| TSLA | 2021-03-30 | 601.75 | 637.6600 | 591.01 | 635.62 | 635.62 | 39432360 |
| AAPL | 2021-03-31 | 121.65 | 123.5200 | 121.15 | 122.15 | 122.15 | 118323820 |
| TSLA | 2021-03-31 | 646.62 | 672.0000 | 641.11 | 667.93 | 667.93 | 33337288 |
| AAPL | 2021-04-01 | 123.66 | 124.1800 | 122.49 | 123.00 | 123.00 | 75089140 |
| TSLA | 2021-04-01 | 688.37 | 692.4202 | 659.42 | 661.75 | 661.75 | 35298376 |
sectors <- data.frame(list(
symbol = c("AAPL", "TSLA"), name = c("Apple Inc", "Tesla Inc"),
sector = c("Technology", "Consumer Cyclical"),
industry = c("Consumer Electronics", "Auto Manufacturers"),
sicindustry = c("Electronic Computers", "Motor Vehicles & Passenger Car Bodies")
))
sectors %>%
pretty_table_1()
| symbol | name | sector | industry | sicindustry |
|---|---|---|---|---|
| AAPL | Apple Inc | Technology | Consumer Electronics | Electronic Computers |
| TSLA | Tesla Inc | Consumer Cyclical | Auto Manufacturers | Motor Vehicles & Passenger Car Bodies |
The legacy approach to doing this join would look like this:
prices %>%
left_join(sectors, by = c(ticker = "symbol")) %>%
pretty_table_2()
| ticker | date | open | high | low | close | unadjusted_close | volume | name | sector | industry | sicindustry |
|---|---|---|---|---|---|---|---|---|---|---|---|
| AAPL | 2021-03-30 | 120.11 | 120.4030 | 118.86 | 119.90 | 119.90 | 85671920 | Apple Inc | Technology | Consumer Electronics | Electronic Computers |
| TSLA | 2021-03-30 | 601.75 | 637.6600 | 591.01 | 635.62 | 635.62 | 39432360 | Tesla Inc | Consumer Cyclical | Auto Manufacturers | Motor Vehicles & Passenger Car Bodies |
| AAPL | 2021-03-31 | 121.65 | 123.5200 | 121.15 | 122.15 | 122.15 | 118323820 | Apple Inc | Technology | Consumer Electronics | Electronic Computers |
| TSLA | 2021-03-31 | 646.62 | 672.0000 | 641.11 | 667.93 | 667.93 | 33337288 | Tesla Inc | Consumer Cyclical | Auto Manufacturers | Motor Vehicles & Passenger Car Bodies |
| AAPL | 2021-04-01 | 123.66 | 124.1800 | 122.49 | 123.00 | 123.00 | 75089140 | Apple Inc | Technology | Consumer Electronics | Electronic Computers |
| TSLA | 2021-04-01 | 688.37 | 692.4202 | 659.42 | 661.75 | 661.75 | 35298376 | Tesla Inc | Consumer Cyclical | Auto Manufacturers | Motor Vehicles & Passenger Car Bodies |
Have you ever felt like that by = c(ticker = "symbol") argument is a bit off? I know it’s only semantics, but ticker = "symbol" feels like an assignment operation rather than an equivalence operation. It’s unintuitive.
The new approach uses a helper function: join_by.
join_by takes an expression that more naturally conveys the intent behind the join:
join_by(ticker == symbol)
## Join By: ## - ticker == symbol
Here it is in action:
prices %>%
left_join(sectors, by = join_by(ticker == symbol)) %>%
pretty_table_2()
| ticker | date | open | high | low | close | unadjusted_close | volume | name | sector | industry | sicindustry |
|---|---|---|---|---|---|---|---|---|---|---|---|
| AAPL | 2021-03-30 | 120.11 | 120.4030 | 118.86 | 119.90 | 119.90 | 85671920 | Apple Inc | Technology | Consumer Electronics | Electronic Computers |
| TSLA | 2021-03-30 | 601.75 | 637.6600 | 591.01 | 635.62 | 635.62 | 39432360 | Tesla Inc | Consumer Cyclical | Auto Manufacturers | Motor Vehicles & Passenger Car Bodies |
| AAPL | 2021-03-31 | 121.65 | 123.5200 | 121.15 | 122.15 | 122.15 | 118323820 | Apple Inc | Technology | Consumer Electronics | Electronic Computers |
| TSLA | 2021-03-31 | 646.62 | 672.0000 | 641.11 | 667.93 | 667.93 | 33337288 | Tesla Inc | Consumer Cyclical | Auto Manufacturers | Motor Vehicles & Passenger Car Bodies |
| AAPL | 2021-04-01 | 123.66 | 124.1800 | 122.49 | 123.00 | 123.00 | 75089140 | Apple Inc | Technology | Consumer Electronics | Electronic Computers |
| TSLA | 2021-04-01 | 688.37 | 692.4202 | 659.42 | 661.75 | 661.75 | 35298376 | Tesla Inc | Consumer Cyclical | Auto Manufacturers | Motor Vehicles & Passenger Car Bodies |
That feels much more natural to me.
Inequality joins with join_by
join_by has some useful features that were hard to implement in the previous version of dplyr.
Say I have a bunch of trades and quotes data. I want to match each trade with the previous quote that’s closest in time.
Here are some trades and quotes data:
Trades:
options("digits.secs" = 6)
trades <- data.frame(list(
time = structure(c(1464197400.023, 1464197400.038, 1464197400.048, 1464197400.048,
1464197400.048), class = c("POSIXct", "POSIXt"), tzone = "America/New_York"),
ticker = c("MSFT", "MSFT", "GOOG", "GOOG", "AAPL"),
price = c(51.95, 51.95, 720.77, 720.92, 98),
quantity = c(75, 155, 100, 100, 100)
))
trades %>%
pretty_table_1()
| time | ticker | price | quantity |
|---|---|---|---|
| 2016-05-25 13:30:00.023 | MSFT | 51.95 | 75 |
| 2016-05-25 13:30:00.038 | MSFT | 51.95 | 155 |
| 2016-05-25 13:30:00.048 | GOOG | 720.77 | 100 |
| 2016-05-25 13:30:00.048 | GOOG | 720.92 | 100 |
| 2016-05-25 13:30:00.048 | AAPL | 98.00 | 100 |
Quotes:
quotes <- data.frame(list(
time = structure(c(1464197400.023, 1464197400.023, 1464197400.03, 1464197400.041,
1464197400.048, 1464197400.049, 1464197400.072, 1464197400.075), class = c("POSIXct", "POSIXt"), tzone = "America/New_York", class = c("POSIXct", "POSIXt"), tzone = "America/New_York"),
ticker = c("GOOG", "MSFT", "MSFT", "MSFT", "GOOG", "AAPL", "GOOG", "MSFT"),
bid = c(720.5, 51.95, 51.97, 51.99, 720.5, 97.99, 720.5, 52.01),
ask = c(720.93, 51.96, 51.98, 52, 720.93, 98.01, 720.88, 52.03)
))
quotes %>%
pretty_table_1()
| time | ticker | bid | ask |
|---|---|---|---|
| 2016-05-25 13:30:00.023 | GOOG | 720.50 | 720.93 |
| 2016-05-25 13:30:00.023 | MSFT | 51.95 | 51.96 |
| 2016-05-25 13:30:00.029 | MSFT | 51.97 | 51.98 |
| 2016-05-25 13:30:00.040 | MSFT | 51.99 | 52.00 |
| 2016-05-25 13:30:00.048 | GOOG | 720.50 | 720.93 |
| 2016-05-25 13:30:00.049 | AAPL | 97.99 | 98.01 |
| 2016-05-25 13:30:00.072 | GOOG | 720.50 | 720.88 |
| 2016-05-25 13:30:00.075 | MSFT | 52.01 | 52.03 |
We can find the most recent quote prior to each trade by passing an equality expression to join_by, which we wrap in the helper function closest:
join_by(ticker, closest(time >= time))
## Join By: ## - ticker ## - closest(time >= time)
trades %>%
left_join(quotes, by = join_by(ticker, closest(time >= time)), suffix = c(".trade", ".quote")) %>%
pretty_table_3()
| time.trade | ticker | price | quantity | time.quote | bid | ask |
|---|---|---|---|---|---|---|
| 2016-05-25 13:30:00.023 | MSFT | 51.95 | 75 | 2016-05-25 13:30:00.023 | 51.95 | 51.96 |
| 2016-05-25 13:30:00.038 | MSFT | 51.95 | 155 | 2016-05-25 13:30:00.029 | 51.97 | 51.98 |
| 2016-05-25 13:30:00.048 | GOOG | 720.77 | 100 | 2016-05-25 13:30:00.048 | 720.50 | 720.93 |
| 2016-05-25 13:30:00.048 | GOOG | 720.92 | 100 | 2016-05-25 13:30:00.048 | 720.50 | 720.93 |
| 2016-05-25 13:30:00.048 | AAPL | 98.00 | 100 | NA | NA | NA |
This is functionally equivalent to the SQL operation ASOF and the merge_asof function in pandas (in fact, I took the data for this example from the pandas docs).
Note in the above example, we got a trade in MSFT outside the quoted bid-ask, which can happen.
It might be reasonable to not match when the times are equal, in which case we can change the >= equality to > (here we have millisecond precision, and it’s entirely reasonable that a quote and a trade could occur in the same millisecond, but ignore that for the sake of the example):
trades %>%
left_join(quotes, by = join_by(ticker, closest(time > time)), suffix = c(".trade", ".quote")) %>%
pretty_table_3()
| time.trade | ticker | price | quantity | time.quote | bid | ask |
|---|---|---|---|---|---|---|
| 2016-05-25 13:30:00.023 | MSFT | 51.95 | 75 | NA | NA | NA |
| 2016-05-25 13:30:00.038 | MSFT | 51.95 | 155 | 2016-05-25 13:30:00.029 | 51.97 | 51.98 |
| 2016-05-25 13:30:00.048 | GOOG | 720.77 | 100 | 2016-05-25 13:30:00.023 | 720.50 | 720.93 |
| 2016-05-25 13:30:00.048 | GOOG | 720.92 | 100 | 2016-05-25 13:30:00.023 | 720.50 | 720.93 |
| 2016-05-25 13:30:00.048 | AAPL | 98.00 | 100 | NA | NA | NA |
We could also impose a 2ms gap between our quote and our matched trade, but to do this we need to make a new column, as join_by can’t take a computed column:
library(lubridate)
trades %>%
left_join(
# create new column before doing the join
quotes %>% mutate(time_buffer = time + lubridate::seconds(2/1e3)),
by = join_by(ticker, closest(time >= time_buffer)),
suffix = c(".trade", ".quote")
) %>%
pretty_table_4()
| time.trade | ticker | price | quantity | time.quote | bid | ask | time_buffer |
|---|---|---|---|---|---|---|---|
| 2016-05-25 13:30:00.023 | MSFT | 51.95 | 75 | NA | NA | NA | NA |
| 2016-05-25 13:30:00.038 | MSFT | 51.95 | 155 | 2016-05-25 13:30:00.029 | 51.97 | 51.98 | 2016-05-25 13:30:00.032 |
| 2016-05-25 13:30:00.048 | GOOG | 720.77 | 100 | 2016-05-25 13:30:00.023 | 720.50 | 720.93 | 2016-05-25 13:30:00.025 |
| 2016-05-25 13:30:00.048 | GOOG | 720.92 | 100 | 2016-05-25 13:30:00.023 | 720.50 | 720.93 | 2016-05-25 13:30:00.025 |
| 2016-05-25 13:30:00.048 | AAPL | 98.00 | 100 | NA | NA | NA | NA |
Lastly, join_by can take helpers other than closest to create overlapping joins:
betweenwithinoverlaps
For example, we can match trades to all quotes that occur within the prior 15ms:
trades %>%
left_join(
quotes %>% mutate(time_buffer = time + lubridate::seconds(15/1e3)),
by = join_by(ticker, between(x = time, y_lower = time, y_upper = time_buffer, bounds = "[]")),
suffix = c(".trade", ".quote")
) %>%
pretty_table_4()
| time.trade | ticker | price | quantity | time.quote | bid | ask | time_buffer |
|---|---|---|---|---|---|---|---|
| 2016-05-25 13:30:00.023 | MSFT | 51.95 | 75 | 2016-05-25 13:30:00.023 | 51.95 | 51.96 | 2016-05-25 13:30:00.038 |
| 2016-05-25 13:30:00.038 | MSFT | 51.95 | 155 | 2016-05-25 13:30:00.023 | 51.95 | 51.96 | 2016-05-25 13:30:00.038 |
| 2016-05-25 13:30:00.038 | MSFT | 51.95 | 155 | 2016-05-25 13:30:00.029 | 51.97 | 51.98 | 2016-05-25 13:30:00.045 |
| 2016-05-25 13:30:00.048 | GOOG | 720.77 | 100 | 2016-05-25 13:30:00.048 | 720.50 | 720.93 | 2016-05-25 13:30:00.063 |
| 2016-05-25 13:30:00.048 | GOOG | 720.92 | 100 | 2016-05-25 13:30:00.048 | 720.50 | 720.93 | 2016-05-25 13:30:00.063 |
| 2016-05-25 13:30:00.048 | AAPL | 98.00 | 100 | NA | NA | NA | NA |
Note how this method matches the second MSFT trade at 13:30:00.038 to two quotes at 13:30:00.023 and 13:30:00.029.
Will the old approaches still work?
Yes, for now. The dplyr docs state that both approaches to joining are supported, and there is no mention of deprecating the legacy approach. Good news for backwards compatibility.
However, I’m not even sure how you’d implement an ASOF or an overlapping join using that approach. The new approach opens up a whole new set of joining operations.
Other changes in dplyr 1.1.0
The other big change in dplyr 1.1.0 is the approach to grouping and aggregating data. I’ll take a look at that in another post.
Your thoughts?
What do you think? Do you prefer this new approach to joining data in dplyr? Does it feel more natural, or more aligned with what you might have done in SQL? Let me know in the comments.

1 thought on “More Intuitive Joins in dplyr 1.1.0 – how to do an asof join on trades and quotes data”