dplyr 1.1.0
was a significant release that makes several common data operations more syntactically intuitive.
The most significant changes relate to joins and grouping/aggregating operations.
In this post we’ll look at the changes to joins.
First, install and load the latest version of dplyr
:
install.packages("dplyr")
library(dplyr)
A new approach to joins
The best way to explore the new syntax is through examples.
Say we had a table of stock prices and a table of industry sectors, and we wanted to join them. The prices table has the field ticker
, which is equivalent to the field symbol
on the sectors table.
An example of this data might look like this:
prices <- data.frame(list(
ticker = c("AAPL", "TSLA", "AAPL", "TSLA", "AAPL", "TSLA"),
date = structure(c(18716, 18716, 18717, 18717, 18718, 18718), class = "Date"),
open = c(120.11, 601.75, 121.65, 646.62, 123.66, 688.37),
high = c(120.40305, 637.66, 123.52, 672, 124.18, 692.42015),
low = c(118.86, 591.01, 121.15, 641.11, 122.49, 659.42),
close = c(119.9, 635.62, 122.15, 667.93, 123, 661.75),
unadjusted_close = c(119.9, 635.62, 122.15, 667.93, 123, 661.75),
volume = c(85671920, 39432360, 118323820, 33337288, 75089140, 35298376)
))
prices %>%
pretty_table_1()
ticker | date | open | high | low | close | unadjusted_close | volume |
---|---|---|---|---|---|---|---|
AAPL | 2021-03-30 | 120.11 | 120.4030 | 118.86 | 119.90 | 119.90 | 85671920 |
TSLA | 2021-03-30 | 601.75 | 637.6600 | 591.01 | 635.62 | 635.62 | 39432360 |
AAPL | 2021-03-31 | 121.65 | 123.5200 | 121.15 | 122.15 | 122.15 | 118323820 |
TSLA | 2021-03-31 | 646.62 | 672.0000 | 641.11 | 667.93 | 667.93 | 33337288 |
AAPL | 2021-04-01 | 123.66 | 124.1800 | 122.49 | 123.00 | 123.00 | 75089140 |
TSLA | 2021-04-01 | 688.37 | 692.4202 | 659.42 | 661.75 | 661.75 | 35298376 |
sectors <- data.frame(list(
symbol = c("AAPL", "TSLA"), name = c("Apple Inc", "Tesla Inc"),
sector = c("Technology", "Consumer Cyclical"),
industry = c("Consumer Electronics", "Auto Manufacturers"),
sicindustry = c("Electronic Computers", "Motor Vehicles & Passenger Car Bodies")
))
sectors %>%
pretty_table_1()
symbol | name | sector | industry | sicindustry |
---|---|---|---|---|
AAPL | Apple Inc | Technology | Consumer Electronics | Electronic Computers |
TSLA | Tesla Inc | Consumer Cyclical | Auto Manufacturers | Motor Vehicles & Passenger Car Bodies |
The legacy approach to doing this join would look like this:
prices %>%
left_join(sectors, by = c(ticker = "symbol")) %>%
pretty_table_2()
ticker | date | open | high | low | close | unadjusted_close | volume | name | sector | industry | sicindustry |
---|---|---|---|---|---|---|---|---|---|---|---|
AAPL | 2021-03-30 | 120.11 | 120.4030 | 118.86 | 119.90 | 119.90 | 85671920 | Apple Inc | Technology | Consumer Electronics | Electronic Computers |
TSLA | 2021-03-30 | 601.75 | 637.6600 | 591.01 | 635.62 | 635.62 | 39432360 | Tesla Inc | Consumer Cyclical | Auto Manufacturers | Motor Vehicles & Passenger Car Bodies |
AAPL | 2021-03-31 | 121.65 | 123.5200 | 121.15 | 122.15 | 122.15 | 118323820 | Apple Inc | Technology | Consumer Electronics | Electronic Computers |
TSLA | 2021-03-31 | 646.62 | 672.0000 | 641.11 | 667.93 | 667.93 | 33337288 | Tesla Inc | Consumer Cyclical | Auto Manufacturers | Motor Vehicles & Passenger Car Bodies |
AAPL | 2021-04-01 | 123.66 | 124.1800 | 122.49 | 123.00 | 123.00 | 75089140 | Apple Inc | Technology | Consumer Electronics | Electronic Computers |
TSLA | 2021-04-01 | 688.37 | 692.4202 | 659.42 | 661.75 | 661.75 | 35298376 | Tesla Inc | Consumer Cyclical | Auto Manufacturers | Motor Vehicles & Passenger Car Bodies |
Have you ever felt like that by = c(ticker = "symbol")
argument is a bit off? I know it’s only semantics, but ticker = "symbol"
feels like an assignment operation rather than an equivalence operation. It’s unintuitive.
The new approach uses a helper function: join_by
.
join_by
takes an expression that more naturally conveys the intent behind the join:
join_by(ticker == symbol)
## Join By: ## - ticker == symbol
Here it is in action:
prices %>%
left_join(sectors, by = join_by(ticker == symbol)) %>%
pretty_table_2()
ticker | date | open | high | low | close | unadjusted_close | volume | name | sector | industry | sicindustry |
---|---|---|---|---|---|---|---|---|---|---|---|
AAPL | 2021-03-30 | 120.11 | 120.4030 | 118.86 | 119.90 | 119.90 | 85671920 | Apple Inc | Technology | Consumer Electronics | Electronic Computers |
TSLA | 2021-03-30 | 601.75 | 637.6600 | 591.01 | 635.62 | 635.62 | 39432360 | Tesla Inc | Consumer Cyclical | Auto Manufacturers | Motor Vehicles & Passenger Car Bodies |
AAPL | 2021-03-31 | 121.65 | 123.5200 | 121.15 | 122.15 | 122.15 | 118323820 | Apple Inc | Technology | Consumer Electronics | Electronic Computers |
TSLA | 2021-03-31 | 646.62 | 672.0000 | 641.11 | 667.93 | 667.93 | 33337288 | Tesla Inc | Consumer Cyclical | Auto Manufacturers | Motor Vehicles & Passenger Car Bodies |
AAPL | 2021-04-01 | 123.66 | 124.1800 | 122.49 | 123.00 | 123.00 | 75089140 | Apple Inc | Technology | Consumer Electronics | Electronic Computers |
TSLA | 2021-04-01 | 688.37 | 692.4202 | 659.42 | 661.75 | 661.75 | 35298376 | Tesla Inc | Consumer Cyclical | Auto Manufacturers | Motor Vehicles & Passenger Car Bodies |
That feels much more natural to me.
Inequality joins with join_by
join_by
has some useful features that were hard to implement in the previous version of dplyr
.
Say I have a bunch of trades and quotes data. I want to match each trade with the previous quote that’s closest in time.
Here are some trades and quotes data:
Trades:
options("digits.secs" = 6)
trades <- data.frame(list(
time = structure(c(1464197400.023, 1464197400.038, 1464197400.048, 1464197400.048,
1464197400.048), class = c("POSIXct", "POSIXt"), tzone = "America/New_York"),
ticker = c("MSFT", "MSFT", "GOOG", "GOOG", "AAPL"),
price = c(51.95, 51.95, 720.77, 720.92, 98),
quantity = c(75, 155, 100, 100, 100)
))
trades %>%
pretty_table_1()
time | ticker | price | quantity |
---|---|---|---|
2016-05-25 13:30:00.023 | MSFT | 51.95 | 75 |
2016-05-25 13:30:00.038 | MSFT | 51.95 | 155 |
2016-05-25 13:30:00.048 | GOOG | 720.77 | 100 |
2016-05-25 13:30:00.048 | GOOG | 720.92 | 100 |
2016-05-25 13:30:00.048 | AAPL | 98.00 | 100 |
Quotes:
quotes <- data.frame(list(
time = structure(c(1464197400.023, 1464197400.023, 1464197400.03, 1464197400.041,
1464197400.048, 1464197400.049, 1464197400.072, 1464197400.075), class = c("POSIXct", "POSIXt"), tzone = "America/New_York", class = c("POSIXct", "POSIXt"), tzone = "America/New_York"),
ticker = c("GOOG", "MSFT", "MSFT", "MSFT", "GOOG", "AAPL", "GOOG", "MSFT"),
bid = c(720.5, 51.95, 51.97, 51.99, 720.5, 97.99, 720.5, 52.01),
ask = c(720.93, 51.96, 51.98, 52, 720.93, 98.01, 720.88, 52.03)
))
quotes %>%
pretty_table_1()
time | ticker | bid | ask |
---|---|---|---|
2016-05-25 13:30:00.023 | GOOG | 720.50 | 720.93 |
2016-05-25 13:30:00.023 | MSFT | 51.95 | 51.96 |
2016-05-25 13:30:00.029 | MSFT | 51.97 | 51.98 |
2016-05-25 13:30:00.040 | MSFT | 51.99 | 52.00 |
2016-05-25 13:30:00.048 | GOOG | 720.50 | 720.93 |
2016-05-25 13:30:00.049 | AAPL | 97.99 | 98.01 |
2016-05-25 13:30:00.072 | GOOG | 720.50 | 720.88 |
2016-05-25 13:30:00.075 | MSFT | 52.01 | 52.03 |
We can find the most recent quote prior to each trade by passing an equality expression to join_by
, which we wrap in the helper function closest
:
join_by(ticker, closest(time >= time))
## Join By: ## - ticker ## - closest(time >= time)
trades %>%
left_join(quotes, by = join_by(ticker, closest(time >= time)), suffix = c(".trade", ".quote")) %>%
pretty_table_3()
time.trade | ticker | price | quantity | time.quote | bid | ask |
---|---|---|---|---|---|---|
2016-05-25 13:30:00.023 | MSFT | 51.95 | 75 | 2016-05-25 13:30:00.023 | 51.95 | 51.96 |
2016-05-25 13:30:00.038 | MSFT | 51.95 | 155 | 2016-05-25 13:30:00.029 | 51.97 | 51.98 |
2016-05-25 13:30:00.048 | GOOG | 720.77 | 100 | 2016-05-25 13:30:00.048 | 720.50 | 720.93 |
2016-05-25 13:30:00.048 | GOOG | 720.92 | 100 | 2016-05-25 13:30:00.048 | 720.50 | 720.93 |
2016-05-25 13:30:00.048 | AAPL | 98.00 | 100 | NA | NA | NA |
This is functionally equivalent to the SQL operation ASOF
and the merge_asof
function in pandas
(in fact, I took the data for this example from the pandas
docs).
Note in the above example, we got a trade in MSFT outside the quoted bid-ask, which can happen.
It might be reasonable to not match when the times are equal, in which case we can change the >=
equality to >
(here we have millisecond precision, and it’s entirely reasonable that a quote and a trade could occur in the same millisecond, but ignore that for the sake of the example):
trades %>%
left_join(quotes, by = join_by(ticker, closest(time > time)), suffix = c(".trade", ".quote")) %>%
pretty_table_3()
time.trade | ticker | price | quantity | time.quote | bid | ask |
---|---|---|---|---|---|---|
2016-05-25 13:30:00.023 | MSFT | 51.95 | 75 | NA | NA | NA |
2016-05-25 13:30:00.038 | MSFT | 51.95 | 155 | 2016-05-25 13:30:00.029 | 51.97 | 51.98 |
2016-05-25 13:30:00.048 | GOOG | 720.77 | 100 | 2016-05-25 13:30:00.023 | 720.50 | 720.93 |
2016-05-25 13:30:00.048 | GOOG | 720.92 | 100 | 2016-05-25 13:30:00.023 | 720.50 | 720.93 |
2016-05-25 13:30:00.048 | AAPL | 98.00 | 100 | NA | NA | NA |
We could also impose a 2ms gap between our quote and our matched trade, but to do this we need to make a new column, as join_by
can’t take a computed column:
library(lubridate)
trades %>%
left_join(
# create new column before doing the join
quotes %>% mutate(time_buffer = time + lubridate::seconds(2/1e3)),
by = join_by(ticker, closest(time >= time_buffer)),
suffix = c(".trade", ".quote")
) %>%
pretty_table_4()
time.trade | ticker | price | quantity | time.quote | bid | ask | time_buffer |
---|---|---|---|---|---|---|---|
2016-05-25 13:30:00.023 | MSFT | 51.95 | 75 | NA | NA | NA | NA |
2016-05-25 13:30:00.038 | MSFT | 51.95 | 155 | 2016-05-25 13:30:00.029 | 51.97 | 51.98 | 2016-05-25 13:30:00.032 |
2016-05-25 13:30:00.048 | GOOG | 720.77 | 100 | 2016-05-25 13:30:00.023 | 720.50 | 720.93 | 2016-05-25 13:30:00.025 |
2016-05-25 13:30:00.048 | GOOG | 720.92 | 100 | 2016-05-25 13:30:00.023 | 720.50 | 720.93 | 2016-05-25 13:30:00.025 |
2016-05-25 13:30:00.048 | AAPL | 98.00 | 100 | NA | NA | NA | NA |
Lastly, join_by
can take helpers other than closest
to create overlapping joins:
between
within
overlaps
For example, we can match trades to all quotes that occur within the prior 15ms:
trades %>%
left_join(
quotes %>% mutate(time_buffer = time + lubridate::seconds(15/1e3)),
by = join_by(ticker, between(x = time, y_lower = time, y_upper = time_buffer, bounds = "[]")),
suffix = c(".trade", ".quote")
) %>%
pretty_table_4()
time.trade | ticker | price | quantity | time.quote | bid | ask | time_buffer |
---|---|---|---|---|---|---|---|
2016-05-25 13:30:00.023 | MSFT | 51.95 | 75 | 2016-05-25 13:30:00.023 | 51.95 | 51.96 | 2016-05-25 13:30:00.038 |
2016-05-25 13:30:00.038 | MSFT | 51.95 | 155 | 2016-05-25 13:30:00.023 | 51.95 | 51.96 | 2016-05-25 13:30:00.038 |
2016-05-25 13:30:00.038 | MSFT | 51.95 | 155 | 2016-05-25 13:30:00.029 | 51.97 | 51.98 | 2016-05-25 13:30:00.045 |
2016-05-25 13:30:00.048 | GOOG | 720.77 | 100 | 2016-05-25 13:30:00.048 | 720.50 | 720.93 | 2016-05-25 13:30:00.063 |
2016-05-25 13:30:00.048 | GOOG | 720.92 | 100 | 2016-05-25 13:30:00.048 | 720.50 | 720.93 | 2016-05-25 13:30:00.063 |
2016-05-25 13:30:00.048 | AAPL | 98.00 | 100 | NA | NA | NA | NA |
Note how this method matches the second MSFT trade at 13:30:00.038 to two quotes at 13:30:00.023 and 13:30:00.029.
Will the old approaches still work?
Yes, for now. The dplyr
docs state that both approaches to joining are supported, and there is no mention of deprecating the legacy approach. Good news for backwards compatibility.
However, I’m not even sure how you’d implement an ASOF
or an overlapping join using that approach. The new approach opens up a whole new set of joining operations.
Other changes in dplyr 1.1.0
The other big change in dplyr
1.1.0 is the approach to grouping and aggregating data. I’ll take a look at that in another post.
Your thoughts?
What do you think? Do you prefer this new approach to joining data in dplyr
? Does it feel more natural, or more aligned with what you might have done in SQL? Let me know in the comments.
1 thought on “More Intuitive Joins in dplyr 1.1.0 – how to do an asof join on trades and quotes data”