Almost all joins between 2 data.table
s use a notation where one of them is used as i
in a frame applied to the other, and the joining columns are specified with the on
parameter. However, in addition to the “basic” joins, data.table
allows for special cases like rolling joins, summarizing while joining, non-equi joins, etc. This vignette will describe the notation to apply these joins with verbs defined in table.express
, which, like the single-table verbs, build data.table
expressions.
We’ll consider most of the dplyr
joining verbs in this section:
inner_join
left_join
right_join
anti_join
semi_join
full_join
A <- data.table::data.table(x = rep(c("b", "a", "c"), each = 3),
y = c(1, 3, 6),
v = 1:9)
B <- data.table::data.table(x = c("c", "b"),
v2 = 8:7,
foo = c(4, 2))
A
#> x y v
#> 1: b 1 1
#> 2: b 3 2
#> 3: b 6 3
#> 4: a 1 4
#> 5: a 3 5
#> 6: a 6 6
#> 7: c 1 7
#> 8: c 3 8
#> 9: c 6 9
B
#> x v2 foo
#> 1: c 8 4
#> 2: b 7 2
The methods defined in table.express
accept the on
part of the expression in their ellipsis:
A %>%
inner_join(B, x)
#> x y v v2 foo
#> 1: c 1 7 8 4
#> 2: c 3 8 8 4
#> 3: c 6 9 8 4
#> 4: b 1 1 7 2
#> 5: b 3 2 7 2
#> 6: b 6 3 7 2
A %>%
inner_join(B, x, v = v2)
#> x y v foo
#> 1: c 3 8 4
An important thing to note in the second example above is the order in which the columns are given, i.e. that v
is written before v2
, since the order is relevant for data.table
. We can remember the correct order simply by looking at which data.table
appears first in the expression, and knowing that said data.table
’s columns must appear first in the on
expressions. In this case, A
appears before B
, so writing v2 = v
would not work.
In order to maintain consistency in the ordering just described, left_join
automatically swaps expression elements internally in order to build the expression:
#> .DT_[.DT_0_, on = list(x, v2 = v), nomatch = , mult = , roll = ,
#> rollends = ]
#> x v2 foo y
#> 1: b 1 NA 1
#> 2: b 2 NA 3
#> 3: b 3 NA 6
#> 4: a 4 NA 1
#> 5: a 5 NA 3
#> 6: a 6 NA 6
#> 7: c 7 NA 1
#> 8: c 8 4 3
#> 9: c 9 NA 6
We can also see an extra .DT_0_
pronoun in the expression. These special .DT_*_
pronouns hold the different data.table
s that have entered the expression so far, and are automatically assigned to the evaluation’s environment. In this case, .DT_
holds B
and .DT_0_
holds A
.
No additional considerations are required to use right_join
or anti_join
:
A %>%
right_join(B, x, v = v2)
#> x y v foo
#> 1: c 3 8 4
#> 2: b NA 7 2
#> x y v
#> 1: b 1 1
#> 2: b 3 2
#> 3: b 6 3
#> 4: a 1 4
#> 5: a 3 5
#> 6: a 6 6
#> 7: c 1 7
#> 8: c 6 9
A semi_join
is essentially a right_join
with nomatch = NULL
, and where j
is set to unique(.SD)
:
#> x y v
#> 1: c 1 7
#> 2: c 3 8
#> 3: c 6 9
#> 4: b 1 1
#> 5: b 3 2
#> 6: b 6 3
Finally, full_join
is basically a wrapper for merge
specifying all = TRUE
:
#> x y v.x v.y foo
#> 1: a 1 4 NA NA
#> 2: a 3 5 NA NA
#> 3: a 6 6 NA NA
#> 4: b 1 1 7 2
#> 5: b 3 2 7 2
#> 6: b 6 3 7 2
#> 7: c 1 7 8 4
#> 8: c 3 8 8 4
#> 9: c 6 9 8 4
Something to keep in mind is that most joins specify i
and on
inside the frame, so any subsequent verbs that specify j
, by
, or keyby
would still be possible. In order to enable this, many joining verbs have an .expr
parameter that indicates that the expression should be kept, delaying evaluation until a verb that sets j
is used. This can be useful if only a subset of the joined columns is desired:
#> x y foo
#> 1: b 1 2
#> 2: b 3 2
#> 3: b 6 2
#> 4: a 1 NA
#> 5: a 3 NA
#> 6: a 6 NA
#> 7: c 1 4
#> 8: c 3 4
#> 9: c 6 4
But, when working lazily, this would require explicit chaining for expressions that should be applied to the joined table:
A %>%
start_expr %>%
left_join(B, x) %>%
chain %>%
group_by(x) %>%
mutate(y = cumsum(y)) %>%
end_expr %>% {
invisible(print(.))
}
#> x v foo y i.v
#> 1: b 7 2 1 1
#> 2: b 7 2 4 2
#> 3: b 7 2 10 3
#> 4: a NA NA 1 4
#> 5: a NA NA 4 5
#> 6: a NA NA 10 6
#> 7: c 8 4 1 7
#> 8: c 8 4 4 8
#> 9: c 8 4 10 9
This is particularly important if the selection expressions call any function, e.g. tidyselect
helpers or even :
with non-numerics, because data.table
does not support that kind of syntax for j
in the same frame as a join.
A special data.table
idiom is described here as an update join. In order to highlight the fact that it modifies the left-hand side table by reference, the mutate_join
verb is defined in table.express
. Said verb accepts the columns to be added in its .SDcols
parameter, possibly with new names:
A %>%
mutate_join(B, x, .SDcols = c("foo", v_from_B = "v"))
print(A)
#> x y v foo v_from_B
#> 1: b 1 1 2 7
#> 2: b 3 2 2 7
#> 3: b 6 3 2 7
#> 4: a 1 4 NA NA
#> 5: a 3 5 NA NA
#> 6: a 6 6 NA NA
#> 7: c 1 7 4 8
#> 8: c 3 8 4 8
#> 9: c 6 9 4 8
A particularity of this idiom is that the number of rows from the resulting join must match the left-hand side exactly or not at all, so this won’t work:
B %>%
mutate_join(A, x, .SDcols = "y")
In these cases, we must either use mult
if appropriate, or specify a summarizing expression in .SDcols
:
B %>%
mutate_join(A, x, .SDcols = "y", mult = "first")
print(B)
#> x v foo y
#> 1: c 8 4 1
#> 2: b 7 2 1
B %>%
mutate_join(A, x, .SDcols = .(y = mean(y)))
print(B)
#> x v foo y
#> 1: c 8 4 3.333333
#> 2: b 7 2 3.333333
The last example specifies by = .EACHI
in the joining expression.
A nice blog post describing rolling joins can be found at R-bloggers, so almost the same website
and paypal
tables will be used for the examples below. Another short description with animated depictions can also be found here.
print(website)
#> name session_start_time session_id
#> 1: Erica 2016-01-04 19:12:00 1
#> 2: Erica 2016-01-04 21:05:00 2
#> 3: Francis 2016-01-02 13:09:00 3
#> 4: Francis 2016-01-03 19:22:00 4
#> 5: Francis 2016-01-08 08:44:00 5
#> 6: Francis 2016-01-08 20:22:00 6
#> 7: Francis 2016-01-10 17:36:00 7
#> 8: Francis 2016-01-15 16:56:00 8
#> 9: Isabel 2016-01-01 11:01:00 9
#> 10: Isabel 2016-01-02 08:59:00 10
#> 11: Isabel 2016-01-05 18:18:00 11
#> 12: Isabel 2016-01-07 19:03:00 12
#> 13: Isabel 2016-01-08 19:01:00 13
#> 14: Sally 2016-01-03 10:00:00 14
#> 15: Vivian 2016-01-01 09:10:00 15
#> 16: Vivian 2016-01-09 02:15:00 16
print(paypal)
#> name purchase_time payment_id
#> 1: Erica 2016-01-03 08:02:00 1
#> 2: Francis 2016-01-03 19:28:00 2
#> 3: Francis 2016-01-08 20:33:00 3
#> 4: Francis 2016-01-10 17:46:00 4
#> 5: Isabel 2016-01-08 19:10:00 5
#> 6: Mom 2015-12-02 17:58:00 6
#> 7: Sally 2016-01-03 10:06:00 7
#> 8: Sally 2016-01-03 10:15:00 8
In contrast to the blog post, no join_time
is added to the tables. This is done on purpose in order to show what happens with the columns that are rolled.
Let’s use a left rolling join to obtain the session_id
that immediately preceded a purchase, if any:
#> name session_start_time session_id payment_id
#> 1: Erica 2016-01-03 08:02:00 NA 1
#> 2: Francis 2016-01-03 19:28:00 4 2
#> 3: Francis 2016-01-08 20:33:00 6 3
#> 4: Francis 2016-01-10 17:46:00 7 4
#> 5: Isabel 2016-01-08 19:10:00 13 5
#> 6: Mom 2015-12-02 17:58:00 NA 6
#> 7: Sally 2016-01-03 10:06:00 14 7
#> 8: Sally 2016-01-03 10:15:00 14 8
We can see that the rows returned are from the left-hand side (paypal
), and since neither Mom nor Erica visited the website before their purchases, their session_id
ended as NA
.
The order of the columns in the on
expressions is the same as above. The tricky part is that the rolled column ended up with the name from the right-hand side, but keeping the values from the left-hand side. If we “invert” the join, the result is the same, but the rolled column’s name is now from the expression’s left-hand side.
website %>%
right_join(paypal, name, session_start_time = purchase_time, roll = Inf)
#> name session_start_time session_id payment_id
#> 1: Erica 2016-01-03 08:02:00 NA 1
#> 2: Francis 2016-01-03 19:28:00 4 2
#> 3: Francis 2016-01-08 20:33:00 6 3
#> 4: Francis 2016-01-10 17:46:00 7 4
#> 5: Isabel 2016-01-08 19:10:00 13 5
#> 6: Mom 2015-12-02 17:58:00 NA 6
#> 7: Sally 2016-01-03 10:06:00 14 7
#> 8: Sally 2016-01-03 10:15:00 14 8
Note, however, that roll
stayed equal to Inf
. This is because even though the column order in the expressions changed, we could understand the rolling expressions as follows:
left_join
, the rolling column on the left is purchase_time
, so with roll = Inf
, the values from session_start_time
are rolled forward onto purchase_time
to find a match while joining.right_join
, the rolling column on the right is purchase_time
, so roll
must stay as Inf
to keep the same semantics.Now let’s say we want to keep all the rows from website
and find the closest payment_id
that occurred after the visit. This could be expressed as:
#> name purchase_time payment_id session_id
#> 1: Erica 2016-01-04 19:12:00 NA 1
#> 2: Erica 2016-01-04 21:05:00 NA 2
#> 3: Francis 2016-01-02 13:09:00 2 3
#> 4: Francis 2016-01-03 19:22:00 2 4
#> 5: Francis 2016-01-08 08:44:00 3 5
#> 6: Francis 2016-01-08 20:22:00 3 6
#> 7: Francis 2016-01-10 17:36:00 4 7
#> 8: Francis 2016-01-15 16:56:00 NA 8
#> 9: Isabel 2016-01-01 11:01:00 5 9
#> 10: Isabel 2016-01-02 08:59:00 5 10
#> 11: Isabel 2016-01-05 18:18:00 5 11
#> 12: Isabel 2016-01-07 19:03:00 5 12
#> 13: Isabel 2016-01-08 19:01:00 5 13
#> 14: Sally 2016-01-03 10:00:00 7 14
#> 15: Vivian 2016-01-01 09:10:00 NA 15
#> 16: Vivian 2016-01-09 02:15:00 NA 16
In order to simplify the meaning of rollends
a bit, we could think of it as missing or being a single TRUE
/FALSE
. If it’s missing, rolling works according to the value of roll
, otherwise:
rollends = TRUE
, the value of roll
is inverted only for those rows that would have no match otherwise.rollends = FALSE
, a matching roll will only occur if the column’s value falls in a gap with values both before and after.
website %>%
left_join(paypal, name, session_start_time = purchase_time, roll = -Inf, rollends = TRUE)
#> name purchase_time payment_id session_id
#> 1: Erica 2016-01-04 19:12:00 1 1
#> 2: Erica 2016-01-04 21:05:00 1 2
#> 3: Francis 2016-01-02 13:09:00 2 3
#> 4: Francis 2016-01-03 19:22:00 2 4
#> 5: Francis 2016-01-08 08:44:00 3 5
#> 6: Francis 2016-01-08 20:22:00 3 6
#> 7: Francis 2016-01-10 17:36:00 4 7
#> 8: Francis 2016-01-15 16:56:00 4 8
#> 9: Isabel 2016-01-01 11:01:00 5 9
#> 10: Isabel 2016-01-02 08:59:00 5 10
#> 11: Isabel 2016-01-05 18:18:00 5 11
#> 12: Isabel 2016-01-07 19:03:00 5 12
#> 13: Isabel 2016-01-08 19:01:00 5 13
#> 14: Sally 2016-01-03 10:00:00 7 14
#> 15: Vivian 2016-01-01 09:10:00 NA 15
#> 16: Vivian 2016-01-09 02:15:00 NA 16
website %>%
left_join(paypal, name, session_start_time = purchase_time, roll = -Inf, rollends = FALSE)
#> name purchase_time payment_id session_id
#> 1: Erica 2016-01-04 19:12:00 NA 1
#> 2: Erica 2016-01-04 21:05:00 NA 2
#> 3: Francis 2016-01-02 13:09:00 NA 3
#> 4: Francis 2016-01-03 19:22:00 NA 4
#> 5: Francis 2016-01-08 08:44:00 3 5
#> 6: Francis 2016-01-08 20:22:00 3 6
#> 7: Francis 2016-01-10 17:36:00 4 7
#> 8: Francis 2016-01-15 16:56:00 NA 8
#> 9: Isabel 2016-01-01 11:01:00 NA 9
#> 10: Isabel 2016-01-02 08:59:00 NA 10
#> 11: Isabel 2016-01-05 18:18:00 NA 11
#> 12: Isabel 2016-01-07 19:03:00 NA 12
#> 13: Isabel 2016-01-08 19:01:00 NA 13
#> 14: Sally 2016-01-03 10:00:00 NA 14
#> 15: Vivian 2016-01-01 09:10:00 NA 15
#> 16: Vivian 2016-01-09 02:15:00 NA 16
Vivian’s payment_id
s are always NA
because she has never purchased anything. On the other hand, no one except Francis has visited the website both before and after a purchase.
Non-equi joins are similar to rolling joins, but instead of rolling a single row’s value, they can return several values per row.
Using the same data as before, we could find all the session_id
s that preceded a payment_id
, giving “priority” to paypal
’s rows:
#> name session_start_time session_id payment_id
#> 1: Erica 2016-01-03 08:02:00 NA 1
#> 2: Francis 2016-01-03 19:28:00 3 2
#> 3: Francis 2016-01-03 19:28:00 4 2
#> 4: Francis 2016-01-08 20:33:00 3 3
#> 5: Francis 2016-01-08 20:33:00 4 3
#> 6: Francis 2016-01-08 20:33:00 5 3
#> 7: Francis 2016-01-08 20:33:00 6 3
#> 8: Francis 2016-01-10 17:46:00 3 4
#> 9: Francis 2016-01-10 17:46:00 4 4
#> 10: Francis 2016-01-10 17:46:00 5 4
#> 11: Francis 2016-01-10 17:46:00 6 4
#> 12: Francis 2016-01-10 17:46:00 7 4
#> 13: Isabel 2016-01-08 19:10:00 9 5
#> 14: Isabel 2016-01-08 19:10:00 10 5
#> 15: Isabel 2016-01-08 19:10:00 11 5
#> 16: Isabel 2016-01-08 19:10:00 12 5
#> 17: Isabel 2016-01-08 19:10:00 13 5
#> 18: Mom 2015-12-02 17:58:00 NA 6
#> 19: Sally 2016-01-03 10:06:00 14 7
#> 20: Sally 2016-01-03 10:15:00 14 8
Priority above simply means that all rows from paypal
are returned, even if they don’t have a match in website
. Even though a column session_start_time
appears in the result, the values contained therein are from paypal
’s purchase_time
.
A corresponding right non-equi join would yield the same result, expecting only a different order in the columns that are part of the comparisons:
website %>%
right_join(paypal, name, session_start_time <= purchase_time)
#> name session_start_time session_id payment_id
#> 1: Erica 2016-01-03 08:02:00 NA 1
#> 2: Francis 2016-01-03 19:28:00 3 2
#> 3: Francis 2016-01-03 19:28:00 4 2
#> 4: Francis 2016-01-08 20:33:00 3 3
#> 5: Francis 2016-01-08 20:33:00 4 3
#> 6: Francis 2016-01-08 20:33:00 5 3
#> 7: Francis 2016-01-08 20:33:00 6 3
#> 8: Francis 2016-01-10 17:46:00 3 4
#> 9: Francis 2016-01-10 17:46:00 4 4
#> 10: Francis 2016-01-10 17:46:00 5 4
#> 11: Francis 2016-01-10 17:46:00 6 4
#> 12: Francis 2016-01-10 17:46:00 7 4
#> 13: Isabel 2016-01-08 19:10:00 9 5
#> 14: Isabel 2016-01-08 19:10:00 10 5
#> 15: Isabel 2016-01-08 19:10:00 11 5
#> 16: Isabel 2016-01-08 19:10:00 12 5
#> 17: Isabel 2016-01-08 19:10:00 13 5
#> 18: Mom 2015-12-02 17:58:00 NA 6
#> 19: Sally 2016-01-03 10:06:00 14 7
#> 20: Sally 2016-01-03 10:15:00 14 8
In case a self join were necessary, perhaps while using a rolling or non-equi join, the way magrittr
’s pipe handles the .
outside of nested calls wouldn’t allow calling a joining verb with .
both in x
and y
. To work around this, the following verbs default to an eager self join when y
is missing:
full_join
left_join
mutate_join
As a somewhat contrived example, we could add a rolling count of weekly visits per user to the website
data introduced above:
website %>%
mutate(window_start = session_start_time - as.difftime(7, units = "days")) %>%
mutate_join(, name, window_start <= session_start_time, session_start_time >= session_start_time,
.SDcols = .(weekly_visits = .N),
.by_each = TRUE) %>%
mutate(window_start = NULL)
print(website)
#> name session_start_time session_id weekly_visits
#> 1: Erica 2016-01-04 19:12:00 1 1
#> 2: Erica 2016-01-04 21:05:00 2 2
#> 3: Francis 2016-01-02 13:09:00 3 1
#> 4: Francis 2016-01-03 19:22:00 4 2
#> 5: Francis 2016-01-08 08:44:00 5 3
#> 6: Francis 2016-01-08 20:22:00 6 4
#> 7: Francis 2016-01-10 17:36:00 7 4
#> 8: Francis 2016-01-15 16:56:00 8 3
#> 9: Isabel 2016-01-01 11:01:00 9 1
#> 10: Isabel 2016-01-02 08:59:00 10 2
#> 11: Isabel 2016-01-05 18:18:00 11 3
#> 12: Isabel 2016-01-07 19:03:00 12 4
#> 13: Isabel 2016-01-08 19:01:00 13 4
#> 14: Sally 2016-01-03 10:00:00 14 1
#> 15: Vivian 2016-01-01 09:10:00 15 1
#> 16: Vivian 2016-01-09 02:15:00 16 1