feat: Partial support for `across()` in `mutate()` and `summarise()` #296

DavisVaughan · 2024-10-28T14:08:37Z

Joint work with @lionel-

This PR adds partial support for inlining of across() in mutate() and summarise(). If we can successfully inline the across() call, then we don't have to fall back to a dplyr implementation.

It currently has the following restrictions:

You can select multiple cols, i.e. across(c(a, b),)
But you can only supply 1 function, i.e. across(cols, fn)
- You cannot do across(cols, list(fn1 = fn1, fn2 = fn2))
- You cannot even do across(cols, list(fn = fn))
Inlined lambdas can be expanded like across(cols, \(x) mean(x, na.rm = TRUE))
- But they can't be stored in a variable first, like fn <- \(x) mean(x, na.rm = TRUE) followed by across(cols, fn)
Using .names is supported, but not particularly useful right now with just 1 function
Using .unpack forces a fallback (like dplyr)
Using the ... forces a fallback (like dplyr)

In the future on the dplyr side we'd like to fully deprecate the usage of ... in across(), which currently accepts extra arguments for the .fns and is very confusing. After it is deprecated fully we can instead allow it to accept multiple functions, rather than using a list, so you can do: across(cols, mean, median, \(x) mean(x, na.rm = TRUE)). This is when duckplyr could more easily start accepting >1 functions.

In the details section below, there are a number of examples about when we do vs don't expand.

Sys.setenv(DUCKPLYR_FALLBACK_COLLECT = 0)
library(duckplyr)
library(dplyr)

df <- tibble(x = 1:5, y = 6:10)
df <- as_duckplyr_df(df)

# yay both of these work
mutate(df, across(x:y, mean))
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [mean(x) OVER () as x, mean(y) OVER () as y]
#>   r_dataframe_scan(0x11330e9c8)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - x (DOUBLE)
#> - y (DOUBLE)
#> 
#> # A tibble: 5 × 2
#>       x     y
#>   <dbl> <dbl>
#> 1     3     8
#> 2     3     8
#> 3     3     8
#> 4     3     8
#> 5     3     8

mutate(df, across(x:y, \(arg) mean(arg)))
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [mean(x) OVER () as x, mean(y) OVER () as y]
#>   r_dataframe_scan(0x114daab48)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - x (DOUBLE)
#> - y (DOUBLE)
#> 
#> # A tibble: 5 × 2
#>       x     y
#>   <dbl> <dbl>
#> 1     3     8
#> 2     3     8
#> 3     3     8
#> 4     3     8
#> 5     3     8

# cool (x_mean as new name)
mutate(df, across(c(x_mean = x, y_mean = y), \(arg) mean(arg)))
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [x as x, y as y, mean(x) OVER () as x_mean, mean(y) OVER () as y_mean]
#>   r_dataframe_scan(0x112346ec8)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - x (INTEGER)
#> - y (INTEGER)
#> - x_mean (DOUBLE)
#> - y_mean (DOUBLE)
#> 
#> # A tibble: 5 × 4
#>       x     y x_mean y_mean
#>   <int> <int>  <dbl>  <dbl>
#> 1     1     6      3      8
#> 2     2     7      3      8
#> 3     3     8      3      8
#> 4     4     9      3      8
#> 5     5    10      3      8

# yep, that's right (x_mean_1 as new name)
mutate(df, across(c(x_mean = x, y_mean = y), \(arg) mean(arg), .names = "{.col}_{.fn}"))
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [x as x, y as y, mean(x) OVER () as x_mean_1, mean(y) OVER () as y_mean_1]
#>   r_dataframe_scan(0x112532708)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - x (INTEGER)
#> - y (INTEGER)
#> - x_mean_1 (DOUBLE)
#> - y_mean_1 (DOUBLE)
#> 
#> # A tibble: 5 × 4
#>       x     y x_mean_1 y_mean_1
#>   <int> <int>    <dbl>    <dbl>
#> 1     1     6        3        8
#> 2     2     7        3        8
#> 3     3     8        3        8
#> 4     4     9        3        8
#> 5     5    10        3        8

# this falls back due to `.unpack`
mutate(df, across(x:y, \(arg) mean(arg), .unpack = TRUE))
#> # A tibble: 5 × 2
#>       x     y
#>   <dbl> <dbl>
#> 1     3     8
#> 2     3     8
#> 3     3     8
#> 4     3     8
#> 5     3     8

# this triggers the fallback because it cant translate `na.rm`
mutate(df, across(x:y, \(arg) mean(arg, na.rm = TRUE)))
#> # A tibble: 5 × 2
#>       x     y
#>   <dbl> <dbl>
#> 1     3     8
#> 2     3     8
#> 3     3     8
#> 4     3     8
#> 5     3     8

# updates from the `a` expression are usable and no fallback is triggered
mutate(df, a = y, across(c(a, x), function(col) col + 1))
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection ["+"(x, 1.0) as x, y as y, "+"(a, 1.0) as a]
#>   Projection [x as x, y as y, y as a]
#>     r_dataframe_scan(0x10725d948)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - x (DOUBLE)
#> - y (INTEGER)
#> - a (DOUBLE)
#> 
#> # A tibble: 5 × 3
#>       x     y     a
#>   <dbl> <int> <dbl>
#> 1     2     6     7
#> 2     3     7     8
#> 3     4     8     9
#> 4     5     9    10
#> 5     6    10    11

# looks like `NULL` triggers the fallback-to-dplyr case, so we don't have to
# worry about that i guess
mutate(df, x = NULL)
#> # A tibble: 5 × 1
#>       y
#>   <int>
#> 1     6
#> 2     7
#> 3     8
#> 4     9
#> 5    10

# Not expanded - `list()` not supported even with 1 fn
mutate(df, across(x:y, list(mean = mean)))
#> # A tibble: 5 × 4
#>       x     y x_mean y_mean
#>   <int> <int>  <dbl>  <dbl>
#> 1     1     6      3      8
#> 2     2     7      3      8
#> 3     3     8      3      8
#> 4     4     9      3      8
#> 5     5    10      3      8

# Not expanded - falls back to dplyr
mutate(df, across(x:y, list(mean = mean, median = median)))
#> # A tibble: 5 × 6
#>       x     y x_mean x_median y_mean y_median
#>   <int> <int>  <dbl>    <int>  <dbl>    <int>
#> 1     1     6      3        3      8        8
#> 2     2     7      3        3      8        8
#> 3     3     8      3        3      8        8
#> 4     4     9      3        3      8        8
#> 5     5    10      3        3      8        8

# Not expanded - falls back to dplyr
mutate(df, packed = across(x:y, mean))
#> # A tibble: 5 × 3
#>       x     y packed$x    $y
#>   <int> <int>    <dbl> <dbl>
#> 1     1     6        3     8
#> 2     2     7        3     8
#> 3     3     8        3     8
#> 4     4     9        3     8
#> 5     5    10        3     8

# Expanded - then falls back to dplyr after duckplyr checks if `mean()`
# matches `base::mean()`
mean <- function(...) stop("oh no")
mutate(df, across(x:y, mean))
#> Error in `mutate()`:
#> ℹ In argument: `across(x:y, mean)`.
#> Caused by error in `across()`:
#> ! Can't compute column `x`.
#> Caused by error:
#> ! oh no
rm(mean)

# All expressions generated by `across()` need to evaluate within the
# same projection (this should return 2s in all columns)
df <- tibble(x = 1, y = 2, z = 3)
df <- as_duckplyr_df(df)
mutate(df, across(x:z, ~ 1 + x))
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection ["+"(1.0, x) as x, "+"(1.0, x) as y, "+"(1.0, x) as z]
#>   r_dataframe_scan(0x112572b48)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - x (DOUBLE)
#> - y (DOUBLE)
#> - z (DOUBLE)
#> 
#> # A tibble: 1 × 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1     2     2     2

Some rationale for why we got here:

Why can't we inline across() in dplyr's generics directly?

Method dispatch makes this hard, S3 signatures currently look like this:

mutate <- function(.data, ...) {
  UseMethod("mutate")
}

mutate.data.frame <- function(.data, ..., .keep, .more) {
  ...
}

mutate.method <- function(.data, ..., .keep, .method_specific) {
  ...
}

We could try something like:

mutate <- function(.data, ...) {
  dots <- across_expand(enquos(...))
  mutate_dispatch(.data, !!!dots)
}
mutate_dispatch <- function(.data, ...) {
  UseMethod("mutate")
}

But how do we know to not apply across_expand() to .method_specific? It gets captured in the ... of the generic when the user calls something like:

mutate(df, across(), .method_specific = "first")

We would end up transforming all incoming arguments to quosures, even things like .keep that are not meant to be evaluated with eval_tidy().

So why not export `dplyr::partial_across_expand()` for methods to call manually?

We could try and make the following approach work:

mutate <- function(.data, ...) {
  UseMethod("mutate")
}

mutate.data.frame <- function(.data, ..., .keep, .more) {
  # dplyr specific expanding, fully featured `across_expand()`
}

mutate.method <- function(.data, ..., .keep, .method_specific) {
  # Alternative to `dots <- enquos(...)`
  dots <- dplyr::partial_across_expand(...)
}

This gets around the ... problem from above, because we've matched the dots to the specific S3 method at this point. But there are a number of reasons we don't necessarily want to do this right now:

We aren't sure how fully featured we can make "partial" expansion while still doing something all backends support. dbplyr and dtplyr already have their own across() support, so it is also possible our "partial" support might not capture all the use cases they already support - in which case they'd still need their own version.
The returned dots would likely be "dplyr_quosures" which are currently internal dplyr details that we are free to change.
dplyr itself would not use this, because it would have full across expansion support instead, which unfortunately relies on many internal dplyr details that we would not want to try and expose in any way

Since we don't have enough experience with this new variant yet, we currently want duckplyr to be the only consumer for now. We can revisit exporting it from dplyr once we get a little experience with it and do a cross comparison between duckplyr, dbplyr, and dtplyr to see if we can find some common utility that everyone can happily use.

krlmlr · 2024-11-01T13:35:33Z

Thanks, this looks good. I'll play with it a bit before merging.

IIUC, the main argument for not adding it to the mutate() generic is that we can't distinguish new columns from .method_specific arguments. What if dplyr warned on any new column that starts with a dot and doesn't use the := syntax, and if we expected all methods to start their custom arguments with a dot?

…ingly

krlmlr · 2024-11-02T17:29:09Z

Thanks. I added tests, and have an extension ready that I'll open a new PR for.

lionel- · 2024-11-04T08:44:52Z

What if dplyr warned on any new column that starts with a dot and doesn't use the := syntax, and if we expected all methods to start their custom arguments with a dot?

Unfortunately I don't think we can expect downstream dependencies to adhere to this convention at this point?

Interesting point about requiring := for dotted column names. That said having to use := is only a workaround for the parser restrictions on the LHS of = and ideally we'd not use it at all. I don't think we'd want to find new usages for it as it's a confusing aspect of tidyeval / tidyverse syntax.

krlmlr force-pushed the feature/across branch from 1eb360c to 896315d Compare November 2, 2024 11:21

lionel- and others added 16 commits November 2, 2024 12:23

Draft expand_across()

48b1b0d

Inline lambdas / functions where possible

aea1264

Fix minor buglets

730d2b0

Remove outdated comments

8e8911e

Fix symbol case

231c996

Rename helper to avoid collision

01cbdae

Use in mutate()

b481903

Return NULL if we bail on expansion so we can tweak behavior accord…

364f1d7

…ingly

Use names_out for the "current state of the world"

ddf0e87

Add a big ole TODO

26fd657

Clean up a bit more

bbad9a6

Create 1 projection per across() call

789cb70

Use reduce() for clarity

f597407

Expand across() in summarise()

8680849

Clean up expr$.fns usage

7380ed3

Remove forbid_new entirely

137d1fa

krlmlr force-pushed the feature/across branch from 896315d to 137d1fa Compare November 2, 2024 11:23

krlmlr added 2 commits November 2, 2024 13:10

Comment

978f775

Rename

bce9ff9

github-actions bot mentioned this pull request Nov 2, 2024

test: Snapshot updates for rcc-smoke (null) #303

Closed

krlmlr added 2 commits November 2, 2024 15:04

Reduce complexity

8093de7

Add tests

9743558

krlmlr force-pushed the feature/across branch 4 times, most recently from 5eb3aef to 9743558 Compare November 2, 2024 17:28

krlmlr marked this pull request as ready for review November 2, 2024 17:28

krlmlr enabled auto-merge November 2, 2024 17:29

krlmlr changed the title ~~Partial support for across() in mutate() and summarise()~~ feat: Partial support for across() in mutate() and summarise() Nov 2, 2024

krlmlr added this pull request to the merge queue Nov 2, 2024

Merged via the queue into main with commit fc68cfa Nov 2, 2024
17 checks passed

krlmlr deleted the feature/across branch November 2, 2024 18:01

krlmlr mentioned this pull request Nov 2, 2024

feat: More complete across() #306

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Partial support for `across()` in `mutate()` and `summarise()` #296

feat: Partial support for `across()` in `mutate()` and `summarise()` #296

DavisVaughan commented Oct 28, 2024 •

edited

Loading

krlmlr commented Nov 1, 2024

krlmlr commented Nov 2, 2024

lionel- commented Nov 4, 2024

feat: Partial support for across() in mutate() and summarise() #296

feat: Partial support for across() in mutate() and summarise() #296

Conversation

DavisVaughan commented Oct 28, 2024 • edited Loading

Why can't we inline across() in dplyr's generics directly?

So why not export dplyr::partial_across_expand() for methods to call manually?

krlmlr commented Nov 1, 2024

krlmlr commented Nov 2, 2024

lionel- commented Nov 4, 2024

feat: Partial support for `across()` in `mutate()` and `summarise()` #296

feat: Partial support for `across()` in `mutate()` and `summarise()` #296

DavisVaughan commented Oct 28, 2024 •

edited

Loading

So why not export `dplyr::partial_across_expand()` for methods to call manually?