Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Partial support for across() in mutate() and summarise() #296

Merged
merged 20 commits into from
Nov 2, 2024

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Oct 28, 2024

Joint work with @lionel-

This PR adds partial support for inlining of across() in mutate() and summarise(). If we can successfully inline the across() call, then we don't have to fall back to a dplyr implementation.

It currently has the following restrictions:

  • You can select multiple cols, i.e. across(c(a, b),)

  • But you can only supply 1 function, i.e. across(cols, fn)

    • You cannot do across(cols, list(fn1 = fn1, fn2 = fn2))
    • You cannot even do across(cols, list(fn = fn))
  • Inlined lambdas can be expanded like across(cols, \(x) mean(x, na.rm = TRUE))

    • But they can't be stored in a variable first, like fn <- \(x) mean(x, na.rm = TRUE) followed by across(cols, fn)
  • Using .names is supported, but not particularly useful right now with just 1 function

  • Using .unpack forces a fallback (like dplyr)

  • Using the ... forces a fallback (like dplyr)

In the future on the dplyr side we'd like to fully deprecate the usage of ... in across(), which currently accepts extra arguments for the .fns and is very confusing. After it is deprecated fully we can instead allow it to accept multiple functions, rather than using a list, so you can do: across(cols, mean, median, \(x) mean(x, na.rm = TRUE)). This is when duckplyr could more easily start accepting >1 functions.

In the details section below, there are a number of examples about when we do vs don't expand.

Sys.setenv(DUCKPLYR_FALLBACK_COLLECT = 0)
library(duckplyr)
library(dplyr)

df <- tibble(x = 1:5, y = 6:10)
df <- as_duckplyr_df(df)

# yay both of these work
mutate(df, across(x:y, mean))
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [mean(x) OVER () as x, mean(y) OVER () as y]
#>   r_dataframe_scan(0x11330e9c8)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - x (DOUBLE)
#> - y (DOUBLE)
#> 
#> # A tibble: 5 × 2
#>       x     y
#>   <dbl> <dbl>
#> 1     3     8
#> 2     3     8
#> 3     3     8
#> 4     3     8
#> 5     3     8

mutate(df, across(x:y, \(arg) mean(arg)))
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [mean(x) OVER () as x, mean(y) OVER () as y]
#>   r_dataframe_scan(0x114daab48)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - x (DOUBLE)
#> - y (DOUBLE)
#> 
#> # A tibble: 5 × 2
#>       x     y
#>   <dbl> <dbl>
#> 1     3     8
#> 2     3     8
#> 3     3     8
#> 4     3     8
#> 5     3     8

# cool (x_mean as new name)
mutate(df, across(c(x_mean = x, y_mean = y), \(arg) mean(arg)))
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [x as x, y as y, mean(x) OVER () as x_mean, mean(y) OVER () as y_mean]
#>   r_dataframe_scan(0x112346ec8)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - x (INTEGER)
#> - y (INTEGER)
#> - x_mean (DOUBLE)
#> - y_mean (DOUBLE)
#> 
#> # A tibble: 5 × 4
#>       x     y x_mean y_mean
#>   <int> <int>  <dbl>  <dbl>
#> 1     1     6      3      8
#> 2     2     7      3      8
#> 3     3     8      3      8
#> 4     4     9      3      8
#> 5     5    10      3      8

# yep, that's right (x_mean_1 as new name)
mutate(df, across(c(x_mean = x, y_mean = y), \(arg) mean(arg), .names = "{.col}_{.fn}"))
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [x as x, y as y, mean(x) OVER () as x_mean_1, mean(y) OVER () as y_mean_1]
#>   r_dataframe_scan(0x112532708)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - x (INTEGER)
#> - y (INTEGER)
#> - x_mean_1 (DOUBLE)
#> - y_mean_1 (DOUBLE)
#> 
#> # A tibble: 5 × 4
#>       x     y x_mean_1 y_mean_1
#>   <int> <int>    <dbl>    <dbl>
#> 1     1     6        3        8
#> 2     2     7        3        8
#> 3     3     8        3        8
#> 4     4     9        3        8
#> 5     5    10        3        8

# this falls back due to `.unpack`
mutate(df, across(x:y, \(arg) mean(arg), .unpack = TRUE))
#> # A tibble: 5 × 2
#>       x     y
#>   <dbl> <dbl>
#> 1     3     8
#> 2     3     8
#> 3     3     8
#> 4     3     8
#> 5     3     8

# this triggers the fallback because it cant translate `na.rm`
mutate(df, across(x:y, \(arg) mean(arg, na.rm = TRUE)))
#> # A tibble: 5 × 2
#>       x     y
#>   <dbl> <dbl>
#> 1     3     8
#> 2     3     8
#> 3     3     8
#> 4     3     8
#> 5     3     8

# updates from the `a` expression are usable and no fallback is triggered
mutate(df, a = y, across(c(a, x), function(col) col + 1))
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection ["+"(x, 1.0) as x, y as y, "+"(a, 1.0) as a]
#>   Projection [x as x, y as y, y as a]
#>     r_dataframe_scan(0x10725d948)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - x (DOUBLE)
#> - y (INTEGER)
#> - a (DOUBLE)
#> 
#> # A tibble: 5 × 3
#>       x     y     a
#>   <dbl> <int> <dbl>
#> 1     2     6     7
#> 2     3     7     8
#> 3     4     8     9
#> 4     5     9    10
#> 5     6    10    11

# looks like `NULL` triggers the fallback-to-dplyr case, so we don't have to
# worry about that i guess
mutate(df, x = NULL)
#> # A tibble: 5 × 1
#>       y
#>   <int>
#> 1     6
#> 2     7
#> 3     8
#> 4     9
#> 5    10

# Not expanded - `list()` not supported even with 1 fn
mutate(df, across(x:y, list(mean = mean)))
#> # A tibble: 5 × 4
#>       x     y x_mean y_mean
#>   <int> <int>  <dbl>  <dbl>
#> 1     1     6      3      8
#> 2     2     7      3      8
#> 3     3     8      3      8
#> 4     4     9      3      8
#> 5     5    10      3      8

# Not expanded - falls back to dplyr
mutate(df, across(x:y, list(mean = mean, median = median)))
#> # A tibble: 5 × 6
#>       x     y x_mean x_median y_mean y_median
#>   <int> <int>  <dbl>    <int>  <dbl>    <int>
#> 1     1     6      3        3      8        8
#> 2     2     7      3        3      8        8
#> 3     3     8      3        3      8        8
#> 4     4     9      3        3      8        8
#> 5     5    10      3        3      8        8

# Not expanded - falls back to dplyr
mutate(df, packed = across(x:y, mean))
#> # A tibble: 5 × 3
#>       x     y packed$x    $y
#>   <int> <int>    <dbl> <dbl>
#> 1     1     6        3     8
#> 2     2     7        3     8
#> 3     3     8        3     8
#> 4     4     9        3     8
#> 5     5    10        3     8

# Expanded - then falls back to dplyr after duckplyr checks if `mean()`
# matches `base::mean()`
mean <- function(...) stop("oh no")
mutate(df, across(x:y, mean))
#> Error in `mutate()`:
#> ℹ In argument: `across(x:y, mean)`.
#> Caused by error in `across()`:
#> ! Can't compute column `x`.
#> Caused by error:
#> ! oh no
rm(mean)

# All expressions generated by `across()` need to evaluate within the
# same projection (this should return 2s in all columns)
df <- tibble(x = 1, y = 2, z = 3)
df <- as_duckplyr_df(df)
mutate(df, across(x:z, ~ 1 + x))
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection ["+"(1.0, x) as x, "+"(1.0, x) as y, "+"(1.0, x) as z]
#>   r_dataframe_scan(0x112572b48)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - x (DOUBLE)
#> - y (DOUBLE)
#> - z (DOUBLE)
#> 
#> # A tibble: 1 × 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1     2     2     2

Some rationale for why we got here:

Why can't we inline across() in dplyr's generics directly?

Method dispatch makes this hard, S3 signatures currently look like this:

mutate <- function(.data, ...) {
  UseMethod("mutate")
}

mutate.data.frame <- function(.data, ..., .keep, .more) {
  ...
}

mutate.method <- function(.data, ..., .keep, .method_specific) {
  ...
}

We could try something like:

mutate <- function(.data, ...) {
  dots <- across_expand(enquos(...))
  mutate_dispatch(.data, !!!dots)
}
mutate_dispatch <- function(.data, ...) {
  UseMethod("mutate")
}

But how do we know to not apply across_expand() to .method_specific? It gets captured in the ... of the generic when the user calls something like:

mutate(df, across(), .method_specific = "first")

We would end up transforming all incoming arguments to quosures, even things like .keep that are not meant to be evaluated with eval_tidy().

So why not export dplyr::partial_across_expand() for methods to call manually?

We could try and make the following approach work:

mutate <- function(.data, ...) {
  UseMethod("mutate")
}

mutate.data.frame <- function(.data, ..., .keep, .more) {
  # dplyr specific expanding, fully featured `across_expand()`
}

mutate.method <- function(.data, ..., .keep, .method_specific) {
  # Alternative to `dots <- enquos(...)`
  dots <- dplyr::partial_across_expand(...)
}

This gets around the ... problem from above, because we've matched the dots to the specific S3 method at this point. But there are a number of reasons we don't necessarily want to do this right now:

  • We aren't sure how fully featured we can make "partial" expansion while still doing something all backends support. dbplyr and dtplyr already have their own across() support, so it is also possible our "partial" support might not capture all the use cases they already support - in which case they'd still need their own version.
  • The returned dots would likely be "dplyr_quosures" which are currently internal dplyr details that we are free to change.
  • dplyr itself would not use this, because it would have full across expansion support instead, which unfortunately relies on many internal dplyr details that we would not want to try and expose in any way

Since we don't have enough experience with this new variant yet, we currently want duckplyr to be the only consumer for now. We can revisit exporting it from dplyr once we get a little experience with it and do a cross comparison between duckplyr, dbplyr, and dtplyr to see if we can find some common utility that everyone can happily use.

@krlmlr
Copy link
Member

krlmlr commented Nov 1, 2024

Thanks, this looks good. I'll play with it a bit before merging.

IIUC, the main argument for not adding it to the mutate() generic is that we can't distinguish new columns from .method_specific arguments. What if dplyr warned on any new column that starts with a dot and doesn't use the := syntax, and if we expected all methods to start their custom arguments with a dot?

@krlmlr krlmlr force-pushed the feature/across branch 4 times, most recently from 5eb3aef to 9743558 Compare November 2, 2024 17:28
@krlmlr krlmlr marked this pull request as ready for review November 2, 2024 17:28
@krlmlr
Copy link
Member

krlmlr commented Nov 2, 2024

Thanks. I added tests, and have an extension ready that I'll open a new PR for.

@krlmlr krlmlr enabled auto-merge November 2, 2024 17:29
@krlmlr krlmlr changed the title Partial support for across() in mutate() and summarise() feat: Partial support for across() in mutate() and summarise() Nov 2, 2024
@krlmlr krlmlr added this pull request to the merge queue Nov 2, 2024
Merged via the queue into main with commit fc68cfa Nov 2, 2024
17 checks passed
@krlmlr krlmlr deleted the feature/across branch November 2, 2024 18:01
@lionel-
Copy link
Member

lionel- commented Nov 4, 2024

What if dplyr warned on any new column that starts with a dot and doesn't use the := syntax, and if we expected all methods to start their custom arguments with a dot?

Unfortunately I don't think we can expect downstream dependencies to adhere to this convention at this point?

Interesting point about requiring := for dotted column names. That said having to use := is only a workaround for the parser restrictions on the LHS of = and ideally we'd not use it at all. I don't think we'd want to find new usages for it as it's a confusing aspect of tidyeval / tidyverse syntax.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants