Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: New duckparquet(), duckcsv(), duckjson() and duckfile(), deprecating duckplyr_df_from_*() and df_from_*() functions #396

Merged
merged 4 commits into from
Dec 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,10 @@ export(df_from_file)
export(df_from_parquet)
export(df_to_parquet)
export(distinct)
export(duckcsv)
export(duckfile)
export(duckjson)
export(duckparquet)
export(duckplyr_df_from_csv)
export(duckplyr_df_from_file)
export(duckplyr_df_from_parquet)
Expand Down
3 changes: 2 additions & 1 deletion R/ducktbl.R
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,8 @@
#'
#' @param ... For `ducktbl()`, passed on to [tibble()].
#' For `as_ducktbl()`, passed on to methods.
#' @param .lazy Logical, whether to create a lazy duckplyr frame
#' @param .lazy Logical, whether to create a lazy duckplyr frame.
#' If `TRUE`, [collect()] must be called before the data can be accessed.
#'
#' @return For `ducktbl()` and `as_ducktbl()`, an object with the following classes:
#' - `"lazy_duckplyr_df"` if `.lazy` is `TRUE`
Expand Down
6 changes: 5 additions & 1 deletion R/io-.R
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
#' Read Parquet, CSV, and other files using DuckDB
#'
#' @description
#' `r lifecycle::badge("deprecated")`
#'
#' `df_from_file()` uses arbitrary table functions to read data.
#' See <https://duckdb.org/docs/data/overview> for a documentation
#' of the available functions and their options.
Expand All @@ -24,6 +27,7 @@
#' `duckplyr_df_from_file()`, extended by the provided `class`.
#'
#' @export
#' @keywords internal
df_from_file <- function(path,
table_function,
...,
Expand Down Expand Up @@ -53,7 +57,7 @@ df_from_file <- function(path,
options
)

meta_rel_register_file(out, path, table_function, options)
meta_rel_register_file(out, table_function, path, options)

out <- duckdb$rel_to_altrep(out)
class(out) <- unique(c(class, "data.frame"), fromLast = TRUE)
Expand Down
156 changes: 156 additions & 0 deletions R/io2.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
#' Read Parquet, CSV, and other files using DuckDB
#'
#' @description
#' These functions ingest data from a file.
#' In many cases, these functions return immediately because they only read the metadata.
#' The actual data is only read when it is actually processed.
#'
#' @name duckfile
NULL

#' @description
#' `duckparquet()` reads a CSV file using DuckDB's `read_parquet()` table function.
#'
#' @rdname duckfile
#' @export
duckparquet <- function(path, ..., lazy = TRUE, options = list()) {
check_dots_empty()

duckfile(path, "read_parquet", lazy = lazy, options = options)
}

#' @description
#' `duckcsv()` reads a CSV file using DuckDB's `read_csv_auto()` table function.
#'
#' @rdname duckfile
#' @export
#' @examples
#' # Create simple CSV file
#' path <- tempfile("duckplyr_test_", fileext = ".csv")
#' write.csv(data.frame(a = 1:3, b = letters[4:6]), path, row.names = FALSE)
#'
#' # Reading is immediate
#' df <- duckcsv(path)
#'
#' # Names are always available
#' names(df)
#'
#' # Materialization upon access is turned off by default
#' try(print(df$a))
#'
#' # Materialize explicitly
#' collect(df)$a
#'
#' # Automatic materialization with lazy = FALSE
#' df <- duckcsv(path, lazy = FALSE)
#' df$a
#'
#' # Specify column types
#' duckcsv(
#' path,
#' options = list(delim = ",", types = list(c("DOUBLE", "VARCHAR")))
#' )
duckcsv <- function(path, ..., lazy = TRUE, options = list()) {
check_dots_empty()

duckfile(path, "read_csv_auto", lazy = lazy, options = options)
}

#' @description
#' `duckjson()` reads a JSON file using DuckDB's `read_json()` table function.
#'
#' @rdname duckfile
#' @export
#' @examples
#'
#' # Create and read a simple JSON file
#' path <- tempfile("duckplyr_test_", fileext = ".json")
#' writeLines('[{"a": 1, "b": "x"}, {"a": 2, "b": "y"}]', path)
#'
#' # Reading needs the json extension
#' duckplyr_execute("INSTALL json")
#' duckplyr_execute("LOAD json")
#' duckjson(path)
duckjson <- function(path, ..., lazy = TRUE, options = list()) {
check_dots_empty()

duckfile(path, "read_json", lazy = lazy, options = options)
}

#' @description
#' `duckfile()` uses arbitrary readers to read data.
#' See <https://duckdb.org/docs/data/overview> for a documentation
#' of the available functions and their options.
#' To read multiple files with the same schema,
#' pass a wildcard or a character vector to the `path` argument,
#'
#' @details
#' By default, a lazy duckplyr frame is created.
#' This means that all the data can be shown and all dplyr verbs can be used,
#' but attempting to access the columns of the data frame or using an unsupported verb,
#' data type, or function will result in an error.
#' Pass `lazy = FALSE` to transparently switch to local processing as needed,
#' or use [collect()] to explicitly materialize and continue local processing.
#'
#' @inheritParams rlang::args_dots_empty
#'
#' @param path Path to files, glob patterns `*` and `?` are supported.
#' @param table_function The name of a table-valued
#' DuckDB function such as `"read_parquet"`,
#' `"read_csv"`, `"read_csv_auto"` or `"read_json"`.
#' @param lazy Logical, whether to create a lazy duckplyr frame.
#' If `TRUE` (the default), [collect()] must be called before the data can be accessed.
#' @param options Arguments to the DuckDB function
#' indicated by `table_function`.
#'
#' @return A duckplyr frame, see [as_ducktbl()] for details.
#'
#' @rdname duckfile
#' @export
duckfile <- function(
path,
table_function,
...,
lazy = TRUE,
options = list()
) {
check_dots_empty()

if (!rlang::is_character(path)) {
cli::cli_abort("{.arg path} must be a character vector.")
}

if (length(path) != 1) {
path <- list(path)
}

duckfun(table_function, c(list(path), options), lazy = lazy)
}

duckfun <- function(table_function, args, ..., lazy = TRUE) {
if (!is.list(args)) {
cli::cli_abort("{.arg args} must be a list.")
}
if (length(args) == 0) {
cli::cli_abort("{.arg args} must not be empty.")
}

# FIXME: For some reason, it's important to create an alias here
con <- get_default_duckdb_connection()

# FIXME: Provide better duckdb API
path <- args[[1]]
options <- args[-1]

rel <- duckdb$rel_from_table_function(
con,
table_function,
list(path),
options
)

meta_rel_register_file(rel, table_function, path, options)

out <- duckdb$rel_to_altrep(rel)
as_ducktbl(out, .lazy = lazy)
}
2 changes: 1 addition & 1 deletion R/meta.R
Original file line number Diff line number Diff line change
Expand Up @@ -213,7 +213,7 @@ meta_rel_register_df <- function(rel, df) {
meta_rel_register(rel, rel_expr)
}

meta_rel_register_file <- function(rel, path, table_function, options) {
meta_rel_register_file <- function(rel, table_function, path, options) {
if (Sys.getenv("DUCKPLYR_META_SKIP") == "TRUE") {
return(invisible())
}
Expand Down
5 changes: 1 addition & 4 deletions _pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ reference:
- title: Using duckplyr
contents:
- ducktbl
- duckfile

- title: dplyr verbs

Expand Down Expand Up @@ -62,10 +63,6 @@ reference:
contents:
- methods_overwrite

- title: Data ingestion
contents:
- df_from_file

- title: Configuration, telemetry, and internals
contents:
- config
Expand Down
3 changes: 3 additions & 0 deletions man/df_from_file.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

96 changes: 96 additions & 0 deletions man/duckfile.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion man/ducktbl.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading