diff --git a/docs-v2/advanced/distributed.md b/docs-v2/advanced/distributed.md deleted file mode 100644 index 9f78b4e0f9..0000000000 --- a/docs-v2/advanced/distributed.md +++ /dev/null @@ -1,72 +0,0 @@ -# Distributed Computing - -By default, Daft runs using your local machine's resources and your operations are thus limited by the CPUs, memory and GPUs available to you in your single local development machine. - -However, Daft has strong integrations with [Ray](https://www.ray.io) which is a distributed computing framework for distributing computations across a cluster of machines. Here is a snippet showing how you can connect Daft to a Ray cluster: - -=== "๐Ÿ Python" - - ```python - import daft - - daft.context.set_runner_ray() - ``` - -By default, if no address is specified Daft will spin up a Ray cluster locally on your machine. If you are running Daft on a powerful machine (such as an AWS P3 machine which is equipped with multiple GPUs) this is already very useful because Daft can parallelize its execution of computation across your CPUs and GPUs. However, if instead you already have your own Ray cluster running remotely, you can connect Daft to it by supplying an address: - -=== "๐Ÿ Python" - - ```python - daft.context.set_runner_ray(address="ray://url-to-mycluster") - ``` - -For more information about the `address` keyword argument, please see the [Ray documentation on initialization](https://docs.ray.io/en/latest/ray-core/api/doc/ray.init.html). - - -If you want to start a single node ray cluster on your local machine, you can do the following: - -```bash -> pip install ray[default] -> ray start --head --port=6379 -``` - -This should output something like: - -``` -Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details. - -Local node IP: 127.0.0.1 - --------------------- -Ray runtime started. --------------------- - -... -``` - -You can take the IP address and port and pass it to Daft: - -=== "๐Ÿ Python" - - ```python - >>> import daft - >>> daft.context.set_runner_ray("127.0.0.1:6379") - DaftContext(_daft_execution_config=, _daft_planning_config=, _runner_config=_RayRunnerConfig(address='127.0.0.1:6379', max_task_backlog=None), _disallow_set_runner=True, _runner=None) - >>> df = daft.from_pydict({ - ... 'text': ['hello', 'world'] - ... }) - 2024-07-29 15:49:26,610 INFO worker.py:1567 -- Connecting to existing Ray cluster at address: 127.0.0.1:6379... - 2024-07-29 15:49:26,622 INFO worker.py:1752 -- Connected to Ray cluster. - >>> print(df) - โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ - โ”‚ text โ”‚ - โ”‚ --- โ”‚ - โ”‚ Utf8 โ”‚ - โ•žโ•โ•โ•โ•โ•โ•โ•โ•ก - โ”‚ hello โ”‚ - โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค - โ”‚ world โ”‚ - โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ - - (Showing first 2 of 2 rows) - ``` diff --git a/docs-v2/core_concepts.md b/docs-v2/core_concepts.md index e748cfc4cf..86435df7a1 100644 --- a/docs-v2/core_concepts.md +++ b/docs-v2/core_concepts.md @@ -2330,6 +2330,6 @@ Letโ€™s turn the bytes into human-readable images using [`image.decode()`](https - [:material-memory: **Managing Memory Usage**](advanced/memory.md) - [:fontawesome-solid-equals: **Partitioning**](advanced/partitioning.md) -- [:material-distribute-vertical-center: **Distributed Computing**](advanced/distributed.md) +- [:material-distribute-vertical-center: **Distributed Computing**](distributed.md) diff --git a/docs-v2/core_concepts/aggregations.md b/docs-v2/core_concepts/aggregations.md deleted file mode 100644 index 4bb835f59d..0000000000 --- a/docs-v2/core_concepts/aggregations.md +++ /dev/null @@ -1,111 +0,0 @@ -# Aggregations and Grouping - -Some operations such as the sum or the average of a column are called **aggregations**. Aggregations are operations that reduce the number of rows in a column. - -## Global Aggregations - -An aggregation can be applied on an entire DataFrame, for example to get the mean on a specific column: - -=== "๐Ÿ Python" - ``` python - import daft - - df = daft.from_pydict({ - "class": ["a", "a", "b", "b"], - "score": [10, 20., 30., 40], - }) - - df.mean("score").show() - ``` - -``` {title="Output"} - -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ score โ”‚ -โ”‚ --- โ”‚ -โ”‚ Float64 โ”‚ -โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก -โ”‚ 25 โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ - -(Showing first 1 of 1 rows) -``` - -For a full list of available Dataframe aggregations, see [Aggregations](https://www.getdaft.io/projects/docs/en/stable/api_docs/dataframe.html#df-aggregations). - -Aggregations can also be mixed and matched across columns, via the `agg` method: - -=== "๐Ÿ Python" - ``` python - df.agg( - df["score"].mean().alias("mean_score"), - df["score"].max().alias("max_score"), - df["class"].count().alias("class_count"), - ).show() - ``` - -``` {title="Output"} - -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ mean_score โ”† max_score โ”† class_count โ”‚ -โ”‚ --- โ”† --- โ”† --- โ”‚ -โ”‚ Float64 โ”† Float64 โ”† UInt64 โ”‚ -โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก -โ”‚ 25 โ”† 40 โ”† 4 โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ - -(Showing first 1 of 1 rows) -``` - -For a full list of available aggregation expressions, see [Aggregation Expressions](https://www.getdaft.io/projects/docs/en/stable/api_docs/expressions.html#api-aggregation-expression) - -## Grouped Aggregations - -Aggregations can also be called on a "Grouped DataFrame". For the above example, perhaps we want to get the mean "score" not for the entire DataFrame, but for each "class". - -Let's run the mean of column "score" again, but this time grouped by "class": - -=== "๐Ÿ Python" - ``` python - df.groupby("class").mean("score").show() - ``` - -``` {title="Output"} - -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ class โ”† score โ”‚ -โ”‚ --- โ”† --- โ”‚ -โ”‚ Utf8 โ”† Float64 โ”‚ -โ•žโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก -โ”‚ a โ”† 15 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ b โ”† 35 โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ - -(Showing first 2 of 2 rows) -``` - -To run multiple aggregations on a Grouped DataFrame, you can use the `agg` method: - -=== "๐Ÿ Python" - ``` python - df.groupby("class").agg( - df["score"].mean().alias("mean_score"), - df["score"].max().alias("max_score"), - ).show() - ``` - -``` {title="Output"} - -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ class โ”† mean_score โ”† max_score โ”‚ -โ”‚ --- โ”† --- โ”† --- โ”‚ -โ”‚ Utf8 โ”† Float64 โ”† Float64 โ”‚ -โ•žโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก -โ”‚ a โ”† 15 โ”† 20 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ b โ”† 35 โ”† 40 โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ - -(Showing first 2 of 2 rows) -``` diff --git a/docs-v2/core_concepts/dataframe.md b/docs-v2/core_concepts/dataframe.md deleted file mode 100644 index 9ee1bb3320..0000000000 --- a/docs-v2/core_concepts/dataframe.md +++ /dev/null @@ -1,654 +0,0 @@ -# DataFrame - -!!! failure "todo(docs): Check that this page makes sense. Can we have a 1-1 mapping of "Common data operations that you would perform on DataFrames are: ..." to its respective section?" - -!!! failure "todo(docs): I reused some of these sections in the Quickstart (create df, execute df and view data, select rows, select columns) but the examples in the quickstart are different. Should we still keep those sections on this page?" - - -If you are coming from other DataFrame libraries such as Pandas or Polars, here are some key differences about Daft DataFrames: - -1. **Distributed:** When running in a distributed cluster, Daft splits your data into smaller "chunks" called *Partitions*. This allows Daft to process your data in parallel across multiple machines, leveraging more resources to work with large datasets. - -2. **Lazy:** When you write operations on a DataFrame, Daft doesn't execute them immediately. Instead, it creates a plan (called a query plan) of what needs to be done. This plan is optimized and only executed when you specifically request the results, which can lead to more efficient computations. - -3. **Multimodal:** Unlike traditional tables that usually contain simple data types like numbers and text, Daft DataFrames can handle complex data types in its columns. This includes things like images, audio files, or even custom Python objects. - -For a full comparison between Daft and other DataFrame Libraries, see [DataFrame Comparison](../resources/dataframe_comparison.md). - -Common data operations that you would perform on DataFrames are: - -1. [**Filtering rows:**](dataframe.md/#selecting-rows) Use [`df.where(...)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.where.html#daft.DataFrame.where) to keep only the rows that meet certain conditions. -2. **Creating new columns:** Use [`df.with_column(...)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.with_column.html#daft.DataFrame.with_column) to add a new column based on calculations from existing ones. -3. [**Joining DataFrames:**](dataframe.md/#combining-dataframes) Use [`df.join(other_df, ...)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.join.html#daft.DataFrame.join) to combine two DataFrames based on common columns. -4. [**Sorting:**](dataframe.md#reordering-rows) Use [`df.sort(...)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.sort.html#daft.DataFrame.sort) to arrange your data based on values in one or more columns. -5. **Grouping and aggregating:** Use [`df.groupby(...).agg(...)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.groupby.html#daft.DataFrame.groupby) to summarize your data by groups. - -## Creating a DataFrame - -!!! tip "See Also" - - [Reading/Writing Data](read_write.md) - a more in-depth guide on various options for reading/writing data to/from Daft DataFrames from in-memory data (Python, Arrow), files (Parquet, CSV, JSON), SQL Databases and Data Catalogs - -Let's create our first Dataframe from a Python dictionary of columns. - -=== "๐Ÿ Python" - - ```python - import daft - - df = daft.from_pydict({ - "A": [1, 2, 3, 4], - "B": [1.5, 2.5, 3.5, 4.5], - "C": [True, True, False, False], - "D": [None, None, None, None], - }) - ``` - -Examine your Dataframe by printing it: - -``` -df -``` - -``` {title="Output"} - -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ A โ”† B โ”† C โ”† D โ”‚ -โ”‚ --- โ”† --- โ”† --- โ”† --- โ”‚ -โ”‚ Int64 โ”† Float64 โ”† Boolean โ”† Null โ”‚ -โ•žโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•ก -โ”‚ 1 โ”† 1.5 โ”† true โ”† None โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 2 โ”† 2.5 โ”† true โ”† None โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 3 โ”† 3.5 โ”† false โ”† None โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 4 โ”† 4.5 โ”† false โ”† None โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ - -(Showing first 4 of 4 rows) -``` - -Congratulations - you just created your first DataFrame! It has 4 columns, "A", "B", "C", and "D". Let's try to select only the "A", "B", and "C" columns: - -=== "๐Ÿ Python" - ``` python - df = df.select("A", "B", "C") - df - ``` - -=== "โš™๏ธ SQL" - ```python - df = daft.sql("SELECT A, B, C FROM df") - df - ``` - -``` {title="Output"} - -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ A โ”† B โ”† C โ”‚ -โ”‚ --- โ”† --- โ”† --- โ”‚ -โ”‚ Int64 โ”† Float64 โ”† Boolean โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ - -(No data to display: Dataframe not materialized) -``` - -But wait - why is it printing the message `(No data to display: Dataframe not materialized)` and where are the rows of each column? - -## Executing DataFrame and Viewing Data - -The reason that our DataFrame currently does not display its rows is that Daft DataFrames are **lazy**. This just means that Daft DataFrames will defer all its work until you tell it to execute. - -In this case, Daft is just deferring the work required to read the data and select columns, however in practice this laziness can be very useful for helping Daft optimize your queries before execution! - -!!! info "Info" - - When you call methods on a Daft Dataframe, it defers the work by adding to an internal "plan". You can examine the current plan of a DataFrame by calling [`df.explain()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.explain.html#daft.DataFrame.explain)! - - Passing the `show_all=True` argument will show you the plan after Daft applies its query optimizations and the physical (lower-level) plan. - - ``` - Plan Output - - == Unoptimized Logical Plan == - - * Project: col(A), col(B), col(C) - | - * Source: - | Number of partitions = 1 - | Output schema = A#Int64, B#Float64, C#Boolean, D#Null - - - == Optimized Logical Plan == - - * Project: col(A), col(B), col(C) - | - * Source: - | Number of partitions = 1 - | Output schema = A#Int64, B#Float64, C#Boolean, D#Null - - - == Physical Plan == - - * Project: col(A), col(B), col(C) - | Clustering spec = { Num partitions = 1 } - | - * InMemoryScan: - | Schema = A#Int64, B#Float64, C#Boolean, D#Null, - | Size bytes = 65, - | Clustering spec = { Num partitions = 1 } - ``` - -We can tell Daft to execute our DataFrame and store the results in-memory using [`df.collect()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.collect.html#daft.DataFrame.collect): - -=== "๐Ÿ Python" - ``` python - df.collect() - df - ``` - -``` {title="Output"} -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ A โ”† B โ”† C โ”† D โ”‚ -โ”‚ --- โ”† --- โ”† --- โ”† --- โ”‚ -โ”‚ Int64 โ”† Float64 โ”† Boolean โ”† Null โ”‚ -โ•žโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•ก -โ”‚ 1 โ”† 1.5 โ”† true โ”† None โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 2 โ”† 2.5 โ”† true โ”† None โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 3 โ”† 3.5 โ”† false โ”† None โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 4 โ”† 4.5 โ”† false โ”† None โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ - -(Showing first 4 of 4 rows) -``` - -Now your DataFrame object `df` is **materialized** - Daft has executed all the steps required to compute the results, and has cached the results in memory so that it can display this preview. - -Any subsequent operations on `df` will avoid recomputations, and just use this materialized result! - -### When should I materialize my DataFrame? - -If you "eagerly" call [`df.collect()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.collect.html#daft.DataFrame.collect) immediately on every DataFrame, you may run into issues: - -1. If data is too large at any step, materializing all of it may cause memory issues -2. Optimizations are not possible since we cannot "predict future operations" - -However, data science is all about experimentation and trying different things on the same data. This means that materialization is crucial when working interactively with DataFrames, since it speeds up all subsequent experimentation on that DataFrame. - -We suggest materializing DataFrames using [`df.collect()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.collect.html#daft.DataFrame.collect) when they contain expensive operations (e.g. sorts or expensive function calls) and have to be called multiple times by downstream code: - -=== "๐Ÿ Python" - ``` python - df = df.sort("A") # expensive sort - df.collect() # materialize the DataFrame - - # All subsequent work on df avoids recomputing previous steps - df.sum("B").show() - df.mean("B").show() - df.with_column("try_this", df["A"] + 1).show(5) - ``` - -=== "โš™๏ธ SQL" - ```python - df = daft.sql("SELECT * FROM df ORDER BY A") - df.collect() - - # All subsequent work on df avoids recomputing previous steps - daft.sql("SELECT sum(B) FROM df").show() - daft.sql("SELECT mean(B) FROM df").show() - daft.sql("SELECT *, (A + 1) AS try_this FROM df").show(5) - ``` - -``` {title="Output"} - -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ B โ”‚ -โ”‚ --- โ”‚ -โ”‚ Float64 โ”‚ -โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก -โ”‚ 12 โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ - -(Showing first 1 of 1 rows) - -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ B โ”‚ -โ”‚ --- โ”‚ -โ”‚ Float64 โ”‚ -โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก -โ”‚ 3 โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ - -(Showing first 1 of 1 rows) - -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ A โ”† B โ”† C โ”† try_this โ”‚ -โ”‚ --- โ”† --- โ”† --- โ”† --- โ”‚ -โ”‚ Int64 โ”† Float64 โ”† Boolean โ”† Int64 โ”‚ -โ•žโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก -โ”‚ 1 โ”† 1.5 โ”† true โ”† 2 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 2 โ”† 2.5 โ”† true โ”† 3 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 3 โ”† 3.5 โ”† false โ”† 4 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 4 โ”† 4.5 โ”† false โ”† 5 โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ - -(Showing first 4 of 4 rows) -``` - -In many other cases however, there are better options than materializing your entire DataFrame with [`df.collect()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.collect.html#daft.DataFrame.collect): - -1. **Peeking with df.show(N)**: If you only want to "peek" at the first few rows of your data for visualization purposes, you can use [`df.show(N)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.show.html#daft.DataFrame.show), which processes and shows only the first `N` rows. -2. **Writing to disk**: The `df.write_*` methods will process and write your data to disk per-partition, avoiding materializing it all in memory at once. -3. **Pruning data**: You can materialize your DataFrame after performing a [`df.limit()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.limit.html#daft.DataFrame.limit), [`df.where()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.where.html#daft.DataFrame.where) or [`df.select()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.select.html#daft.DataFrame.select) operation which processes your data or prune it down to a smaller size. - -## Schemas and Types - -Notice also that when we printed our DataFrame, Daft displayed its **schema**. Each column of your DataFrame has a **name** and a **type**, and all data in that column will adhere to that type! - -Daft can display your DataFrame's schema without materializing it. Under the hood, it performs intelligent sampling of your data to determine the appropriate schema, and if you make any modifications to your DataFrame it can infer the resulting types based on the operation. - -!!! note "Note" - - Under the hood, Daft represents data in the [Apache Arrow](https://arrow.apache.org/) format, which allows it to efficiently represent and work on data using high-performance kernels which are written in Rust. - -## Running Computation with Expressions - -To run computations on data in our DataFrame, we use Expressions. - -The following statement will [`df.show()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.show.html#daft.DataFrame.show) a DataFrame that has only one column - the column `A` from our original DataFrame but with every row incremented by 1. - -=== "๐Ÿ Python" - ``` python - df.select(df["A"] + 1).show() - ``` - -=== "โš™๏ธ SQL" - ```python - daft.sql("SELECT A + 1 FROM df").show() - ``` - -``` {title="Output"} - -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ A โ”‚ -โ”‚ --- โ”‚ -โ”‚ Int64 โ”‚ -โ•žโ•โ•โ•โ•โ•โ•โ•โ•ก -โ”‚ 2 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 3 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 4 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 5 โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ - -(Showing first 4 of 4 rows) -``` - -!!! info "Info" - - A common pattern is to create a new columns using [`DataFrame.with_column`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.with_column.html): - - === "๐Ÿ Python" - ``` python - # Creates a new column named "foo" which takes on values - # of column "A" incremented by 1 - df = df.with_column("foo", df["A"] + 1) - df.show() - ``` - - === "โš™๏ธ SQL" - ```python - # Creates a new column named "foo" which takes on values - # of column "A" incremented by 1 - df = daft.sql("SELECT *, A + 1 AS foo FROM df") - df.show() - ``` - -``` {title="Output"} - -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ A โ”† B โ”† C โ”† foo โ”‚ -โ”‚ --- โ”† --- โ”† --- โ”† --- โ”‚ -โ”‚ Int64 โ”† Float64 โ”† Boolean โ”† Int64 โ”‚ -โ•žโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ก -โ”‚ 1 โ”† 1.5 โ”† true โ”† 2 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 2 โ”† 2.5 โ”† true โ”† 3 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 3 โ”† 3.5 โ”† false โ”† 4 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 4 โ”† 4.5 โ”† false โ”† 5 โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ - -(Showing first 4 of 4 rows) -``` - -Congratulations, you have just written your first **Expression**: `df["A"] + 1`! Expressions are a powerful way of describing computation on columns. For more details, check out the next section on [Expressions](expressions.md). - - - -## Selecting Rows - -We can limit the rows to the first ``N`` rows using [`df.limit(N)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.limit.html#daft.DataFrame.limit): - -=== "๐Ÿ Python" - ``` python - df = daft.from_pydict({ - "A": [1, 2, 3, 4, 5], - "B": [6, 7, 8, 9, 10], - }) - - df.limit(3).show() - ``` - -``` {title="Output"} - -+---------+---------+ -| A | B | -| Int64 | Int64 | -+=========+=========+ -| 1 | 6 | -+---------+---------+ -| 2 | 7 | -+---------+---------+ -| 3 | 8 | -+---------+---------+ -(Showing first 3 rows) -``` - -We can also filter rows using [`df.where()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.where.html#daft.DataFrame.where), which takes an input a Logical Expression predicate: - -=== "๐Ÿ Python" - ``` python - df.where(df["A"] > 3).show() - ``` - -``` {title="Output"} - -+---------+---------+ -| A | B | -| Int64 | Int64 | -+=========+=========+ -| 4 | 9 | -+---------+---------+ -| 5 | 10 | -+---------+---------+ -(Showing first 2 rows) -``` - -## Selecting Columns - -Select specific columns in a DataFrame using [`df.select()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.select.html#daft.DataFrame.select), which also takes Expressions as an input. - -=== "๐Ÿ Python" - ``` python - import daft - - df = daft.from_pydict({"A": [1, 2, 3], "B": [4, 5, 6]}) - - df.select("A").show() - ``` - -``` {title="Output"} - -+---------+ -| A | -| Int64 | -+=========+ -| 1 | -+---------+ -| 2 | -+---------+ -| 3 | -+---------+ -(Showing first 3 rows) -``` - -A useful alias for [`df.select()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.select.html#daft.DataFrame.select) is indexing a DataFrame with a list of column names or Expressions: - -=== "๐Ÿ Python" - ``` python - df[["A", "B"]].show() - ``` - -``` {title="Output"} - -+---------+---------+ -| A | B | -| Int64 | Int64 | -+=========+=========+ -| 1 | 4 | -+---------+---------+ -| 2 | 5 | -+---------+---------+ -| 3 | 6 | -+---------+---------+ -(Showing first 3 rows) -``` - -Sometimes, it may be useful to exclude certain columns from a DataFrame. This can be done with [`df.exclude()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.exclude.html#daft.DataFrame.exclude): - -=== "๐Ÿ Python" - ``` python - df.exclude("A").show() - ``` - -```{title="Output"} - -+---------+ -| B | -| Int64 | -+=========+ -| 4 | -+---------+ -| 5 | -+---------+ -| 6 | -+---------+ -(Showing first 3 rows) -``` - -Adding a new column can be achieved with [`df.with_column()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.with_column.html#daft.DataFrame.with_column): - -=== "๐Ÿ Python" - ``` python - df.with_column("C", df["A"] + df["B"]).show() - ``` - -``` {title="Output"} - -+---------+---------+---------+ -| A | B | C | -| Int64 | Int64 | Int64 | -+=========+=========+=========+ -| 1 | 4 | 5 | -+---------+---------+---------+ -| 2 | 5 | 7 | -+---------+---------+---------+ -| 3 | 6 | 9 | -+---------+---------+---------+ -(Showing first 3 rows) -``` - -### Selecting Columns Using Wildcards - -We can select multiple columns at once using wildcards. The expression [`.col(*)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/expression_methods/daft.col.html#daft.col) selects every column in a DataFrame, and you can operate on this expression in the same way as a single column: - -=== "๐Ÿ Python" - ``` python - df = daft.from_pydict({"A": [1, 2, 3], "B": [4, 5, 6]}) - df.select(col("*") * 3).show() - ``` - -``` {title="Output"} -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ A โ”† B โ”‚ -โ”‚ --- โ”† --- โ”‚ -โ”‚ Int64 โ”† Int64 โ”‚ -โ•žโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ก -โ”‚ 3 โ”† 12 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 6 โ”† 15 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 9 โ”† 18 โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ -``` - -We can also select multiple columns within structs using `col("struct.*")`: - -=== "๐Ÿ Python" - ``` python - df = daft.from_pydict({ - "A": [ - {"B": 1, "C": 2}, - {"B": 3, "C": 4} - ] - }) - df.select(col("A.*")).show() - ``` - -``` {title="Output"} - -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ B โ”† C โ”‚ -โ”‚ --- โ”† --- โ”‚ -โ”‚ Int64 โ”† Int64 โ”‚ -โ•žโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ก -โ”‚ 1 โ”† 2 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 3 โ”† 4 โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ -``` - -Under the hood, wildcards work by finding all of the columns that match, then copying the expression several times and replacing the wildcard. This means that there are some caveats: - -* Only one wildcard is allowed per expression tree. This means that `col("*") + col("*")` and similar expressions do not work. -* Be conscious about duplicated column names. Any code like `df.select(col("*"), col("*") + 3)` will not work because the wildcards expand into the same column names. - - For the same reason, `col("A") + col("*")` will not work because the name on the left-hand side is inherited, meaning all the output columns are named `A`, causing an error if there is more than one. - However, `col("*") + col("A")` will work fine. - -## Combining DataFrames - -Two DataFrames can be column-wise joined using [`df.join()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.join.html#daft.DataFrame.join). - -This requires a "join key", which can be supplied as the `on` argument if both DataFrames have the same name for their key columns, or the `left_on` and `right_on` argument if the key column has different names in each DataFrame. - -Daft also supports multi-column joins if you have a join key comprising of multiple columns! - -=== "๐Ÿ Python" - ``` python - df1 = daft.from_pydict({"A": [1, 2, 3], "B": [4, 5, 6]}) - df2 = daft.from_pydict({"A": [1, 2, 3], "C": [7, 8, 9]}) - - df1.join(df2, on="A").show() - ``` - -``` {title="Output"} - -+---------+---------+---------+ -| A | B | C | -| Int64 | Int64 | Int64 | -+=========+=========+=========+ -| 1 | 4 | 7 | -+---------+---------+---------+ -| 2 | 5 | 8 | -+---------+---------+---------+ -| 3 | 6 | 9 | -+---------+---------+---------+ -(Showing first 3 rows) -``` - -## Reordering Rows - -Rows in a DataFrame can be reordered based on some column using [`df.sort()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.sort.html#daft.DataFrame.sort). Daft also supports multi-column sorts for sorting on multiple columns at once. - -=== "๐Ÿ Python" - ``` python - df = daft.from_pydict({ - "A": [1, 2, 3], - "B": [6, 7, 8], - }) - - df.sort("A", desc=True).show() - ``` - -```{title="Output"} - -+---------+---------+ -| A | B | -| Int64 | Int64 | -+=========+=========+ -| 3 | 8 | -+---------+---------+ -| 2 | 7 | -+---------+---------+ -| 1 | 6 | -+---------+---------+ -(Showing first 3 rows) -``` - -## Exploding Columns - -The [`df.explode()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.explode.html#daft.DataFrame.explode) method can be used to explode a column containing a list of values into multiple rows. All other rows will be **duplicated**. - -=== "๐Ÿ Python" - ``` python - df = daft.from_pydict({ - "A": [1, 2, 3], - "B": [[1, 2, 3], [4, 5, 6], [7, 8, 9]], - }) - - df.explode("B").show() - ``` - -``` {title="Output"} - -+---------+---------+ -| A | B | -| Int64 | Int64 | -+=========+=========+ -| 1 | 1 | -+---------+---------+ -| 1 | 2 | -+---------+---------+ -| 1 | 3 | -+---------+---------+ -| 2 | 4 | -+---------+---------+ -| 2 | 5 | -+---------+---------+ -| 2 | 6 | -+---------+---------+ -| 3 | 7 | -+---------+---------+ -| 3 | 8 | -+---------+---------+ -(Showing first 8 rows) -``` - - - diff --git a/docs-v2/core_concepts/datatypes.md b/docs-v2/core_concepts/datatypes.md deleted file mode 100644 index f932623806..0000000000 --- a/docs-v2/core_concepts/datatypes.md +++ /dev/null @@ -1,96 +0,0 @@ -# DataTypes - -All columns in a Daft DataFrame have a DataType (also often abbreviated as `dtype`). - -All elements of a column are of the same dtype, or they can be the special Null value (indicating a missing value). - -Daft provides simple DataTypes that are ubiquituous in many DataFrames such as numbers, strings and dates - all the way up to more complex types like tensors and images. - -!!! tip "Tip" - - For a full overview on all the DataTypes that Daft supports, see the [DataType API Reference](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html). - - -## Numeric DataTypes - -Numeric DataTypes allows Daft to represent numbers. These numbers can differ in terms of the number of bits used to represent them (8, 16, 32 or 64 bits) and the semantic meaning of those bits -(float vs integer vs unsigned integers). - -Examples: - -1. [`DataType.int8()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.int8): represents an 8-bit signed integer (-128 to 127) -2. [`DataType.float32()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.float32): represents a 32-bit float (a float number with about 7 decimal digits of precision) - -Columns/expressions with these datatypes can be operated on with many numeric expressions such as `+` and `*`. - -See also: [Numeric Expressions](https://www.getdaft.io/projects/docs/en/stable/user_guide/expressions.html#userguide-numeric-expressions) - -## Logical DataTypes - -The [`Boolean`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.bool) DataType represents values which are boolean values: `True`, `False` or `Null`. - -Columns/expressions with this dtype can be operated on using logical expressions such as ``&`` and [`.if_else()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/expression_methods/daft.Expression.if_else.html#daft.Expression.if_else). - -See also: [Logical Expressions](https://www.getdaft.io/projects/docs/en/stable/user_guide/expressions.html#userguide-logical-expressions) - -## String Types - -Daft has string types, which represent a variable-length string of characters. - -As a convenience method, string types also support the `+` Expression, which has been overloaded to support concatenation of elements between two [`DataType.string()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.string) columns. - -1. [`DataType.string()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.string): represents a string of UTF-8 characters -2. [`DataType.binary()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.binary): represents a string of bytes - -See also: [String Expressions](https://www.getdaft.io/projects/docs/en/stable/user_guide/expressions.html#userguide-string-expressions) - -## Temporal - -Temporal dtypes represent data that have to do with time. - -Examples: - -1. [`DataType.date()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.date): represents a Date (year, month and day) -2. [`DataType.timestamp()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.timestamp): represents a Timestamp (particular instance in time) - -See also: [Temporal Expressions](https://www.getdaft.io/projects/docs/en/stable/api_docs/expressions.html#api-expressions-temporal) - -## Nested - -Nested DataTypes wrap other DataTypes, allowing you to compose types into complex data structures. - -Examples: - -1. [`DataType.list(child_dtype)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.list): represents a list where each element is of the child `dtype` -2. [`DataType.struct({"field_name": child_dtype})`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.struct): represents a structure that has children `dtype`s, each mapped to a field name - -## Python - -The [`DataType.python()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.python) dtype represent items that are Python objects. - -!!! warning "Warning" - - Daft does not impose any invariants about what *Python types* these objects are. To Daft, these are just generic Python objects! - -Python is AWESOME because it's so flexible, but it's also slow and memory inefficient! Thus we recommend: - -1. **Cast early!**: Casting your Python data into native Daft DataTypes if possible - this results in much more efficient downstream data serialization and computation. -2. **Use Python UDFs**: If there is no suitable Daft representation for your Python objects, use Python UDFs to process your Python data and extract the relevant data to be returned as native Daft DataTypes! - -!!! note "Note" - - If you work with Python classes for a generalizable use-case (e.g. documents, protobufs), it may be that these types are good candidates for "promotion" into a native Daft type! Please get in touch with the Daft team and we would love to work together on building your type into canonical Daft types. - -## Complex Types - -Daft supports many more interesting complex DataTypes, for example: - -* [`DataType.tensor()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.tensor): Multi-dimensional (potentially uniformly-shaped) tensors of data -* [`DataType.embedding()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.embedding): Lower-dimensional vector representation of data (e.g. words) -* [`DataType.image()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.image): NHWC images - -Daft abstracts away the in-memory representation of your data and provides kernels for many common operations on top of these data types. For supported image operations see the [image expressions API reference](https://www.getdaft.io/projects/docs/en/stable/api_docs/expressions.html#api-expressions-images). - -For more complex algorithms, you can also drop into a Python UDF to process this data using your custom Python libraries. - -Please add suggestions for new DataTypes to our [Github Discussions page](https://github.com/Eventual-Inc/Daft/discussions)! diff --git a/docs-v2/core_concepts/expressions.md b/docs-v2/core_concepts/expressions.md deleted file mode 100644 index 81ba19bfc6..0000000000 --- a/docs-v2/core_concepts/expressions.md +++ /dev/null @@ -1,744 +0,0 @@ -# Expressions - -Expressions are how you can express computations that should be run over columns of data. - -## Creating Expressions - -### Referring to a column in a DataFrame - -Most commonly you will be creating expressions by using the [`daft.col`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/expression_methods/daft.col.html#daft.col) function. - -=== "๐Ÿ Python" - ``` python - # Refers to column "A" - daft.col("A") - ``` - -=== "โš™๏ธ SQL" - ```python - daft.sql_expr("A") - ``` - -``` {title="Output"} - -col(A) -``` - -The above code creates an Expression that refers to a column named `"A"`. - -### Using SQL - -Daft can also parse valid SQL as expressions. - -=== "โš™๏ธ SQL" - ```python - daft.sql_expr("A + 1") - ``` -``` {title="Output"} - -col(A) + lit(1) -``` - -The above code will create an expression representing "the column named 'x' incremented by 1". For many APIs, [`sql_expr`](https://www.getdaft.io/projects/docs/en/stable/api_docs/sql.html#daft.sql_expr) will actually be applied for you as syntactic sugar! - -### Literals - -You may find yourself needing to hardcode a "single value" oftentimes as an expression. Daft provides a [`lit()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/expression_methods/daft.lit.html) helper to do so: - -=== "๐Ÿ Python" - ``` python - from daft import lit - - # Refers to an expression which always evaluates to 42 - lit(42) - ``` - -=== "โš™๏ธ SQL" - ```python - # Refers to an expression which always evaluates to 42 - daft.sql_expr("42") - ``` - -```{title="Output"} - -lit(42) -``` -This special :func:`~daft.expressions.lit` expression we just created evaluates always to the value ``42``. - -### Wildcard Expressions - -You can create expressions on multiple columns at once using a wildcard. The expression [`col("*")`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/expression_methods/daft.col.html#daft.col)) selects every column in a DataFrame, and you can operate on this expression in the same way as a single column: - -=== "๐Ÿ Python" - ``` python - import daft - from daft import col - - df = daft.from_pydict({"A": [1, 2, 3], "B": [4, 5, 6]}) - df.select(col("*") * 3).show() - ``` - -``` {title="Output"} - -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ A โ”† B โ”‚ -โ”‚ --- โ”† --- โ”‚ -โ”‚ Int64 โ”† Int64 โ”‚ -โ•žโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ก -โ”‚ 3 โ”† 12 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 6 โ”† 15 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 9 โ”† 18 โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ -``` - -Wildcards also work very well for accessing all members of a struct column: - -=== "๐Ÿ Python" - ``` python - - import daft - from daft import col - - df = daft.from_pydict({ - "person": [ - {"name": "Alice", "age": 30}, - {"name": "Bob", "age": 25}, - {"name": "Charlie", "age": 35} - ] - }) - - # Access all fields of the 'person' struct - df.select(col("person.*")).show() - ``` - -=== "โš™๏ธ SQL" - ```python - import daft - - df = daft.from_pydict({ - "person": [ - {"name": "Alice", "age": 30}, - {"name": "Bob", "age": 25}, - {"name": "Charlie", "age": 35} - ] - }) - - # Access all fields of the 'person' struct using SQL - daft.sql("SELECT person.* FROM df").show() - ``` - -``` {title="Output"} - -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ name โ”† age โ”‚ -โ”‚ --- โ”† --- โ”‚ -โ”‚ String โ”† Int64 โ”‚ -โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ก -โ”‚ Alice โ”† 30 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ Bob โ”† 25 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ Charlie โ”† 35 โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ -``` - -In this example, we use the wildcard `*` to access all fields of the `person` struct column. This is equivalent to selecting each field individually (`person.name`, `person.age`), but is more concise and flexible, especially when dealing with structs that have many fields. - - - -## Composing Expressions - -### Numeric Expressions - -Since column "A" is an integer, we can run numeric computation such as addition, division and checking its value. Here are some examples where we create new columns using the results of such computations: - -=== "๐Ÿ Python" - ``` python - # Add 1 to each element in column "A" - df = df.with_column("A_add_one", df["A"] + 1) - - # Divide each element in column A by 2 - df = df.with_column("A_divide_two", df["A"] / 2.) - - # Check if each element in column A is more than 1 - df = df.with_column("A_gt_1", df["A"] > 1) - - df.collect() - ``` - -=== "โš™๏ธ SQL" - ```python - df = daft.sql(""" - SELECT - *, - A + 1 AS A_add_one, - A / 2.0 AS A_divide_two, - A > 1 AS A_gt_1 - FROM df - """) - df.collect() - ``` - -```{title="Output"} - -+---------+-------------+----------------+-----------+ -| A | A_add_one | A_divide_two | A_gt_1 | -| Int64 | Int64 | Float64 | Boolean | -+=========+=============+================+===========+ -| 1 | 2 | 0.5 | false | -+---------+-------------+----------------+-----------+ -| 2 | 3 | 1 | true | -+---------+-------------+----------------+-----------+ -| 3 | 4 | 1.5 | true | -+---------+-------------+----------------+-----------+ -(Showing first 3 of 3 rows) -``` - -Notice that the returned types of these operations are also well-typed according to their input types. For example, calling ``df["A"] > 1`` returns a column of type [`Boolean`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.bool). - -Both the [`Float`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.float32) and [`Int`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType.int16) types are numeric types, and inherit many of the same arithmetic Expression operations. You may find the full list of numeric operations in the [Expressions API Reference](https://www.getdaft.io/projects/docs/en/stable/api_docs/expressions.html#numeric). - -### String Expressions - -Daft also lets you have columns of strings in a DataFrame. Let's take a look! - -=== "๐Ÿ Python" - ``` python - df = daft.from_pydict({"B": ["foo", "bar", "baz"]}) - df.show() - ``` - -``` {title="Output"} - -+--------+ -| B | -| Utf8 | -+========+ -| foo | -+--------+ -| bar | -+--------+ -| baz | -+--------+ -(Showing first 3 rows) -``` - -Unlike the numeric types, the string type does not support arithmetic operations such as `*` and `/`. The one exception to this is the `+` operator, which is overridden to concatenate two string expressions as is commonly done in Python. Let's try that! - -=== "๐Ÿ Python" - ``` python - df = df.with_column("B2", df["B"] + "foo") - df.show() - ``` - -=== "โš™๏ธ SQL" - ```python - df = daft.sql("SELECT *, B + 'foo' AS B2 FROM df") - df.show() - ``` - -``` {title="Output"} - -+--------+--------+ -| B | B2 | -| Utf8 | Utf8 | -+========+========+ -| foo | foofoo | -+--------+--------+ -| bar | barfoo | -+--------+--------+ -| baz | bazfoo | -+--------+--------+ -(Showing first 3 rows) -``` - -There are also many string operators that are accessed through a separate [`.str.*`](https://www.getdaft.io/projects/docs/en/stable/api_docs/expressions.html#strings) "method namespace". - -For example, to check if each element in column "B" contains the substring "a", we can use the [`.str.contains`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/expression_methods/daft.Expression.str.contains.html#daft.Expression.str.contains) method: - -=== "๐Ÿ Python" - ``` python - df = df.with_column("B2_contains_B", df["B2"].str.contains(df["B"])) - df.show() - ``` - -=== "โš™๏ธ SQL" - ```python - df = daft.sql("SELECT *, contains(B2, B) AS B2_contains_B FROM df") - df.show() - ``` - -``` {title="Output"} - -+--------+--------+-----------------+ -| B | B2 | B2_contains_B | -| Utf8 | Utf8 | Boolean | -+========+========+=================+ -| foo | foofoo | true | -+--------+--------+-----------------+ -| bar | barfoo | true | -+--------+--------+-----------------+ -| baz | bazfoo | true | -+--------+--------+-----------------+ -(Showing first 3 rows) -``` - -You may find a full list of string operations in the [Expressions API Reference](https://www.getdaft.io/projects/docs/en/stable/api_docs/expressions.html). - -### URL Expressions - -One special case of a String column you may find yourself working with is a column of URL strings. - -Daft provides the [`.url.*`](https://www.getdaft.io/projects/docs/en/stable/api_docs/expressions.html) method namespace with functionality for working with URL strings. For example, to download data from URLs: - -=== "๐Ÿ Python" - ``` python - df = daft.from_pydict({ - "urls": [ - "https://www.google.com", - "s3://daft-public-data/open-images/validation-images/0001eeaf4aed83f9.jpg", - ], - }) - df = df.with_column("data", df["urls"].url.download()) - df.collect() - ``` - -=== "โš™๏ธ SQL" - ```python - df = daft.from_pydict({ - "urls": [ - "https://www.google.com", - "s3://daft-public-data/open-images/validation-images/0001eeaf4aed83f9.jpg", - ], - }) - df = daft.sql(""" - SELECT - urls, - url_download(urls) AS data - FROM df - """) - df.collect() - ``` - -``` {title="Output"} - -+----------------------+----------------------+ -| urls | data | -| Utf8 | Binary | -+======================+======================+ -| https://www.google.c | b' df["B"]).if_else(df["A"], df["B"]), - ) - - df.collect() - ``` - -=== "โš™๏ธ SQL" - ```python - df = daft.from_pydict({"A": [1, 2, 3], "B": [0, 2, 4]}) - - df = daft.sql(""" - SELECT - A, - B, - CASE - WHEN A > B THEN A - ELSE B - END AS A_if_bigger_else_B - FROM df - """) - - df.collect() - ``` - -```{title="Output"} - -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ A โ”† B โ”† A_if_bigger_else_B โ”‚ -โ”‚ --- โ”† --- โ”† --- โ”‚ -โ”‚ Int64 โ”† Int64 โ”† Int64 โ”‚ -โ•žโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก -โ”‚ 1 โ”† 0 โ”† 1 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 2 โ”† 2 โ”† 2 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 3 โ”† 4 โ”† 4 โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ - -(Showing first 3 of 3 rows) -``` - -This is a useful expression for cleaning your data! - - -### Temporal Expressions - -Daft provides rich support for working with temporal data types like Timestamp and Duration. Let's explore some common temporal operations: - -#### Basic Temporal Operations - -You can perform arithmetic operations with timestamps and durations, such as adding a duration to a timestamp or calculating the duration between two timestamps: - -=== "๐Ÿ Python" - ``` python - import datetime - - df = daft.from_pydict({ - "timestamp": [ - datetime.datetime(2021, 1, 1, 0, 1, 1), - datetime.datetime(2021, 1, 1, 0, 1, 59), - datetime.datetime(2021, 1, 1, 0, 2, 0), - ] - }) - - # Add 10 seconds to each timestamp - df = df.with_column( - "plus_10_seconds", - df["timestamp"] + datetime.timedelta(seconds=10) - ) - - df.show() - ``` - -=== "โš™๏ธ SQL" - ```python - import datetime - - df = daft.from_pydict({ - "timestamp": [ - datetime.datetime(2021, 1, 1, 0, 1, 1), - datetime.datetime(2021, 1, 1, 0, 1, 59), - datetime.datetime(2021, 1, 1, 0, 2, 0), - ] - }) - - # Add 10 seconds to each timestamp and calculate duration between timestamps - df = daft.sql(""" - SELECT - timestamp, - timestamp + INTERVAL '10 seconds' as plus_10_seconds, - FROM df - """) - - df.show() - ``` - -``` {title="Output"} - -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ timestamp โ”† plus_10_seconds โ”‚ -โ”‚ --- โ”† --- โ”‚ -โ”‚ Timestamp(Microseconds, None) โ”† Timestamp(Microseconds, None) โ”‚ -โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก -โ”‚ 2021-01-01 00:01:01 โ”† 2021-01-01 00:01:11 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 2021-01-01 00:01:59 โ”† 2021-01-01 00:02:09 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 2021-01-01 00:02:00 โ”† 2021-01-01 00:02:10 โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ -``` - -#### Temporal Component Extraction - -The [`.dt.*`](https://www.getdaft.io/projects/docs/en/stable/api_docs/expressions.html#temporal) method namespace provides extraction methods for the components of a timestamp, such as year, month, day, hour, minute, and second: - -=== "๐Ÿ Python" - ``` python - df = daft.from_pydict({ - "timestamp": [ - datetime.datetime(2021, 1, 1, 0, 1, 1), - datetime.datetime(2021, 1, 1, 0, 1, 59), - datetime.datetime(2021, 1, 1, 0, 2, 0), - ] - }) - - # Extract year, month, day, hour, minute, and second from the timestamp - df = df.with_columns({ - "year": df["timestamp"].dt.year(), - "month": df["timestamp"].dt.month(), - "day": df["timestamp"].dt.day(), - "hour": df["timestamp"].dt.hour(), - "minute": df["timestamp"].dt.minute(), - "second": df["timestamp"].dt.second() - }) - - df.show() - ``` - -=== "โš™๏ธ SQL" - ```python - df = daft.from_pydict({ - "timestamp": [ - datetime.datetime(2021, 1, 1, 0, 1, 1), - datetime.datetime(2021, 1, 1, 0, 1, 59), - datetime.datetime(2021, 1, 1, 0, 2, 0), - ] - }) - - # Extract year, month, day, hour, minute, and second from the timestamp - df = daft.sql(""" - SELECT - timestamp, - year(timestamp) as year, - month(timestamp) as month, - day(timestamp) as day, - hour(timestamp) as hour, - minute(timestamp) as minute, - second(timestamp) as second - FROM df - """) - - df.show() - ``` - -``` {title="Output"} - -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ timestamp โ”† year โ”† month โ”† day โ”† hour โ”† minute โ”† second โ”‚ -โ”‚ --- โ”† --- โ”† --- โ”† --- โ”† --- โ”† --- โ”† --- โ”‚ -โ”‚ Timestamp(Microseconds, None) โ”† Int32 โ”† UInt32 โ”† UInt32 โ”† UInt32 โ”† UInt32 โ”† UInt32 โ”‚ -โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•ก -โ”‚ 2021-01-01 00:01:01 โ”† 2021 โ”† 1 โ”† 1 โ”† 0 โ”† 1 โ”† 1 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 2021-01-01 00:01:59 โ”† 2021 โ”† 1 โ”† 1 โ”† 0 โ”† 1 โ”† 59 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 2021-01-01 00:02:00 โ”† 2021 โ”† 1 โ”† 1 โ”† 0 โ”† 2 โ”† 0 โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ -``` - -#### Time Zone Operations - -You can parse strings as timestamps with time zones and convert between different time zones: - -=== "๐Ÿ Python" - ``` python - df = daft.from_pydict({ - "timestamp_str": [ - "2021-01-01 00:00:00.123 +0800", - "2021-01-02 12:30:00.456 +0800" - ] - }) - - # Parse the timestamp string with time zone and convert to New York time - df = df.with_column( - "ny_time", - df["timestamp_str"].str.to_datetime( - "%Y-%m-%d %H:%M:%S%.3f %z", - timezone="America/New_York" - ) - ) - - df.show() - ``` - -=== "โš™๏ธ SQL" - ```python - df = daft.from_pydict({ - "timestamp_str": [ - "2021-01-01 00:00:00.123 +0800", - "2021-01-02 12:30:00.456 +0800" - ] - }) - - # Parse the timestamp string with time zone and convert to New York time - df = daft.sql(""" - SELECT - timestamp_str, - to_datetime(timestamp_str, '%Y-%m-%d %H:%M:%S%.3f %z', 'America/New_York') as ny_time - FROM df - """) - - df.show() - ``` - -``` {title="Output"} - -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ timestamp_str โ”† ny_time โ”‚ -โ”‚ --- โ”† --- โ”‚ -โ”‚ Utf8 โ”† Timestamp(Milliseconds, Some("America/New_York")) โ”‚ -โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก -โ”‚ 2021-01-01 00:00:00.123 +0800 โ”† 2020-12-31 11:00:00.123 EST โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 2021-01-02 12:30:00.456 +0800 โ”† 2021-01-01 23:30:00.456 EST โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ -``` - -#### Temporal Truncation - -The [`.dt.truncate()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/expression_methods/daft.Expression.dt.truncate.html#daft.Expression.dt.truncate) method allows you to truncate timestamps to specific time units. This can be useful for grouping data by time periods. For example, to truncate timestamps to the nearest hour: - -=== "๐Ÿ Python" - ``` python - df = daft.from_pydict({ - "timestamp": [ - datetime.datetime(2021, 1, 7, 0, 1, 1), - datetime.datetime(2021, 1, 8, 0, 1, 59), - datetime.datetime(2021, 1, 9, 0, 30, 0), - datetime.datetime(2021, 1, 10, 1, 59, 59), - ] - }) - - # Truncate timestamps to the nearest hour - df = df.with_column( - "hour_start", - df["timestamp"].dt.truncate("1 hour") - ) - - df.show() - ``` - -``` {title="Output"} - -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ timestamp โ”† hour_start โ”‚ -โ”‚ --- โ”† --- โ”‚ -โ”‚ Timestamp(Microseconds, None) โ”† Timestamp(Microseconds, None) โ”‚ -โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก -โ”‚ 2021-01-07 00:01:01 โ”† 2021-01-07 00:00:00 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 2021-01-08 00:01:59 โ”† 2021-01-08 00:00:00 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 2021-01-09 00:30:00 โ”† 2021-01-09 00:00:00 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 2021-01-10 01:59:59 โ”† 2021-01-10 01:00:00 โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ -``` diff --git a/docs-v2/core_concepts/read_write.md b/docs-v2/core_concepts/read_write.md deleted file mode 100644 index fa3aeef066..0000000000 --- a/docs-v2/core_concepts/read_write.md +++ /dev/null @@ -1,142 +0,0 @@ -# Reading/Writing Data - -!!! failure "todo(docs): Should this page also have sql examples?" - -Daft can read data from a variety of sources, and write data to many destinations. - -## Reading Data - -### From Files - -DataFrames can be loaded from file(s) on some filesystem, commonly your local filesystem or a remote cloud object store such as AWS S3. Additionally, Daft can read data from a variety of container file formats, including CSV, line-delimited JSON and Parquet. - -Daft supports file paths to a single file, a directory of files, and wildcards. It also supports paths to remote object storage such as AWS S3. -=== "๐Ÿ Python" - ```python - import daft - - # You can read a single CSV file from your local filesystem - df = daft.read_csv("path/to/file.csv") - - # You can also read folders of CSV files, or include wildcards to select for patterns of file paths - df = daft.read_csv("path/to/*.csv") - - # Other formats such as parquet and line-delimited JSON are also supported - df = daft.read_parquet("path/to/*.parquet") - df = daft.read_json("path/to/*.json") - - # Remote filesystems such as AWS S3 are also supported, and can be specified with their protocols - df = daft.read_csv("s3://mybucket/path/to/*.csv") - ``` - -To learn more about each of these constructors, as well as the options that they support, consult the API documentation on [`creating DataFrames from files`](https://www.getdaft.io/projects/docs/en/stable/api_docs/creation.html#df-io-files). - -### From Data Catalogs - -If you use catalogs such as Apache Iceberg or Hive, you may wish to consult our user guide on integrations with Data Catalogs: [`Daft integration with Data Catalogs`](https://www.getdaft.io/projects/docs/en/stable/user_guide/integrations.html). - -### From File Paths - -Daft also provides an easy utility to create a DataFrame from globbing a path. You can use the [`daft.from_glob_path`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_functions/daft.from_glob_path.html#daft.from_glob_path) method which will read a DataFrame of globbed filepaths. - -=== "๐Ÿ Python" - ``` python - df = daft.from_glob_path("s3://mybucket/path/to/images/*.jpeg") - - # +----------+------+-----+ - # | name | size | ... | - # +----------+------+-----+ - # ... - ``` - -This is especially useful for reading things such as a folder of images or documents into Daft. A common pattern is to then download data from these files into your DataFrame as bytes, using the [`.url.download()`](https://getdaft.io/projects/docs/en/stable/api_docs/doc_gen/expression_methods/daft.Expression.url.download.html#daft.Expression.url.download) method. - - -### From Memory - -For testing, or small datasets that fit in memory, you may also create DataFrames using Python lists and dictionaries. - -=== "๐Ÿ Python" - ``` python - # Create DataFrame using a dictionary of {column_name: list_of_values} - df = daft.from_pydict({"A": [1, 2, 3], "B": ["foo", "bar", "baz"]}) - - # Create DataFrame using a list of rows, where each row is a dictionary of {column_name: value} - df = daft.from_pylist([{"A": 1, "B": "foo"}, {"A": 2, "B": "bar"}, {"A": 3, "B": "baz"}]) - ``` - -To learn more, consult the API documentation on [`creating DataFrames from in-memory data structures`](https://www.getdaft.io/projects/docs/en/stable/api_docs/creation.html#df-io-in-memory). - -### From Databases - -Daft can also read data from a variety of databases, including PostgreSQL, MySQL, Trino, and SQLite using the [`daft.read_sql`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_functions/daft.read_sql.html#daft.read_sql) method. In order to partition the data, you can specify a partition column, which will allow Daft to read the data in parallel. - -=== "๐Ÿ Python" - ``` python - # Read from a PostgreSQL database - uri = "postgresql://user:password@host:port/database" - df = daft.read_sql("SELECT * FROM my_table", uri) - - # Read with a partition column - df = daft.read_sql("SELECT * FROM my_table", partition_col="date", uri) - ``` - -To learn more, consult the [`SQL User Guide`](https://www.getdaft.io/projects/docs/en/stable/user_guide/integrations/sql.html) or the API documentation on [`daft.read_sql`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_functions/daft.read_sql.html#daft.read_sql). - -## Reading a column of URLs - -Daft provides a convenient way to read data from a column of URLs using the [`.url.download()`](https://getdaft.io/projects/docs/en/stable/api_docs/doc_gen/expression_methods/daft.Expression.url.download.html#daft.Expression.url.download) method. This is particularly useful when you have a DataFrame with a column containing URLs pointing to external resources that you want to fetch and incorporate into your DataFrame. - -Here's an example of how to use this feature: - -=== "๐Ÿ Python" - ```python - # Assume we have a DataFrame with a column named 'image_urls' - df = daft.from_pydict({ - "image_urls": [ - "https://example.com/image1.jpg", - "https://example.com/image2.jpg", - "https://example.com/image3.jpg" - ] - }) - - # Download the content from the URLs and create a new column 'image_data' - df = df.with_column("image_data", df["image_urls"].url.download()) - df.show() - ``` - -``` {title="Output"} - -+------------------------------------+------------------------------------+ -| image_urls | image_data | -| Utf8 | Binary | -+====================================+====================================+ -| https://example.com/image1.jpg | b'\xff\xd8\xff\xe0\x00\x10JFIF...' | -+------------------------------------+------------------------------------+ -| https://example.com/image2.jpg | b'\xff\xd8\xff\xe0\x00\x10JFIF...' | -+------------------------------------+------------------------------------+ -| https://example.com/image3.jpg | b'\xff\xd8\xff\xe0\x00\x10JFIF...' | -+------------------------------------+------------------------------------+ - -(Showing first 3 of 3 rows) -``` - -This approach allows you to efficiently download and process data from a large number of URLs in parallel, leveraging Daft's distributed computing capabilities. - -## Writing Data - -Writing data will execute your DataFrame and write the results out to the specified backend. The [`df.write_*(...)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/dataframe.html#df-write-data) methods are used to write DataFrames to files or other destinations. - -=== "๐Ÿ Python" - ``` python - # Write to various file formats in a local folder - df.write_csv("path/to/folder/") - df.write_parquet("path/to/folder/") - - # Write DataFrame to a remote filesystem such as AWS S3 - df.write_csv("s3://mybucket/path/") - ``` - -!!! note "Note" - - Because Daft is a distributed DataFrame library, by default it will produce multiple files (one per partition) at your specified destination. Writing your dataframe is a **blocking** operation that executes your DataFrame. It will return a new `DataFrame` that contains the filepaths to the written data. diff --git a/docs-v2/core_concepts/sql.md b/docs-v2/core_concepts/sql.md deleted file mode 100644 index 55ebde486e..0000000000 --- a/docs-v2/core_concepts/sql.md +++ /dev/null @@ -1,224 +0,0 @@ -# SQL - -Daft supports Structured Query Language (SQL) as a way of constructing query plans (represented in Python as a [`daft.DataFrame`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.html#daft.DataFrame)) and expressions ([`daft.Expression`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.html#daft.DataFrame)). - -SQL is a human-readable way of constructing these query plans, and can often be more ergonomic than using DataFrames for writing queries. - -!!! tip "Daft's SQL support is new and is constantly being improved on!" - - Please give us feedback or submit an [issue](https://github.com/Eventual-Inc/Daft/issues) and we'd love to hear more about what you would like. - - -## Running SQL on DataFrames - -Daft's [`daft.sql`](https://www.getdaft.io/projects/docs/en/stable/api_docs/sql.html#daft.sql) function will automatically detect any [`daft.DataFrame`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.html#daft.DataFrame) objects in your current Python environment to let you query them easily by name. - -=== "โš™๏ธ SQL" - ```python - # Note the variable name `my_special_df` - my_special_df = daft.from_pydict({"A": [1, 2, 3], "B": [1, 2, 3]}) - - # Use the SQL table name "my_special_df" to refer to the above DataFrame! - sql_df = daft.sql("SELECT A, B FROM my_special_df") - - sql_df.show() - ``` - -``` {title="Output"} - -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ A โ”† B โ”‚ -โ”‚ --- โ”† --- โ”‚ -โ”‚ Int64 โ”† Int64 โ”‚ -โ•žโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ก -โ”‚ 1 โ”† 1 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 2 โ”† 2 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 3 โ”† 3 โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ - -(Showing first 3 of 3 rows) -``` - -In the above example, we query the DataFrame called `"my_special_df"` by simply referring to it in the SQL command. This produces a new DataFrame `sql_df` which can natively integrate with the rest of your Daft query. - -## Reading data from SQL - -!!! warning "Warning" - - This feature is a WIP and will be coming soon! We will support reading common datasources directly from SQL: - - === "๐Ÿ Python" - - ```python - daft.sql("SELECT * FROM read_parquet('s3://...')") - daft.sql("SELECT * FROM read_delta_lake('s3://...')") - ``` - - Today, a workaround for this is to construct your dataframe in Python first and use it from SQL instead: - - === "๐Ÿ Python" - - ```python - df = daft.read_parquet("s3://...") - daft.sql("SELECT * FROM df") - ``` - - We appreciate your patience with us and hope to deliver this crucial feature soon! - -## SQL Expressions - -SQL has the concept of expressions as well. Here is an example of a simple addition expression, adding columns "a" and "b" in SQL to produce a new column C. - -We also present here the equivalent query for SQL and DataFrame. Notice how similar the concepts are! - -=== "โš™๏ธ SQL" - ```python - df = daft.from_pydict({"A": [1, 2, 3], "B": [1, 2, 3]}) - df = daft.sql("SELECT A + B as C FROM df") - df.show() - ``` - -=== "๐Ÿ Python" - ``` python - expr = (daft.col("A") + daft.col("B")).alias("C") - - df = daft.from_pydict({"A": [1, 2, 3], "B": [1, 2, 3]}) - df = df.select(expr) - df.show() - ``` - -``` {title="Output"} - -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ C โ”‚ -โ”‚ --- โ”‚ -โ”‚ Int64 โ”‚ -โ•žโ•โ•โ•โ•โ•โ•โ•โ•ก -โ”‚ 2 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 4 โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ 6 โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ - -(Showing first 3 of 3 rows) -``` - -In the above query, both the SQL version of the query and the DataFrame version of the query produce the same result. - -Under the hood, they run the same Expression `col("A") + col("B")`! - -One really cool trick you can do is to use the [`daft.sql_expr`](https://www.getdaft.io/projects/docs/en/stable/api_docs/sql.html#daft.sql_expr) function as a helper to easily create Expressions. The following are equivalent: - -=== "โš™๏ธ SQL" - ```python - sql_expr = daft.sql_expr("A + B as C") - print("SQL expression:", sql_expr) - ``` - -=== "๐Ÿ Python" - ``` python - py_expr = (daft.col("A") + daft.col("B")).alias("C") - print("Python expression:", py_expr) - ``` - -``` {title="Output"} - -SQL expression: col(A) + col(B) as C -Python expression: col(A) + col(B) as C -``` - -This means that you can pretty much use SQL anywhere you use Python expressions, making Daft extremely versatile at mixing workflows which leverage both SQL and Python. - -As an example, consider the filter query below and compare the two equivalent Python and SQL queries: - -=== "โš™๏ธ SQL" - ```python - df = daft.from_pydict({"A": [1, 2, 3], "B": [1, 2, 3]}) - - # Daft automatically converts this string using `daft.sql_expr` - df = df.where("A < 2") - - df.show() - ``` - -=== "๐Ÿ Python" - ``` python - df = daft.from_pydict({"A": [1, 2, 3], "B": [1, 2, 3]}) - - # Using Daft's Python Expression API - df = df.where(df["A"] < 2) - - df.show() - ``` - -``` {title="Output"} - -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ A โ”† B โ”‚ -โ”‚ --- โ”† --- โ”‚ -โ”‚ Int64 โ”† Int64 โ”‚ -โ•žโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ก -โ”‚ 1 โ”† 1 โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ - -(Showing first 1 of 1 rows) -``` - -Pretty sweet! Of course, this support for running Expressions on your columns extends well beyond arithmetic as we'll see in the next section on SQL Functions. - -## SQL Functions - -SQL also has access to all of Daft's powerful [`daft.Expression`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.html#daft.DataFrame) functionality through SQL functions. - -However, unlike the Python Expression API which encourages method-chaining (e.g. `col("a").url.download().image.decode()`), in SQL you have to do function nesting instead (e.g. `"image_decode(url_download(a))"`). - -!!! note "Note" - - A full catalog of the available SQL Functions in Daft is available in the [`../api_docs/sql`](https://www.getdaft.io/projects/docs/en/stable/api_docs/sql.html). - - Note that it closely mirrors the Python API, with some function naming differences vs the available Python methods. - We also have some aliased functions for ANSI SQL-compliance or familiarity to users coming from other common SQL dialects such as PostgreSQL and SparkSQL to easily find their functionality. - -Here is an example of an equivalent function call in SQL vs Python: - -=== "โš™๏ธ SQL" - ```python - df = daft.from_pydict({"urls": [ - "https://user-images.githubusercontent.com/17691182/190476440-28f29e87-8e3b-41c4-9c28-e112e595f558.png", - "https://user-images.githubusercontent.com/17691182/190476440-28f29e87-8e3b-41c4-9c28-e112e595f558.png", - "https://user-images.githubusercontent.com/17691182/190476440-28f29e87-8e3b-41c4-9c28-e112e595f558.png", - ]}) - df = daft.sql("SELECT image_decode(url_download(urls)) FROM df") - df.show() - ``` - -=== "๐Ÿ Python" - ``` python - df = daft.from_pydict({"urls": [ - "https://user-images.githubusercontent.com/17691182/190476440-28f29e87-8e3b-41c4-9c28-e112e595f558.png", - "https://user-images.githubusercontent.com/17691182/190476440-28f29e87-8e3b-41c4-9c28-e112e595f558.png", - "https://user-images.githubusercontent.com/17691182/190476440-28f29e87-8e3b-41c4-9c28-e112e595f558.png", - ]}) - df = df.select(daft.col("urls").url.download().image.decode()) - df.show() - ``` - -``` {title="Output"} - -โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ -โ”‚ urls โ”‚ -โ”‚ --- โ”‚ -โ”‚ Image[MIXED] โ”‚ -โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก -โ”‚ โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ โ”‚ -โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค -โ”‚ โ”‚ -โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ - -(Showing first 3 of 3 rows) -``` diff --git a/docs-v2/core_concepts/udf.md b/docs-v2/core_concepts/udf.md deleted file mode 100644 index a57913a05f..0000000000 --- a/docs-v2/core_concepts/udf.md +++ /dev/null @@ -1,213 +0,0 @@ -# User-Defined Functions (UDF) - -A key piece of functionality in Daft is the ability to flexibly define custom functions that can run on any data in your dataframe. This guide walks you through the different types of UDFs that Daft allows you to run. - -Let's first create a dataframe that will be used as a running example throughout this tutorial! - -=== "๐Ÿ Python" - ``` python - import daft - import numpy as np - - df = daft.from_pydict({ - # the `image` column contains images represented as 2D numpy arrays - "image": [np.ones((128, 128)) for i in range(16)], - # the `crop` column contains a box to crop from our image, represented as a list of integers: [x1, x2, y1, y2] - "crop": [[0, 1, 0, 1] for i in range(16)], - }) - ``` - - -## Per-column per-row functions using [`.apply`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/expression_methods/daft.Expression.apply.html) - -You can use [`.apply`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/expression_methods/daft.Expression.apply.html) to run a Python function on every row in a column. - -For example, the following example creates a new `flattened_image` column by calling `.flatten()` on every object in the `image` column. - -=== "๐Ÿ Python" - ``` python - df.with_column( - "flattened_image", - df["image"].apply(lambda img: img.flatten(), return_dtype=daft.DataType.python()) - ).show(2) - ``` - -``` {title="Output"} - -+----------------------+---------------+---------------------+ -| image | crop | flattened_image | -| Python | List[Int64] | Python | -+======================+===============+=====================+ -| [[1. 1. 1. ... 1. 1. | [0, 1, 0, 1] | [1. 1. 1. ... 1. 1. | -| 1.] [1. 1. 1. ... | | 1.] | -| 1. 1. 1.] [1. 1.... | | | -+----------------------+---------------+---------------------+ -| [[1. 1. 1. ... 1. 1. | [0, 1, 0, 1] | [1. 1. 1. ... 1. 1. | -| 1.] [1. 1. 1. ... | | 1.] | -| 1. 1. 1.] [1. 1.... | | | -+----------------------+---------------+---------------------+ -(Showing first 2 rows) -``` - -Note here that we use the `return_dtype` keyword argument to specify that our returned column type is a Python column! - -## Multi-column per-partition functions using [`@udf`](https://www.getdaft.io/projects/docs/en/stable/api_docs/udf.html#creating-udfs) - -[`.apply`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/expression_methods/daft.Expression.apply.html) is great for convenience, but has two main limitations: - -1. It can only run on single columns -2. It can only run on single items at a time - -Daft provides the [`@udf`](https://www.getdaft.io/projects/docs/en/stable/api_docs/udf.html#creating-udfs) decorator for defining your own UDFs that process multiple columns or multiple rows at a time. - -For example, let's try writing a function that will crop all our images in the `image` column by its corresponding value in the `crop` column: - -=== "๐Ÿ Python" - ``` python - @daft.udf(return_dtype=daft.DataType.python()) - def crop_images(images, crops, padding=0): - cropped = [] - for img, crop in zip(images.to_pylist(), crops.to_pylist()): - x1, x2, y1, y2 = crop - cropped_img = img[x1:x2 + padding, y1:y2 + padding] - cropped.append(cropped_img) - return cropped - - df = df.with_column( - "cropped", - crop_images(df["image"], df["crop"], padding=1), - ) - df.show(2) - ``` - -``` {title="Output"} - -+----------------------+---------------+--------------------+ -| image | crop | cropped | -| Python | List[Int64] | Python | -+======================+===============+====================+ -| [[1. 1. 1. ... 1. 1. | [0, 1, 0, 1] | [[1. 1.] [1. 1.]] | -| 1.] [1. 1. 1. ... | | | -| 1. 1. 1.] [1. 1.... | | | -+----------------------+---------------+--------------------+ -| [[1. 1. 1. ... 1. 1. | [0, 1, 0, 1] | [[1. 1.] [1. 1.]] | -| 1.] [1. 1. 1. ... | | | -| 1. 1. 1.] [1. 1.... | | | -+----------------------+---------------+--------------------+ -(Showing first 2 rows) -``` - -There's a few things happening here, let's break it down: - -1. `crop_images` is a normal Python function. It takes as input: - - a. A list of images: `images` - - b. A list of cropping boxes: `crops` - - c. An integer indicating how much padding to apply to the right and bottom of the cropping: `padding` - -2. To allow Daft to pass column data into the `images` and `crops` arguments, we decorate the function with [`@udf`](https://www.getdaft.io/projects/docs/en/stable/api_docs/udf.html#creating-udfs) - - a. `return_dtype` defines the returned data type. In this case, we return a column containing Python objects of numpy arrays - - b. At runtime, because we call the UDF on the `image` and `crop` columns, the UDF will receive a [`daft.Series`](https://www.getdaft.io/projects/docs/en/stable/api_docs/series.html#daft.Series) object for each argument. - -3. We can create a new column in our DataFrame by applying our UDF on the `"image"` and `"crop"` columns inside of a [`df.with_column()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.with_column.html#daft.DataFrame.with_column) call. - -### UDF Inputs - - -When you specify an Expression as an input to a UDF, Daft will calculate the result of that Expression and pass it into your function as a [`daft.Series`](https://www.getdaft.io/projects/docs/en/stable/api_docs/series.html#daft.Series) object. - -The Daft [`daft.Series`](https://www.getdaft.io/projects/docs/en/stable/api_docs/series.html#daft.Series) is just an abstraction on a "column" of data! You can obtain several different data representations from a [`daft.Series`](https://www.getdaft.io/projects/docs/en/stable/api_docs/series.html#daft.Series): - -1. PyArrow Arrays (`pa.Array`): [`s.to_arrow()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/series.html#daft.Series.to_arrow) -2. Python lists (`list`): [`s.to_pylist()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/series.html#daft.Series.to_pylist) - -Depending on your application, you may choose a different data representation that is more performant or more convenient! - -!!! info "Info" - - Certain array formats have some restrictions around the type of data that they can handle: - - 1. **Null Handling**: In Pandas and Numpy, nulls are represented as NaNs for numeric types, and Nones for non-numeric types. Additionally, the existence of nulls will trigger a type casting from integer to float arrays. If null handling is important to your use-case, we recommend using one of the other available options. - - 2. **Python Objects**: PyArrow array formats cannot support Python columns. - - We recommend using Python lists if performance is not a major consideration, and using the arrow-native formats such as PyArrow arrays and numpy arrays if performance is important. - -### Return Types - -The `return_dtype` argument specifies what type of column your UDF will return. Types can be specified using the [`daft.DataType`](https://www.getdaft.io/projects/docs/en/stable/api_docs/datatype.html#daft.DataType) class. - -Your UDF function itself needs to return a batch of columnar data, and can do so as any one of the following array types: - -1. Numpy Arrays (`np.ndarray`) -2. PyArrow Arrays (`pa.Array`) -3. Python lists (`list`) - -Note that if the data you have returned is not castable to the return_dtype that you specify (e.g. if you return a list of floats when you've specified a `return_dtype=DataType.bool()`), Daft will throw a runtime error! - -## Class UDFs - -UDFs can also be created on Classes, which allow for initialization on some expensive state that can be shared between invocations of the class, for example downloading data or creating a model. - -=== "๐Ÿ Python" - ``` python - @daft.udf(return_dtype=daft.DataType.int64()) - class RunModel: - - def __init__(self): - # Perform expensive initializations - self._model = create_model() - - def __call__(self, features_col): - return self._model(features_col) - ``` - -Running Class UDFs are exactly the same as running their functional cousins. - -=== "๐Ÿ Python" - ``` python - df = df.with_column("image_classifications", RunModel(df["images"])) - ``` - -## Resource Requests - -Sometimes, you may want to request for specific resources for your UDF. For example, some UDFs need one GPU to run as they will load a model onto the GPU. - -To do so, you can create your UDF and assign it a resource request: - -=== "๐Ÿ Python" - ``` python - @daft.udf(return_dtype=daft.DataType.int64(), num_gpus=1) - class RunModelWithOneGPU: - - def __init__(self): - # Perform expensive initializations - self._model = create_model() - - def __call__(self, features_col): - return self._model(features_col) - ``` - - ``` python - df = df.with_column( - "image_classifications", - RunModelWithOneGPU(df["images"]), - ) - ``` - -In the above example, if Daft ran on a Ray cluster consisting of 8 GPUs and 64 CPUs, Daft would be able to run 8 replicas of your UDF in parallel, thus massively increasing the throughput of your UDF! - -UDFs can also be parametrized with new resource requests after being initialized. - -=== "๐Ÿ Python" - ``` python - RunModelWithTwoGPUs = RunModelWithOneGPU.override_options(num_gpus=2) - df = df.with_column( - "image_classifications", - RunModelWithTwoGPUs(df["images"]), - ) - ``` diff --git a/docs-v2/distributed.md b/docs-v2/distributed.md new file mode 100644 index 0000000000..455cd726cf --- /dev/null +++ b/docs-v2/distributed.md @@ -0,0 +1,303 @@ +# Distributed Computing + +!!! failure "todo(docs): add daft launcher docs and review order of information" + +By default, Daft runs using your local machine's resources and your operations are thus limited by the CPUs, memory and GPUs available to you in your single local development machine. + +However, Daft has strong integrations with [Ray](https://www.ray.io) which is a distributed computing framework for distributing computations across a cluster of machines. Here is a snippet showing how you can connect Daft to a Ray cluster: + +=== "๐Ÿ Python" + + ```python + import daft + + daft.context.set_runner_ray() + ``` + +By default, if no address is specified Daft will spin up a Ray cluster locally on your machine. If you are running Daft on a powerful machine (such as an AWS P3 machine which is equipped with multiple GPUs) this is already very useful because Daft can parallelize its execution of computation across your CPUs and GPUs. However, if instead you already have your own Ray cluster running remotely, you can connect Daft to it by supplying an address: + +=== "๐Ÿ Python" + + ```python + daft.context.set_runner_ray(address="ray://url-to-mycluster") + ``` + +For more information about the `address` keyword argument, please see the [Ray documentation on initialization](https://docs.ray.io/en/latest/ray-core/api/doc/ray.init.html). + + +If you want to start a single node ray cluster on your local machine, you can do the following: + +```bash +> pip install ray[default] +> ray start --head --port=6379 +``` + +This should output something like: + +``` +Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details. + +Local node IP: 127.0.0.1 + +-------------------- +Ray runtime started. +-------------------- + +... +``` + +You can take the IP address and port and pass it to Daft: + +=== "๐Ÿ Python" + + ```python + >>> import daft + >>> daft.context.set_runner_ray("127.0.0.1:6379") + DaftContext(_daft_execution_config=, _daft_planning_config=, _runner_config=_RayRunnerConfig(address='127.0.0.1:6379', max_task_backlog=None), _disallow_set_runner=True, _runner=None) + >>> df = daft.from_pydict({ + ... 'text': ['hello', 'world'] + ... }) + 2024-07-29 15:49:26,610 INFO worker.py:1567 -- Connecting to existing Ray cluster at address: 127.0.0.1:6379... + 2024-07-29 15:49:26,622 INFO worker.py:1752 -- Connected to Ray cluster. + >>> print(df) + โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ + โ”‚ text โ”‚ + โ”‚ --- โ”‚ + โ”‚ Utf8 โ”‚ + โ•žโ•โ•โ•โ•โ•โ•โ•โ•ก + โ”‚ hello โ”‚ + โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค + โ”‚ world โ”‚ + โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ + + (Showing first 2 of 2 rows) + ``` + +## Daft Launcher + +Daft Launcher is a convenient command-line tool that provides simple abstractions over Ray, enabling a quick uptime for users to leverage Daft for distributed computations. Rather than worrying about the complexities of managing Ray, users can simply run a few CLI commands to spin up a cluster, submit a job, observe the status of jobs and clusters, and spin down a cluster. + +### Prerequisites + +The following should be installed on your machine: + +- The [AWS CLI](https://aws.amazon.com/cli) tool. (Assuming you're using AWS as your cloud provider) + +- A python package manager. We recommend using `uv` to manage everything (i.e., dependencies, as well as the python version itself). It's much cleaner and faster than `pip`. + +### Install Daft Launcher + +Run the following commands in your terminal to initialize your project: + +```bash +# Create a project directory +cd some/working/directory +mkdir launch-test +cd launch-test + +# Initialize the project +uv init --python 3.12 +uv venv +source .venv/bin/activate + +# Install Daft Launcher +uv pip install "daft-launcher" +``` + +In your virtual environment, you should have Daft launcher installed โ€” you can verify this by running `daft --version` which will return the latest version of Daft launcher available. You should also have a basic working directly that may look something like this: + +```bash +/ +|- .venv/ +|- hello.py +|- pyproject.toml +|- README.md +|- .python-version +``` + +### Configure AWS Credentials + +Establish an SSO connection to configure your AWS credentials: + +```bash +# Configure your SSO +aws configure sso + +# Login to your SSO +aws sso login +``` + +These commands should open your browsers. Accept the prompted requests and then return to your terminal, you should see a success message from the AWS CLI tool. At this point, your AWS CLI tool has been configured and your environment is fully setup. + +### Initialize Configuration File + +Initialize a default configuration file to store default values that you can later tune, and they are denoted as required and optional respectively. + +```python +# Initialize the default .daft.toml configuration file +daft init-config + +# Optionally you can also specify a custom name for your file +daft init-config my-custom-config.toml +``` + +Fill out the required values in your `.daft.toml` file. Optional configurations will have a default value pre-defined. + +```toml +[setup] + +# (required) +# The name of the cluster. +name = ... + +# (required) +# The cloud provider that this cluster will be created in. +# Has to be one of the following: +# - "aws" +# - "gcp" +# - "azure" +provider = ... + +# (optional; default = None) +# The IAM instance profile ARN which will provide this cluster with the necessary permissions to perform whatever actions. +# Please note that if you don't specify this field, Ray will create an automatic instance profile for you. +# That instance profile will be minimal and may restrict some of the feature of Daft. +iam_instance_profile_arn = ... + +# (required) +# The AWS region in which to place this cluster. +region = ... + +# (optional; default = "ec2-user") +# The ssh user name when connecting to the cluster. +ssh_user = ... + +# (optional; default = 2) +# The number of worker nodes to create in the cluster. +number_of_workers = ... + +# (optional; default = "m7g.medium") +# The instance type to use for the head and worker nodes. +instance_type = ... + +# (optional; default = "ami-01c3c55948a949a52") +# The AMI ID to use for the head and worker nodes. +image_id = ... + +# (optional; default = []) +# A list of dependencies to install on the head and worker nodes. +# These will be installed using UV (https://docs.astral.sh/uv/). +dependencies = [...] + +[run] + +# (optional; default = ['echo "Hello, World!"']) +# Any post-setup commands that you want to invoke manually. +# This is a good location to install any custom dependencies or run some arbitrary script. +setup_commands = [...] + +``` + +### Spin Up a Cluster + +`daft up` will spin up a cluster given the configuration file you initialized earlier. The configuration file contains all required information necessary for Daft launcher to know how to spin up a cluster. + +```python +# Spin up a cluster using the default .daft.toml configuration file created earlier +daft up + +# Alternatively spin up a cluster using a custom configuration file created earlier +daft up -c my-custom-config.toml +``` + +This command will do a couple of things: + +1. First, it will reach into your cloud provider and spin up the necessary resources. This includes things such as the worker nodes, security groups, permissions, etc. + +2. When the nodes are spun up, the ray and daft dependencies will be downloaded into a python virtual environment. + +3. Next, any other custom dependencies that you've specified in the configuration file will then be downloaded. + +4. Finally, the setup commands that you've specified in the configuration file will be run on the head node. + +!!! note "Note" + + `daft up` will only return successfully when the head node is fully set up. Even though the command will request the worker nodes to also spin up, it will not wait for them to be spun up before returning. Therefore, when the command completes and you type `daft list`, the worker nodes may be in a โ€œpendingโ€ state immediately after. Give it a few seconds and they should be fully running. + +### Submit a Job + +`daft submit` enables you to submit a working directory and command or a โ€œjobโ€ to the remote cluster to be run. + +```python +# Submit a job using the default .daft.toml configuration file +daft submit -i my-keypair.pem -w my-working-director + +# Alternatively submit a job using a custom configuration file +daft submit -c my-custom-config.toml -i my-keypair.pem -w my-working-director +``` + +### Run a SQL Query + +Daft supports SQL API so you can use `daft sql` to run raw SQL queries against your data. The SQL dialect is the postgres standard. + +```python +# Run a sql query using the default .daft.toml configuration file +daft sql -- "\"SELECT * FROM my_table\"" + +# Alternatively you can run a sql query using a custom configuration file +daft sql -c my-custom-config.toml -- "\"SELECT * FROM my_table\"" +``` + +### View Ray Dashboard + +You can view the Ray dashboard of your running cluster with `daft connect` which establishes a port-forward over SSH from your local machine to the head node of the cluster (connectingย `localhost:8265`ย to the remote head'sย `8265`). + +```python +# Establish the port-forward using the default .daft.toml configuration file +daft connect -i my-keypair.pem + +# Alternatively establish the port-forward using a custom configuration file +daft connect -c my-custom-config.toml -i my-keypair.pem +``` + +!!! note "Note" + + `daft connect` will require you to have the appropriate SSH keypair to authenticate against the remote headโ€™s public SSH keypair. Make sure to pass this SSH keypair as an argument to the command. + +### Spin Down a Cluster + +`daft down` will spin down all instances of the cluster specified in the configuration file, not just the head node. + +```python +# Spin down a cluster using the default .daft.toml configuration file +daft down + +# Alternatively spin down a cluster using a custom configuration file +daft down -c my-custom-config.toml +``` + +### List Running and Terminated Clusters + +`daft list` allows you to view the current state of all clusters, running and terminated, and includes each instance name and their given IPs (assuming the cluster is running). Hereโ€™s an example output after running `daft list`: + +```python +Running: + - daft-demo, head, i-053f9d4856d92ea3d, 35.94.91.91 + - daft-demo, worker, i-00c340dc39d54772d + - daft-demo, worker, i-042a96ce1413c1dd6 +``` + +Say we spun up another cluster `new-cluster` and then terminated it, hereโ€™s what the output of `daft list` would look like immediately after: + +```python +Running: + - daft-demo, head, i-053f9d4856d92ea3d, 35.94.91.91 + - daft-demo, worker, i-00c340dc39d54772d, 44.234.112.173 + - daft-demo, worker, i-042a96ce1413c1dd6, 35.94.206.130 +Shutting-down: + - new-cluster, head, i-0be0db9803bd06652, 35.86.200.101 + - new-cluster, worker, i-056f46bd69e1dd3f1, 44.242.166.108 + - new-cluster, worker, i-09ff0e1d8e67b8451, 35.87.221.180 +``` + +In a few seconds later, the state of `new-cluster` will be finalized to โ€œTerminatedโ€. diff --git a/docs-v2/integrations/delta_lake.md b/docs-v2/integrations/delta_lake.md index d0c92bed5c..e19f05bb88 100644 --- a/docs-v2/integrations/delta_lake.md +++ b/docs-v2/integrations/delta_lake.md @@ -4,7 +4,7 @@ Daft currently supports: -1. **Parallel + Distributed Reads:** Daft parallelizes Delta Lake table reads over all cores of your machine, if using the default multithreading runner, or all cores + machines of your Ray cluster, if using the [distributed Ray runner](../advanced/distributed.md). +1. **Parallel + Distributed Reads:** Daft parallelizes Delta Lake table reads over all cores of your machine, if using the default multithreading runner, or all cores + machines of your Ray cluster, if using the [distributed Ray runner](../distributed.md). 2. **Skipping Filtered Data:** Daft ensures that only data that matches your [`df.where(...)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.where.html#daft.DataFrame.where) filter will be read, often skipping entire files/partitions. diff --git a/docs-v2/integrations/hudi.md b/docs-v2/integrations/hudi.md index e0a28ec7da..6c5b33fb01 100644 --- a/docs-v2/integrations/hudi.md +++ b/docs-v2/integrations/hudi.md @@ -4,7 +4,7 @@ Daft currently supports: -1. **Parallel + Distributed Reads:** Daft parallelizes Hudi table reads over all cores of your machine, if using the default multithreading runner, or all cores + machines of your Ray cluster, if using the [distributed Ray runner](../advanced/distributed.md). +1. **Parallel + Distributed Reads:** Daft parallelizes Hudi table reads over all cores of your machine, if using the default multithreading runner, or all cores + machines of your Ray cluster, if using the [distributed Ray runner](../distributed.md). 2. **Skipping Filtered Data:** Daft ensures that only data that matches your [`df.where(...)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.where.html#daft.DataFrame.where) filter will be read, often skipping entire files/partitions. diff --git a/docs-v2/integrations/ray.md b/docs-v2/integrations/ray.md index 0517248aa7..55c334ba35 100644 --- a/docs-v2/integrations/ray.md +++ b/docs-v2/integrations/ray.md @@ -1,5 +1,8 @@ # Ray +!!! failure "todo(docs): add reference to daft launcher" + + [Ray](https://docs.ray.io/en/latest/ray-overview/index.html) is an open-source framework for distributed computing. Daft's native support for Ray enables you to run distributed DataFrame workloads at scale. ## Usage diff --git a/docs-v2/integrations/sql.md b/docs-v2/integrations/sql.md index 95676d0082..ce32e8ca65 100644 --- a/docs-v2/integrations/sql.md +++ b/docs-v2/integrations/sql.md @@ -6,7 +6,7 @@ Daft currently supports: 1. **20+ SQL Dialects:** Daft supports over 20 databases, data warehouses, and query engines by using [SQLGlot](https://sqlglot.com/sqlglot.html) to convert SQL queries across dialects. See the full list of supported dialects [here](https://sqlglot.com/sqlglot/dialects.html). -2. **Parallel + Distributed Reads:** Daft parallelizes SQL reads by using all local machine cores with its default multithreading runner, or all cores across multiple machines if using the [distributed Ray runner](../advanced/distributed.md). +2. **Parallel + Distributed Reads:** Daft parallelizes SQL reads by using all local machine cores with its default multithreading runner, or all cores across multiple machines if using the [distributed Ray runner](../distributed.md). 3. **Skipping Filtered Data:** Daft ensures that only data that matches your [`df.select(...)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.select.html#daft.DataFrame.select), [`df.limit(...)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.limit.html#daft.DataFrame.limit), and [`df.where(...)`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.where.html#daft.DataFrame.where) expressions will be read, often skipping entire partitions/columns. @@ -80,7 +80,7 @@ You can also directly provide a SQL alchemy connection via a **connection factor ## Parallel + Distributed Reads -For large datasets, Daft can parallelize SQL reads by using all local machine cores with its default multithreading runner, or all cores across multiple machines if using the [distributed Ray runner](../advanced/distributed.md). +For large datasets, Daft can parallelize SQL reads by using all local machine cores with its default multithreading runner, or all cores across multiple machines if using the [distributed Ray runner](../distributed.md). Supply the [`daft.read_sql()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_functions/daft.read_sql.html#daft.read_sql) function with a **partition column** and optionally the **number of partitions** to enable parallel reads. diff --git a/docs-v2/migration/dask_migration.md b/docs-v2/migration/dask_migration.md index 359d2d91e5..af2d91b7ce 100644 --- a/docs-v2/migration/dask_migration.md +++ b/docs-v2/migration/dask_migration.md @@ -117,7 +117,7 @@ Dask supports the same data types as pandas. Daft is built to support many more ## Distributed Computing and Remote Clusters -Both Dask and Daft support distributed computing on remote clusters. In Dask, you create a Dask cluster either locally or remotely and perform computations in parallel there. Currently, Daft supports distributed cluster computing [with Ray](../advanced/distributed.md). Support for running Daft computations on Dask clusters is on the roadmap. +Both Dask and Daft support distributed computing on remote clusters. In Dask, you create a Dask cluster either locally or remotely and perform computations in parallel there. Currently, Daft supports distributed cluster computing [with Ray](../distributed.md). Support for running Daft computations on Dask clusters is on the roadmap. Cloud support for both Dask and Daft is the same. diff --git a/docs-v2/quickstart.md b/docs-v2/quickstart.md index c744d99c9e..2372b868fe 100644 --- a/docs-v2/quickstart.md +++ b/docs-v2/quickstart.md @@ -4,6 +4,11 @@ !!! failure "todo(docs): Incorporate SQL examples" +!!! failure "todo(docs): Add link to notebook to DIY (notebook is in docs-v2 dir, but idk how to host on colab)." + +!!! failure "todo(docs): What does the actual output look like for some of these examples?" + + In this quickstart, you will learn the basics of Daft's DataFrame and SQL API and the features that set it apart from frameworks like Pandas, PySpark, Dask, and Ray. @@ -57,6 +62,7 @@ See also [DataFrame Creation](https://www.getdaft.io/projects/docs/en/stable/api (Showing first 4 of 4 rows) + ``` You just created your first DataFrame! diff --git a/mkdocs.yml b/mkdocs.yml index cadf521f08..9e80ca1af6 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -6,6 +6,12 @@ site_name: Daft Documentation docs_dir: docs-v2 +# Scarf pixel for tracking analytics +image: + referrerpolicy: "no-referrer-when-downgrade" + src: "https://static.scarf.sh/a.png?x-pxid=c9065f3a-a090-4243-8f69-145d5de7bfca" + + # Repository repo_name: Daft repo_url: https://github.com/Eventual-Inc/Daft @@ -18,17 +24,10 @@ nav: - Installation: install.md - Quickstart: quickstart.md - Core Concepts: core_concepts.md - # - DataFrame: core_concepts/dataframe.md - # - Expressions: core_concepts/expressions.md - # - Reading/Writing Data: core_concepts/read_write.md - # - DataTypes: core_concepts/datatypes.md - # - SQL: core_concepts/sql.md - # - Aggregations and Grouping: core_concepts/aggregations.md - # - User-Defined Functions (UDF): core_concepts/udf.md + - Distributed Computing: distributed.md - Advanced: - Managing Memory Usage: advanced/memory.md - Partitioning: advanced/partitioning.md - - Distributed Computing: advanced/distributed.md - Integrations: - Ray: integrations/ray.md - Unity Catalog: integrations/unity_catalog.md