diff --git a/web/book/src/internals/type-system.md b/web/book/src/internals/type-system.md new file mode 100644 index 000000000000..23c5134b6f62 --- /dev/null +++ b/web/book/src/internals/type-system.md @@ -0,0 +1,332 @@ +# Type system + +> For variables, the type system determines the allowed values of that term. +> +> -- Wikipedia + +## Purpose + +Each of the SQL DBMSs has their own type system. Thanks to SQL standard, they +are very similar, but have key differences. For example, SQLite does not have a +type for date or time or timestamps, but it has functions for handling date and +time that take ISO 8601 strings or integers that represent Unix timestamps. So +it does support most of what is possible to do with dates in other dialects, +even though it stores data with a different physical layout and uses different +functions to achieve that. + +PRQL's task is to define common description of _data formats_, just as how it +already defines common _data transformations_. + +I believe this should best be done in two steps: + +1. Define PRQL's Type System (PTS), following principles we think a relational + language should have (and not focus on what existing SQL DBMSs have). + +2. Define a mapping between SQL Type System (STS) and PTS, for each of the + DBMSs. Ideally we'd want that to be a bijection, so each type in PTS would be + represented by a single type in STS and vice-versa. Unfortunately this is not + entirely possible, as shown below. + +In practical terms, we want for a user to be able to: + +- ... express types of their database with PRQL (map their STS into PTS). In + some cases, we can allow to say "your database is not representable with PRQL, + change it or use only a subset of it". An example of what we don't want to + support are arrays with arbitrary indexes in Postgres (i.e. 2-based index for + arrays). + + This task of mapping to PTS could be automated by LSP server, by introspecting + user's SQL database and generating PRQL source. + +- ... express their SQL queries in PRQL. Again, using mapping from STS to PTS, + one should be able to express an SQL operation in PRQL. + + For example, translate MSSQL `DATEDIFF` to subtraction operator `-` in PRQL. + + For now, this mapping is manual, but should be documented and may be + automated. + +- ... use any PRQL feature in their database. Here we are mapping back from PTS + into STS. Note that STS may have changed to a different dialect. + + For example, translate PRQL's datetime operations to use TEXT in SQLite. + + As of now, prql-compiler already does a good job of automatically doing this + mapping. + +Example of the mapping between PTS and two STSs: + +| PTS | STS Postgres | STS SQLite | +| --------- | ------------ | ---------- | +| int32 | integer | INTEGER | +| int64 | bigint | INTEGER | +| timestamp | timestamp | TEXT | + +## Principles + +**Algebraic types** - have a way of expressing sum and product types. In Rust, +sum would be an enum and product would be tuple or a struct. In SQL, product +would be a row, since it can contain different types, all at once. Sum would be +harder to express, see (this +post)[https://www.parsonsmatt.org/2019/03/19/sum_types_in_sql.html ]. + +The value proposition here is that algebraic types give a lot modeling +flexibility, all while being conceptually simple. + +**Composable** - as with transformation, we'd want types to compose together. + +Using Python, JavaScript, C++ or Rust, one could define many different a data +structure that would correspond to our idea of "relation". Most of them would be +an object/struct that has column names and types and then a generic array of +arrays for rows. + +PRQL's type system should also be able to express relations as composed from +primitive types, but have only one idiomatic way of doing so. + +In practice this means that builtin types include only primitives (int, text, +bool, float), tuple (for product), enum (for sum) and array (for repeating). + +An SQL row would translate to tuple, and a relation would translate to an array +of tuples. + +I would also strive for the type system to be minimal - don't differentiate +between tuples, objects and structs. Choose one and stick to it. + +**Type constraints** - constrain a type with a predicate. For example, have a +type of int64 that are equal or greater than 10. Postgres +[does support this](https://news.ycombinator.com/item?id=34835063). The primary +value of using constrained types would not be validation (as it is used in +linked article), but when matching the type. + +Say, for example, that we have a pipeline like this: + +``` +derive color = switch [x => 'red', true => 'green'] +derive is_red = switch [color == 'red' => true, color == 'green' => false] +``` + +It should be possible to infer that `color` is of type `text`, but only when +equal to `'red'` or `'green'`. This means that the second switch covers all +possible cases and `is_red` cannot be `null`. + +## Theory + +> For any undefined terms used in this section, refer to set theory and +> mathematical definitions in general. + +A "type of a variable" is a "set of all possible values of that variable". This +means that terms "type" and "set" are equivalent in this context. + +Types (sets) can be expressions. For example, a union of two types is a type +itself. This means a type expression is equivalent to any other expression whose +type is a "set of sets". + +So let's introduce a "set" as a PRQL expression construct (alongside existing +idents, literals, ranges and so on). For now, it does not need any special +syntax. Because sets are normal expressions, existing syntax can be repurposed +to define operations on sets: + +- Binary operation `or` of two sets represents a union of those two sets: + + ``` + let number = int or float + ``` + + With algebraic types, this is named "a sum type". + +- Literals can be coerced into a singleton set (i.e. `false` is converted into a + set with only one element `false`): + + ``` + let int_or_null = int or null + ``` + +- A list of set expressions can be coerced into a set of tuples, where entries + of the tuples correspond to elements of the set expressions in the list: + + ``` + let my_row = [id = int, bool, name = str] + ``` + +- An array of set expressions with exactly one entry can be coerced into a set + of arrays of that set expression: + + ``` + let array_of_int = {int} # proposed syntax for arrays + ``` + +- A function that takes set as params and returns a set is converted into a set + of functions. + + ``` + let floor_signature = (float -> int) + # using a proposed syntax for lambda functions + ``` + +Module `std` defines built-in sets `int`, `float`, `bool`, `text` and `set`. +Other built-in sets will be added in the future. + +## Type annotations + +Let's extend the syntax for declaration of variable `a`, whose value can be +computed by evaluating `x`, with a type annotation: + +``` +let a = x +``` + +This extended syntax applies following assertions: + +- `t` can be evaluated statically (at compile time), +- `t` can be coerced into a set, +- value of `x` (and `a`) must be an element of `t`. This assertion must be + possible to evaluate statically. + +Similar rules apply to type annotations of return types of functions and +function parameter definitions. + +## Type definitions + +As shown, types can be defined by defining expressions and coercing them to set +expressions by using `< >`. + +But similar to how both `func` and `let` can be used to define functions (when +we introduce lambda function syntax), let's also define syntactic sugar for type +definitions: + +``` +# these two are equivalent +let my_type = set_expr +type my_type = set_expr +``` + +## Container types + +> Terminology is under discussion + +**Tuple** is the only product type in PTS. It contains n ordered fields, where n +is known at compile-time. Each field has a type itself and an optional name. +Fields are not necessarily of the same type. + +In other languages, similar constructs are named struct, tuple, named tuple or +(data)class. + +**Array** is a container type that contains n ordered fields, where n is not +known at compile-time. All fields are of the same type and cannot be named. + +**Relation** is an array of tuples. + +The first argument of transforms `select` and `derive` contains a known number +of entries, which can be of different types. Thus, it is a tuple. + +``` +select [1.4, false, "foo"] +``` + +## Physical layout + +_Logical type_ is user-facing the notion of a type that is the building block of +the type system. + +_Physical layout_ is the underlying memory layout of the data represented by a +variable. + +In many programming languages, physical layout of a logical type is dependent on +the target platform. Similarly, physical layout of a PRQL logical type is +dependent on representation of that type in the target STS. + +``` +PTS logical type ---> STS logical type ---> STS physical layout +``` + +Note that STS types do not have a single physical layout. Postgres has a logical +(pseudo)type `anyelement`, which is a super type of any data type. It can be +used as a function parameter type, but does not have a single physical layout so +it cannot be used in a column declaration. + +For now, PRQL does not define a physical layout of any type. It is not needed +since PRQL is not used for DDL (see section "Built-in primitives") or does not +support raw access to underlying memory. + +As a consequence, results of a PRQL query cannot be robustly compared across +DBMSs, since the physical layout of the result will vary. + +In the future, PRQL may define a common physical layout of types, probably using +Apache Arrow. + + + +## Examples + +``` +type my_relation = {[ + id = int, + title = text, + age = int +]} + +type invoices = {[ + invoice_id = int64, + issued_at = timestamp, + labels = {text} + + #[repr(json)] + items = [{ + article_id = int64, + count = int16 where x -> x >= 1, + }], + paid_by_user_id = int64 or null, + status = status, +]} +``` + +## Appendix + +### Built-in primitives + +This document mentions `int32` and `int64` as distinct types, but there is no +need for that in the initial implementation. The built-in `int` can associate +with all operations on integers and translate PRQL to valid SQL regardless of +the size of the integer. Later, `int` be replaced by +`type int = int8 or int16 or int32 or int64`. + +The general rule for "when to make a distinction between types" would be "as +soon as the types carry different information and we find an operation that +would be expressed differently". In this example, that would require some +operation on `int32` to have different syntax than same operation over `int64`. + +We can have such relaxed rule because PRQL is not aiming to be a Data Definition +Language and does not have to bother with exact physical layout of types. + +### Non-bijection cases between PTS and STS + +There are cases where a PTS construct has multiple possible and valid +representations in some STSs. + +For such cases, we'd want to have something similar to Rust's `#[repr(X)]` which +says "data in this type is represented as X" (we'd probably want a different +syntax). + +This is needed because translation from PRQL operations to SQL may depend on the +representation. + +Using SQLite as an example again, users may have some data stored as INTEGER and +some as TEXT, but would want to define both of them as PTS `timestamp`. They +would attach `#[repr(INTEGER)]` or `#[repr(TEXT)]` to the type. This would +affect how `timestamp - timestamp` is translated into SQL. INTEGER can use +normal int subtraction, but TEXT must apply `unixepoch` first. + +A similar example is a "string array type" in PTS that could be represented by a +`text[]` (if DBMS supports arrays) or `json` or it's variant `jsonb` in +Postgres. Again, the representation would affect operators: in Postgres arrays +can be access with `my_array[1]` and json uses `my_json_array -> 1`. This +example may not be applicable, if we decide that we want a separate JSON type in +PST.