docs: RFC for type system (#1964)

PRQL · Mar 14, 2023 · cb13428 · cb13428
1 parent d08b13a
commit cb13428
Showing 1 changed file with 332 additions and 0 deletions.
diff --git a/web/book/src/internals/type-system.md b/web/book/src/internals/type-system.md
@@ -0,0 +1,332 @@
+# Type system
+
+> For variables, the type system determines the allowed values of that term.
+>
+> -- Wikipedia
+
+## Purpose
+
+Each of the SQL DBMSs has their own type system. Thanks to SQL standard, they
+are very similar, but have key differences. For example, SQLite does not have a
+type for date or time or timestamps, but it has functions for handling date and
+time that take ISO 8601 strings or integers that represent Unix timestamps. So
+it does support most of what is possible to do with dates in other dialects,
+even though it stores data with a different physical layout and uses different
+functions to achieve that.
+
+PRQL's task is to define common description of _data formats_, just as how it
+already defines common _data transformations_.
+
+I believe this should best be done in two steps:
+
+1. Define PRQL's Type System (PTS), following principles we think a relational
+   language should have (and not focus on what existing SQL DBMSs have).
+
+2. Define a mapping between SQL Type System (STS) and PTS, for each of the
+   DBMSs. Ideally we'd want that to be a bijection, so each type in PTS would be
+   represented by a single type in STS and vice-versa. Unfortunately this is not
+   entirely possible, as shown below.
+
+In practical terms, we want for a user to be able to:
+
+- ... express types of their database with PRQL (map their STS into PTS). In
+  some cases, we can allow to say "your database is not representable with PRQL,
+  change it or use only a subset of it". An example of what we don't want to
+  support are arrays with arbitrary indexes in Postgres (i.e. 2-based index for
+  arrays).
+
+  This task of mapping to PTS could be automated by LSP server, by introspecting
+  user's SQL database and generating PRQL source.
+
+- ... express their SQL queries in PRQL. Again, using mapping from STS to PTS,
+  one should be able to express an SQL operation in PRQL.
+
+  For example, translate MSSQL `DATEDIFF` to subtraction operator `-` in PRQL.
+
+  For now, this mapping is manual, but should be documented and may be
+  automated.
+
+- ... use any PRQL feature in their database. Here we are mapping back from PTS
+  into STS. Note that STS may have changed to a different dialect.
+
+  For example, translate PRQL's datetime operations to use TEXT in SQLite.
+
+  As of now, prql-compiler already does a good job of automatically doing this
+  mapping.
+
+Example of the mapping between PTS and two STSs:
+
+| PTS       | STS Postgres | STS SQLite |
+| --------- | ------------ | ---------- |
+| int32     | integer      | INTEGER    |
+| int64     | bigint       | INTEGER    |
+| timestamp | timestamp    | TEXT       |
+
+## Principles
+
+**Algebraic types** - have a way of expressing sum and product types. In Rust,
+sum would be an enum and product would be tuple or a struct. In SQL, product
+would be a row, since it can contain different types, all at once. Sum would be
+harder to express, see (this
+post)[https://www.parsonsmatt.org/2019/03/19/sum_types_in_sql.html ].
+
+The value proposition here is that algebraic types give a lot modeling
+flexibility, all while being conceptually simple.
+
+**Composable** - as with transformation, we'd want types to compose together.
+
+Using Python, JavaScript, C++ or Rust, one could define many different a data
+structure that would correspond to our idea of "relation". Most of them would be
+an object/struct that has column names and types and then a generic array of
+arrays for rows.
+
+PRQL's type system should also be able to express relations as composed from
+primitive types, but have only one idiomatic way of doing so.
+
+In practice this means that builtin types include only primitives (int, text,
+bool, float), tuple (for product), enum (for sum) and array (for repeating).
+
+An SQL row would translate to tuple, and a relation would translate to an array
+of tuples.
+
+I would also strive for the type system to be minimal - don't differentiate
+between tuples, objects and structs. Choose one and stick to it.
+
+**Type constraints** - constrain a type with a predicate. For example, have a
+type of int64 that are equal or greater than 10. Postgres
+[does support this](https://news.ycombinator.com/item?id=34835063). The primary
+value of using constrained types would not be validation (as it is used in
+linked article), but when matching the type.
+
+Say, for example, that we have a pipeline like this:
+
+```
+derive color = switch [x => 'red', true => 'green']
+derive is_red = switch [color == 'red' => true, color == 'green' => false]
+```
+
+It should be possible to infer that `color` is of type `text`, but only when
+equal to `'red'` or `'green'`. This means that the second switch covers all
+possible cases and `is_red` cannot be `null`.
+
+## Theory
+
+> For any undefined terms used in this section, refer to set theory and
+> mathematical definitions in general.
+
+A "type of a variable" is a "set of all possible values of that variable". This
+means that terms "type" and "set" are equivalent in this context.
+
+Types (sets) can be expressions. For example, a union of two types is a type
+itself. This means a type expression is equivalent to any other expression whose
+type is a "set of sets".
+
+So let's introduce a "set" as a PRQL expression construct (alongside existing
+idents, literals, ranges and so on). For now, it does not need any special
+syntax. Because sets are normal expressions, existing syntax can be repurposed
+to define operations on sets:
+
+- Binary operation `or` of two sets represents a union of those two sets:
+
+  ```
+  let number = int or float
+  ```
+
+  With algebraic types, this is named "a sum type".
+
+- Literals can be coerced into a singleton set (i.e. `false` is converted into a
+  set with only one element `false`):
+
+  ```
+  let int_or_null = int or null
+  ```
+
+- A list of set expressions can be coerced into a set of tuples, where entries
+  of the tuples correspond to elements of the set expressions in the list:
+
+  ```
+  let my_row = [id = int, bool, name = str]
+  ```
+
+- An array of set expressions with exactly one entry can be coerced into a set
+  of arrays of that set expression:
+
+  ```
+  let array_of_int = {int} # proposed syntax for arrays
+  ```
+
+- A function that takes set as params and returns a set is converted into a set
+  of functions.
+
+  ```
+  let floor_signature = (float -> int)
+  # using a proposed syntax for lambda functions
+  ```
+
+Module `std` defines built-in sets `int`, `float`, `bool`, `text` and `set`.
+Other built-in sets will be added in the future.
+
+## Type annotations
+
+Let's extend the syntax for declaration of variable `a`, whose value can be
+computed by evaluating `x`, with a type annotation:
+
+```
+let a <t> = x
+```
+
+This extended syntax applies following assertions:
+
+- `t` can be evaluated statically (at compile time),
+- `t` can be coerced into a set,
+- value of `x` (and `a`) must be an element of `t`. This assertion must be
+  possible to evaluate statically.
+
+Similar rules apply to type annotations of return types of functions and
+function parameter definitions.
+
+## Type definitions
+
+As shown, types can be defined by defining expressions and coercing them to set
+expressions by using `< >`.
+
+But similar to how both `func` and `let` can be used to define functions (when
+we introduce lambda function syntax), let's also define syntactic sugar for type
+definitions:
+
+```
+# these two are equivalent
+let my_type <set> = set_expr
+type my_type = set_expr
+```
+
+## Container types
+
+> Terminology is under discussion
+
+**Tuple** is the only product type in PTS. It contains n ordered fields, where n
+is known at compile-time. Each field has a type itself and an optional name.
+Fields are not necessarily of the same type.
+
+In other languages, similar constructs are named struct, tuple, named tuple or
+(data)class.
+
+**Array** is a container type that contains n ordered fields, where n is not
+known at compile-time. All fields are of the same type and cannot be named.
+
+**Relation** is an array of tuples.
+
+The first argument of transforms `select` and `derive` contains a known number
+of entries, which can be of different types. Thus, it is a tuple.
+
+```
+select [1.4, false, "foo"]
+```
+
+## Physical layout
+
+_Logical type_ is user-facing the notion of a type that is the building block of
+the type system.
+
+_Physical layout_ is the underlying memory layout of the data represented by a
+variable.
+
+In many programming languages, physical layout of a logical type is dependent on
+the target platform. Similarly, physical layout of a PRQL logical type is
+dependent on representation of that type in the target STS.
+
+```
+PTS logical type  --->  STS logical type  ---> STS physical layout
+```
+
+Note that STS types do not have a single physical layout. Postgres has a logical
+(pseudo)type `anyelement`, which is a super type of any data type. It can be
+used as a function parameter type, but does not have a single physical layout so
+it cannot be used in a column declaration.
+
+For now, PRQL does not define a physical layout of any type. It is not needed
+since PRQL is not used for DDL (see section "Built-in primitives") or does not
+support raw access to underlying memory.
+
+As a consequence, results of a PRQL query cannot be robustly compared across
+DBMSs, since the physical layout of the result will vary.
+
+In the future, PRQL may define a common physical layout of types, probably using
+Apache Arrow.
+
+<!-- ## Enums
+
+```
+# user-defined enum
+type open
+type pending
+type closed
+type status = open or pending or closed
+``` -->
+
+## Examples
+
+```
+type my_relation = {[
+	id = int,
+	title = text,
+	age = int
+]}
+
+type invoices = {[
+    invoice_id = int64,
+    issued_at = timestamp,
+    labels = {text}
+
+    #[repr(json)]
+    items = [{
+        article_id = int64,
+        count = int16 where x -> x >= 1,
+    }],
+    paid_by_user_id = int64 or null,
+    status = status,
+]}
+```
+
+## Appendix
+
+### Built-in primitives
+
+This document mentions `int32` and `int64` as distinct types, but there is no
+need for that in the initial implementation. The built-in `int` can associate
+with all operations on integers and translate PRQL to valid SQL regardless of
+the size of the integer. Later, `int` be replaced by
+`type int = int8 or int16 or int32 or int64`.
+
+The general rule for "when to make a distinction between types" would be "as
+soon as the types carry different information and we find an operation that
+would be expressed differently". In this example, that would require some
+operation on `int32` to have different syntax than same operation over `int64`.
+
+We can have such relaxed rule because PRQL is not aiming to be a Data Definition
+Language and does not have to bother with exact physical layout of types.
+
+### Non-bijection cases between PTS and STS
+
+There are cases where a PTS construct has multiple possible and valid
+representations in some STSs.
+
+For such cases, we'd want to have something similar to Rust's `#[repr(X)]` which
+says "data in this type is represented as X" (we'd probably want a different
+syntax).
+
+This is needed because translation from PRQL operations to SQL may depend on the
+representation.
+
+Using SQLite as an example again, users may have some data stored as INTEGER and
+some as TEXT, but would want to define both of them as PTS `timestamp`. They
+would attach `#[repr(INTEGER)]` or `#[repr(TEXT)]` to the type. This would
+affect how `timestamp - timestamp` is translated into SQL. INTEGER can use
+normal int subtraction, but TEXT must apply `unixepoch` first.
+
+A similar example is a "string array type" in PTS that could be represented by a
+`text[]` (if DBMS supports arrays) or `json` or it's variant `jsonb` in
+Postgres. Again, the representation would affect operators: in Postgres arrays
+can be access with `my_array[1]` and json uses `my_json_array -> 1`. This
+example may not be applicable, if we decide that we want a separate JSON type in
+PST.