Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: RFC for type system #1964

Merged
merged 12 commits into from
Mar 14, 2023
323 changes: 323 additions & 0 deletions book/src/internals/type-system.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,323 @@
# Type system

> For variables, the type system determines the allowed values of that term.
>
> -- Wikipedia

## Purpose

Each of the SQL DBMSs has their own type system. Thanks to SQL standard, they
are very similar, but have key differences. For example, SQLite does not have a
type for date or time or timestamps, but it has functions for handling date and
time that take ISO 8601 strings or integers that represent Unix timestamps. So
it does support most of what is possible to do with dates in other dialects,
even though it stores data with a different physical layout and uses different
functions to achieve that.

PRQL's task is to define common description of _data formats_, just as how it
already defines common _data transformations_.

I believe this should best be done in two steps:

1. Define PRQL's Type System (PTS), following principles we think a relational
language should have (and not focus on what existing SQL DBMSs have).

2. Define a mapping between SQL Type System (STS) and PTS, for each of the
DBMSs. Ideally we'd want that to be a bijection, so each type in PTS would be
represented by a single type in STS and vice-versa. Unfortunately this is not
entirely possible, as shown below.

In practical terms, we want for a user to be able to:

- ... express types of their database with PRQL (map their STS into PTS). In
some cases, we can allow to say "your database is not representable with PRQL,
change it or use only a subset of it". An example of what we don't want to
support are arrays with arbitrary indexes in Postgres (i.e. 2-based index for
arrays).

This task of mapping to PTS could be automated by LSP server, by introspecting
user's SQL database and generating PRQL source.

- ... express their SQL queries in PRQL. Again, using mapping from STS to PTS,
one should be able to express an SQL operation in PRQL.

For example, translate MSSQL `DATEDIFF` to subtraction operator `-` in PRQL.

For now, this mapping is manual, but should be documented and may be
automated.

- ... use any PRQL feature in their database. Here we are mapping back from PTS
into STS. Note that STS may have changed to a different dialect.

For example, translate PRQL's datetime operations to use TEXT in SQLite.

As of now, prql-compiler already does a good job of automatically doing this
mapping.

Example of the mapping between PTS and two STSs:

| PTS | STS Postgres | STS SQLite |
| --------- | ------------ | ---------- |
| int32 | integer | INTEGER |
| int64 | bigint | INTEGER |
| timestamp | timestamp | TEXT |

## Principles

**Algebraic types** - have a way of expressing sum and product types. In Rust,
sum would be an enum and product would be tuple or a struct. In SQL, product
would be a row, since it can contain different types, all at once. Sum would be
harder to express, see (this
post)[https://www.parsonsmatt.org/2019/03/19/sum_types_in_sql.html ].

The value proposition here is that algebraic types give a lot modeling
flexibility, all while being conceptually simple.

**Composable** - as with transformation, we'd want types to compose together.

Using Python, JavaScript, C++ or Rust, one could define many different a data
structure that would correspond to our idea of "relation". Most of them would be
an object/struct that has column names and types and then a generic array of
arrays for rows.

PRQL's type system should also be able to express relations as composed from
primitive types, but have only one idiomatic way of doing so.

In practice this means that builtin types include only primitives (int, text,
bool, float), tuple (for product), enum (for sum) and array (for repeating).

An SQL row would translate to tuple, and a relation would translate to an array
of tuples.

I would also strive for the type system to be minimal - don't differentiate
between tuples, objects and structs. Choose one and stick to it.

**Type constraints** - constrain a type with a predicate. For example, have a
type of int64 that are equal or greater than 10. Postgres
[does support this](https://news.ycombinator.com/item?id=34835063). The primary
value of using constrained types would not be validation (as it is used in
linked article), but when matching the type.

Say, for example, that we have a pipeline like this:

```
derive color = switch [x => 'red', true => 'green']
derive is_red = switch [color == 'red' => true, color == 'green' => false]
```

It should be possible to infer that `color` is of type `text`, but only when
equal to `'red'` or `'green'`. This means that the second switch covers all
possible cases and `is_red` cannot be `null`.

## Theory

> For any undefined terms used in this section, refer to set theory and
> mathematical definitions in general.

A "type of a variable" is a "set of all possible values of that variable". This
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion, but isn't the word type more familiar than the word set? I get the theory here. For the sake of our docs, I'd favor going back to type when these become docs....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going back to type

Changing to "or of two types represents a union of those two types"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My focus here iz the discussion below where we start to refer to types exclusively as sets. I support making the link between the concepts, I just think when we start documenting this, we should use the term "type" rather than "set".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I'm using term "set", because it's more suitable here in theory section and because it will be implemented this way - type declarations will be just variable declarations of sets.

Let's hold on with this until we convert this to documentation.

means that terms "type" and "set" are equivalent in this context.

Types (sets) can be expressions. For example, a union of two types is a type
itself. This means a type expression is equivalent to any other expression whose
type is a "set of sets".

So let's introduce a "set" as a PRQL expression construct (alongside existing
idents, literals, ranges and so on). For now, it does not need any special
syntax. Because sets are normal expressions, existing syntax can be repurposed
to define operations on sets:

- Binary operation `or` of two sets represents a union of those two sets:

```
let number = int or float
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate issue filed at #2170

```

With algebraic types, this is named "a sum type".

- Literals can be coerced into a singleton set (i.e. `false` is converted into a
set with only one element `false`):

```
let int_or_null = int or null
```

- A list of set expressions can be coerced into a set of tuples, where entries
of the tuples correspond to elements of the set expressions in the list:

```
let my_row = [id = int, bool, name = str]
```

- An array of set expressions with exactly one entry can be coerced into a set
of arrays of that set expression:
```
let array_of_int = {int} # proposed syntax for arrays
```

Module `std` defines built-in sets `int`, `float`, `bool`, `text` and `set`.
Other built-in sets will be added in the future.

## Type annotations

Let's extend the syntax for declaration of variable `a`, whose value can be
computed by evaluating `x`, with a type annotation:

```
let a <t> = x
```

This extended syntax applies following assertions:

- `t` can be evaluated statically (at compile time),
- `t` can be coerced into a set,
- value of `x` (and `a`) must be an element of `t`. This assertion must be
possible to evaluate statically.

Similar rules apply to type annotations of return types of functions and
function parameter definitions.

## Type definitions

As shown, types can be defined by defining expressions and coercing them to set
expressions by using `< >`.

But similar to how both `func` and `let` can be used to define functions (when
we introduce lambda function syntax), let's also define syntactic sugar for type
definitions:

```
# these two are equivalent
let my_type <set> = set_expr
type my_type = set_expr
```
Copy link
Member

@max-sixty max-sixty Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I'm increasingly in favor of this approach — i.e. functions can be let add = a b -> a + b

OTOH having two ways of defining these is not ideal. So we could only do the first, and lint towards including the type annotation for set/types & functions...


## Container types

> Terminology is under discussion

**Tuple** is the only product type in PTS. It contains n ordered fields, where n
is known at compile-time. Each field has a type itself and an optional name.
Fields are not necessarily of the same type.

In other languages, similar constructs are named struct, tuple, named tuple or
(data)class.

**Array** is a container type that contains n ordered fields, where n is not
known at compile-time. All fields are of the same type and cannot be named.

**Relation** is an array of tuples.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clear, thanks!


The first argument of transforms `select` and `derive` contains a known number
of entries, which can be of different types. Thus, it is a tuple.

```
select [1.4, false, "foo"]
```

## Physical layout

_Logical type_ is user-facing the notion of a type that is the building block of
the type system.

_Physical layout_ is the underlying memory layout of the data represented by a
variable.

In many programming languages, physical layout of a logical type is dependent on
the target platform. Similarly, physical layout of a PRQL logical type is
dependent on representation of that type in the target STS.

```
PTS logical type ---> STS logical type ---> STS physical layout
```

Note that STS types do not have a single physical layout. Postgres has a logical
(pseudo)type `anyelement`, which is a super type of any data type. It can be
used as a function parameter type, but does not have a single physical layout so
it cannot be used in a column declaration.

For now, PRQL does not define a physical layout of any type. It is not needed
since PRQL is not used for DDL (see section "Built-in primitives") or does not
support raw access to underlying memory.

As a consequence, results of a PRQL query cannot be robustly compared across
DBMSs, since the physical layout of the result will vary.

In the future, PRQL may define a common physical layout of types, probably using
Apache Arrow.

<!-- ## Enums

```
# user-defined enum
type open
type pending
type closed
type status = open or pending or closed
``` -->

## Examples

```
type my_relation = {[
id = int,
title = text,
age = int
]}

type invoices = {[
invoice_id = int64,
issued_at = timestamp,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanillajonathan
datetime instead of timestamp?

As for timestamp maybe this should be named datetime?
It is datetime in Microsoft SQL Server (documentation) which also offers the date and timedata types, so if we were to offer a date and time data type (which I think we should) then datetime would be much more suitable, coherent and predictable than timestamp.

I'm torn on this question.

datetime is more user-friendly, but also a bit misleading in that it refers to a date and a time including a timezone, when in fact it is refering to a moment, which is timezone independent.

My feeling here is that we should find a good (modern) API that deals with time and build on top of their findings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A solid API that hasn't disappointed me yet is chrono crate

Also Java 8 java.time package

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully agreed on using a good modern API for datetime logic. A major source of logic bugs in data transformations comes from datetime handling.

Chrono is definitely a great reference and personally I think there should definitely be at least two native datetime types for sql usage: datetime (with TZ) and naive_datetime (without TZ).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed here: #2125

labels = {text}

#[repr(json)]
items = [{
article_id = int64,
count = int16 where x -> x >= 1,
}],
paid_by_user_id = int64 or null,
status = status,
]}
```

## Appendix

### Built-in primitives

This document mentions `int32` and `int64` as distinct types, but there is no
need for that in the initial implementation. The built-in `int` can associate
with all operations on integers and translate PRQL to valid SQL regardless of
the size of the integer. Later, `int` be replaced by
`type int = int8 or int16 or int32 or int64`.

The general rule for "when to make a distinction between types" would be "as
soon as the types carry different information and we find an operation that
would be expressed differently". In this example, that would require some
operation on `int32` to have different syntax than same operation over `int64`.

We can have such relaxed rule because PRQL is not aiming to be a Data Definition
Language and does not have to bother with exact physical layout of types.

### Non-bijection cases between PTS and STS

There are cases where a PTS construct has multiple possible and valid
representations in some STSs.

For such cases, we'd want to have something similar to Rust's `#[repr(X)]` which
says "data in this type is represented as X" (we'd probably want a different
syntax).

This is needed because translation from PRQL operations to SQL may depend on the
representation.

Using SQLite as an example again, users may have some data stored as INTEGER and
some as TEXT, but would want to define both of them as PTS `timestamp`. They
would attach `#[repr(INTEGER)]` or `#[repr(TEXT)]` to the type. This would
affect how `timestamp - timestamp` is translated into SQL. INTEGER can use
normal int subtraction, but TEXT must apply `unixepoch` first.

A similar example is a "string array type" in PTS that could be represented by a
`text[]` (if DBMS supports arrays) or `json` or it's variant `jsonb` in
Postgres. Again, the representation would affect operators: in Postgres arrays
can be access with `my_array[1]` and json uses `my_json_array -> 1`. This
example may not be applicable, if we decide that we want a separate JSON type in
PST.