-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: RFC for type system #1964
Conversation
1ee2190
to
a5b360b
Compare
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall great! And thanks for writing it up!
Very much agree with the approach of starting simple — e.g. just int
rather than int64
— and seeing where we get to. At least in SQL, that's going to give us almost all the benefits, like an error on trying to add an int & a string.
One thing we could consider — though maybe you've considered it and rejected it, is to model the type system on something like DuckDB. This is what @ghalimi suggested. To the extent that the industry coalesces on that standard, it'd be good. To the extent it's just another standard, it's less good, but could still be a helpful template.
That said, I think the approaches aren't that different — both would start with some basic types and attempt to map in other type systems. And there isn't an easy way of using DuckDB's types in code (it's not in rust), so we're just using the spec.
My guess is that a core challenge here will be adding in types to the library gradually — e.g. can we get an error on 5 + "foo"
with reasonable work? There's so much we could do here — and likely we'll get lots of utility from the first 20% of work.
DuckDB or Iceberg. They're highly compatible with each other, but we like DuckDB better because it offers more control regarding encoding sizes. Here is a mapping table between the two: https://github.com/sutoiku/puffin/blob/main/TYPES.md |
I like the comparisons with Postgres, SQLite and DuckDB. I think it would also be interesting to compare to Apache Arrow |
Apache Arrow is a great candidate for that as well, and is really close to DuckDB. You can't go wrong with Arrow. |
Here is a mapping table between DuckDB, Arrow, and Iceberg types: https://github.com/sutoiku/puffin/blob/main/TYPES.md Apache Arrow is probably more "universal", but DuckDB's types look a bit cleaner to me. It's purely cosmetic though. |
To what extent should we have a single type system for:
My impulse is that they should be common, at least at the "front-end". If so, possibly we should add those to the doc. (Am happy to do it if you'd like me to) |
I've added a section on physical layout that talks about this (if I understood correctly what you meant). In summary, we don't have to define "data types" right now. |
Regarding DuckDB and Apache Iceberg: we will surely use them as a base for defining primitives, but also steer away from their handling of null and a few other quirks. |
Nulls are painful... |
Why not just use standard Arrow data types? I believe those are more common and with more databases and systems like Data Fusion and Polars providing native arrow and parquet data support, plus Arrow Flight and ADBC, PRQL target system submodules could just convert arrow data types to the target system defined/supported data types. See this api for example or arrow data types defined for other languages on their site: |
@RandomFractals as I've written, this is not talking about physical layout of types, but about logical type system in PRQL. For physical layout we will use a standard, probably Apache Arrow, but this is not needed right now. I am strongly against just adopting type system that is vaguely defined for Apache Arrow Python client or the type system of DuckDB, because it does not cohere principles described in the proposal. Again, for example, handling of null values. |
Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>
I really like the data types But what about Note: Interestingly enough Rust have a
|
book/src/internals/type-system.md
Outdated
type int | ||
type float | ||
type bool | ||
type text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vanillajonathan
A string data type?Most programming languages have a string datatype so that is what developers, especially front-end developers and NoSQL developers are accustomed to, but SQL uses varchar (and char). Oh, and Microsoft SQL Server also have nvarchar and nchar for Unicode.
I've used "text" here, because we found term "string" too programmer-like, which may be confusing to non-tech-savy people (see #286 (comment)). Apart from being consistent in the language, I don't have a strong preference here.
Whatever we choose, this type does not have a specific physical layout, by which I mean that I could be implemented by VARCHAR in one DBMS and by NVARCHAR in another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this was me that suggested from_text
rather than from_string
.
I'm actually +1 on using string
here, since this part is much more engineer-y, and the "Excel poweruser archetype user" won't be interacting with these.
book/src/internals/type-system.md
Outdated
# built-in types | ||
type int | ||
type float | ||
type bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vanillajonathan
A boolean data type?Maybe there could be a boolean data type?
But keep in mind that MySQL have a built-in boolean data type, while Microsoft SQL Server does not and instead uses the bit data type.
There is one proposed. And we really need it, because it's the type of all comparison expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor point — I was the person who thought from_string
was too programmer-y. I'm less against it here, since types will predominantly be handled by the programmer-y folks rather than those more used to Excel)
Yes I think so. My main point was that we already have a type system for "PRQL types" such as functions and relations — attempting to compile I'm guessing that will be part of this type system, even though our |
book/src/internals/type-system.md
Outdated
A similar example is a "string array type" in PTS that could be represented by | ||
an `text[]` (if DBMS supports arrays) or `json` or it's variant `jsonb` in | ||
Postgres. Again, the representation would affect operators: in Postgres arrays | ||
can be access with `my_array[1]` and json uses `my_json_array -> 1`. This | ||
example may not be applicable, if we decide that we want a separate JSON type in | ||
PST. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vanillajonathan
Vendor-specific data types?Some databases have non-standard vendor-specific data types, such as jsonbin PostgreSQL. One thing that bothers me about this is that just by looking at a database schema it is difficult to discern whether a data type is a standard data type or a non-standard data type, and to which DBMS it belongs to.
CSS deals with this with vendor prefixes, example Chrome might have their vendor-specific CSS property called -webkit-user-drag and Firefox might have a vendor-specific CSS property named -moz-radial-gradient, then these are denoted as such by the -webkit- and -moz- prefix respectively.
Media types addresses this with "vendor trees", example application/vnd.microsoft.portable-executable.
What I meant here, is that we may be able to support vendor-specific data types jsonb in Postgres by defining them as normal structs and lists in PRQL and mark them as having a "Postgres JSON" representation.
This would tell the compiler to use postgres's operators (i.e. ->
) for implementing standard PRQL operations (in this case field indirection - accessing a field of a struct).
book/src/internals/type-system.md
Outdated
|
||
type invoices = {[ | ||
invoice_id = int64, | ||
issued_at = timestamp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vanillajonathan
datetime instead of timestamp?As for timestamp maybe this should be named datetime?
It is datetime in Microsoft SQL Server (documentation) which also offers the date and timedata types, so if we were to offer a date and time data type (which I think we should) then datetime would be much more suitable, coherent and predictable than timestamp.
I'm torn on this question.
datetime is more user-friendly, but also a bit misleading in that it refers to a date and a time including a timezone, when in fact it is refering to a moment, which is timezone independent.
My feeling here is that we should find a good (modern) API that deals with time and build on top of their findings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A solid API that hasn't disappointed me yet is chrono crate
Also Java 8 java.time package
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fully agreed on using a good modern API for datetime logic. A major source of logic bugs in data transformations comes from datetime handling.
Chrono is definitely a great reference and personally I think there should definitely be at least two native datetime types for sql usage: datetime (with TZ) and naive_datetime (without TZ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed here: #2125
Yes, this is meant to replace or extend the type system we have now. It was built without much design work, so I lacks the big picture view. And yes, I completely forgot about the types of functions :D |
book/src/internals/type-system.md
Outdated
type my_int = int | ||
|
||
# sum type (union or enum) | ||
type number = int | float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a naive question, but I think it's hard to understand if pipe and OR use the same symbol |
.
Couldn't we use some other symbol here?
(I don't know of any other symbol that would be better suited for OR except ||
, so it might be better to leave it as is.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree - using same syntax as pipe it not ideal.
I think having ||
is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, I just realized that we already use or
for logical disjunction. Can we use it here?
Let's say we have two sets:
let a = {1, 2, 3}
let b = bool
# does this make sense?
let union_of_a_and_b = a or b
# or is the better?
let union_of_a_and_b = a + b
P.S. Python uses |
(binary or) for unioning sets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it is better to unify either "&&
and ||
" or "and
and or
".
I prefer the former, but given that it is easier to distinguish from |
, maybe the latter is better.
I disagree. What exactly really is It is ambiguous.
I believe introducing |
It is an integer number. Without specified size, but with many arithmetic operations defined. That is enough for now, as the size does not affect any of our translations. If needed, we can define it as a union of all I wouldn't say this is ambiguous - type definition and all of its operations are well defined. But it may be too "dynamic"... |
Perhaps a large number of types like Apache Arrow data types will be needed when implementing something like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added suggestions
As hinted at before, I have some suggestions for the syntax of the type system. The principles guiding my suggestions are:
We currently use angle brackets
Other syntax that we have already firmly established is (square) brackets Moreover, we have already established Therefore I propose rewriting the Syntax section as follows: Syntax# built-in types
type int
type float
type bool
type text
type char
type null
# users-defined types
type my_int = int
# sum type (union or enum)
type number = int || float
type scalar = number || bool || str || char
# by default types are not nullable
type my_nullable_int = int || null
# user-defined enum
type open
type pending
type closed
type status = open || pending || closed
# product type
type my_struct = {id: int_nullable, name: str}
type my_timestamp_with_timezone = {int, text}
# list type
type list = [T]
# relations are list of structs
type my_relation = [{
id: int,
title: text,
age: int
}]
type invoices = [{
invoice_id: int64,
issued_at: timestamp,
labels: [text]
#[repr(json)]
items = [{
article_id: int64,
count: int16 where x -> x >= 1,
}],
paid_by_user_id: int64 || null,
status: status,
}] WDYT? EDIT: I didn't remember about the |
book/src/internals/type-system.md
Outdated
type status = open | pending | closed | ||
|
||
# product type | ||
type my_struct = [id = int_nullable, name = str] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@snth
Other syntax that we have already firmly established is (square) brackets [] for lists. I think we should carry this over into the type system rather than introducing {} for this. Moreover, in terms of point 2. above, {} is quite commonly used for struct types in other languages such as Javascript, JSON, Python, ... .Moreover, we have already established : as syntax for named parameters, as for example in the interp function definition above. So then if we view {} as a "struct constructor", then it would make sense that it takes named parameters with : and purely positional ones (for tuples) simply separated by ,. This has the nice consequence that the resulting syntax is very similar, if not identical to, JSON.
There is a clash between what we now call "lists" and what list is in this PR.
Our select
and derive
take something that has a fixed number elements, they can be of different types and they can be named. This definition collides with what this PR proposes under the name "struct". So effectively, this PR renames "lists" to "structs".
E.g.:
select [1.4, false, "foo"] # this is now named "struct"
It also proposes a new construct named "list" which acts as "a repeated type" - number of repetitions is unknown at compile time, all elements are of the same type and they cannot be named. Think of a column in a relation.
Now, there are two options for the syntax:
- (what this PR currently proposes) try to preserve current syntax: use
[]
for structs and{}
for lists, - (what you described) mimic JSON and basically all other languages: use
{}
for structs and[]
for lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definition collides with what this PR proposes under the name "struct". So effectively, this PR renames "lists" to "structs".
E.g.:
select [1.4, false, "foo"] # this is now named "struct"
I hadn't thought about it like that, but I think that's quite accurate.
IIUC most DBs name their List type as Array, so we could have
I'm keen to keep the PRQL construct of select [a, b, c]
be called something simple, like List (and not Struct), since it's a very simple type that everyone would encounter very early.
Whereas the "repeated type" is more advanced and OK to call something technical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd then have lists and arrays. And lists would represent relation rows.
What about tuples and arrays? Is the name "tuple" approachable for non-programmer people? Would it be confusing that []
is used for tuples?
I think that tuple is very close to relational algebra and would be a good fit here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that tuple is very close to relational algebra and would be a good fit here.
It's a bit foreign but I agree it's very close to relational algebra. Definitely fine in the typing docs; I'm less keen on having the opening example explaining PRQL be "Then we pass a tuple of [1.4, false, "foo"]
to select
"...
It's also unfortunate that our []
syntax is not standard tuple syntax. Though this could give us an opportunity to make our parentheses syntax even more complicated! (no)
And lists would represent relation rows.
I didn't have this in my mind — if anything the List would be closer to columns, or values. (Or maybe that's what you mean? Values in one row?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This naming gets confusing fast :D Let me clarify:
- lists (currently expressed with
[]
) or tuples (proposed) correspond to rows. - arrays (proposed) correspond to columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(edit: possibly we should push the syntax changes off to a different thread, unless it's required for typing decisions, sorry for sidetracking us)
from my_table select {num = 1.4, truthy = false, bar = "foo", double_col = col * 2} group {col_2} (derive a = sum double_col) sort {num, truthy}
I agree with the internal elegance of this. And we can say "select {a}
is equivalent to select {a=a}
.
But sort {num, truthy}
is still more alien than sort [num, truthy]
, so I think we should pause before making this change. For most people who aren't programmers, the squiggly brackets are barely used!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to further clarify what we mean by "correspond to {rows,columns}"?
Right now, this is the way I'd model our relations:
relation = array of tuples
tuple = a list of named types
From this I hope it's obvious what I mean when I say that a relation row is expressed as a tuple.
For most people who aren't programmers, the squiggly brackets are barely used!
I'd say the same thing for square brackets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @max-sixty that we should pause for a while before a change like this (if we went ahead) because it's quite fundamental. I'm actually pro it and would also favour bringing things around to be more consistent with JSON type syntax but IMHO we should take care to try and think through as many of the repercussions as we can because we don't want to make changes like this to often.
So, in terms of current release numbers I would propose something like the following (numbers are just indicative and could be incremented appropriately depending on when the work gets done):
- 0.6.0: release imminent, pretty much in "feature freeze" mode
- 0.7.0: more QOL improvements, (settle syntax deliberations)
- 0.8.0: last release with old syntax, advertise upcoming syntax change
- 0.9.0 or 0.10.0: release new syntax
My main point is that given that it might be a fair amount of work to rewrite existing queries, people should be given adequate warning and they should still be given the opportunity to benefit from the next round or two of QOL improvements before having to rewrite their queries for the new syntax.
Maybe I'm too conservative but that's how I would go about it.
I agree that we should probably move this to the Github Discussions section since we seem to have stumbled onto a bigger topic than we anticipated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tuple = a list of named types
OK, makes sense.
(Let's move the syntax conversation elsewhere, I'm muddying the discussion...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's continue the discussion here: #2124
book/src/internals/type-system.md
Outdated
type null | ||
|
||
# users-defined types | ||
type my_int = int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@snth
We currently use angle brackets <> for type annotations. I have mostly seen this used in discussions in Github issues and in the PRQL std lib. To me it seems like <> acts as a "type constructor" operator (I'm not sure if that's completely correct in the most technical sense of the word) in the sense that it takes an expression and turns it into a type. I think we should try to keep that and ensure that whatever syntax we decide on stays consistent with that. What I mean by that is the following:
- While for function defintions we write
this is equivalent tofunc interp lower:0 higher x -> (x - lower) / (higher - lower)
but the first form is preferred for readability.let interp = lower:0 higher x -> (x - lower) / (higher - lower)
- Similarly for types we could(/should) have that
is equivalent totype number = int | float
where again the first is preferred for readability but they are equivalent.let number = <int | float>
This is a great point: I haven't really defined how types interact with values and what < >
means.
I think we should define a type of a variable to be "a set of all possible values of that variable". And if types are just sets, any type expression are just set expressions (e.g. int || null
is a union of two sets).
param <X>
would then mean "param can accept any value from set X".
I'd say that the syntax for your last example should then be:
type number = int || float
let number = int || float
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should define a type of a variable to be "a set of all possible values of that variable".
Nice, I like this!
And if types are just sets, any type expression are just set expressions (e.g. int || null is a union of two sets).
Minor nit: do we distinguish between singleton sets and the value they contain or are values automatically upgraded to singleton sets? For example technically it should probably be
nullable_int = int || {null}
but I think it might make sense to treat values the same as singleton sets containing them as long as that doesn't cause any problems elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we distinguish between singleton sets and the value they contain
Technically we should distinguish, but "automatic coercion to singletons" you propose would work.
The thing is that apart from this, we don't need syntax for sets so we don't need to use {}
and it remains available for other applications.
@snth For types declared over multiple lines, I think it would be nice have to use squrae brackets, so that curly braces would be enough, just like in the majority of other languages. |
I think there should be a |
Frankly, if we have this many comments in PR, maybe we should start/use design/feature ticket or discussions next time to get consensus and iron out the details. PRs this long should never happen mainly b/c not many github users actually read PR comments, but they do read issue comments and discussions. And if you want to involve your dev/users community in these discussions, maybe move them out to discussions next time. :) |
At the risk of getting meta / having another thread of discussion here — I do think anchoring in a specific proposal is very helpful; rather than a discussion on "what does anyone want to see in our type system". I find this a very difficult problem. There are tradeoffs between gathering more ideas vs. having a focused discussion; between collecting a wider range of viewpoints vs. having a more informed discussion. A couple of things I've found helpful:
We get (1) & (3) by working on this doc and then merging it, and then having another discussion if we need one. I'm fine to have that discussion in GH Discussion if folks prefer... |
Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com> Co-authored-by: Tobias Brandt <Tobias.Brandt@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps missed replacements?
Co-authored-by: eitsupi <50911393+eitsupi@users.noreply.github.com>
I've updated the RFC with discussed changes. Mainly these are under sections Theory, Type annotations, Type definitions and Container types. I still have to expand definitions to include functions. |
book/src/internals/type-system.md
Outdated
> For any undefined terms used in this section, refer to set theory and | ||
> mathematical definitions in general. | ||
|
||
A "type of a variable" is a "set of all possible values of that variable". This |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong opinion, but isn't the word type
more familiar than the word set
? I get the theory here. For the sake of our docs, I'd favor going back to type
when these become docs....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going back to type
Changing to "or of two types represents a union of those two types"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My focus here iz the discussion below where we start to refer to types exclusively as sets. I support making the link between the concepts, I just think when we start documenting this, we should use the term "type" rather than "set".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I'm using term "set", because it's more suitable here in theory section and because it will be implemented this way - type declarations will be just variable declarations of sets.
Let's hold on with this until we convert this to documentation.
book/src/internals/type-system.md
Outdated
But similar to how both `func` and `let` can be used to define functions (when | ||
we introduce lambda function syntax), let's also define syntactic sugar for type | ||
definitions: | ||
|
||
``` | ||
# these two are equivalent | ||
let my_type <set> = set_expr | ||
type my_type = set_expr | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I'm increasingly in favor of this approach — i.e. functions can be let add = a b -> a + b
OTOH having two ways of defining these is not ideal. So we could only do the first, and lint towards including the type annotation for set/types & functions...
book/src/internals/type-system.md
Outdated
**Tuple** is the only product type in PTS. It contains n ordered fields, where n | ||
is known at compile-time. Each field has a type itself and an optional name. | ||
Fields are not necessarily of the same type. | ||
|
||
In other languages, similar constructs are named struct, tuple, named tuple or | ||
(data)class. | ||
|
||
**Array** is a container type that contains n ordered fields, where n is not | ||
known at compile-time. All fields are of the same type and cannot be named. | ||
|
||
**Relation** is an array of tuples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clear, thanks!
book/src/internals/type-system.md
Outdated
- Binary operation `or` of two sets represents a union of those two sets: | ||
|
||
``` | ||
let number = int or float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate issue filed at #2170
I'm merging this now, as it is generally finished. That does not mean that all details described here are set in stone, but rather that the outline of the type system is ready for further discussion in separate threads. Thank you all for comments! |
No description provided.