Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: RFC for type system #1964

Merged
merged 12 commits into from
Mar 14, 2023
Merged

docs: RFC for type system #1964

merged 12 commits into from
Mar 14, 2023

Conversation

aljazerzen
Copy link
Member

No description provided.

@aljazerzen aljazerzen mentioned this pull request Feb 26, 2023
Copy link
Member

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall great! And thanks for writing it up!

Very much agree with the approach of starting simple — e.g. just int rather than int64 — and seeing where we get to. At least in SQL, that's going to give us almost all the benefits, like an error on trying to add an int & a string.


One thing we could consider — though maybe you've considered it and rejected it, is to model the type system on something like DuckDB. This is what @ghalimi suggested. To the extent that the industry coalesces on that standard, it'd be good. To the extent it's just another standard, it's less good, but could still be a helpful template.

That said, I think the approaches aren't that different — both would start with some basic types and attempt to map in other type systems. And there isn't an easy way of using DuckDB's types in code (it's not in rust), so we're just using the spec.


My guess is that a core challenge here will be adding in types to the library gradually — e.g. can we get an error on 5 + "foo" with reasonable work? There's so much we could do here — and likely we'll get lots of utility from the first 20% of work.

@ghalimi
Copy link

ghalimi commented Feb 26, 2023

Overall great! And thanks for writing it up!

Very much agree with the approach of starting simple — e.g. just int rather than int64 — and seeing where we get to. At least in SQL, that's going to give us almost all the benefits, like an error on trying to add an int & a string.

One thing we could consider — though maybe you've considered it and rejected it, is to model the type system on something like DuckDB. This is what @ghalimi suggested. To the extent that the industry coalesces on that standard, it'd be good. To the extent it's just another standard, it's less good, but could still be a helpful template.

That said, I think the approaches aren't that different — both would start with some basic types and attempt to map in other type systems. And there isn't an easy way of using DuckDB's types in code (it's not in rust), so we're just using the spec.

My guess is that a core challenge here will be adding in types to the library gradually — e.g. can we get an error on 5 + "foo" with reasonable work? There's so much we could do here — and likely we'll get lots of utility from the first 20% of work.

DuckDB or Iceberg. They're highly compatible with each other, but we like DuckDB better because it offers more control regarding encoding sizes. Here is a mapping table between the two: https://github.com/sutoiku/puffin/blob/main/TYPES.md

@adriangb
Copy link

I like the comparisons with Postgres, SQLite and DuckDB. I think it would also be interesting to compare to Apache Arrow

@ghalimi
Copy link

ghalimi commented Feb 26, 2023

I like the comparisons with Postgres, SQLite and DuckDB. I think it would also be interesting to compare to Apache Arrow

Apache Arrow is a great candidate for that as well, and is really close to DuckDB. You can't go wrong with Arrow.

@ghalimi
Copy link

ghalimi commented Feb 27, 2023

I like the comparisons with Postgres, SQLite and DuckDB. I think it would also be interesting to compare to Apache Arrow

Here is a mapping table between DuckDB, Arrow, and Iceberg types:

https://github.com/sutoiku/puffin/blob/main/TYPES.md

Apache Arrow is probably more "universal", but DuckDB's types look a bit cleaner to me. It's purely cosmetic though.

@max-sixty
Copy link
Member

To what extent should we have a single type system for:

  • "PRQL types", such as relation / column / function, and
  • "Data types", such as int / string?

My impulse is that they should be common, at least at the "front-end". If so, possibly we should add those to the doc. (Am happy to do it if you'd like me to)

@aljazerzen
Copy link
Member Author

@max-sixty

To what extent should we have a single type system for:

I've added a section on physical layout that talks about this (if I understood correctly what you meant).

In summary, we don't have to define "data types" right now.

@aljazerzen
Copy link
Member Author

Regarding DuckDB and Apache Iceberg: we will surely use them as a base for defining primitives, but also steer away from their handling of null and a few other quirks.

@ghalimi
Copy link

ghalimi commented Feb 27, 2023

Regarding DuckDB and Apache Iceberg: we will surely use them as a base for defining primitives, but also steer away from their handling of null and a few other quirks.

Nulls are painful...

@RandomFractals
Copy link

RandomFractals commented Feb 27, 2023

Why not just use standard Arrow data types?

I believe those are more common and with more databases and systems like Data Fusion and Polars providing native arrow and parquet data support, plus Arrow Flight and ADBC, PRQL target system submodules could just convert arrow data types to the target system defined/supported data types.

See this api for example or arrow data types defined for other languages on their site:

https://arrow.apache.org/docs/python/api/datatypes.html

@aljazerzen
Copy link
Member Author

@RandomFractals as I've written, this is not talking about physical layout of types, but about logical type system in PRQL. For physical layout we will use a standard, probably Apache Arrow, but this is not needed right now.

I am strongly against just adopting type system that is vaguely defined for Apache Arrow Python client or the type system of DuckDB, because it does not cohere principles described in the proposal. Again, for example, handling of null values.

Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>
@vanillajonathan
Copy link
Collaborator

I really like the data types int32 and int64. I've been thinking about database type systems and those two data types are exactly what I had in mind.

But what about int8 and int16?
We would need something to map to tinyint and smallint.

Note: Interestingly enough Rust have a i128 and a u128 data type.

datetime instead of timestamp?

As for timestamp maybe this should be named datetime?
It is datetime in Microsoft SQL Server (documentation) which also offers the date and timedata types, so if we were to offer a date and time data type (which I think we should) then datetime would be much more suitable, coherent and predictable than timestamp.

A boolean data type?

Maybe there could be a boolean data type?
But keep in mind that MySQL have a built-in boolean data type, while Microsoft SQL Server does not and instead uses the bit data type.

A string data type?

Most programming languages have a string datatype so that is what developers, especially front-end developers and NoSQL developers are accustomed to, but SQL uses varchar (and char). Oh, and Microsoft SQL Server also have nvarchar and nchar for Unicode.

Vendor-specific data types?

Some databases have non-standard vendor-specific data types, such as jsonbin PostgreSQL. One thing that bothers me about this is that just by looking at a database schema it is difficult to discern whether a data type is a standard data type or a non-standard data type, and to which DBMS it belongs to.

CSS deals with this with vendor prefixes, example Chrome might have their vendor-specific CSS property called -webkit-user-drag and Firefox might have a vendor-specific CSS property named -moz-radial-gradient, then these are denoted as such by the -webkit- and -moz- prefix respectively.

Media types addresses this with "vendor trees", example application/vnd.microsoft.portable-executable.

type int
type float
type bool
type text
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanillajonathan
A string data type?

Most programming languages have a string datatype so that is what developers, especially front-end developers and NoSQL developers are accustomed to, but SQL uses varchar (and char). Oh, and Microsoft SQL Server also have nvarchar and nchar for Unicode.

I've used "text" here, because we found term "string" too programmer-like, which may be confusing to non-tech-savy people (see #286 (comment)). Apart from being consistent in the language, I don't have a strong preference here.

Whatever we choose, this type does not have a specific physical layout, by which I mean that I could be implemented by VARCHAR in one DBMS and by NVARCHAR in another.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this was me that suggested from_text rather than from_string.

I'm actually +1 on using string here, since this part is much more engineer-y, and the "Excel poweruser archetype user" won't be interacting with these.

# built-in types
type int
type float
type bool
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanillajonathan
A boolean data type?

Maybe there could be a boolean data type?
But keep in mind that MySQL have a built-in boolean data type, while Microsoft SQL Server does not and instead uses the bit data type.

There is one proposed. And we really need it, because it's the type of all comparison expressions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor point — I was the person who thought from_string was too programmer-y. I'm less against it here, since types will predominantly be handled by the programmer-y folks rather than those more used to Excel)

@max-sixty
Copy link
Member

I've added a section on physical layout that talks about this (if I understood correctly what you meant).

Yes I think so. My main point was that we already have a type system for "PRQL types" such as functions and relations — attempting to compile select (f a) will fail if f isn't a function`.

I'm guessing that will be part of this type system, even though our function type will never map to a DB type, because it's evaluated before it gets to the DB...

Comment on lines 219 to 323
A similar example is a "string array type" in PTS that could be represented by
an `text[]` (if DBMS supports arrays) or `json` or it's variant `jsonb` in
Postgres. Again, the representation would affect operators: in Postgres arrays
can be access with `my_array[1]` and json uses `my_json_array -> 1`. This
example may not be applicable, if we decide that we want a separate JSON type in
PST.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanillajonathan
Vendor-specific data types?

Some databases have non-standard vendor-specific data types, such as jsonbin PostgreSQL. One thing that bothers me about this is that just by looking at a database schema it is difficult to discern whether a data type is a standard data type or a non-standard data type, and to which DBMS it belongs to.

CSS deals with this with vendor prefixes, example Chrome might have their vendor-specific CSS property called -webkit-user-drag and Firefox might have a vendor-specific CSS property named -moz-radial-gradient, then these are denoted as such by the -webkit- and -moz- prefix respectively.

Media types addresses this with "vendor trees", example application/vnd.microsoft.portable-executable.

What I meant here, is that we may be able to support vendor-specific data types jsonb in Postgres by defining them as normal structs and lists in PRQL and mark them as having a "Postgres JSON" representation.

This would tell the compiler to use postgres's operators (i.e. ->) for implementing standard PRQL operations (in this case field indirection - accessing a field of a struct).


type invoices = {[
invoice_id = int64,
issued_at = timestamp,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanillajonathan
datetime instead of timestamp?

As for timestamp maybe this should be named datetime?
It is datetime in Microsoft SQL Server (documentation) which also offers the date and timedata types, so if we were to offer a date and time data type (which I think we should) then datetime would be much more suitable, coherent and predictable than timestamp.

I'm torn on this question.

datetime is more user-friendly, but also a bit misleading in that it refers to a date and a time including a timezone, when in fact it is refering to a moment, which is timezone independent.

My feeling here is that we should find a good (modern) API that deals with time and build on top of their findings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A solid API that hasn't disappointed me yet is chrono crate

Also Java 8 java.time package

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully agreed on using a good modern API for datetime logic. A major source of logic bugs in data transformations comes from datetime handling.

Chrono is definitely a great reference and personally I think there should definitely be at least two native datetime types for sql usage: datetime (with TZ) and naive_datetime (without TZ).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed here: #2125

@aljazerzen
Copy link
Member Author

I'm guessing that will be part of this type system, even though our function type will never map to a DB type, because it's evaluated before it gets to the DB...

Yes, this is meant to replace or extend the type system we have now. It was built without much design work, so I lacks the big picture view.

And yes, I completely forgot about the types of functions :D

type my_int = int

# sum type (union or enum)
type number = int | float
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a naive question, but I think it's hard to understand if pipe and OR use the same symbol |.
Couldn't we use some other symbol here?
(I don't know of any other symbol that would be better suited for OR except ||, so it might be better to leave it as is.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree - using same syntax as pipe it not ideal.

I think having || is better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, I just realized that we already use or for logical disjunction. Can we use it here?

Let's say we have two sets:

let a = {1, 2, 3}
let b = bool

# does this make sense?
let union_of_a_and_b = a or b

# or is the better?
let union_of_a_and_b = a + b

P.S. Python uses | (binary or) for unioning sets.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is better to unify either "&& and ||" or "and and or".
I prefer the former, but given that it is easier to distinguish from |, maybe the latter is better.

@vanillajonathan
Copy link
Collaborator

Very much agree with the approach of starting simple — e.g. just int rather than int64.

I disagree.

What exactly really is int?

It is ambiguous.

  • Is it an alias to int32?
  • Is it an alias to int64?
  • Is it an union type that means either int32 or int64?

I believe introducing int would be a mistake. Ambiguity is bad.

@aljazerzen
Copy link
Member Author

What exactly really is int?

It is an integer number. Without specified size, but with many arithmetic operations defined. That is enough for now, as the size does not affect any of our translations. If needed, we can define it as a union of all intXX types in the future.

I wouldn't say this is ambiguous - type definition and all of its operations are well defined. But it may be too "dynamic"...

@eitsupi
Copy link
Member

eitsupi commented Mar 4, 2023

Perhaps a large number of types like Apache Arrow data types will be needed when implementing something like cast function?

Copy link
Member

@snth snth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added suggestions

@snth
Copy link
Member

snth commented Mar 5, 2023

As hinted at before, I have some suggestions for the syntax of the type system.

The principles guiding my suggestions are:

  1. Try to reuse whatever syntax we have already.
  2. Where we need to introduce new syntax, try to keep this consistent with other popular conventions out there.

We currently use angle brackets <> for type annotations. I have mostly seen this used in discussions in Github issues and in the PRQL std lib. To me it seems like <> acts as a "type constructor" operator (I'm not sure if that's completely correct in the most technical sense of the word) in the sense that it takes an expression and turns it into a type. I think we should try to keep that and ensure that whatever syntax we decide on stays consistent with that. What I mean by that is the following:

  • While for function defintions we write
    func interp lower:0 higher x -> (x - lower) / (higher - lower)
    this is equivalent to
    let interp = lower:0 higher x -> (x - lower) / (higher - lower)
    but the first form is preferred for readability.
  • Similarly for types we could(/should) have that
    type number = int | float
    is equivalent to
    let number = <int | float>
    where again the first is preferred for readability but they are equivalent.

Other syntax that we have already firmly established is (square) brackets [] for lists. I think we should carry this over into the type system rather than introducing {} for this. Moreover, in terms of point 2. above, {} is quite commonly used for struct types in other languages such as Javascript, JSON, Python, ... .

Moreover, we have already established : as syntax for named parameters, as for example in the interp function definition above. So then if we view {} as a "struct constructor", then it would make sense that it takes named parameters with : and purely positional ones (for tuples) simply separated by ,. This has the nice consequence that the resulting syntax is very similar, if not identical to, JSON.

Therefore I propose rewriting the Syntax section as follows:

Syntax

# built-in types
type int
type float
type bool
type text
type char
type null

# users-defined types
type my_int = int

# sum type (union or enum)
type number = int || float
type scalar = number || bool || str || char

# by default types are not nullable
type my_nullable_int = int || null

# user-defined enum
type open
type pending
type closed
type status = open || pending || closed

# product type
type my_struct = {id: int_nullable, name: str}
type my_timestamp_with_timezone = {int, text}

# list type
type list = [T]

# relations are list of structs
type my_relation = [{
	id: int,
	title: text,
	age: int
}]

type invoices = [{
    invoice_id: int64,
    issued_at: timestamp,
    labels: [text]

    #[repr(json)]
    items = [{
        article_id: int64,
        count: int16 where x -> x >= 1,
    }],
    paid_by_user_id: int64 || null,
    status: status,
}]

WDYT?

EDIT: I didn't remember about the #[repr(json)] bit and will have to reread that and comment on that another time.

type status = open | pending | closed

# product type
type my_struct = [id = int_nullable, name = str]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snth
Other syntax that we have already firmly established is (square) brackets [] for lists. I think we should carry this over into the type system rather than introducing {} for this. Moreover, in terms of point 2. above, {} is quite commonly used for struct types in other languages such as Javascript, JSON, Python, ... .

Moreover, we have already established : as syntax for named parameters, as for example in the interp function definition above. So then if we view {} as a "struct constructor", then it would make sense that it takes named parameters with : and purely positional ones (for tuples) simply separated by ,. This has the nice consequence that the resulting syntax is very similar, if not identical to, JSON.

There is a clash between what we now call "lists" and what list is in this PR.

Our select and derive take something that has a fixed number elements, they can be of different types and they can be named. This definition collides with what this PR proposes under the name "struct". So effectively, this PR renames "lists" to "structs".

E.g.:

select [1.4, false, "foo"] # this is now named "struct"

It also proposes a new construct named "list" which acts as "a repeated type" - number of repetitions is unknown at compile time, all elements are of the same type and they cannot be named. Think of a column in a relation.

Now, there are two options for the syntax:

  • (what this PR currently proposes) try to preserve current syntax: use [] for structs and {} for lists,
  • (what you described) mimic JSON and basically all other languages: use {} for structs and [] for lists.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definition collides with what this PR proposes under the name "struct". So effectively, this PR renames "lists" to "structs".

E.g.:

select [1.4, false, "foo"] # this is now named "struct"

I hadn't thought about it like that, but I think that's quite accurate.

IIUC most DBs name their List type as Array, so we could have

I'm keen to keep the PRQL construct of select [a, b, c] be called something simple, like List (and not Struct), since it's a very simple type that everyone would encounter very early.

Whereas the "repeated type" is more advanced and OK to call something technical.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd then have lists and arrays. And lists would represent relation rows.

What about tuples and arrays? Is the name "tuple" approachable for non-programmer people? Would it be confusing that [] is used for tuples?

I think that tuple is very close to relational algebra and would be a good fit here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that tuple is very close to relational algebra and would be a good fit here.

It's a bit foreign but I agree it's very close to relational algebra. Definitely fine in the typing docs; I'm less keen on having the opening example explaining PRQL be "Then we pass a tuple of [1.4, false, "foo"] to select"...

It's also unfortunate that our [] syntax is not standard tuple syntax. Though this could give us an opportunity to make our parentheses syntax even more complicated! (no)

And lists would represent relation rows.

I didn't have this in my mind — if anything the List would be closer to columns, or values. (Or maybe that's what you mean? Values in one row?)

Copy link
Member Author

@aljazerzen aljazerzen Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This naming gets confusing fast :D Let me clarify:

  • lists (currently expressed with []) or tuples (proposed) correspond to rows.
  • arrays (proposed) correspond to columns.

Copy link
Member

@max-sixty max-sixty Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(edit: possibly we should push the syntax changes off to a different thread, unless it's required for typing decisions, sorry for sidetracking us)

from my_table
select {num = 1.4, truthy = false, bar = "foo", double_col = col * 2}
group {col_2} (derive a = sum double_col)
sort {num, truthy}

I agree with the internal elegance of this. And we can say "select {a} is equivalent to select {a=a}.

But sort {num, truthy} is still more alien than sort [num, truthy], so I think we should pause before making this change. For most people who aren't programmers, the squiggly brackets are barely used!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to further clarify what we mean by "correspond to {rows,columns}"?

Right now, this is the way I'd model our relations:

relation = array of tuples
tuple = a list of named types

From this I hope it's obvious what I mean when I say that a relation row is expressed as a tuple.


For most people who aren't programmers, the squiggly brackets are barely used!

I'd say the same thing for square brackets.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @max-sixty that we should pause for a while before a change like this (if we went ahead) because it's quite fundamental. I'm actually pro it and would also favour bringing things around to be more consistent with JSON type syntax but IMHO we should take care to try and think through as many of the repercussions as we can because we don't want to make changes like this to often.

So, in terms of current release numbers I would propose something like the following (numbers are just indicative and could be incremented appropriately depending on when the work gets done):

  • 0.6.0: release imminent, pretty much in "feature freeze" mode
  • 0.7.0: more QOL improvements, (settle syntax deliberations)
  • 0.8.0: last release with old syntax, advertise upcoming syntax change
  • 0.9.0 or 0.10.0: release new syntax

My main point is that given that it might be a fair amount of work to rewrite existing queries, people should be given adequate warning and they should still be given the opportunity to benefit from the next round or two of QOL improvements before having to rewrite their queries for the new syntax.

Maybe I'm too conservative but that's how I would go about it.


I agree that we should probably move this to the Github Discussions section since we seem to have stumbled onto a bigger topic than we anticipated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tuple = a list of named types

OK, makes sense.


(Let's move the syntax conversation elsewhere, I'm muddying the discussion...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's continue the discussion here: #2124

type null

# users-defined types
type my_int = int
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snth
We currently use angle brackets <> for type annotations. I have mostly seen this used in discussions in Github issues and in the PRQL std lib. To me it seems like <> acts as a "type constructor" operator (I'm not sure if that's completely correct in the most technical sense of the word) in the sense that it takes an expression and turns it into a type. I think we should try to keep that and ensure that whatever syntax we decide on stays consistent with that. What I mean by that is the following:

  • While for function defintions we write
    func interp lower:0 higher x -> (x - lower) / (higher - lower)
    
    this is equivalent to
    let interp = lower:0 higher x -> (x - lower) / (higher - lower)
    
    but the first form is preferred for readability.
  • Similarly for types we could(/should) have that
    type number = int | float
    
    is equivalent to
    let number = <int | float>
    
    where again the first is preferred for readability but they are equivalent.

This is a great point: I haven't really defined how types interact with values and what < > means.

I think we should define a type of a variable to be "a set of all possible values of that variable". And if types are just sets, any type expression are just set expressions (e.g. int || null is a union of two sets).

param <X> would then mean "param can accept any value from set X".

I'd say that the syntax for your last example should then be:

type number = int || float
let number = int || float

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should define a type of a variable to be "a set of all possible values of that variable".

Nice, I like this!

And if types are just sets, any type expression are just set expressions (e.g. int || null is a union of two sets).

Minor nit: do we distinguish between singleton sets and the value they contain or are values automatically upgraded to singleton sets? For example technically it should probably be

nullable_int = int || {null}

but I think it might make sense to treat values the same as singleton sets containing them as long as that doesn't cause any problems elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we distinguish between singleton sets and the value they contain

Technically we should distinguish, but "automatic coercion to singletons" you propose would work.

The thing is that apart from this, we don't need syntax for sets so we don't need to use {} and it remains available for other applications.

@vanillajonathan
Copy link
Collaborator

@snth For types declared over multiple lines, I think it would be nice have to use squrae brackets, so that curly braces would be enough, just like in the majority of other languages.

@vanillajonathan
Copy link
Collaborator

I think there should be a uuid type. Example 123e4567-e89b-12d3-a456-426614174000.

@RandomFractals
Copy link

RandomFractals commented Mar 7, 2023

Frankly, if we have this many comments in PR, maybe we should start/use design/feature ticket or discussions next time to get consensus and iron out the details. PRs this long should never happen mainly b/c not many github users actually read PR comments, but they do read issue comments and discussions. And if you want to involve your dev/users community in these discussions, maybe move them out to discussions next time. :)

@max-sixty
Copy link
Member

max-sixty commented Mar 7, 2023

if we have this many comments in PR, maybe we should start/use design/feature ticket or discussions next time to get consensus and iron out the details.

At the risk of getting meta / having another thread of discussion here — I do think anchoring in a specific proposal is very helpful; rather than a discussion on "what does anyone want to see in our type system".

I find this a very difficult problem. There are tradeoffs between gathering more ideas vs. having a focused discussion; between collecting a wider range of viewpoints vs. having a more informed discussion.

A couple of things I've found helpful:

  1. Anchor in an actual proposal
  2. Move other threads elsewhere, to keep the discussion focused
  3. Produce a summary, so folks don't need to read the whole thing to catch up on the conclusions. Can do midway through to outline the current state of the the discussion and then restart.

We get (1) & (3) by working on this doc and then merging it, and then having another discussion if we need one. I'm fine to have that discussion in GH Discussion if folks prefer...

aljazerzen and others added 3 commits March 11, 2023 11:48
Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>
Co-authored-by: Tobias Brandt <Tobias.Brandt@gmail.com>
Copy link
Member

@eitsupi eitsupi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps missed replacements?

Co-authored-by: eitsupi <50911393+eitsupi@users.noreply.github.com>
@aljazerzen
Copy link
Member Author

I've updated the RFC with discussed changes. Mainly these are under sections Theory, Type annotations, Type definitions and Container types. I still have to expand definitions to include functions.

@aljazerzen aljazerzen mentioned this pull request Mar 12, 2023
> For any undefined terms used in this section, refer to set theory and
> mathematical definitions in general.

A "type of a variable" is a "set of all possible values of that variable". This
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion, but isn't the word type more familiar than the word set? I get the theory here. For the sake of our docs, I'd favor going back to type when these become docs....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going back to type

Changing to "or of two types represents a union of those two types"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My focus here iz the discussion below where we start to refer to types exclusively as sets. I support making the link between the concepts, I just think when we start documenting this, we should use the term "type" rather than "set".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I'm using term "set", because it's more suitable here in theory section and because it will be implemented this way - type declarations will be just variable declarations of sets.

Let's hold on with this until we convert this to documentation.

Comment on lines 184 to 192
But similar to how both `func` and `let` can be used to define functions (when
we introduce lambda function syntax), let's also define syntactic sugar for type
definitions:

```
# these two are equivalent
let my_type <set> = set_expr
type my_type = set_expr
```
Copy link
Member

@max-sixty max-sixty Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I'm increasingly in favor of this approach — i.e. functions can be let add = a b -> a + b

OTOH having two ways of defining these is not ideal. So we could only do the first, and lint towards including the type annotation for set/types & functions...

Comment on lines 198 to 208
**Tuple** is the only product type in PTS. It contains n ordered fields, where n
is known at compile-time. Each field has a type itself and an optional name.
Fields are not necessarily of the same type.

In other languages, similar constructs are named struct, tuple, named tuple or
(data)class.

**Array** is a container type that contains n ordered fields, where n is not
known at compile-time. All fields are of the same type and cannot be named.

**Relation** is an array of tuples.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clear, thanks!

- Binary operation `or` of two sets represents a union of those two sets:

```
let number = int or float
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate issue filed at #2170

@aljazerzen
Copy link
Member Author

I'm merging this now, as it is generally finished. That does not mean that all details described here are set in stone, but rather that the outline of the type system is ready for further discussion in separate threads.

Thank you all for comments!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants