Syntax for lists and arrays #2124

aljazerzen · 2023-03-12T11:32:45Z

This is continuation of discussion in #1964 (comment)

As of the latest update to RFC about type system (see linked PR), PRQL will have two container types:

tuples, which have a known number of entries that can have different types,
arrays, which have an unknown number of entries all of the same type.

Currently PRQL has "lists" with same semantics as the ones described for tuples. To limit confusion in this thread, let's use terms "tuple", "array" and "list" as I described it here and be clear when proposing changes to the names.

With terminology out of the way, let's talk about the problem: syntax.

As I see it, there are two options:

Option 1 ({} for arrays). Try to preserve current syntax. Use [] for tuples, {} for arrays,
Option 2 (JSON-like). Use {} for tuples and [] for arrays.

The text was updated successfully, but these errors were encountered:

max-sixty · 2023-03-13T04:52:42Z

How are you thinking about using Arrays? Are they just the underlying datatype? Or they would have some function in the PRQL language itself?

Are you thinking a map / dict is also a tuple? i.e. "(JSON-like)"?

For me the central tradeoff is:

Almost every language uses parentheses () for tuples (is there any that doesn't? Maybe this is the same question as map / dict above then)
We already have way too much resting on parentheses! https://prql-lang.org/book/syntax.html#precedence-and-parentheses

aljazerzen · 2023-04-13T11:35:10Z

How are you thinking about using Arrays?

Primarily, we need them to be the building block of what a "relation" is.

In the future, we will probably want to support actual arrays as a column data type.

Are you thinking a map / dict is also a tuple? i.e. "(JSON-like)"?

No, a tuple has known number of entries at the compile time. Map / dict would be another primitive data structure, which would need its own syntax.

I don't see a use-case for maps now, but I do think that we should support them eventually. Which means it would be wise to think about their syntax now.

For me the central tradeoff is:

Exactly. I'm leaning toward violating the "Almost every language uses parentheses", but I recognize that it's far from ideal.

max-sixty · 2023-05-19T07:03:21Z

As of #2585, we have:

[tuple]
{array} (not used much, only just introduced)
no struct / map / dict / record type (I'll use the term "record")

The main issue I see here is that these are very different from almost every other language, for both tuples & arrays! And there's an open question of what we can do when we want to represent a struct.

One hard-backward-incompatible change would be to do:

{tuple}
[array]
Treat a record as a tuple with named fields

I'm worried that it's very backward incompatible. (Though maybe this is the one upside of not having many users yet...). But apart from that, it has a few advantages:

It uses the familiar brackets syntax for arrays.
The only lang that uses something other than parentheses for tuples is Erlang, and it uses braces
A record is arguably equivalent to a tuple in relational algebra, but with the names inline. And a record does generally use braces.

It would also serve our current ambiguity around whether lists contain assignments:

group {title, country} (
  aggregate {  
    average gross_salary,
    sum_gross_cost = sum gross_cost,    # this was in a list but contains an assignment?
  }
)

The commented item above is instead an item in a tuple, which can contain assignments.

What do others think? One alternative is to retain backward compatibility, but use [tuple] and {array}, and we can still keep the record <> tuple part.

aljazerzen · 2023-05-19T08:30:02Z

An important point that you make is that tuples are semantically equivalent to records and structs. I'm +2 on that.

I'd treat maps / dicts distinct from tuples, as their names are not known at compile time.

With that out of the way, the syntax. I've been advocating for the {tuple} and [array] in the original type system discussion, and I still think that's the best choice. That mostly because that's what we are used to in most other languages.

I'm not that concerned with backward compatibility that much - the long-term benefits outweigh the transition cost now.

I also believe that @snth once noted that he would also prefer [arrays]. This is a big change so we should wait for him or @eitsupi to weight in before making a change.

eitsupi · 2023-05-19T11:01:40Z

{[a = 5, b = false], [a = 6, b = true]}
filter b == true

to

[{a = 5, b = false}, {a = 6, b = true}]
filter b == true

looks good!

The change to a syntax closer to the commonly used json would be worthwhile, I think.

max-sixty · 2023-05-21T17:26:46Z

I discussed on the dev call with @snth — he's keen too...

max-sixty · 2023-05-23T08:13:36Z

Would this be a straight swap? Is there anything that would remain []? How about lists of join conditions?

(I think anything but a straight swap would be more difficult, though it's still helpful to think through examples and confirm the theory corresponds to the vision)

aljazerzen · 2023-05-23T11:10:18Z

It should be a straight swap. For arrays of join conditions too.

(this is close to another thing I'd like to change: join should not take an array of conditions if we changed filter not to do so. I'd be clearer what's happening if we had join a (==id_1 && ==id_2) instead of join a {==id_1, ==id_2})

aljazerzen · 2023-05-23T11:28:09Z

Do we also rename "list" to "tuple"? I mean in documentation and code.

Having {tuple} is probably easier to remember than {list}.

aljazerzen · 2023-05-25T08:27:11Z

In #2612 @max-sixty mentioned that it would be possible to change {tuple} into (tuple):

group (e.emp_no, e.gender) (
  aggregate (
    emp_salary = average salaries.salary
  )
)

See a few comments there about pros and cons.

I'll post a few more examples here:

sort (first_name)             # scalar, no tuple
sort (first_name,)            # tuple of one element
sort (first_name, last_name)  # tuple of two elements

# all of these work because sort can take either a column directly, or a tuple of columns

filter (total == 0)   # scalar, filter is fine with it
filter (total == 0,)  # tuple, error: filter expected a scalar found a tuple

select (
   height_cm = height_inch * 2.54 | floor,  # tuple with a name and a pipeline
)
select (
   height_cm = height_inch * 2.54 | floor   # also a tuple, because there is a name, even
                                            # though there is no comma?
                                            # or maybe an error saying missing comma?
)
select (!(first_name))   # error: cannot negate a column
select (!(first_name,))  # select all columns but first_name

aljazerzen · 2023-05-25T16:09:36Z

Just today, I'm hitting into this error frequently:

    .query::<String, _>(
        "SELECT ...",
        &(state.user_id),
    );

error[E0277]: the trait bound `edgedb_protocol::model::Uuid: QueryArgs` is not satisfied
   --> src/main.rs:128:25
    |
126 |                     .query::<String, _>(
    |                      ----- required by a bound introduced by this call
127 |                         "SELECT Project.name FILTER .owner = <uuid>$0",
128 |                         &(state.user_id),
    |                         ^^^^^^^^^^^^^^^^ the trait `QueryArgs` is not implemented for `edgedb_protocol::model::Uuid`
    |
    = help: the following other types implement trait `QueryArgs`:
              ()
              (T0, T1)
              (T0, T1, T2)
              (T0, T1, T2, T3)
              (T0, T1, T2, T3, T4)
              (T0, T1, T2, T3, T4, T5)
              (T0, T1, T2, T3, T4, T5, T6)
              (T0, T1, T2, T3, T4, T5, T6, T7)
            and 6 others

... and it's not obvious that I'm missing a tuple so this becomes a tuple:

    .query::<String, _>(
        "SELECT ...",
-       &(state.user_id),
+       &(state.user_id,),
    );

Thus I'm -1 for (tuple).

To lessen the naming problem, we could call it {row} or {struct}.

snth · 2023-05-25T19:44:45Z

As I commented on #2612 , I'm also -1 on (tuple). I like the {row} naming proposal.

snth · 2023-05-25T20:23:15Z

It be clearer what's happening if we had join a (==id_1 && ==id_2) instead of join a {==id_1, ==id_2}.

I agree with this (see also #2612 (comment)).

What about having a keyword to delineate the optional join predicate instead of the parentheses?

One could go with on as in SQL, e.g. join a on ==id_1 && ==id_2. Pro: only 2 characters, Con: another reserved word.

Alternatively use filter, i.e. join a filter ==id_1 && ==id_2. Pro: leverages orthogonality and reuses existing keyword, Con: more characters.

I'm not particularly attached to this proposal but thought I would just throw it out there and type up some examples to see what it would look like and to see what others think.

snth · 2023-05-25T20:35:34Z

While we're making sweeping changes, what about using where instead of filter? I know filter is the traditional term in the FP world but realistically our target audience are data scientists and analytics engineers who are more likely to be familiar with SQL, pandas and polars and filter seems like an unnecessary departure from SQL. pandas and dplyr do use filter as well though and while pandas does have where it actually does something different. polars and kusto allow both filter and where. So I'd say it's probably balanced based on familiarity.

Another line of argument is that I've sometimes been a bit confused for a few seconds as to whether filter "filters out" rows where the condition is true or filters the dataset to keep only those where the predicate is true. Of course it's the latter but I think where makes that inherently clearer.

Also, the join example would read better with where, i.e. join a where ==id_1 && ==id_2.

aljazerzen · 2023-05-26T07:02:57Z

Let's not start making sweeping changes only because we have just done some :D We have not reached 1.0 yet, so we can make them anytime - if they are justified.

I've opened two issues about your last two comments. Let's keep this issue about syntax of tuples and arrays.

aljazerzen · 2023-05-30T16:55:55Z

@max-sixty @eitsupi what are your opinions on:

change syntax {tuple} to (tuple), or
rename {tuple} to {row}

max-sixty · 2023-05-30T17:39:20Z

change syntax {tuple} to (tuple), or

If no one else had commented, I'd be +0.1. Others make good points, and I think it's dangerous, so I'm -0.5. We can also add it later, but would be very expensive to remove, given we'd be removing only half of the usages.

rename {tuple} to {row}

Nothing is perfect:

I think {tuple} is a bit too geeky. It's perfect for "our type" of people, not for the Excel-crossover users.
{row} works in many circumstances, e.g. select {row}. But does it work with group {row}? Or even derive {row}?
{list} is easy but doesn't have the conceptual meaning

WDYT?

aljazerzen · 2023-05-30T18:11:53Z

I dislike name {tuple} the least, mostly because it comes from formal math.

snth · 2023-05-30T20:39:00Z

I really like the {tuple} syntax.

As to whether it should be {tuple}, {row} or {record} is less important to me but I guess I'm also for {tuple} as that is what they are usually called in database theory and papers (e.g. see https://en.wikipedia.org/wiki/Relational_model).

snth · 2023-05-30T21:18:27Z

For me one sign of good notation is when it brings out relationships or patterns that might not have been apparent before. I had that experience during the last dev call when @max-sixty was taking me through the [] to {} syntax change proposal.

Given the use of {} for structs in languages like Rust and C as well as for objects in Javascript and JSON, it seems natural that each element of the tuple can also have a name. We currently have this for select (e.g. select {new_name = old_name}) and aggregate (aggregate {total = sum amount}) as was pointed out above.

Now what about the following query?

from invoices
group {customer_id} (
  aggregate {
    my_total = sum total
    }
  )

Seeing group {customer_id} straight away made me wonder whether we should allow defining an alias in the group transform like group {c_id = customer_id}?

The query above translates to the following SQL

SELECT
  customer_id,
  SUM(total) AS my_total
FROM
  invoices
GROUP BY
  customer_id

which does allow defining an alias for customer_id in the SELECT clause so I think this should be possible and be allowed.

I think in PRQL you currently would need to have another select transform below the group where you alias customer_id but you would also need to repeat all the other columns unnecessarily.

This is something I had never noticed before until I saw it written with the {} syntax so to me this is a positive sign that we are on the right track with this change.

aljazerzen · 2023-05-31T06:22:45Z

Actually, this query:

from invoices
group {c_id = customer_id} (
  aggregate {
    my_total = sum total
  }
)

... produces:

SELECT
  customer_id AS c_id,
  SUM(total) AS my_total
FROM
  invoices
GROUP BY
  customer_id

... on the current version of the compiler already! This really is validation of good language design.

Ok, I think we can say that these proposals:

change syntax {tuple} to (tuple), or

rename {tuple} to {row}

... are rejected.

max-sixty · 2023-05-31T06:31:33Z

... on the current version of the compiler already! This really is validation of good language design.

Very impressive! I didn't even know it could do that...

aljazerzen · 2023-05-31T06:44:51Z

When you make language constructs independent, the orthogonality kicks in :D

snth · 2023-05-31T06:45:53Z

Yes, that's amazing! I thought I tried it in the playground last night and it didn't work but I just tried again and it also works with the old syntax in 0.8.1.

Just shows that because of the list association with [], none of us thought to try it. At least that's the way it seems to me.

aljazerzen added language-design Changes to PRQL-the-language needs-discussion Undecided dilemma labels Mar 12, 2023

aljazerzen mentioned this issue Mar 12, 2023

docs: RFC for type system #1964

Merged

aljazerzen mentioned this issue May 18, 2023

feat: arrays and type system consolidation #2585

Merged

aljazerzen mentioned this issue May 23, 2023

feat!: swap array and tuple syntax #2612

Merged

2 tasks

aljazerzen closed this as completed in #2612 May 25, 2023

aljazerzen reopened this May 25, 2023

This was referenced May 26, 2023

Join tuple equivalence #2641

Closed

Rename filter to where #2642

Closed

aljazerzen closed this as completed May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Syntax for lists and arrays #2124

Syntax for lists and arrays #2124

aljazerzen commented Mar 12, 2023

max-sixty commented Mar 13, 2023

aljazerzen commented Apr 13, 2023

max-sixty commented May 19, 2023 •

edited

Loading

aljazerzen commented May 19, 2023 •

edited

Loading

eitsupi commented May 19, 2023

max-sixty commented May 21, 2023

max-sixty commented May 23, 2023

aljazerzen commented May 23, 2023

aljazerzen commented May 23, 2023

aljazerzen commented May 25, 2023

aljazerzen commented May 25, 2023 •

edited

Loading

snth commented May 25, 2023

snth commented May 25, 2023

snth commented May 25, 2023

aljazerzen commented May 26, 2023

aljazerzen commented May 30, 2023

max-sixty commented May 30, 2023 •

edited

Loading

aljazerzen commented May 30, 2023 •

edited

Loading

snth commented May 30, 2023

snth commented May 30, 2023

aljazerzen commented May 31, 2023

max-sixty commented May 31, 2023

aljazerzen commented May 31, 2023

snth commented May 31, 2023

Syntax for lists and arrays #2124

Syntax for lists and arrays #2124

Comments

aljazerzen commented Mar 12, 2023

max-sixty commented Mar 13, 2023

aljazerzen commented Apr 13, 2023

max-sixty commented May 19, 2023 • edited Loading

aljazerzen commented May 19, 2023 • edited Loading

eitsupi commented May 19, 2023

max-sixty commented May 21, 2023

max-sixty commented May 23, 2023

aljazerzen commented May 23, 2023

aljazerzen commented May 23, 2023

aljazerzen commented May 25, 2023

aljazerzen commented May 25, 2023 • edited Loading

snth commented May 25, 2023

snth commented May 25, 2023

snth commented May 25, 2023

aljazerzen commented May 26, 2023

aljazerzen commented May 30, 2023

max-sixty commented May 30, 2023 • edited Loading

aljazerzen commented May 30, 2023 • edited Loading

snth commented May 30, 2023

snth commented May 30, 2023

aljazerzen commented May 31, 2023

max-sixty commented May 31, 2023

aljazerzen commented May 31, 2023

snth commented May 31, 2023

max-sixty commented May 19, 2023 •

edited

Loading

aljazerzen commented May 19, 2023 •

edited

Loading

aljazerzen commented May 25, 2023 •

edited

Loading

max-sixty commented May 30, 2023 •

edited

Loading

aljazerzen commented May 30, 2023 •

edited

Loading