Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syntax for lists and arrays #2124

Closed
aljazerzen opened this issue Mar 12, 2023 · 24 comments · Fixed by #2612
Closed

Syntax for lists and arrays #2124

aljazerzen opened this issue Mar 12, 2023 · 24 comments · Fixed by #2612
Labels
language-design Changes to PRQL-the-language needs-discussion Undecided dilemma

Comments

@aljazerzen
Copy link
Member

This is continuation of discussion in #1964 (comment)

As of the latest update to RFC about type system (see linked PR), PRQL will have two container types:

  • tuples, which have a known number of entries that can have different types,

  • arrays, which have an unknown number of entries all of the same type.

Currently PRQL has "lists" with same semantics as the ones described for tuples. To limit confusion in this thread, let's use terms "tuple", "array" and "list" as I described it here and be clear when proposing changes to the names.

With terminology out of the way, let's talk about the problem: syntax.

As I see it, there are two options:

  • Option 1 ({} for arrays). Try to preserve current syntax. Use [] for tuples, {} for arrays,
  • Option 2 (JSON-like). Use {} for tuples and [] for arrays.
@aljazerzen aljazerzen added language-design Changes to PRQL-the-language needs-discussion Undecided dilemma labels Mar 12, 2023
@max-sixty
Copy link
Member

How are you thinking about using Arrays? Are they just the underlying datatype? Or they would have some function in the PRQL language itself?

Are you thinking a map / dict is also a tuple? i.e. "(JSON-like)"?


For me the central tradeoff is:

@aljazerzen
Copy link
Member Author

How are you thinking about using Arrays?

Primarily, we need them to be the building block of what a "relation" is.

In the future, we will probably want to support actual arrays as a column data type.

Are you thinking a map / dict is also a tuple? i.e. "(JSON-like)"?

No, a tuple has known number of entries at the compile time. Map / dict would be another primitive data structure, which would need its own syntax.

I don't see a use-case for maps now, but I do think that we should support them eventually. Which means it would be wise to think about their syntax now.

For me the central tradeoff is:

Exactly. I'm leaning toward violating the "Almost every language uses parentheses", but I recognize that it's far from ideal.

@max-sixty
Copy link
Member

max-sixty commented May 19, 2023

As of #2585, we have:

  • [tuple]
  • {array} (not used much, only just introduced)
  • no struct / map / dict / record type (I'll use the term "record")

The main issue I see here is that these are very different from almost every other language, for both tuples & arrays! And there's an open question of what we can do when we want to represent a struct.

One hard-backward-incompatible change would be to do:

  • {tuple}
  • [array]
  • Treat a record as a tuple with named fields

I'm worried that it's very backward incompatible. (Though maybe this is the one upside of not having many users yet...). But apart from that, it has a few advantages:

  • It uses the familiar brackets syntax for arrays.
  • The only lang that uses something other than parentheses for tuples is Erlang, and it uses braces
  • A record is arguably equivalent to a tuple in relational algebra, but with the names inline. And a record does generally use braces.

It would also serve our current ambiguity around whether lists contain assignments:

group {title, country} (
  aggregate {  
    average gross_salary,
    sum_gross_cost = sum gross_cost,    # this was in a list but contains an assignment?
  }
)

The commented item above is instead an item in a tuple, which can contain assignments.


What do others think? One alternative is to retain backward compatibility, but use [tuple] and {array}, and we can still keep the record <> tuple part.

@aljazerzen
Copy link
Member Author

aljazerzen commented May 19, 2023

An important point that you make is that tuples are semantically equivalent to records and structs. I'm +2 on that.

I'd treat maps / dicts distinct from tuples, as their names are not known at compile time.


With that out of the way, the syntax. I've been advocating for the {tuple} and [array] in the original type system discussion, and I still think that's the best choice. That mostly because that's what we are used to in most other languages.

I'm not that concerned with backward compatibility that much - the long-term benefits outweigh the transition cost now.


I also believe that @snth once noted that he would also prefer [arrays]. This is a big change so we should wait for him or @eitsupi to weight in before making a change.

@eitsupi
Copy link
Member

eitsupi commented May 19, 2023

{[a = 5, b = false], [a = 6, b = true]}
filter b == true

to

[{a = 5, b = false}, {a = 6, b = true}]
filter b == true

looks good!

The change to a syntax closer to the commonly used json would be worthwhile, I think.

@max-sixty
Copy link
Member

I discussed on the dev call with @snth — he's keen too...

@max-sixty
Copy link
Member

Would this be a straight swap? Is there anything that would remain []? How about lists of join conditions?

(I think anything but a straight swap would be more difficult, though it's still helpful to think through examples and confirm the theory corresponds to the vision)

@aljazerzen
Copy link
Member Author

It should be a straight swap. For arrays of join conditions too.

(this is close to another thing I'd like to change: join should not take an array of conditions if we changed filter not to do so. I'd be clearer what's happening if we had join a (==id_1 && ==id_2) instead of join a {==id_1, ==id_2})

@aljazerzen
Copy link
Member Author

Do we also rename "list" to "tuple"? I mean in documentation and code.

Having {tuple} is probably easier to remember than {list}.

@aljazerzen
Copy link
Member Author

In #2612 @max-sixty mentioned that it would be possible to change {tuple} into (tuple):

group (e.emp_no, e.gender) (
  aggregate (
    emp_salary = average salaries.salary
  )
)

See a few comments there about pros and cons.

I'll post a few more examples here:

sort (first_name)             # scalar, no tuple
sort (first_name,)            # tuple of one element
sort (first_name, last_name)  # tuple of two elements

# all of these work because sort can take either a column directly, or a tuple of columns
filter (total == 0)   # scalar, filter is fine with it
filter (total == 0,)  # tuple, error: filter expected a scalar found a tuple
select (
   height_cm = height_inch * 2.54 | floor,  # tuple with a name and a pipeline
)
select (
   height_cm = height_inch * 2.54 | floor   # also a tuple, because there is a name, even
                                            # though there is no comma?
                                            # or maybe an error saying missing comma?
)
select (!(first_name))   # error: cannot negate a column
select (!(first_name,))  # select all columns but first_name

@aljazerzen
Copy link
Member Author

aljazerzen commented May 25, 2023

Just today, I'm hitting into this error frequently:

    .query::<String, _>(
        "SELECT ...",
        &(state.user_id),
    );
error[E0277]: the trait bound `edgedb_protocol::model::Uuid: QueryArgs` is not satisfied
   --> src/main.rs:128:25
    |
126 |                     .query::<String, _>(
    |                      ----- required by a bound introduced by this call
127 |                         "SELECT Project.name FILTER .owner = <uuid>$0",
128 |                         &(state.user_id),
    |                         ^^^^^^^^^^^^^^^^ the trait `QueryArgs` is not implemented for `edgedb_protocol::model::Uuid`
    |
    = help: the following other types implement trait `QueryArgs`:
              ()
              (T0, T1)
              (T0, T1, T2)
              (T0, T1, T2, T3)
              (T0, T1, T2, T3, T4)
              (T0, T1, T2, T3, T4, T5)
              (T0, T1, T2, T3, T4, T5, T6)
              (T0, T1, T2, T3, T4, T5, T6, T7)
            and 6 others

... and it's not obvious that I'm missing a tuple so this becomes a tuple:

    .query::<String, _>(
        "SELECT ...",
-       &(state.user_id),
+       &(state.user_id,),
    );

Thus I'm -1 for (tuple).

To lessen the naming problem, we could call it {row} or {struct}.

@snth
Copy link
Member

snth commented May 25, 2023

As I commented on #2612 , I'm also -1 on (tuple). I like the {row} naming proposal.

@snth
Copy link
Member

snth commented May 25, 2023

It be clearer what's happening if we had join a (==id_1 && ==id_2) instead of join a {==id_1, ==id_2}.

I agree with this (see also #2612 (comment)).

What about having a keyword to delineate the optional join predicate instead of the parentheses?

One could go with on as in SQL, e.g. join a on ==id_1 && ==id_2. Pro: only 2 characters, Con: another reserved word.

Alternatively use filter, i.e. join a filter ==id_1 && ==id_2. Pro: leverages orthogonality and reuses existing keyword, Con: more characters.

I'm not particularly attached to this proposal but thought I would just throw it out there and type up some examples to see what it would look like and to see what others think.

@snth
Copy link
Member

snth commented May 25, 2023

While we're making sweeping changes, what about using where instead of filter? I know filter is the traditional term in the FP world but realistically our target audience are data scientists and analytics engineers who are more likely to be familiar with SQL, pandas and polars and filter seems like an unnecessary departure from SQL. pandas and dplyr do use filter as well though and while pandas does have where it actually does something different. polars and kusto allow both filter and where. So I'd say it's probably balanced based on familiarity.

Another line of argument is that I've sometimes been a bit confused for a few seconds as to whether filter "filters out" rows where the condition is true or filters the dataset to keep only those where the predicate is true. Of course it's the latter but I think where makes that inherently clearer.

Also, the join example would read better with where, i.e. join a where ==id_1 && ==id_2.

This was referenced May 26, 2023
@aljazerzen
Copy link
Member Author

Let's not start making sweeping changes only because we have just done some :D We have not reached 1.0 yet, so we can make them anytime - if they are justified.

I've opened two issues about your last two comments. Let's keep this issue about syntax of tuples and arrays.

@aljazerzen
Copy link
Member Author

@max-sixty @eitsupi what are your opinions on:

  1. change syntax {tuple} to (tuple), or
  2. rename {tuple} to {row}

@max-sixty
Copy link
Member

max-sixty commented May 30, 2023

  • change syntax {tuple} to (tuple), or

If no one else had commented, I'd be +0.1. Others make good points, and I think it's dangerous, so I'm -0.5. We can also add it later, but would be very expensive to remove, given we'd be removing only half of the usages.

  • rename {tuple} to {row}

Nothing is perfect:

  • I think {tuple} is a bit too geeky. It's perfect for "our type" of people, not for the Excel-crossover users.
  • {row} works in many circumstances, e.g. select {row}. But does it work with group {row}? Or even derive {row}?
  • {list} is easy but doesn't have the conceptual meaning

WDYT?

@aljazerzen
Copy link
Member Author

aljazerzen commented May 30, 2023

I dislike name {tuple} the least, mostly because it comes from formal math.

@snth
Copy link
Member

snth commented May 30, 2023

I really like the {tuple} syntax.

As to whether it should be {tuple}, {row} or {record} is less important to me but I guess I'm also for {tuple} as that is what they are usually called in database theory and papers (e.g. see https://en.wikipedia.org/wiki/Relational_model).

@snth
Copy link
Member

snth commented May 30, 2023

For me one sign of good notation is when it brings out relationships or patterns that might not have been apparent before. I had that experience during the last dev call when @max-sixty was taking me through the [] to {} syntax change proposal.

Given the use of {} for structs in languages like Rust and C as well as for objects in Javascript and JSON, it seems natural that each element of the tuple can also have a name. We currently have this for select (e.g. select {new_name = old_name}) and aggregate (aggregate {total = sum amount}) as was pointed out above.

Now what about the following query?

from invoices
group {customer_id} (
  aggregate {
    my_total = sum total
    }
  )

Seeing group {customer_id} straight away made me wonder whether we should allow defining an alias in the group transform like group {c_id = customer_id}?

The query above translates to the following SQL

SELECT
  customer_id,
  SUM(total) AS my_total
FROM
  invoices
GROUP BY
  customer_id

which does allow defining an alias for customer_id in the SELECT clause so I think this should be possible and be allowed.

I think in PRQL you currently would need to have another select transform below the group where you alias customer_id but you would also need to repeat all the other columns unnecessarily.

This is something I had never noticed before until I saw it written with the {} syntax so to me this is a positive sign that we are on the right track with this change.

@aljazerzen
Copy link
Member Author

Actually, this query:

from invoices
group {c_id = customer_id} (
  aggregate {
    my_total = sum total
  }
)

... produces:

SELECT
  customer_id AS c_id,
  SUM(total) AS my_total
FROM
  invoices
GROUP BY
  customer_id

... on the current version of the compiler already! This really is validation of good language design.


Ok, I think we can say that these proposals:

  1. change syntax {tuple} to (tuple), or
  2. rename {tuple} to {row}

... are rejected.

@max-sixty
Copy link
Member

... on the current version of the compiler already! This really is validation of good language design.

Very impressive! I didn't even know it could do that...

@aljazerzen
Copy link
Member Author

When you make language constructs independent, the orthogonality kicks in :D

@snth
Copy link
Member

snth commented May 31, 2023

Yes, that's amazing! I thought I tried it in the playground last night and it didn't work but I just tried again and it also works with the old syntax in 0.8.1.

Just shows that because of the list association with [], none of us thought to try it. At least that's the way it seems to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language-design Changes to PRQL-the-language needs-discussion Undecided dilemma
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants