Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing a nested list with 0 or an index larger than list size is not handled correctly #5310

Closed
ahmedriza opened this issue Feb 16, 2023 · 0 comments · Fixed by #5311
Closed
Labels
bug Something isn't working

Comments

@ahmedriza
Copy link
Contributor

Describe the bug
Given a nested list, indexing works correctly as long as the index is not 0 or larger than the size of the list. However, if 0 or an index larger than the list is given, it will throw an error similar to the following:

Error: Arrow error: Invalid argument error: column types must match schema types, expected Float64 but found List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) at column index 1

To Reproduce
Use the attached parquet file, list.parquet.gz

This Parquet file contains a single row of data as follows:

+------------------+---+
|a                 |id |
+------------------+---+
|[1.71, 2.71, 3.71]|1  |
+------------------+---+

Example code that demonstrates the bug (after uncompressing the file):

use datafusion::prelude::*;

let ctx = SessionContext::new();
ctx.register_parquet("t", "list.parquet", ParquetReadOptions::default()).await?;
let df = ctx.sql("select id, a[0] from t").await?;
df.show().await?;

Expected behavior
We expect to get a null value when the index is out of range. For example, the above code should produce the following output:

+----+--------+
| id | t.a[0] |
+----+--------+
| 1  |        |
+----+--------+

Additional context

We should be able to index this correctly, and if an invalid index is given, that should return nulls. Example:

use datafusion::prelude::*;

let ctx = SessionContext::new();
ctx.register_parquet("t", "list.parquet", ParquetReadOptions::default()).await?;
let df = ctx.sql("select id, a[0] from t").await?;
df.show().await?;

This should produce the following output:

+----+--------+--------+--------+--------+----------+
| id | t.a[0] | t.a[1] | t.a[2] | t.a[3] | t.a[100] |
+----+--------+--------+--------+--------+----------+
| 1  |        | 1.71   | 2.71   | 3.71   |          |
+----+--------+--------+--------+--------+----------+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant