-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in streaming engine involving Parquet #1262
Comments
I can reproduce in R but not python. The code related to Parquet files has changed a lot upstream so it's likely they have already fixed it but it hasn't been available in a release of rust-polars yet (same as #1246). I'll mark this upstream for now and we'll see if this is resolved after the next rust-polars release. |
How about Python Polars 1.7.1? |
I can reproduce with 1.7.1. This must have been fixed upstream |
Thanks for the quick response. In that case I'll try to work around the problems and eagerly await the new releases of Rust Polars and R Polars. |
I have looked at the new Rust release and it appears that there are about 70 compile errors that need significant changes to fix. |
I'm confused about your message, do you prefer pushing the rewrite in the next release or have another "standard" release here with rust polars 0.44? |
I wanted to update the current main branch to polars 0.44 if possible, but I gave up because it seemed like too much work. |
My question was more about whether we have another release before the one containing the rewrite. If we don't, I don't see the point of updating main here if it's gonna be replaced anyway. I'd rather work on the rewrite as well. |
I see, in that case I was going to do the release because there is nothing preventing me from doing so. |
@Columbus240 Could you try the new version? |
The problem is gone with |
Thank you for the update. I can confirm, that the problem does not appear for me, when I use 0.21.0. |
Similar to #1246, I detected another problem in the streaming engine. This time it is very important for the bug to appear, to start with
$scan_parquet(…)
and to end with$sink_parquet(…)
. If the whole dataframe is materialized in memory, the problem does not appear.A minimal example follows. Please play around with the number 3 670 015. On my machine, this is the lowest number where the problem still occurs.
I believe, that some code related to Parquet is involved in the bug, because the following code does not exhibit the problem, even if the number of repetitions is increased tenfold.
On the other hand, if the input is read as Parquet and the output is stored in CSV, the problem persists. This can be seen in the following example:
The output of
polars_info()
:Edit: fixed the reference to the other issue.
Edit2: It is possible to replace
$group_by('A')$agg()
byunique()
and detect the same problem.The text was updated successfully, but these errors were encountered: