Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in streaming engine involving Parquet #1262

Closed
Columbus240 opened this issue Oct 18, 2024 · 12 comments
Closed

Bug in streaming engine involving Parquet #1262

Columbus240 opened this issue Oct 18, 2024 · 12 comments
Labels

Comments

@Columbus240
Copy link

Columbus240 commented Oct 18, 2024

Similar to #1246, I detected another problem in the streaming engine. This time it is very important for the bug to appear, to start with $scan_parquet(…) and to end with $sink_parquet(…). If the whole dataframe is materialized in memory, the problem does not appear.

A minimal example follows. Please play around with the number 3 670 015. On my machine, this is the lowest number where the problem still occurs.

library(polars)

# 3'670'014 is too low. Computation is correct.
# 3'670'015 is just high enough. Computation is wrong.
df_part1 = pl$DataFrame(A = rep('1', 3670015))
df_part2 = pl$DataFrame(A = c('2'))

pl$concat(df_part1, df_part2)$write_parquet('test_0.parquet')


df_agg = pl$scan_parquet('test_0.parquet')$group_by('A')$agg()
df_agg$sink_parquet('test_1.parquet')
df_agg_out = pl$read_parquet('test_1.parquet')
df_agg_c = df_agg$collect()

df_agg_out$height # returns 1 but should be 2
df_agg_c$height # returns 2
df_agg_out
df_agg_c

I believe, that some code related to Parquet is involved in the bug, because the following code does not exhibit the problem, even if the number of repetitions is increased tenfold.

library(polars)

df_part1 = pl$DataFrame(A = rep('1', 3670015))
df_part2 = pl$DataFrame(A = c('2'))

pl$concat(df_part1, df_part2)$write_csv('test_0.csv')

df_agg = pl$scan_csv('test_0.csv')$group_by('A')$agg()
df_agg$sink_parquet('test_1.parquet')
df_agg_out = pl$read_parquet('test_1.parquet')
df_agg_c = df_agg$collect()

df_agg_out$height # returns 2
df_agg_c$height # returns 2
df_agg_out
df_agg_c

On the other hand, if the input is read as Parquet and the output is stored in CSV, the problem persists. This can be seen in the following example:

library(polars)

df_part1 = pl$DataFrame(A = rep('1', 3670015))
df_part2 = pl$DataFrame(A = c('2'))

pl$concat(df_part1, df_part2)$write_parquet('test_0.parquet')

df_agg = pl$scan_parquet('test_0.parquet')$group_by('A')$agg()
df_agg$sink_csv('test_1.csv')
df_agg_out = pl$read_csv('test_1.csv')
df_agg_c = df_agg$collect()

df_agg_out$height # returns 1 but should be 2
df_agg_c$height # returns 2
df_agg_out
df_agg_c

The output of polars_info():

Polars R package version : 0.20.0
Rust Polars crate version: 0.43.1

Thread pool size: 14 

Features:                               
default                    TRUE
full_features              TRUE
disable_limit_max_threads  TRUE
nightly                    TRUE
sql                        TRUE
rpolars_debug_print       FALSE

Code completion: deactivated

Edit: fixed the reference to the other issue.
Edit2: It is possible to replace $group_by('A')$agg() by unique() and detect the same problem.

@Columbus240 Columbus240 changed the title Bug in streaming engine involving Parquet and group_by Bug in streaming engine involving Parquet Oct 18, 2024
@etiennebacher
Copy link
Collaborator

I can reproduce in R but not python. The code related to Parquet files has changed a lot upstream so it's likely they have already fixed it but it hasn't been available in a release of rust-polars yet (same as #1246). I'll mark this upstream for now and we'll see if this is resolved after the next rust-polars release.

@eitsupi
Copy link
Collaborator

eitsupi commented Oct 18, 2024

I can reproduce in R but not python.

How about Python Polars 1.7.1?
That should be the same code base of R Polars 0.20.0.

@etiennebacher
Copy link
Collaborator

I can reproduce with 1.7.1. This must have been fixed upstream

@Columbus240
Copy link
Author

Thanks for the quick response. In that case I'll try to work around the problems and eagerly await the new releases of Rust Polars and R Polars.

@eitsupi
Copy link
Collaborator

eitsupi commented Nov 2, 2024

I have looked at the new Rust release and it appears that there are about 70 compile errors that need significant changes to fix.
I probably won't work on it so if anyone could do it it would be greatly appreciated.
(The current code base is so far removed from Python Polars that fixing it in this situation would be very difficult, so work on #1152 is a priority.)

@etiennebacher
Copy link
Collaborator

etiennebacher commented Nov 2, 2024

I'm confused about your message, do you prefer pushing the rewrite in the next release or have another "standard" release here with rust polars 0.44?

@eitsupi
Copy link
Collaborator

eitsupi commented Nov 2, 2024

I'm confused about your message, do you prefer pushing the rewrite in the next release or have another "standard" release here with rust polars 0.44?

I wanted to update the current main branch to polars 0.44 if possible, but I gave up because it seemed like too much work.

@etiennebacher
Copy link
Collaborator

etiennebacher commented Nov 2, 2024

My question was more about whether we have another release before the one containing the rewrite. If we don't, I don't see the point of updating main here if it's gonna be replaced anyway. I'd rather work on the rewrite as well.

@eitsupi
Copy link
Collaborator

eitsupi commented Nov 2, 2024

I see, in that case I was going to do the release because there is nothing preventing me from doing so.
It's just that I don't know if I can include the Rust Polars update there.

@eitsupi
Copy link
Collaborator

eitsupi commented Nov 19, 2024

@Columbus240 Could you try the new version?

@etiennebacher
Copy link
Collaborator

The problem is gone with polars 0.21.0

@Columbus240
Copy link
Author

Thank you for the update. I can confirm, that the problem does not appear for me, when I use 0.21.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants