-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0.25.2 z-order slower than 0.25.1 #3269
Comments
I guess our TableProvider scan is slower than ctx.read_parquet |
@alamb sorry for pinging you but I wonder whether you might understand how these performance differences could happen? ctx.read_parquet uses a listingTableProvider, our own DeltaTableProvider though uses a ParquetExec scan, but I don't know the exact internals of ListingTableProvider tbh |
Thanks for the ping @ion-elgreco -- I don't have any ideas for a report at this level I think getting a reproducer would be the next step, and we could look at the explain plans to see if anything obvious is going on |
@aldder can you create a reproducible example with some sample data? Easier for me to debug :) |
@ion-elgreco mmh, unfortunately I can't seem to reproduce the error with a brand new table no matter what I try |
@aldder how big is that existing table, how many versions, size of log files, size of checkpoint, amount of rows, columns? |
the table has ~25k versions right now schema is: I don't know how many rows it has tbh, but i would say hundred millions |
Ok, Ill make a test branch, I have a theory that might help. Could you compile that wheel then and test it with your table? |
I have never used Rust before, so I don't know the procedure to compile the library. If you can give me steps to follow I would really appreciate it |
Ok, finally I was able to reproduce the issue. Apparently it's something related to the number of partitions since I just needed to increase these number to obtain the desired effect. import polars as pl
import deltalake
from deltalake import DeltaTable, Schema, Field
from time import perf_counter
from tqdm.auto import tqdm
print(deltalake.__version__)
DeltaTable.create('tmp', schema=Schema([
Field('foo', 'string'),
Field('num', 'double'),
Field('time', 'timestamp')
]),
partition_by=['foo']
)
for i in tqdm(range(50)):
df = pl.DataFrame({
'foo': pl.int_range(0, 500, eager=True).cast(pl.String),
'num': pl.int_range(i, i + 500, eager=True),
'time': pl.datetime_range(pl.datetime(2021, 1, 1), pl.datetime(2021, 1, 1) + pl.duration(hours=499), '1h', time_zone='UTC', eager=True)
})
df.write_delta(
target='tmp',
mode='append'
)
dt = DeltaTable('tmp')
tic = perf_counter()
dt.optimize.z_order(['time'])
toc = perf_counter()
print(f"Optimization time: {toc-tic}") With version 0.25.2
with version 0.25.1
|
Thanks this is very useful to narrow the issue down! I'll try to dive deeper into it which part in the DeltaScan makes this more slower |
Actually I can't reproduce this.. this is main with release mode: Optimization time: 1.1585120419913437 And 0.25.1: |
@aldder I have a small fix that slightly brings it closer to 0.25.1 for me. You can try it out if you want: https://github.com/ion-elgreco/delta-rs/tree/fix/optimize_scan Make sure to build the wheel in --release mode |
@ion-elgreco I tested the changes, but they don't seem to have changed much in substance 0.25.1 0.25.2 (fix) I did several tests to confirm 30% more slowness than the previous version |
@aldder what kind of cpu are you using? Because for me the differences are negligible on M4 chip |
Intel i7 on my local pc, but I have the same effect on aws cloud (lambda and batch job) |
Environment
Delta-rs version: 0.25.2
Environment:
Bug
What happened:
I have a process that run z-order compation on a table every hour and I noticed that since the release of 0.25.2 the operation is much slower respect to the previous version, as you can see in the following monitoring chart
The text was updated successfully, but these errors were encountered: