Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.25.2 z-order slower than 0.25.1 #3269

Open
aldder opened this issue Feb 26, 2025 · 16 comments
Open

0.25.2 z-order slower than 0.25.1 #3269

aldder opened this issue Feb 26, 2025 · 16 comments
Labels
bug Something isn't working

Comments

@aldder
Copy link

aldder commented Feb 26, 2025

Environment

Delta-rs version: 0.25.2

Environment:

  • Cloud provider: AWS

Bug

What happened:
I have a process that run z-order compation on a table every hour and I noticed that since the release of 0.25.2 the operation is much slower respect to the previous version, as you can see in the following monitoring chart

Image

@aldder aldder added the bug Something isn't working label Feb 26, 2025
@ion-elgreco
Copy link
Collaborator

I guess our TableProvider scan is slower than ctx.read_parquet

@ion-elgreco
Copy link
Collaborator

@alamb sorry for pinging you but I wonder whether you might understand how these performance differences could happen?

ctx.read_parquet uses a listingTableProvider, our own DeltaTableProvider though uses a ParquetExec scan, but I don't know the exact internals of ListingTableProvider tbh

@alamb
Copy link
Contributor

alamb commented Feb 26, 2025

@alamb sorry for pinging you but I wonder whether you might understand how these performance differences could happen?

ctx.read_parquet uses a listingTableProvider, our own DeltaTableProvider though uses a ParquetExec scan, but I don't know the exact internals of ListingTableProvider tbh

Thanks for the ping @ion-elgreco -- I don't have any ideas for a report at this level

I think getting a reproducer would be the next step, and we could look at the explain plans to see if anything obvious is going on

@ion-elgreco
Copy link
Collaborator

@aldder can you create a reproducible example with some sample data? Easier for me to debug :)

@aldder
Copy link
Author

aldder commented Feb 27, 2025

@ion-elgreco mmh, unfortunately I can't seem to reproduce the error with a brand new table no matter what I try

@ion-elgreco
Copy link
Collaborator

@aldder how big is that existing table, how many versions, size of log files, size of checkpoint, amount of rows, columns?

@aldder
Copy link
Author

aldder commented Feb 27, 2025

@ion-elgreco

the table has ~25k versions right now
checkpoints size is ~25-30MB

schema is:
x1: string
x2: timestamp[us, tz=UTC]
x3: timestamp[us, tz=UTC]
x4: double

I don't know how many rows it has tbh, but i would say hundred millions

@ion-elgreco
Copy link
Collaborator

@ion-elgreco

the table has ~25k versions right now
checkpoints size is ~25-30MB

schema is:
x1: string
x2: timestamp[us, tz=UTC]
x3: timestamp[us, tz=UTC]
x4: double

I don't know how many rows it has tbh, but i would say hundred millions

Ok, Ill make a test branch, I have a theory that might help. Could you compile that wheel then and test it with your table?

@aldder
Copy link
Author

aldder commented Feb 27, 2025

I have never used Rust before, so I don't know the procedure to compile the library. If you can give me steps to follow I would really appreciate it

@aldder
Copy link
Author

aldder commented Feb 27, 2025

Ok, finally I was able to reproduce the issue. Apparently it's something related to the number of partitions since I just needed to increase these number to obtain the desired effect.

import polars as pl
import deltalake
from deltalake import DeltaTable, Schema, Field
from time import perf_counter
from tqdm.auto import tqdm

print(deltalake.__version__)

DeltaTable.create('tmp', schema=Schema([
        Field('foo', 'string'),
        Field('num', 'double'),
        Field('time', 'timestamp')
    ]),
    partition_by=['foo']
)

for i in tqdm(range(50)):
    df = pl.DataFrame({
        'foo': pl.int_range(0, 500, eager=True).cast(pl.String),
        'num': pl.int_range(i, i + 500, eager=True),
        'time': pl.datetime_range(pl.datetime(2021, 1, 1), pl.datetime(2021, 1, 1) + pl.duration(hours=499), '1h', time_zone='UTC', eager=True)
    })

    df.write_delta(
        target='tmp',
        mode='append'
    )

dt = DeltaTable('tmp')
tic = perf_counter()
dt.optimize.z_order(['time'])
toc = perf_counter()

print(f"Optimization time: {toc-tic}")

With version 0.25.2

0.25.2
100%|█████████████████████████████████████████████████████████| 50/50 [00:53<00:00,  1.07s/it]
Optimization time: 29.001007099999697

with version 0.25.1

0.25.1
100%|█████████████████████████████████████████████████████████| 50/50 [00:52<00:00,  1.05s/it]
Optimization time: 19.724042600006214

@ion-elgreco
Copy link
Collaborator

Ok, finally I was able to reproduce the issue. Apparently it's something related to the number of partitions since I just needed to increase these number to obtain the desired effect.

import polars as pl
import deltalake
from deltalake import DeltaTable, Schema, Field
from time import perf_counter
from tqdm.auto import tqdm

print(deltalake.__version__)

DeltaTable.create('tmp', schema=Schema([
        Field('foo', 'string'),
        Field('num', 'double'),
        Field('time', 'timestamp')
    ]),
    partition_by=['foo']
)

for i in tqdm(range(50)):
    df = pl.DataFrame({
        'foo': pl.int_range(0, 500, eager=True).cast(pl.String),
        'num': pl.int_range(i, i + 500, eager=True),
        'time': pl.datetime_range(pl.datetime(2021, 1, 1), pl.datetime(2021, 1, 1) + pl.duration(hours=499), '1h', time_zone='UTC', eager=True)
    })

    df.write_delta(
        target='tmp',
        mode='append'
    )

dt = DeltaTable('tmp')
tic = perf_counter()
dt.optimize.z_order(['time'])
toc = perf_counter()

print(f"Optimization time: {toc-tic}")

With version 0.25.2

0.25.2
100%|█████████████████████████████████████████████████████████| 50/50 [00:53<00:00,  1.07s/it]
Optimization time: 29.001007099999697

with version 0.25.1

0.25.1
100%|█████████████████████████████████████████████████████████| 50/50 [00:52<00:00,  1.05s/it]
Optimization time: 19.724042600006214

Thanks this is very useful to narrow the issue down! I'll try to dive deeper into it which part in the DeltaScan makes this more slower

@ion-elgreco
Copy link
Collaborator

Actually I can't reproduce this.. this is main with release mode:

Optimization time: 1.1585120419913437

And 0.25.1:
Optimization time: 1.0465130830125418

@ion-elgreco
Copy link
Collaborator

@aldder I have a small fix that slightly brings it closer to 0.25.1 for me. You can try it out if you want: https://github.com/ion-elgreco/delta-rs/tree/fix/optimize_scan

Make sure to build the wheel in --release mode

@aldder
Copy link
Author

aldder commented Feb 28, 2025

@ion-elgreco I tested the changes, but they don't seem to have changed much in substance

0.25.1
Optimization time: 45.88432119999197

0.25.2 (fix)
Optimization time: 59.264429899980314

I did several tests to confirm 30% more slowness than the previous version

@ion-elgreco
Copy link
Collaborator

@aldder what kind of cpu are you using? Because for me the differences are negligible on M4 chip

@aldder
Copy link
Author

aldder commented Feb 28, 2025

Intel i7 on my local pc, but I have the same effect on aws cloud (lambda and batch job)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants