Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slowdown in ClickBench Q36-Q37 between DataFusion 43.0.0 and 44.0.0 #14481

Open
Tracked by #14482 ...
alamb opened this issue Feb 4, 2025 · 5 comments
Open
Tracked by #14482 ...

Slowdown in ClickBench Q36-Q37 between DataFusion 43.0.0 and 44.0.0 #14481

alamb opened this issue Feb 4, 2025 · 5 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Feb 4, 2025

Is your feature request related to a problem or challenge?

@pmcgleenon ran ClickBench on DataFusion 44 ❤

Here are the results of ClickBench across several DataFusion versions:
clickbench-latest.html.zip

Q36 and Q37 look like they got slower

Image

Describe the solution you'd like

Investigate (and hopefully restore) the performance in Q36 and Q37

Here are the queries (note the queries are numbered starting at 0 but the line numbers start at 1):

SELECT "URL", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "DontCountHits" = 0 AND "IsRefresh" = 0 AND "URL" <> '' GROUP BY "URL" ORDER BY PageViews DESC LIMIT 10;
SELECT "Title", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "DontCountHits" = 0 AND "IsRefresh" = 0 AND "Title" <> '' GROUP BY "Title" ORDER BY PageViews DESC LIMIT 10;

Describe alternatives you've considered

No response

Additional context

No response

@alamb
Copy link
Contributor Author

alamb commented Feb 4, 2025

I got flamegraphs from them using https://github.com/flamegraph-rs/flamegraph

Q36

./datafusion-cli-44 -c "SELECT \"URL\", COUNT(*) AS PageViews FROM 'hits.parquet' WHERE \"CounterID\" = 62 AND \"EventDate\"::INT::DATE >= '2013-07-01' AND \"EventDate\"::INT::DATE <= '2013-07-31' AND \"DontCountHits\" = 0 AND \"IsRefresh\" = 0 AND \"URL\" <> '' GROUP BY \"URL\" ORDER BY PageViews DESC LIMIT 10; "

And made flamegraphs with

sudo flamegraph -- ./datafusion-cli-43 -c "SELECT \"URL\", COUNT(*) AS PageViews FROM 'hits.parquet' WHERE \"CounterID\" = 62 AND \"EventDate\"::INT::DATE >= '2013-07-01' AND \"EventDate\"::INT::DATE <= '2013-07-31' AND \"DontCountHits\" = 0 AND \"IsRefresh\" = 0 AND \"URL\" <> '' GROUP BY \"URL\" ORDER BY PageViews DESC LIMIT 10;

Here is DataFusion 43:

Image

Here is DataFusion 44:

Image

A largre amount of the time is spent decoding ParquetMetadata
Image

@alamb
Copy link
Contributor Author

alamb commented Feb 4, 2025

Also when I tried in DataFusion 45 (pre-release) the speed seems to have gotten better again...

@alamb
Copy link
Contributor Author

alamb commented Feb 4, 2025

Given how much time is spent decoding ParquetMetadata, maybe it would be good to add some sort of small built in cache for parquet metadata 🤔 I think @Ted-Jiang made hooks to do this a long time ago but we don't have anything in by default

@AdamGS
Copy link
Contributor

AdamGS commented Feb 25, 2025

Would love to help on this issue. We built something similar for Vortex based on moka and it also saves on roundtrips during infer_schema/infer_stats. It can probably be generalized in some way but I'm open to any thoughts you have.

@alamb
Copy link
Contributor Author

alamb commented Feb 26, 2025

Would love to help on this issue.

It can probably be generalized in some way but I'm open to any thoughts you have.

Thank you @AdamGS that would be amazing

I suggest the following:

  1. Write up an example showing how to enable the parquet metadata cahce
  2. Figure out how to implement a simple in memory cache (Maybe default to 5MB or something)

Ideally 2 would use the existing APIs for doing this

The APIs I was referring to are:

You can see there are two APIs there for statistics and metadata but I don't know if they are still hooked up

We built something similar for Vortex based on moka

I think moka is likely overkill and too large a dependnecy for datafusion itself, but being able to connect up a moka based cache would be super helpful.

and it also saves on roundtrips during infer_schema/infer_stats.

Indeed -- and making the schema / stats more efficient in general would be really good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants