-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slowdown in ClickBench Q36-Q37 between DataFusion 43.0.0 and 44.0.0 #14481
Comments
I got flamegraphs from them using https://github.com/flamegraph-rs/flamegraph Q36./datafusion-cli-44 -c "SELECT \"URL\", COUNT(*) AS PageViews FROM 'hits.parquet' WHERE \"CounterID\" = 62 AND \"EventDate\"::INT::DATE >= '2013-07-01' AND \"EventDate\"::INT::DATE <= '2013-07-31' AND \"DontCountHits\" = 0 AND \"IsRefresh\" = 0 AND \"URL\" <> '' GROUP BY \"URL\" ORDER BY PageViews DESC LIMIT 10; " And made flamegraphs with
Here is DataFusion 43: Here is DataFusion 44: A largre amount of the time is spent decoding ParquetMetadata |
Also when I tried in DataFusion 45 (pre-release) the speed seems to have gotten better again... |
Given how much time is spent decoding ParquetMetadata, maybe it would be good to add some sort of small built in cache for parquet metadata 🤔 I think @Ted-Jiang made hooks to do this a long time ago but we don't have anything in by default |
Would love to help on this issue. We built something similar for Vortex based on moka and it also saves on roundtrips during infer_schema/infer_stats. It can probably be generalized in some way but I'm open to any thoughts you have. |
It can probably be generalized in some way but I'm open to any thoughts you have. Thank you @AdamGS that would be amazing I suggest the following:
Ideally 2 would use the existing APIs for doing this The APIs I was referring to are: You can see there are two APIs there for statistics and metadata but I don't know if they are still hooked up
I think moka is likely overkill and too large a dependnecy for datafusion itself, but being able to connect up a moka based cache would be super helpful.
Indeed -- and making the schema / stats more efficient in general would be really good |
Is your feature request related to a problem or challenge?
@pmcgleenon ran ClickBench on DataFusion 44 ❤
44.0.0
#13983 (comment)Here are the results of ClickBench across several DataFusion versions:
clickbench-latest.html.zip
Q36 and Q37 look like they got slower
Describe the solution you'd like
Investigate (and hopefully restore) the performance in Q36 and Q37
Here are the queries (note the queries are numbered starting at 0 but the line numbers start at 1):
datafusion/benchmarks/queries/clickbench/queries.sql
Lines 37 to 38 in 0d9f845
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: