-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
asv comparison tests pandas dataframe memory creation vs pandas.dataframe read/write and arcticdb_lmdb read/writes #2137
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
SYMBOL = "dataframe" | ||
NUMBER_ROWS = 3000000 | ||
|
||
def setup_cache(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should just return the df
given that the df and the dict are really the same thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return of dict is needed for dataframe creation test, which is important for understanging all later numbers and their dynamics, thus this cannot be avoided
"dtype" : str_col(10, size), | ||
} | ||
|
||
def peakmem_create_dataframe(self, tpl): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the idea with this but it's odd for us to have a benchmark that we have no control over - would be very odd for a Pandas regression to block our PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creation of the dataframe is basis for comparison how effective we will be on other graphs. Also it shows that although create and write done separetly have certain mem requirements, the mem requirements of both of them are not equal to the sum of each one. (same for read/write ets.
I would not say block, but warn and require action if the memory usage suddently grows. Why it grew? IS it ok? wtc. Such questions will pop in our heads which otherwise will not ... So see this as a feedback loop, that will remind us to thinl when something unusual happens.
Overall seing all processes in one test and graph will make a difference over time as this coul improve our knowledge and ways of management performance and requirements
Reference Issues/PRs
Those asv tests aim to compare memory usage between above mentioned operations. Seeing all of them on one graph could tell us how efficient we are in memory management.
If we consider that we will have also the ability to make same tests for Polars and Amazon S3 with time, such benchmarks could provide insights for our performance
[20.00%] ··· comparison_benchmarks.ComparisonBenchmarks.peakmem_create_dataframe 3.16G
[40.00%] ··· comparison_benchmarks.ComparisonBenchmarks.peakmem_read_dataframe_arctic 4.95G
[60.00%] ··· comparison_benchmarks.ComparisonBenchmarks.peakmem_read_dataframe_parquet 5.4G
[80.00%] ··· comparison_benchmarks.ComparisonBenchmarks.peakmem_write_dataframe_arctic 3.45G
[100.00%] ··· comparison_benchmarks.ComparisonBenchmarks.peakmem_write_dataframe_parquet 2.77G
What does this implement or fix?
Any other comments?
Checklist
Checklist for code changes...