Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

asv comparison tests pandas dataframe memory creation vs pandas.dataframe read/write and arcticdb_lmdb read/writes #2137

Merged
merged 3 commits into from
Feb 24, 2025

Conversation

grusev
Copy link
Collaborator

@grusev grusev commented Jan 23, 2025

Reference Issues/PRs

Those asv tests aim to compare memory usage between above mentioned operations. Seeing all of them on one graph could tell us how efficient we are in memory management.

If we consider that we will have also the ability to make same tests for Polars and Amazon S3 with time, such benchmarks could provide insights for our performance

         OUTPUT -------->
         DF generated 88.47616171836853                                                                                                 ok

[20.00%] ··· comparison_benchmarks.ComparisonBenchmarks.peakmem_create_dataframe 3.16G
[40.00%] ··· comparison_benchmarks.ComparisonBenchmarks.peakmem_read_dataframe_arctic 4.95G
[60.00%] ··· comparison_benchmarks.ComparisonBenchmarks.peakmem_read_dataframe_parquet 5.4G
[80.00%] ··· comparison_benchmarks.ComparisonBenchmarks.peakmem_write_dataframe_arctic 3.45G
[100.00%] ··· comparison_benchmarks.ComparisonBenchmarks.peakmem_write_dataframe_parquet 2.77G

What does this implement or fix?

Any other comments?

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

Copy link
Collaborator

@G-D-Petrov G-D-Petrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

SYMBOL = "dataframe"
NUMBER_ROWS = 3000000

def setup_cache(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should just return the df given that the df and the dict are really the same thing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return of dict is needed for dataframe creation test, which is important for understanging all later numbers and their dynamics, thus this cannot be avoided

"dtype" : str_col(10, size),
}

def peakmem_create_dataframe(self, tpl):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the idea with this but it's odd for us to have a benchmark that we have no control over - would be very odd for a Pandas regression to block our PRs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creation of the dataframe is basis for comparison how effective we will be on other graphs. Also it shows that although create and write done separetly have certain mem requirements, the mem requirements of both of them are not equal to the sum of each one. (same for read/write ets.

I would not say block, but warn and require action if the memory usage suddently grows. Why it grew? IS it ok? wtc. Such questions will pop in our heads which otherwise will not ... So see this as a feedback loop, that will remind us to thinl when something unusual happens.

Overall seing all processes in one test and graph will make a difference over time as this coul improve our knowledge and ways of management performance and requirements

@grusev grusev merged commit 71ebe0b into master Feb 24, 2025
151 of 152 checks passed
@grusev grusev deleted the asv_test_comparison branch February 24, 2025 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants