Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP, DEBUG: Python derived metrics #848

Conversation

tylerjereddy
Copy link
Collaborator

@tylerjereddy tylerjereddy commented Nov 9, 2022

  • this is a Python/pandas-only version of doing some simple derived metrics; I don't think we'll
    actually do this, but I was exploring a bit because of the difficulties in ENH: Python derived/accum interface #839

  • this matches pretty well with the perl based reports for total bytes, but even simple cases can sometimes disagree on bandwidth per Query: bandwidth calculation for derived metrics #847, so now I'm curious what is going on

  • one problem with doing this is that we'd have the same algorithms implemented in two different languages; the advantages include:

    • not reading all the records in a second time, one at a time, crossing the CFFI boundary each time (because all the records are already stored in the DataFrame from the first time we did this)
    • easier to debug/maintain because bounds checking/no segfaults, etc.
    • likely easier to regex-filter directly on the pandas data structures than with interop to C/CFFI for i.e., custom file derived metrics

@tylerjereddy
Copy link
Collaborator Author

Ok, CI issues notwithstanding, the new test_perf_estimate() does pass for two log files here. I won't carry this any farther for now because it looks like things can get more complex with MPI-IO involvement and the logic is already written in C.

Still, we may do this someday for some of the reasons outlined above, especially if it is more performant (no re-reading records across langauge barriers) and more maintainable.

* this is a Python/pandas-only version of doing
some simple derived metrics; I don't think we'll
actually do this, but I was exploring a bit because
of the difficulties in darshan-hpcgh-839

* this matches pretty well with the `perl` based reports
for total bytes, but even simple cases can sometimes
disagree on bandwidth per darshan-hpcgh-847, so now I'm curious
what is going on

* one problem with doing this is that we'd have the same
algorithms implemented in two different languages; the advantages
include:
- not reading all the records in a second time, crossing the CFFI
boundary each time
- easier to debug/maintain because bounds checking/no segfaults, etc.
* simplify `perf_estimate()` by removing `mod_name_adjusted`
and correct total time calculation to include `META`
@tylerjereddy tylerjereddy force-pushed the treddy_derived_metrics_in_python branch from bf52be3 to 73c4c7e Compare November 9, 2022 21:03
@tylerjereddy
Copy link
Collaborator Author

We're not doing this for now, though long-term it is likely far easier to maintain via pandas or something high-level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant