WIP, DEBUG: Python derived metrics #848

tylerjereddy · 2022-11-09T19:03:33Z

this is a Python/pandas-only version of doing some simple derived metrics; I don't think we'll
actually do this, but I was exploring a bit because of the difficulties in ENH: Python derived/accum interface #839
this matches pretty well with the perl based reports for total bytes, but even simple cases can sometimes disagree on bandwidth per Query: bandwidth calculation for derived metrics #847, so now I'm curious what is going on
one problem with doing this is that we'd have the same algorithms implemented in two different languages; the advantages include:
- not reading all the records in a second time, one at a time, crossing the CFFI boundary each time (because all the records are already stored in the DataFrame from the first time we did this)
- easier to debug/maintain because bounds checking/no segfaults, etc.
- likely easier to regex-filter directly on the pandas data structures than with interop to C/CFFI for i.e., custom file derived metrics

darshan-util/pydarshan/darshan/tests/test_derived_metrics.py

tylerjereddy · 2022-11-09T20:51:46Z

Ok, CI issues notwithstanding, the new test_perf_estimate() does pass for two log files here. I won't carry this any farther for now because it looks like things can get more complex with MPI-IO involvement and the logic is already written in C.

Still, we may do this someday for some of the reasons outlined above, especially if it is more performant (no re-reading records across langauge barriers) and more maintainable.

* this is a Python/pandas-only version of doing some simple derived metrics; I don't think we'll actually do this, but I was exploring a bit because of the difficulties in darshan-hpcgh-839 * this matches pretty well with the `perl` based reports for total bytes, but even simple cases can sometimes disagree on bandwidth per darshan-hpcgh-847, so now I'm curious what is going on * one problem with doing this is that we'd have the same algorithms implemented in two different languages; the advantages include: - not reading all the records in a second time, crossing the CFFI boundary each time - easier to debug/maintain because bounds checking/no segfaults, etc.

* simplify `perf_estimate()` by removing `mod_name_adjusted` and correct total time calculation to include `META`

tylerjereddy · 2022-12-05T17:40:55Z

We're not doing this for now, though long-term it is likely far easier to maintain via pandas or something high-level.

github-actions bot added the pydarshan label Nov 9, 2022

tylerjereddy commented Nov 9, 2022

View reviewed changes

darshan-util/pydarshan/darshan/tests/test_derived_metrics.py Show resolved Hide resolved

tylerjereddy mentioned this pull request Nov 9, 2022

ENH: Python derived/accum interface #839

Merged

tylerjereddy added 2 commits November 9, 2022 14:03

MAINT: PR 848

73c4c7e

* simplify `perf_estimate()` by removing `mod_name_adjusted` and correct total time calculation to include `META`

tylerjereddy force-pushed the treddy_derived_metrics_in_python branch from bf52be3 to 73c4c7e Compare November 9, 2022 21:03

tylerjereddy closed this Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP, DEBUG: Python derived metrics #848

WIP, DEBUG: Python derived metrics #848

tylerjereddy commented Nov 9, 2022 •

edited

Loading

tylerjereddy commented Nov 9, 2022

tylerjereddy commented Dec 5, 2022

WIP, DEBUG: Python derived metrics #848

WIP, DEBUG: Python derived metrics #848

Conversation

tylerjereddy commented Nov 9, 2022 • edited Loading

tylerjereddy commented Nov 9, 2022

tylerjereddy commented Dec 5, 2022

tylerjereddy commented Nov 9, 2022 •

edited

Loading