WIP, ENH: add df->rec converter #886

tylerjereddy · 2023-01-05T00:35:30Z

add a per-record converter from DataFrame to darshan-util C record format that passes tests for single records (rows) with counters and fcounters (but not rank & id yet)
add a full df->multi-record converter; this one still leads to segfaults with the C inject func with stuff like (probably corrupted data..):

../darshan-logutils-accumulator.c:145: darshan_accumulator_inject: Assertion `rank < acc->job_nprocs' failed.`

one of the two added tests passes, the other xfails, feel free to push in fixes
as discussed in meeting earlier today (Rob R. seems to agree), this likely shouldn't be a long-term solution--just temporary until we can hoist all analysis infrastructure to the Python level
some of the accumulator API prototypes are copied in here from other PRs for testing purposes only.. since the main purpose of doing something like this would be to allow filtering DataFrames and then passing the raw records to the C accum interface.. (well, you also avoid re-reading the log file records)
I'll leave backwards compatibility and support beyond STDIO/POSIX drafted here to Argonne folks most likely

* add a per-record converter from DataFrame to darshan-util C record format that passes tests for single records (rows) with counters and fcounters (but not rank & id yet) * add a full df->multi-record converter; this one still leads to segfaults with the C inject func with stuff like: ``` ../darshan-logutils-accumulator.c:145: darshan_accumulator_inject: Assertion `rank < acc->job_nprocs' failed.` ``` * one of the two added tests passes, the other xfails, feel free to push in fixes * as discussed in meeting earlier today (Rob R. seems to agree), this likely shouldn't be a long-term solution--just temporary until we can hoist all analysis infrastructure to the Python level * some of the accumulator API prototypes are copied in here from other PRs for testing purposes only.. since the main purpose of doing something like this would be to allow filtering DataFrames and then passing the raw records to the C accum interface..

darshan-util/pydarshan/darshan/backend/cffi_backend.py

* ` _df_to_rec()` has been simplified and is far more robust/correct now that we use use a `recarray`; a separate function for multiple records was no longer needed so it was purged * cleanups in `test_df_to_rec()` and, add checks for id/rank as well now that they are properly re-packed * `test_reverse_record_array()` now demonstrates the Python filter df -> C accum workflow much more clearly; it still has some kind of memory issue/discrepancy with darshan-hpcgh-867, but much closer now...

tylerjereddy · 2023-01-16T00:52:00Z

This is much closer now--test_reverse_record_array even shows the C accumulator machinery operating on a pre-filtered dataframe that has been converted into a buffer of records, and retrieves somewhat-reasonable file count values locally from the derived metrics. If I don't pre-filter I do see the same total files as described in gh-867 (1026 total files), so that's a good sign...

There's a memory access/free-related issue caused by something in test_reverse_record_array -- feel free to give me a hand with that.. otherwise pretty close I think...

darshan-util/pydarshan/darshan/tests/test_cffi_misc.py

* fix up `test_reverse_record_array` memory issues, parametrize it over scenarios that do/don't match Perl infra, and some cleanups to that test

* support older `pandas` versions with the slicing index in `_df_to_rec()`

tylerjereddy · 2023-01-16T18:30:47Z

Ok, CI is passing now, so this is getting to be in pretty solid shape.

I'll summarize some useful things reviewers could help with:

check for any residual memory leaks on the new function/tests added by using i.e., memory_profiler to stress test repeat calls
help me figure out if we should be worried about the mismatch in accumulator file count values for the stat-filter scenario (see the one xfailed test case in this PR and file category count mismatch between darshan-parser and darshan-job-summary.pl #867 discussion); that will take some digging to figure out I suspect, but since we match darshan-parser perfectly when we don't filter the stat'd files, I'm not so sure we have any problem on the Python side here; still, it would be good to get an explanation for the filtered case
help deal with the warning maybe:

tests/test_cffi_misc.py: 10 warnings
  /home/tyler/python_310_darshan_venv/lib/python3.10/site-packages/darshan/backend/cffi_backend.py:716: UserWarning: implicit cast from 'char *' to a different pointer type: will be forbidden in the future (check that the types are as you expect; use an explicit ffi.cast() if they are correct)
    buf[0] = ffi.from_buffer(rec_arr)

determine if there any modules that are supported by the accumulator machinery that wouldn't automatically work with the infrastructure that is working with STDIO and POSIX here--for example, LUSTRE may have variable-length fields that require additional shims, but I don't think LUSTRE is supported for accumulation anyway. Related to this--more relevant test cases you'd like to see.
obviously, design feedback on the repacking workflow in general; while I still maintain we should hoist the analysis all up to Python eventually (which still uses C/C++ under the hood of course, but only in the context of the ecosystem that accepts Python objects/buffers by design), "eventually" might mean a few months or years so that's why I'm providing this option

tylerjereddy · 2023-01-18T23:28:17Z

Another thing I thought of--it may be possible to simplify the complexity in test_reverse_record_array if we could replace stuff like darshan_accumulator_create() with an equivalent struct initialization on the CFFI/Python/NumPy recarray side. Whether it is worth it depends on how complex the logic is, and how annoying it is to do portably in Python across the different modules.

shanedsnyder · 2023-02-03T23:28:37Z

I still need to review the tests, but the conversion code looks good to me. I don't have any suggestions yet, but maybe I will after I work through some of the issues you mention for reviewers. I'm not sure it's a big deal, but conceptually I would think you could always return a buffer of the underlying record type, as opposed to a char[] as you do in the multi-record case -- I wonder if that has something to do with the warning you mention.

determine if there any modules that are supported by the accumulator machinery that wouldn't automatically work with the infrastructure that is working with STDIO and POSIX here--for example, LUSTRE may have variable-length fields that require additional shims, but I don't think LUSTRE is supported for accumulation anyway. Related to this--more relevant test cases you'd like to see.

Luckily, all of the modules that would ultimately support accumulation follow the same structure as what you already have implemented for POSIX/STDIO with little tweaks. Lustre is one that won't support that. The only issue I can think of is that the H5D and PNETCDF_VAR modules have an additional field in the structure that will need to be packed, but that's probably not a big hassle.

* modified expected values for case where filtering is applied, and removed xfail tag - the expected values from the Perl summary report sum the derived metrics across the POSIX and STDIO modules, which is why the old expected values are larger than the values that are now passing * make sure to call accumulator_destroy to avoid mem leaks * cache nprocs/modules so we don't need to reopen the log

shanedsnyder · 2023-02-07T23:51:34Z

help me figure out if we should be worried about the mismatch in accumulator file count values for the stat-filter scenario (see the one xfailed test case in this PR and file category count mismatch between darshan-parser and darshan-job-summary.pl #867 discussion); that will take some digging to figure out I suspect, but since we match darshan-parser perfectly when we don't filter the stat'd files, I'm not so sure we have any problem on the Python side here; still, it would be good to get an explanation for the filtered case

I just pushed a fix for this and a couple of other smaller things for this test. The short story is that these derived metrics presented in the old Perl report are actually summed across both POSIX and STDIO modules, so we needed to account for that difference in this test since it's POSIX only. I modified the expected values to reflect the POSIX module only values -- I started to add STDIO support for the test, but it was getting messy and I'm not sure there's much value in us testing the expected values for combining modules.

For a little more background, in the past we have often combined POSIX/STDIO data when doing some analyses, since they are both found at the lowest layer of the I/O stack (that we can instrument) and they don't layer on top of each other (we don't have to worry about double counting statistics if the APIs don't layer on top of each other). If you recall, we had a similar issue with the "file access by category" plots, where we needed to update them to additionally include STDIO data. That said, I'm fine with our current approach of providing these stats on a per-module basis in the new PyDarshan job summary tool, so don't think there's much to follow up on here.

* always return a Python bytes object (return value from recarray tobytes() method) from _df_to_rec * adjust `test_df_to_rec()` to "cast" from returned bytes to a record pointer we can use to sanity check record fields * fix `test_reverse_record_array()` usage of 'summation_record' argument to C library `accumulator_emit()` routine - 'summation_record' is a pointer to a record structure for the corresponding module to write an aggregate record (i.e., a record accumulated from all injected records) - explicitly allocate a 'summation_record' and pass it to the emit routine, similar to what is done with the derived_metrics structure - remove unnecessary secondary call in to `_df_to_rec()`

shanedsnyder · 2023-02-08T19:27:33Z

I just pushed another commit that, among other things, addresses the warning you mentioned related to an implicit cast.

I think it simplifies _df_to_rec() a little bit as an added bonus, which now always returns a bytes object -- previously it was returning different things depending on whether you gave it one or multiple records.

The tests have been updated a bit, too, and are passing for me locally:

test_df_to_rec() now casts the returned bytes object into appropriate record pointer type it can use to sanity check specific fields
test_reverse_record_array() was not handling one of the arguments to accumulator_emit() routine properly, so I fixed that, though we aren't using this value yet

tylerjereddy · 2023-02-08T19:45:28Z

I checked all 4 of your commits and they look like clear improvements to me. I suppose you could also update the return value docstring for _df_to_rec now that it is simply bytes, but it is a _private function anyway.

I guess some decision will have to be made on if you proceed with this PR first, then modify some of the older aggregation ones to use this machinery instead of the Python hand-feeding records to the accumulator approach from before.

tylerjereddy · 2023-02-08T19:48:26Z

mypy 1.0.0 was released two days ago--maybe just pin to mypy<1.0.0 to silence CI for now (probably easy to do the proper fix as well.. but not a priority).

shanedsnyder · 2023-02-08T22:11:05Z

I think all looks good here. Not an ideal solution, but our hands are tied until we can move more stuff into Python. In the mean time, this will let us use the existing C aggregation stuff while giving us flexibility (i.e., for filtering) using pandas.

tylerjereddy added the enhancement New feature or request label Jan 5, 2023

github-actions bot added the pydarshan label Jan 5, 2023

tylerjereddy commented Jan 5, 2023

View reviewed changes

darshan-util/pydarshan/darshan/backend/cffi_backend.py Outdated Show resolved Hide resolved

tylerjereddy commented Jan 16, 2023

View reviewed changes

darshan-util/pydarshan/darshan/tests/test_cffi_misc.py Outdated Show resolved Hide resolved

tylerjereddy added 2 commits January 16, 2023 09:48

MAINT: PR 886 revisions

487623f

* fix up `test_reverse_record_array` memory issues, parametrize it over scenarios that do/don't match Perl infra, and some cleanups to that test

MAINT: PR 886 revisions

ac71ed5

* support older `pandas` versions with the slicing index in `_df_to_rec()`

shanedsnyder added 2 commits February 6, 2023 17:00

add MPI-IO test cases

1b4945b

remove unused var

dfc64d7

clarify docstring for _df_to_rec()

cd0143e

shanedsnyder mentioned this pull request Feb 8, 2023

CI: pin to mpy<1.0.0 #897

Merged

shanedsnyder closed this Feb 8, 2023

shanedsnyder reopened this Feb 8, 2023

shanedsnyder approved these changes Feb 8, 2023

View reviewed changes

shanedsnyder merged commit adbe40e into darshan-hpc:main Feb 8, 2023

shanedsnyder mentioned this pull request Feb 8, 2023

ENH: Python derived/accum interface #839

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP, ENH: add df->rec converter #886

WIP, ENH: add df->rec converter #886

tylerjereddy commented Jan 5, 2023 •

edited

Loading

tylerjereddy commented Jan 16, 2023

tylerjereddy commented Jan 16, 2023

tylerjereddy commented Jan 18, 2023

shanedsnyder commented Feb 3, 2023

shanedsnyder commented Feb 7, 2023

shanedsnyder commented Feb 8, 2023

tylerjereddy commented Feb 8, 2023

tylerjereddy commented Feb 8, 2023

shanedsnyder commented Feb 8, 2023

WIP, ENH: add df->rec converter #886

WIP, ENH: add df->rec converter #886

Conversation

tylerjereddy commented Jan 5, 2023 • edited Loading

tylerjereddy commented Jan 16, 2023

tylerjereddy commented Jan 16, 2023

tylerjereddy commented Jan 18, 2023

shanedsnyder commented Feb 3, 2023

shanedsnyder commented Feb 7, 2023

shanedsnyder commented Feb 8, 2023

tylerjereddy commented Feb 8, 2023

tylerjereddy commented Feb 8, 2023

shanedsnyder commented Feb 8, 2023

tylerjereddy commented Jan 5, 2023 •

edited

Loading