Have rdfentry_ represent the global entry number in the TChain even in MT runs #12190

eguiraud · 2023-01-31T18:49:25Z

Currently RDF's special column rdfentry_, despite the name, does not correspond to the global TChain entry number in MT runs (see also the relevant docs).

This is surprising for users (hence the big warning in the docs linked above) and makes it unnecessarily difficult to e.g. attach a numpy array as an additional column (because it's hard to index into it correctly without stable global row numbers).

We could instead make rdfentry_ always match the "real" (global) entry number in the dataset -- if only each MT task knew the offset of the current tree w.r.t. all other trees in the chain.

Proposed solution

have TTreeProcessorMT tell each MT task which tree it is processing w.r.t. to the global chain (#1, #2, #3, ...)
have each task calculate its tree's offset by going over a list of tree entry numbers, filling missing values as needed (the list of entry numbers would be implemented as an array of fixed size nTrees with atomic elements. This plus the fact that threads only need to write into the atomic elements if they see the value has not been calculated yet should minimize thread contention)

Other solution considered

we could always build a global TChain, for every task, and always use global entry numbers everywhere. However this would require that TTreeProcessorMT reads out the number of entries in each tree before the tasks even start, because it first needs to come up with entry ranges for each task. My intuition is that this would bring a larger performance impact than the proposed solution: we know from DistRDF that the (redundant) opening O(1k) remote files at startup is a significant cost.
we could do nothing: rdfentry_ would be unstable and it could not be relied upon to e.g. index into manually added "friend columns" or to fill TEntryLists (like this user would have liked to do)

The text was updated successfully, but these errors were encountered:

pcanal · 2023-01-31T20:38:09Z

I suppose the proposed can work but require a synchronization step of sort. The thread handling file number n, must wait until the file number [0,n-1] have been open before starting to process entries ; the file opening can be in parallel (somewhat) but still if an arbitrary file inside the chain is much smaller than the other, the thread processing it will have to wait a bit.

eguiraud · 2023-01-31T21:27:05Z

In the proposed solution if a task needs an entry number that's not there yet it goes and retrieves it itself (subsequent tasks that need that entry number will then find it in the shared list).

…bers Fixes root-project#12190. To do: - finish discussion at root-project#12190 - implement solution

pcanal · 2023-01-31T21:29:31Z

If you are not carefully, this might lead to the case where at start up for n-core/n-tasks, we might issue (n(n+1) / 2) files opens (eg. at the very least the first file being requested to be open n times).

pcanal · 2023-01-31T21:34:15Z

we could do nothing: rdfentry_ would be unstable and it could not be relied upon to e.g. index into manually added "friend columns"

Indeed, the global number is needed to load the proper friend. For example we could have a friend which is a chain which contains files that have different lengths (number of entries)( (but same total lengths) than the files in the main chain (consequently a single file in the main chain maybe have to use/open 2 or more files from the friend chain).

I.e. we would also need to keep a running total for the friends

Axel-Naumann · 2023-02-01T07:43:05Z

Sounds good! What might help - esp during startup where I agree with @pcanal things can get a bit wild - is to tell the workers: "and report your tree's entries back, and I will then - at some point in the near future - let you know your global offset, once I know it". I.e. the sync step @pcanal was referring to. It makes sense to parallelize that!

eguiraud · 2023-02-01T14:43:08Z

If you are not carefully, this might lead to the case where at start up for n-core/n-tasks, we might issue (n(n+1) / 2) files opens (eg. at the very least the first file being requested to be open n times).

Mmmh that's right...we'll have to be careful.

the global number is needed to load the proper friend

What I'm saying only applies when there are no friends. In case there are friend trees (or a TEntryList) currently TTreeProcessorMT opens all files once at the beginning to recover all tree entry numbers, and each task builds the full chain (and then processes a certain range of global entry numbers).

eguiraud · 2023-02-01T14:44:43Z

report your tree's entries back, and I will then - at some point in the near future - let you know your global offset

Ah, good idea, this avoids the problem Philippe mentioned above with all tasks trying to recover the number of entries of the first file at the same time.

vepadulano · 2023-02-01T15:37:32Z

I think the proposed idea (+ the discussion so far) makes sense, I have nothing to add on the spot. If in the end it still proves to have a tangible startup cost, we could also think about doing it only if the rdfentry_ column is actually requested in the application

pcanal · 2023-02-01T16:03:03Z

We could also think about doing it only if the rdfentry_ column is actually requested in the application

+1

eguiraud · 2023-02-01T16:59:07Z

report your tree's entries back, and I will then - at some point in the near future - let you know your global offset

@Axel-Naumann on second thought this is quite complicated...at the point one task (which e.g. might be processing tree #4) might require to know the number of entries in tree #1, #2 and #3 there is no guarantee that the corresponding tasks are even running.

doing it only if the rdfentry_ column is actually requested in the application

I don't think that at the moment RDF has any logic that "reflects" at a global level on which columns are used, but I guess we could add something ad-hoc.

Axel-Naumann · 2023-02-02T06:31:20Z

Good point.

I just dislike the submission computer having to open all files: it's potentially saturating the bandwidth, not as parallel as it could be (given there might be a whole cluster waiting), and the storage might be optimized for the cluster more than for the submission computer (a laptop?)

Can we - as a first task - submit to the workers the opening of all files / reporting of their entries? Only once they have reported back would then the main task start, possibly only if it uses friends or rdfentry.

eguiraud · 2023-02-02T15:59:53Z

I just dislike the submission computer having to open all files

Distributed RDF employs yet another strategy that doesn't require opening all files before submitting data processing tasks but potentially produces more unbalanced tasks or some empty tasks (it's unclear to me if that has any visible performance impact or not).

For local, multi-thread RDF, I guess we can either do the same that distributed RDF does or come up with a way to schedule this graph of tasks efficiently (with data processing task #N depending on ttree-entry-retrieval tasks #1, #2, ..., #N), e.g. via TBB task graphs.

eguiraud · 2023-05-17T16:29:24Z

After further discussion with @Axel-Naumann and @vepadulano we converged on the following strategy:

check whether rdfentry_ is used in the computation graph
if yes, check whether the RDatasetSpec provides entry numbers for any of the input trees
if there is at least one input tree without an entry number, log a warning and open the necessary files once at the beginning of the event loop to retrieve the missing entry numbers

Now we have enough information to tell each task what the global offset of its TTree is.

We could also automatically generate a "patched up" version of the dataset spec after this first run (one that contains all TTree entry numbers) and suggest to users that they switch to that one.

eguiraud self-assigned this Jan 31, 2023

eguiraud added the improvement label Jan 31, 2023

eguiraud changed the title ~~Fix rdfentry_: it should represent the global entry number in the TChain even in MT runs~~ Have rdfentry_ represent the global entry number in the TChain even in MT runs Jan 31, 2023

eguiraud added a commit to eguiraud/root that referenced this issue Jan 31, 2023

[WIP][DF] Have rdfentry_ always coincide with TChain global entry num…

b786b6d

…bers Fixes root-project#12190. To do: - finish discussion at root-project#12190 - implement solution

vepadulano mentioned this issue Feb 6, 2023

feat: allow awkward type arrays filtering based on rdfentry scikit-hep/awkward#2202

Merged

guitargeek added the in:RDataFrame label Nov 7, 2023

vepadulano mentioned this issue Oct 24, 2024

Support for histogram filling in oversampled dataset #16740

Open

vepadulano unassigned eguiraud Jan 24, 2025

dpiparo assigned martamaja10 Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have rdfentry_ represent the global entry number in the TChain even in MT runs #12190

Have rdfentry_ represent the global entry number in the TChain even in MT runs #12190

eguiraud commented Jan 31, 2023 •

edited

Loading

pcanal commented Jan 31, 2023

eguiraud commented Jan 31, 2023

pcanal commented Jan 31, 2023 •

edited

Loading

pcanal commented Jan 31, 2023 •

edited

Loading

Axel-Naumann commented Feb 1, 2023 •

edited

Loading

eguiraud commented Feb 1, 2023

eguiraud commented Feb 1, 2023

vepadulano commented Feb 1, 2023

pcanal commented Feb 1, 2023

eguiraud commented Feb 1, 2023 •

edited

Loading

Axel-Naumann commented Feb 2, 2023 •

edited

Loading

eguiraud commented Feb 2, 2023

eguiraud commented May 17, 2023

Have rdfentry_ represent the global entry number in the TChain even in MT runs #12190

Have rdfentry_ represent the global entry number in the TChain even in MT runs #12190

Comments

eguiraud commented Jan 31, 2023 • edited Loading

Proposed solution

Other solution considered

pcanal commented Jan 31, 2023

eguiraud commented Jan 31, 2023

pcanal commented Jan 31, 2023 • edited Loading

pcanal commented Jan 31, 2023 • edited Loading

Axel-Naumann commented Feb 1, 2023 • edited Loading

eguiraud commented Feb 1, 2023

eguiraud commented Feb 1, 2023

vepadulano commented Feb 1, 2023

pcanal commented Feb 1, 2023

eguiraud commented Feb 1, 2023 • edited Loading

Axel-Naumann commented Feb 2, 2023 • edited Loading

eguiraud commented Feb 2, 2023

eguiraud commented May 17, 2023

eguiraud commented Jan 31, 2023 •

edited

Loading

pcanal commented Jan 31, 2023 •

edited

Loading

pcanal commented Jan 31, 2023 •

edited

Loading

Axel-Naumann commented Feb 1, 2023 •

edited

Loading

eguiraud commented Feb 1, 2023 •

edited

Loading

Axel-Naumann commented Feb 2, 2023 •

edited

Loading