Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have rdfentry_ represent the global entry number in the TChain even in MT runs #12190

Open
eguiraud opened this issue Jan 31, 2023 · 13 comments
Open

Comments

@eguiraud
Copy link
Contributor

eguiraud commented Jan 31, 2023

Currently RDF's special column rdfentry_, despite the name, does not correspond to the global TChain entry number in MT runs (see also the relevant docs).

This is surprising for users (hence the big warning in the docs linked above) and makes it unnecessarily difficult to e.g. attach a numpy array as an additional column (because it's hard to index into it correctly without stable global row numbers).

We could instead make rdfentry_ always match the "real" (global) entry number in the dataset -- if only each MT task knew the offset of the current tree w.r.t. all other trees in the chain.

Proposed solution

  • have TTreeProcessorMT tell each MT task which tree it is processing w.r.t. to the global chain (#1, #2, #3, ...)
  • have each task calculate its tree's offset by going over a list of tree entry numbers, filling missing values as needed (the list of entry numbers would be implemented as an array of fixed size nTrees with atomic elements. This plus the fact that threads only need to write into the atomic elements if they see the value has not been calculated yet should minimize thread contention)

Other solution considered

  • we could always build a global TChain, for every task, and always use global entry numbers everywhere. However this would require that TTreeProcessorMT reads out the number of entries in each tree before the tasks even start, because it first needs to come up with entry ranges for each task. My intuition is that this would bring a larger performance impact than the proposed solution: we know from DistRDF that the (redundant) opening O(1k) remote files at startup is a significant cost.
  • we could do nothing: rdfentry_ would be unstable and it could not be relied upon to e.g. index into manually added "friend columns" or to fill TEntryLists (like this user would have liked to do)
@eguiraud eguiraud self-assigned this Jan 31, 2023
@eguiraud eguiraud changed the title Fix rdfentry_: it should represent the global entry number in the TChain even in MT runs Have rdfentry_ represent the global entry number in the TChain even in MT runs Jan 31, 2023
@pcanal
Copy link
Member

pcanal commented Jan 31, 2023

I suppose the proposed can work but require a synchronization step of sort. The thread handling file number n, must wait until the file number [0,n-1] have been open before starting to process entries ; the file opening can be in parallel (somewhat) but still if an arbitrary file inside the chain is much smaller than the other, the thread processing it will have to wait a bit.

@eguiraud
Copy link
Contributor Author

In the proposed solution if a task needs an entry number that's not there yet it goes and retrieves it itself (subsequent tasks that need that entry number will then find it in the shared list).

eguiraud added a commit to eguiraud/root that referenced this issue Jan 31, 2023
@pcanal
Copy link
Member

pcanal commented Jan 31, 2023

If you are not carefully, this might lead to the case where at start up for n-core/n-tasks, we might issue (n(n+1) / 2) files opens (eg. at the very least the first file being requested to be open n times).

@pcanal
Copy link
Member

pcanal commented Jan 31, 2023

we could do nothing: rdfentry_ would be unstable and it could not be relied upon to e.g. index into manually added "friend columns"

Indeed, the global number is needed to load the proper friend. For example we could have a friend which is a chain which contains files that have different lengths (number of entries)( (but same total lengths) than the files in the main chain (consequently a single file in the main chain maybe have to use/open 2 or more files from the friend chain).

I.e. we would also need to keep a running total for the friends

@Axel-Naumann
Copy link
Member

Axel-Naumann commented Feb 1, 2023

Sounds good! What might help - esp during startup where I agree with @pcanal things can get a bit wild - is to tell the workers: "and report your tree's entries back, and I will then - at some point in the near future - let you know your global offset, once I know it". I.e. the sync step @pcanal was referring to. It makes sense to parallelize that!

@eguiraud
Copy link
Contributor Author

eguiraud commented Feb 1, 2023

If you are not carefully, this might lead to the case where at start up for n-core/n-tasks, we might issue (n(n+1) / 2) files opens (eg. at the very least the first file being requested to be open n times).

Mmmh that's right...we'll have to be careful.

the global number is needed to load the proper friend

What I'm saying only applies when there are no friends. In case there are friend trees (or a TEntryList) currently TTreeProcessorMT opens all files once at the beginning to recover all tree entry numbers, and each task builds the full chain (and then processes a certain range of global entry numbers).

@eguiraud
Copy link
Contributor Author

eguiraud commented Feb 1, 2023

report your tree's entries back, and I will then - at some point in the near future - let you know your global offset

Ah, good idea, this avoids the problem Philippe mentioned above with all tasks trying to recover the number of entries of the first file at the same time.

@vepadulano
Copy link
Member

I think the proposed idea (+ the discussion so far) makes sense, I have nothing to add on the spot. If in the end it still proves to have a tangible startup cost, we could also think about doing it only if the rdfentry_ column is actually requested in the application

@pcanal
Copy link
Member

pcanal commented Feb 1, 2023

We could also think about doing it only if the rdfentry_ column is actually requested in the application

+1

@eguiraud
Copy link
Contributor Author

eguiraud commented Feb 1, 2023

report your tree's entries back, and I will then - at some point in the near future - let you know your global offset

@Axel-Naumann on second thought this is quite complicated...at the point one task (which e.g. might be processing tree #4) might require to know the number of entries in tree #1, #2 and #3 there is no guarantee that the corresponding tasks are even running.

doing it only if the rdfentry_ column is actually requested in the application

I don't think that at the moment RDF has any logic that "reflects" at a global level on which columns are used, but I guess we could add something ad-hoc.

@Axel-Naumann
Copy link
Member

Axel-Naumann commented Feb 2, 2023

Good point.

I just dislike the submission computer having to open all files: it's potentially saturating the bandwidth, not as parallel as it could be (given there might be a whole cluster waiting), and the storage might be optimized for the cluster more than for the submission computer (a laptop?)

Can we - as a first task - submit to the workers the opening of all files / reporting of their entries? Only once they have reported back would then the main task start, possibly only if it uses friends or rdfentry.

@eguiraud
Copy link
Contributor Author

eguiraud commented Feb 2, 2023

I just dislike the submission computer having to open all files

Distributed RDF employs yet another strategy that doesn't require opening all files before submitting data processing tasks but potentially produces more unbalanced tasks or some empty tasks (it's unclear to me if that has any visible performance impact or not).

For local, multi-thread RDF, I guess we can either do the same that distributed RDF does or come up with a way to schedule this graph of tasks efficiently (with data processing task #N depending on ttree-entry-retrieval tasks #1, #2, ..., #N), e.g. via TBB task graphs.

@eguiraud
Copy link
Contributor Author

After further discussion with @Axel-Naumann and @vepadulano we converged on the following strategy:

  • check whether rdfentry_ is used in the computation graph
  • if yes, check whether the RDatasetSpec provides entry numbers for any of the input trees
  • if there is at least one input tree without an entry number, log a warning and open the necessary files once at the beginning of the event loop to retrieve the missing entry numbers

Now we have enough information to tell each task what the global offset of its TTree is.

We could also automatically generate a "patched up" version of the dataset spec after this first run (one that contains all TTree entry numbers) and suggest to users that they switch to that one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants