Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ExternalGeneratorFilter to handle a stream skipping a lumi #43783

Merged

Conversation

wddgit
Copy link
Contributor

@wddgit wddgit commented Jan 24, 2024

PR description:

Improve ExternalGeneratorFilter to handle a stream skipping a lumi. Allowing a stream to skip a lumi is a proposed new Framework behavior. See PR #43522, which is not merged yet. This PR is part of an effort to improve code outside the Framework so all modules will be able properly handle the new Framework behavior.

The previous version of the code in this module selected a particular stream to use to run beginLumiProduce in globalBeginLuminosityBlockProduce. For other streams, beginLumiProduce is called in streamBeginLuminosityBlock.

On the selected stream, the pointer to the stream cache is set to null in streamBeginLuminosityBlock. If the stream skips the lumi, then that never happens. Then the compare_exchange_strong in streamEndLuminosityBlockSummary never executes because it is expecting the value to be null and it is not. The same stream cache would be used for the next globalBeginLuminosityBlock. That stream might not be done yet with the preceding lumi. And at that point we would be in a bad state.

There is a second pre-existing problem unrelated to a stream skipping a lumi. This problem probably happens very rarely or never. After a global begin lumi transition occurs and before the selected stream runs streamBeginLuminosityBlock and sets the stream cache pointer to null, a different stream could finish and try to move to the next lumi. The next global begin lumi will wait. That wait depends on the stream cache pointer being set to null before you get there. If that pointer has not been set null yet, then the next lumi could start early and then we would also be in a bad state. This problem could occur even without a stream skipping a lumi.

The fixes in this PR address both issues.

There was also a compiler warning about a std::move call that accomplishes nothing and I fixed that also (one line).

PR validation:

There are some nice and thorough unit tests of this module in a different package, GeneratorInterface/Pythia8Interface. I checked that package out also and ran those tests and they passed. I also temporarily edited the code in my working area to force it to use the new code path in globalEndLuminosityBlockProduce and ran the same unit tests again and they still pass.

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 24, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43783/38536

  • This PR adds an extra 16KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @wddgit (W. David Dagenhart) for master.

It involves the following packages:

  • GeneratorInterface/Core (generators)

@SiewYan, @mkirsano, @cmsbuild, @menglu21, @bbilin, @GurpreetSinghChahal, @alberto-sanchez can you please review it and eventually sign? Thanks.
@alberto-sanchez, @mkirsano this is something you requested to watch as well.
@sextonkennedy, @antoniovilela, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@wddgit
Copy link
Contributor Author

wddgit commented Jan 24, 2024

please test

FYI @makortel @Dr15Jones

@wddgit
Copy link
Contributor Author

wddgit commented Jan 24, 2024

Some extra explanation of how this fix works. The new code in the global end lumi transition is only used when the selected stream skips the lumi, which is probably a somewhat rare thing. In that case it takes care of the actions normally handled in streamEndLuminosityBlockSummary.

@@ -287,13 +287,14 @@ void ExternalGeneratorFilter::streamEndRun(edm::StreamID iID, edm::Run const& iR
}
}
void ExternalGeneratorFilter::globalEndRunProduce(edm::Run& iRun, edm::EventSetup const&) const {
iRun.emplace(runInfoToken_, std::move(runCache(iRun.index())->runInfo_));
iRun.emplace(runInfoToken_, runCache(iRun.index())->runInfo_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For anyone else who was wondering the removal of the std::move, the GenRunInfoProduct consists of 7 doubles, so moving brings no benefits over copying.

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-12008b/37028/summary.html
COMMIT: 65b37cf
CMSSW: CMSSW_14_0_X_2024-01-24-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/43783/37028/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

@@ -155,6 +155,11 @@ namespace externalgen {
//Only stream 0 sets this at stream end Lumi and it is read at global end Lumi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The comment is definitely wrong and I will delete it. I'm still looking at this. I think I can make the data member that follows const or constexpr...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted the comment and changed randomState_ in the LumiCache from CMS_THREAD_SAFE mutable to const.

}
state_.store(State::kNotReadyForNextGlobalBeginLumi);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be a compare_exchange_strong just incase there is a race for 2 different Lumis calling globalBeginLuminosityBlockProduce. (maybe one Lumi is empty and the next has something and the empty Lumi call was delayed by the OS?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the globalBeginLuminosityBlockProduce function calls for a LuminosityBlock (and the other functions in the global begin lumi transition) must complete before the EventProcessor can move to the next lumi. The EventProcessor does not even look to see what is next before the entire transition for a lumi is complete. We have to get to the call to handleNextEventForStreamAsync for another lumi to start. See:

https://cmssdt.cern.ch/dxr/CMSSW/source/FWCore/Framework/src/EventProcessor.cc#1742

I think this is OK.

@@ -341,16 +340,25 @@ void ExternalGeneratorFilter::streamEndLuminosityBlockSummary(edm::StreamID iID,
GenLumiInfoProduct* iProduct) const {
iProduct->mergeProduct(*streamCache(iID)->endLumiProduce(iLuminosityBlock.run()));

if (lastLumiIndex_ == iLuminosityBlock.index()) {
if (!luminosityBlockCache(iLuminosityBlock.index())->selectedStreamTransitionsCompleted_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a race condition. Two streams could be calling streamEndLuminosityBlockSummary at the same time, one could be trying to read this value at the same time as one is on line 346 changing the value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to our documentation (https://twiki.cern.ch/twiki/bin/view/CMSPublic/FWMultithreadedFrameworkGlobalModuleInterface#edm_LuminosityBlockSummaryCacheT)

The framework guarantees that only one streamEndLuminosityBlockSummary is called at a time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Matti. There is a mutex lock around those function calls:

https://cmssdt.cern.ch/dxr/CMSSW/source/FWCore/Framework/interface/global/implementors.h#267

Since it is in a lumi cache, different lumis will use different instances and different memory locations. No conflict there. And also globalEndLuminosityBlock will execute after these functions are called. This variable isn't accessed or modified anywhere else so it should be OK.

}
}

void ExternalGeneratorFilter::globalEndLuminosityBlockProduce(edm::LuminosityBlock& iLuminosityBlock,
edm::EventSetup const&,
GenLumiInfoProduct const* iProduct) const {
if (!luminosityBlockCache(iLuminosityBlock.index())->selectedStreamTransitionsCompleted_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe a comment explaining why this is needed would be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment. I think that helps. Thanks.

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-12008b/37112/summary.html
COMMIT: c894f45
CMSSW: CMSSW_14_0_X_2024-01-30-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/43783/37112/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 6, 2024

Milestone for this pull request has been moved to CMSSW_14_1_X. Please open a backport if it should also go in to CMSSW_14_0_X.

@cmsbuild cmsbuild modified the milestones: CMSSW_14_0_X, CMSSW_14_1_X Feb 6, 2024
@makortel
Copy link
Contributor

makortel commented Feb 8, 2024

From core point of view this PR looks good now. @cms-sw/generators-l2 could you review and sign? Thanks!

@cmsbuild cmsbuild modified the milestones: CMSSW_14_0_X, CMSSW_14_1_X Feb 8, 2024
@makortel
Copy link
Contributor

@cms-sw/generators-l2 Could you please review and sign? Thanks!

@makortel
Copy link
Contributor

@cms-sw/generators-l2 ping

3 similar comments
@makortel
Copy link
Contributor

@cms-sw/generators-l2 ping

@makortel
Copy link
Contributor

@cms-sw/generators-l2 ping

@makortel
Copy link
Contributor

@cms-sw/generators-l2 ping

@menglu21
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @rappoccio, @sextonkennedy, @antoniovilela (and backports should be raised in the release meeting by the corresponding L2)

@antoniovilela
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 25a0225 into cms-sw:master Feb 29, 2024
11 checks passed
@wddgit wddgit deleted the improveExternalGeneratorFilterSkippingALumi branch March 7, 2024 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants