-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve ExternalGeneratorFilter to handle a stream skipping a lumi #43783
Improve ExternalGeneratorFilter to handle a stream skipping a lumi #43783
Conversation
cms-bot internal usage |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43783/38536
|
A new Pull Request was created by @wddgit (W. David Dagenhart) for master. It involves the following packages:
@SiewYan, @mkirsano, @cmsbuild, @menglu21, @bbilin, @GurpreetSinghChahal, @alberto-sanchez can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
please test FYI @makortel @Dr15Jones |
Some extra explanation of how this fix works. The new code in the global end lumi transition is only used when the selected stream skips the lumi, which is probably a somewhat rare thing. In that case it takes care of the actions normally handled in streamEndLuminosityBlockSummary. |
@@ -287,13 +287,14 @@ void ExternalGeneratorFilter::streamEndRun(edm::StreamID iID, edm::Run const& iR | |||
} | |||
} | |||
void ExternalGeneratorFilter::globalEndRunProduce(edm::Run& iRun, edm::EventSetup const&) const { | |||
iRun.emplace(runInfoToken_, std::move(runCache(iRun.index())->runInfo_)); | |||
iRun.emplace(runInfoToken_, runCache(iRun.index())->runInfo_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For anyone else who was wondering the removal of the std::move
, the GenRunInfoProduct
consists of 7 doubles, so moving brings no benefits over copying.
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-12008b/37028/summary.html Comparison SummarySummary:
|
@@ -155,6 +155,11 @@ namespace externalgen { | |||
//Only stream 0 sets this at stream end Lumi and it is read at global end Lumi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment still correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. The comment is definitely wrong and I will delete it. I'm still looking at this. I think I can make the data member that follows const or constexpr...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deleted the comment and changed randomState_ in the LumiCache from CMS_THREAD_SAFE mutable
to const
.
} | ||
state_.store(State::kNotReadyForNextGlobalBeginLumi); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be a compare_exchange_strong
just incase there is a race for 2 different Lumis calling globalBeginLuminosityBlockProduce
. (maybe one Lumi is empty and the next has something and the empty Lumi call was delayed by the OS?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the globalBeginLuminosityBlockProduce
function calls for a LuminosityBlock (and the other functions in the global begin lumi transition) must complete before the EventProcessor
can move to the next lumi. The EventProcessor
does not even look to see what is next before the entire transition for a lumi is complete. We have to get to the call to handleNextEventForStreamAsync
for another lumi to start. See:
https://cmssdt.cern.ch/dxr/CMSSW/source/FWCore/Framework/src/EventProcessor.cc#1742
I think this is OK.
@@ -341,16 +340,25 @@ void ExternalGeneratorFilter::streamEndLuminosityBlockSummary(edm::StreamID iID, | |||
GenLumiInfoProduct* iProduct) const { | |||
iProduct->mergeProduct(*streamCache(iID)->endLumiProduce(iLuminosityBlock.run())); | |||
|
|||
if (lastLumiIndex_ == iLuminosityBlock.index()) { | |||
if (!luminosityBlockCache(iLuminosityBlock.index())->selectedStreamTransitionsCompleted_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a race condition. Two streams could be calling streamEndLuminosityBlockSummary
at the same time, one could be trying to read this value at the same time as one is on line 346 changing the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to our documentation (https://twiki.cern.ch/twiki/bin/view/CMSPublic/FWMultithreadedFrameworkGlobalModuleInterface#edm_LuminosityBlockSummaryCacheT)
The framework guarantees that only one
streamEndLuminosityBlockSummary
is called at a time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Matti. There is a mutex lock around those function calls:
https://cmssdt.cern.ch/dxr/CMSSW/source/FWCore/Framework/interface/global/implementors.h#267
Since it is in a lumi cache, different lumis will use different instances and different memory locations. No conflict there. And also globalEndLuminosityBlock will execute after these functions are called. This variable isn't accessed or modified anywhere else so it should be OK.
} | ||
} | ||
|
||
void ExternalGeneratorFilter::globalEndLuminosityBlockProduce(edm::LuminosityBlock& iLuminosityBlock, | ||
edm::EventSetup const&, | ||
GenLumiInfoProduct const* iProduct) const { | ||
if (!luminosityBlockCache(iLuminosityBlock.index())->selectedStreamTransitionsCompleted_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe a comment explaining why this is needed would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment. I think that helps. Thanks.
65b37cf
to
c894f45
Compare
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-12008b/37112/summary.html Comparison SummarySummary:
|
Milestone for this pull request has been moved to CMSSW_14_1_X. Please open a backport if it should also go in to CMSSW_14_0_X. |
From |
@cms-sw/generators-l2 Could you please review and sign? Thanks! |
@cms-sw/generators-l2 ping |
3 similar comments
@cms-sw/generators-l2 ping |
@cms-sw/generators-l2 ping |
@cms-sw/generators-l2 ping |
+1 |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @rappoccio, @sextonkennedy, @antoniovilela (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
PR description:
Improve ExternalGeneratorFilter to handle a stream skipping a lumi. Allowing a stream to skip a lumi is a proposed new Framework behavior. See PR #43522, which is not merged yet. This PR is part of an effort to improve code outside the Framework so all modules will be able properly handle the new Framework behavior.
The previous version of the code in this module selected a particular stream to use to run beginLumiProduce in globalBeginLuminosityBlockProduce. For other streams, beginLumiProduce is called in streamBeginLuminosityBlock.
On the selected stream, the pointer to the stream cache is set to null in streamBeginLuminosityBlock. If the stream skips the lumi, then that never happens. Then the compare_exchange_strong in streamEndLuminosityBlockSummary never executes because it is expecting the value to be null and it is not. The same stream cache would be used for the next globalBeginLuminosityBlock. That stream might not be done yet with the preceding lumi. And at that point we would be in a bad state.
There is a second pre-existing problem unrelated to a stream skipping a lumi. This problem probably happens very rarely or never. After a global begin lumi transition occurs and before the selected stream runs streamBeginLuminosityBlock and sets the stream cache pointer to null, a different stream could finish and try to move to the next lumi. The next global begin lumi will wait. That wait depends on the stream cache pointer being set to null before you get there. If that pointer has not been set null yet, then the next lumi could start early and then we would also be in a bad state. This problem could occur even without a stream skipping a lumi.
The fixes in this PR address both issues.
There was also a compiler warning about a std::move call that accomplishes nothing and I fixed that also (one line).
PR validation:
There are some nice and thorough unit tests of this module in a different package, GeneratorInterface/Pythia8Interface. I checked that package out also and ran those tests and they passed. I also temporarily edited the code in my working area to force it to use the new code path in globalEndLuminosityBlockProduce and ran the same unit tests again and they still pass.