-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[12_3_X] Backport of 37617 + 37798 #37933
Conversation
A new Pull Request was created by @AdrianoDee for CMSSW_12_3_X. It involves the following packages:
@emanueleusai, @makortel, @fwyzard, @ahmad3213, @cmsbuild, @jfernan2, @clacaputo, @slava77, @jpata, @pmandrik, @micsucmed, @rvenditti can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
test parameters:
|
type trk |
please test |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7d844a/24680/summary.html GPU Comparison SummarySummary:
Comparison Summary@slava77 comparisons for the following workflows were not done due to missing matrix map:
Summary:
|
+1 |
tested successfully at P5 |
+reconstruction
|
+heterogeneous |
This pull request is fully signed and it will be integrated in one of the next CMSSW_12_3_X IBs (tests are also fine) and once validation in the development release cycle CMSSW_12_5_X is complete. This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
I suspect this PR is related to a crash seen on HLT GPU nodes earlier today at P5. Preliminary checks suggest that there is a missing protection against Could experts please have a look? |
Hi @missirol, is there a log for the crash? |
I see what you mean, let me add the safeguard anyway. |
Thanks, @AdrianoDee . This [*] avoids the crash in my local checks, but I don't know if it is the correct fix. IIuc from FOG, the fix is relevant for a situation (like earlier today) where Pixel is out of global, and the HLT runs on GPU nodes with triggers using Pixel reco. Ideally, the fix would be in the upcoming [*] diff --git a/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitSoAFromCUDA.cc b/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitSoAFromCUDA.cc
index fda418320e7..2af4588f92f 100644
--- a/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitSoAFromCUDA.cc
+++ b/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitSoAFromCUDA.cc
@@ -82,7 +82,10 @@ void SiPixelRecHitSoAFromCUDA::acquire(edm::Event const& iEvent,
void SiPixelRecHitSoAFromCUDA::produce(edm::Event& iEvent, edm::EventSetup const& es) {
auto hmsp = std::make_unique<uint32_t[]>(nMaxModules_ + 1);
- std::copy(hitsModuleStart_.get(), hitsModuleStart_.get() + nMaxModules_ + 1, hmsp.get());
+
+ if(nHits_ > 0) {
+ std::copy(hitsModuleStart_.get(), hitsModuleStart_.get() + nMaxModules_ + 1, hmsp.get());
+ }
iEvent.emplace(hostPutToken_, std::move(hmsp));
iEvent.emplace(hitsPutTokenCPU_, store32_.get(), store16_.get(), hitsModuleStart_.get(), nHits_); |
|
@missirol I think your patch would leave uninitialised data in |
the collection that goes to |
OK, I admit I'm lost about whether If it does, the proposed fix is fine. diff --git a/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitSoAFromCUDA.cc b/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitSoAFromCUDA.cc
index fda418320e7..2af4588f92f 100644
--- a/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitSoAFromCUDA.cc
+++ b/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitSoAFromCUDA.cc
@@ -82,7 +82,11 @@ void SiPixelRecHitSoAFromCUDA::acquire(edm::Event const& iEvent,
void SiPixelRecHitSoAFromCUDA::produce(edm::Event& iEvent, edm::EventSetup const& es) {
auto hmsp = std::make_unique<uint32_t[]>(nMaxModules_ + 1);
- std::copy(hitsModuleStart_.get(), hitsModuleStart_.get() + nMaxModules_ + 1, hmsp.get());
+ if (nHits_ > 0) {
+ std::copy(hitsModuleStart_.get(), hitsModuleStart_.get() + nMaxModules_ + 1, hmsp.get());
+ } else {
+ std::fill(hmsp.get(), hmsp.get() + nMaxModules_ + 1, 0u);
+ }
iEvent.emplace(hostPutToken_, std::move(hmsp));
iEvent.emplace(hitsPutTokenCPU_, store32_.get(), store16_.get(), hitsModuleStart_.get(), nHits_); |
Thanks for following up, @AdrianoDee ! @fwyzard , thanks for the comments.
Interesting, I think I see it now here.
I was not sure either, all I did was to test and see no crash. :) |
AFAICT it should.
|
PR description:
In order to run properly, a backport to
12_3_X
of #37860 would need backports of #37798 and #37617. This PR is the combination of those backports.PR validation:
Running
39434.501
,39434.502
and11634.x
, with x=501
,502
and503
.