Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLT farm crash in Run 379530 #44759

Closed
trtomei opened this issue Apr 17, 2024 · 23 comments · Fixed by #44764
Closed

HLT farm crash in Run 379530 #44759

trtomei opened this issue Apr 17, 2024 · 23 comments · Fixed by #44764

Comments

@trtomei
Copy link
Contributor

trtomei commented Apr 17, 2024

Reporting a crash in Run 379530

Here's the recipe how to reproduce the crashes (tested with CMSSW_14_0_5_patch1 on lxplus8-gpu):

cmsrel CMSSW_14_0_5_patch1
cd CMSSW_14_0_5_patch1/src
cmsenv

then prepare the reproducer as:

#!/bin/bash -ex

# CMSSW_14_0_5_patch1

hltGetConfiguration run:379530 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input /store/group/tsg/FOG/debug/2024-04-17_run379530/run379530_ls0556.root  > hlt.py
  
cat <<@EOF >> hlt.py
process.options.wantSummary = True

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt.py &> hlt.log

In the log, we will find the following message:

----- Begin Fatal Exception 17-Apr-2024 12:13:39 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 379530 lumi: 556 event: 1073222806 stream: 0
   [1] Running path 'DST_PFScouting_DoubleMuon_v1'
   [2] Calling method for module MuonHLTEcalPFClusterIsolationProducer/'hltMuonEcalMFPFClusterIsoForMuons'
Exception Message:
A std::exception was thrown.
vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 2)
----- End Fatal Exception -------------------------------------------------

@cms-sw/hlt-l2 FYI
@cms-sw/muon-pog-l2 FYI (sorry, tagged the DPG by mistake)

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 17, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @trtomei.

@smuzaffar, @Dr15Jones, @rappoccio, @sextonkennedy, @antoniovilela, @makortel can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@mmusich
Copy link
Contributor

mmusich commented Apr 17, 2024

assign hlt

@mmusich
Copy link
Contributor

mmusich commented Apr 17, 2024

@cms-sw/muon-pog-l2 FYI

@cmsbuild
Copy link
Contributor

New categories assigned: hlt

@Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mmusich
Copy link
Contributor

mmusich commented Apr 17, 2024

the affected module actually resides in e/gamma package

typedef HLTEcalPFClusterIsolationProducer<reco::RecoChargedCandidate> MuonHLTEcalPFClusterIsolationProducer;

@mmusich
Copy link
Contributor

mmusich commented Apr 17, 2024

@cms-sw/egamma-pog-l2 FYI

@mmusich
Copy link
Contributor

mmusich commented Apr 17, 2024

So this happens because we have a edm::Ref<reco::RecoChargedCandidate> with eta exactly equal to 0.
In this loop:

if (doRhoCorrection_) {
int iEA = -1;
auto cEta = std::abs(candRef->eta());
for (int bIt = absEtaLowEdges_.size() - 1; bIt > -1; bIt--) {
if (cEta > absEtaLowEdges_.at(bIt)) {
iEA = bIt;
break;
}
}

we start with iEA = -1 and since cEta is exactly equal to 0, this condition if (cEta > absEtaLowEdges_.at(bIt)) is never verified and iEA stays -1 when hitting this:

sum = sum - rho * effectiveAreas_.at(iEA);

resulting in the bound-check failure for effectiveAreas_.

I tested that this simple change:

diff --git a/RecoEgamma/EgammaHLTProducers/plugins/HLTEcalPFClusterIsolationProducer.cc b/RecoEgamma/EgammaHLTProducers/plugins/HLTEcalPFClusterIsolationProducer.cc
index b4c4c8b9ff7..8c878698508 100644
--- a/RecoEgamma/EgammaHLTProducers/plugins/HLTEcalPFClusterIsolationProducer.cc
+++ b/RecoEgamma/EgammaHLTProducers/plugins/HLTEcalPFClusterIsolationProducer.cc
@@ -174,7 +174,7 @@ void HLTEcalPFClusterIsolationProducer<T1>::produce(edm::Event& iEvent, const ed
       int iEA = -1;
       auto cEta = std::abs(candRef->eta());
       for (int bIt = absEtaLowEdges_.size() - 1; bIt > -1; bIt--) {
-        if (cEta > absEtaLowEdges_.at(bIt)) {
+        if (cEta >= absEtaLowEdges_.at(bIt)) {
           iEA = bIt;
           break;
         }
diff --git a/RecoEgamma/EgammaHLTProducers/plugins/HLTHcalPFClusterIsolationProducer.cc b/RecoEgamma/EgammaHLTProducers/plugins/HLTHcalPFClusterIsolationProducer.cc
index 0b195f62b7a..b23763c10a3 100644
--- a/RecoEgamma/EgammaHLTProducers/plugins/HLTHcalPFClusterIsolationProducer.cc
+++ b/RecoEgamma/EgammaHLTProducers/plugins/HLTHcalPFClusterIsolationProducer.cc
@@ -200,7 +200,7 @@ void HLTHcalPFClusterIsolationProducer<T1>::produce(edm::StreamID sid,
       int iEA = -1;
       auto cEta = std::abs(candRef->eta());
       for (int bIt = absEtaLowEdges_.size() - 1; bIt > -1; bIt--) {
-        if (cEta > absEtaLowEdges_.at(bIt)) {
+        if (cEta >= absEtaLowEdges_.at(bIt)) {
           iEA = bIt;
           break;
         }

prevents the crash.

@mmusich
Copy link
Contributor

mmusich commented Apr 17, 2024

assign reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mmusich
Copy link
Contributor

mmusich commented Apr 17, 2024

Incidentally I am wondering if we might shave off some (probably tiny) reconstruction time by using the [] operator instead that the slower variant .at() it the same classes.

@mandrenguyen
Copy link
Contributor

@mmusich The proposed solution makes sense. Would you like reco to make a PR or you'll take care of it?

@swagata87
Copy link
Contributor

So this happens because we have a edm::Refreco::RecoChargedCandidate with eta exactly equal to 0.

thanks for finding this, Marco. As egamma modules are written in similar fashion, similar unprotected codes are there as well.. so while fixing it maybe one can fix other such unprotected cases as well, if possible..

like for example

@mmusich
Copy link
Contributor

mmusich commented Apr 17, 2024

Would you like reco to make a PR or you'll take care of it?

feel free to make the PR (provided @cms-sw/egamma-pog-l2 agrees). I am wondering though - since these classes were not touched in a long time, if the triggering factor (candidate eta=0) is just a matter of chance or there is some other underlying problem upstream.

@RSalvatico
Copy link
Contributor

Incidentally I am wondering if we might shave off some (probably tiny) reconstruction time by using the [] operator instead that the slower variant .at() it the same classes.

Since we are checking the size of absEtaLowEdges one line above, switching to [] should be safe in my opinion.

@mandrenguyen
Copy link
Contributor

Would you like reco to make a PR or you'll take care of it?

feel free to make the PR (provided @cms-sw/egamma-pog-l2 agrees). I am wondering though - since these classes were not touched in a long time, if the triggering factor (candidate eta=0) is just a matter of chance or there is some other underlying problem upstream.

Ok, since (current and former) egamma experts are reacting in this thread, I'll leave it in their hands, unless someone asks me to do something :-)

@swagata87
Copy link
Contributor

swagata87 commented Apr 17, 2024

i mean since Marco have it set up already, may be he can do the PR.
Since its created a crash I'm assuming its urgent, so immediate fix can go in;
while

since these classes were not touched in a long time, if the triggering factor (candidate eta=0) is just a matter of chance or there is some other underlying problem upstream.

this needs some more thinking probably.. this is a good point

@mmusich
Copy link
Contributor

mmusich commented Apr 17, 2024

i mean since Marco have it set up already, may be he can do the PR.

OK, here's a minimal fix: #44764.

@mmusich
Copy link
Contributor

mmusich commented Apr 17, 2024

unassign reconstruction

  • I realize now that the code is actually only within HLT purview - sorry for the noise @mandrenguyen

@VinInn
Copy link
Contributor

VinInn commented Apr 18, 2024

A binary-search may even be faster and safer

@mmusich
Copy link
Contributor

mmusich commented May 7, 2024

+hlt

@cmsbuild
Copy link
Contributor

cmsbuild commented May 7, 2024

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants