Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vecgeom] Disable aggressive compiler optimizations #7992

Merged
merged 1 commit into from
Jul 19, 2022

Conversation

smuzaffar
Copy link
Contributor

as suggested by @hahnjo #7987 (comment)

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @smuzaffar (Malik Shahzad Muzaffar) for branch IB/CMSSW_12_5_X/master.

@cmsbuild, @smuzaffar, @aandvalenzuela, @iarspider can you please review it and eventually sign? Thanks.
@perrotta, @dpiparo, @qliphy, @rappoccio you are the release manager for this.
cms-bot commands are listed here

@smuzaffar
Copy link
Contributor Author

please test

@smuzaffar
Copy link
Contributor Author

assign simulation

@cmsbuild
Copy link
Contributor

New categories assigned: simulation

@mdhildreth,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a29b8e/26132/summary.html
COMMIT: c088ce8
CMSSW: CMSSW_12_5_X_2022-07-10-2300/el8_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7992/26132/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a29b8e/26132/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a29b8e/26132/git-merge-result

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 55557 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3655970
  • DQMHistoTests: Total failures: 305157
  • DQMHistoTests: Total nulls: 150
  • DQMHistoTests: Total successes: 3350641
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.020999999999999963 KiB( 49 files compared)
  • DQMHistoSizes: changed ( 10224.0 ): -0.352 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 11834.0 ): 0.464 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 250202.181 ): -0.117 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 25202.0 ): -0.012 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
  • Checked 208 log files, 45 edm output root files, 50 DQM output files
  • TriggerResults: found differences in 14 / 49 workflows

@smuzaffar
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a29b8e/26157/summary.html
COMMIT: c088ce8
CMSSW: CMSSW_12_5_X_2022-07-11-2300/el8_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7992/26157/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found errors in the following unit tests:

---> test runtestRecoEgammaElectronIdentification had ERRORS

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 55553 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3655970
  • DQMHistoTests: Total failures: 305155
  • DQMHistoTests: Total nulls: 149
  • DQMHistoTests: Total successes: 3350644
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.016999999999999963 KiB( 49 files compared)
  • DQMHistoSizes: changed ( 10224.0 ): -0.352 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 11834.0 ): 0.464 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 250202.181 ): -0.117 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 25202.0 ): -0.012 KiB SiStrip/MechanicalView
  • Checked 208 log files, 45 edm output root files, 50 DQM output files
  • TriggerResults: found differences in 14 / 49 workflows

@smuzaffar
Copy link
Contributor Author

assign simulation
@civanch , building vecgeom with -O2 caused many comparison differences. Are these expected?

@civanch
Copy link
Contributor

civanch commented Jul 12, 2022

@smuzaffar , differences are uniformly distributed over WFs. All what I see is 0 entries to some histogram versus 1 entry. Differences are in tracker, in HLT, and in muons (am not sure). Much less in EGamma. My interpretation is that it is MC history difference - we have different MC events. The difference may be in a very local code but not easy to find where.

In such cases, we use to ask for PdmV validation - it is similar to compiler change. From common sense point of view -O2 should provide correct results but I afraid, we cannot change the master without PdmV limited test. In other IBs we may switch without a problem.

What I do not understand @hahnjo : are you sure that we should compile VecGeom with -O2 and keep Geant4 and DD4hep with -O3?

@civanch
Copy link
Contributor

civanch commented Jul 13, 2022

My previous comment was not precise. The number of differences in the case of sim history change should be much more. What we see likely is not history change - events are nearly the same but in some cases selection of data for histograms is different.

@hahnjo
Copy link
Contributor

hahnjo commented Jul 13, 2022

@smuzaffar @civanch I think the fact that there are differences shows how ubiquitous and critical this problem is. In theory, I would expect identical results when switching between -O2 and -O3, and I get this in practice on my system when compiling Geant4 and VecGeom with gcc8. However, because gcc10 at -O3 miscompiles VecGeom we get different results when switching to -O2 because it's actually doing different work.

are you sure that we should compile VecGeom with -O2 and keep Geant4 and DD4hep with -O3?

This should be fine, yes.

@fwyzard
Copy link
Contributor

fwyzard commented Jul 13, 2022

Out of curiosity, did you check which of the optimisation flags enabled by -O3 causes problems ?

According to the GCC 10 manual,

-O3 turns on all optimizations specified by -O2 and also turns on the following optimization flags:

-fgcse-after-reload 
-fipa-cp-clone
-floop-interchange 
-floop-unroll-and-jam 
-fpeel-loops 
-fpredictive-commoning 
-fsplit-loops 
-fsplit-paths 
-ftree-loop-distribution 
-ftree-loop-vectorize 
-ftree-partial-pre 
-ftree-slp-vectorize 
-funswitch-loops 
-fvect-cost-model 
-fvect-cost-model=dynamic 
-fversion-loops-for-strides

Also, does GCC 10.4 fixes the problem ?

@hahnjo
Copy link
Contributor

hahnjo commented Jul 13, 2022

Just to prove the point, I've now built Geant4 and VecGeom 1.1.18 (to avoid the crash with the change introduced with 1.1.19) and now I see slight differences between -O2 and -O3. So this does actually point to the miscompilation being solved.

@fwyzard

Out of curiosity, did you check which of the optimisation flags enabled by -O3 causes problems ?

No, I didn't so far...

Also, does GCC 10.4 fixes the problem ?

No, both vanilla GCC 10.2 and 10.4 show the problem in VecGeom, as does 10.3 used in CMSSW. GCC versions 9 and 11 are fine though...

@hahnjo
Copy link
Contributor

hahnjo commented Jul 13, 2022

@fwyzard

Out of curiosity, did you check which of the optimisation flags enabled by -O3 causes problems ?

According to the GCC 10 manual,

-O3 turns on all optimizations specified by -O2 and also turns on the following optimization flags:

-fgcse-after-reload 
-fipa-cp-clone
-floop-interchange 
-floop-unroll-and-jam 
-fpeel-loops 
-fpredictive-commoning 
-fsplit-loops 
-fsplit-paths 
-ftree-loop-distribution 
-ftree-loop-vectorize 
-ftree-partial-pre 
-ftree-slp-vectorize 
-funswitch-loops 
-fvect-cost-model 
-fvect-cost-model=dynamic 
-fversion-loops-for-strides

FWIW it's got something to do with -fsplit-loops, switching to -O3 -fno-split-loops is enough to make my very simplified example give identical results to -O2. But it's not working in isolation, ie -O2 -fsplit-loops does not (yet) expose the problem. It's likely some weird interaction between optimization passes... I still think that switching to plain -O2 is the best action right now.

@perrotta
Copy link
Contributor

+1

  • Merge on top of the supposedly already produced CMSSW_12_5_X_2022-07-19-"afternoon" IB, so that it can be compared with the 2300 IB for a quick validation, as suggested by @civanch

@perrotta
Copy link
Contributor

@aandvalenzuela

@aandvalenzuela
Copy link
Contributor

aandvalenzuela commented Jul 19, 2022

@perrotta Can you see the build in the IB page? Is this what you needed @civanch?

@perrotta
Copy link
Contributor

@aandvalenzuela now I see it, thank you: https://cmssdt.cern.ch/SDT/html/cmssdt-ib/#/ib/CMSSW_12_5_X

For some reason, there are two PRs that wern't picked in that IB, even if they were both merged before the ORP meeting:
vecgeom

However

  • #38775 includes some protection against corrupted CSC TPs when CCLUT data
  • #38777 removes some unneeded fwk classes

Therefore, I think that next CMSSW_12_5_X_2022-07-19-2300 plus this PR and the two above can be safely compared with CMSSW_12_5_G4VECGEOM_X_2022-07-19-1800 for what simulation is concerned

Let merge this one, then!

@perrotta
Copy link
Contributor

merge

@cmsbuild cmsbuild merged commit 848749c into IB/CMSSW_12_5_X/master Jul 19, 2022
@perrotta
Copy link
Contributor

Ah, wait: but that IB was in the VECGEOM stream....
Not clear how CMSSW_12_5_G4VECGEOM_X_2022-07-19-1800 compares with CMSSW_12_5_X_2022-07-19-1100.

Uhm... now this is merged. Maybe we will revert this one after 2300 IB, and tomorrow morning 1100 IB will be the one with "-O3" to compare with. I cannot follow any more by this evening, sorry

@civanch
Copy link
Contributor

civanch commented Jul 19, 2022

@perrotta , @bsunanda, CMSSW_12_5_G4VECGEOM_X_2022-07-19-1800 cannot be used for this purpose, because it is build on top of new Geant4, DD4hep, and VecGeom. I think, we should compare CMSSW_12_5_X_2022-07-19-2300 versus CMSSW_12_5_X_2022-07-19-1100

@perrotta
Copy link
Contributor

@perrotta , @bsunanda, CMSSW_12_5_G4VECGEOM_X_2022-07-19-1800 cannot be used for this purpose, because it is build on top of new Geant4, DD4hep, and VecGeom. I think, we should compare CMSSW_12_5_X_2022-07-19-2300 versus CMSSW_12_5_X_2022-07-19-1100

Indeed... Sorry, there was some confusion yesterday evening, and then I could not follow any more.
I have created a revert of this PR in #7998 that can be merged for CMSSW_12_5_X_2022-07-20-1100. Therefore the validation can compare:

  • CMSSW_12_5_X_2022-07-19-2300 with CXX_FLAGS_RELEASE="-O2 -DNDEBUG"
  • CMSSW_12_5_X_2022-07-20-1100 without that flag set, i.e. "-O3"

As soon as the 1100 IB will become available, we will re-revert the PR, in order to have the "non-aggressive" compiler flag for the remaining of the 12_5_X cycle.

@smuzaffar smuzaffar deleted the smuzaffar-patch-5 branch August 22, 2022 14:55
@hahnjo
Copy link
Contributor

hahnjo commented Dec 7, 2022

FWIW the problem resurfaced in #8211 with the next version 1.2.1 of VecGeom and gcc11. After some investigations, I opened https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108008, let's see what the GCC people think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants