This is a repo for testing the simplified likelihood using third moments, as outlined in the paper https://arxiv.org/abs/1809.05548
The test is based on the datacards of the EFT tutorial
The idea is to have the fit done with combine, and then compare the likelihood scan with the simplified likelihood approach. The logical flow is as follows:
- Run the fit with combine
- Taking N replicas of the dataset using the bootstrapping technique
- Computing the three static moments of the dataset
- Computing the likelihood scan with the simplified likelihood approach
The starting point is to write the test statistic
where
Suppose we have our measurement
where x are our variables with a Normal distribution. One fact to keep in mind is that
Given
Using the replicas from the bootstrapping technique, I can compute a series of
In our specif case, the simplified likelihood can be written as:
$$
-2\Delta\text{NLL} = (\mathbf{\hat{x}}-\mathbf{x})^T \rho^{-1} (\mathbf{\hat{x}}-\mathbf{x})
$$
where
This repo should be run inside Combinev10
cmsrel CMSSW_14_1_0_pre4
cd CMSSW_14_1_0_pre4/src
cmsenv
git clone https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit.git HiggsAnalysis/CombinedLimit
cd HiggsAnalysis/CombinedLimit
git fetch origin
git checkout v10.0.2
scramv1 b clean; scramv1 b
After cloning the repo:
mkdir ws
and you should change the paths in the ws_combine/condor_fit.sh
, line 6 and line 7.
The inputs are stored in the folder inputs
:
- htt_tt.input.root
- htt_tt_125_8TeV.txt
The only difference in the datacard, compared to the one used in the EFT tutorial, is that the numbers in the observation
line are replaced by -1
. This is done as in the bootstap replicas, the number of events is always different, and this make the following steps easier.
text2workspace.py inputs/htt_tt_125_8TeV.txt -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel --PO verbose --PO 'map=htt_tt_0_8TeV/ggH:r_high[1,-5,10]' --PO 'map=htt_tt_1_8TeV/ggH:r_low[1,-5,10]' --PO 'map=htt_tt_2_8TeV/ggH:r_low[1,-5,10]' -m 125 -o ws_mu_ggH.root
combine -M MultiDimFit --algo grid -d ws_mu_ggH.root -P r_high -n .high -m 125
combine -M MultiDimFit --algo grid -d ws_mu_ggH.root -P r_low -n .low -m 125
plot1DScan.py higgsCombine.high.MultiDimFit.mH125.root --POI r_high -o scan_ggH_high
plot1DScan.py higgsCombine.low.MultiDimFit.mH125.root --POI r_low -o scan_ggH_low
I scale the most populated category by r_high
and the other two by r_low
.
This represents our reference. The simplified likelihood should be able to reproduce this result.
The bootstrap replicas are extracted with the following ROOT macro:
root -l 'bootstrapping.C(1000)'
I have tried to use pyroot, but there are issues when opening many files. The old C++ macro works much much better, without any doubt.
The argument represents the number of bootstrap samples to be extracted.
The bootstrap is done by sampling the weights to associate to each event. For each event, a set of int
numbers are extracted from a Poisson distribution with mean 1, where
Tricky part: the dataset is stored as a histogram. Thus, I decided to split the content of each bin into a set of events, and then I sample the weights to associate to each event.
[TO THINK ABOUT IT] The number of events in this dataset is not very large, especially in the category 2
. Is this an issue? Are my replicas really representative of the original dataset?
The fits are submitted on condor.
cd ws_combine
condor_submit condor_fit.sub
The number of fits to perform is to be specified in the queue
line of the condor_fit.sub
file. Of course, this number should be less or equal than the bootstrap replicas generated before.
[ISSUE] Often, one of the fits fails and the job goes to HOLD
. At the moment, I am simply ignoring (implemented in the script used in the following step) the cases where this happens.
python3 extract_values.py
This script does the magic:
- It extracts the values of
r_high
andr_low
from the fits on the bootstrap replicas - It computes the three moments of
r_high
andr_low
- It computes the parameters
$a,b,c$ and the covariance matrix$\rho$ - It computes the simplified likelihood function
$-2\Delta\text{NLL} = \mathbf{x}^T \rho^{-1} \mathbf{x}$ - It compared the simplified likelihood with the combine fit