8: Likelihood inference

(note: in order to run this notebook, first run 20 creating the data and 30, adding a BDT)

In most analysis, fitting a model to data is a crucial, last step in the chain in order to extract information about the parameters of interest.

Model: First, this involves building a model that depends on various parameters - some that are of immediate interest to us (Parameter Of Interest, POI) and some that help to describe the observables that we expect in the data but do not have any relevant physical meaning such as detectoreffects (Nuisance Parameter).

Loss: With the model and the data, a likelihood (or sometimes also a \(\chi^2\) loss) is built. This is a function of the parameters and reflects the likelihood of finding the data given the parameters. The most likely parameter combination is retrieved when maximising the likelihood; also called the Maximum Likelihood (ML) estimator. Except for trivial cases, this has to be done numerically.

While this procedure returns an estimate of the best-fit, it does not say anything about the uncertainty of our measurement. Therefore, advanced statistical methods are needed that perform toy studies; these are fits to generated samples under different conditions such as fixing a parameter to a certain value.

Scope of this tutorial

Here, we focus on the libraries that are available to perform likelihood fits. For unbinned fits in Python, the libraries that we will consider here are zfit (likelihood model fitting library) and hepstats (higher level statistical inference; can use zfit models), both which are relatively young and written in pure Python.

An alternative to be mentioned is RooFit and RooStats, an older, well-proven, reliable C++ framework that has Python bindings and acts as a standard in HEP for this kind of fits.

In case your analysis involved pure binned and templated fits, pyhf provides a specialized library for this kind of fits.

Getting started

In order to get started with zfit, it is recommendable to have a look at the zfit-tutorials or the recorded tutorials, which also contains hepstats tutorials; more on the latter can be found here.

Before we continue, it is highly recommended to follow the zfit introduction tutorial

%store -r bkg_df
%store -r mc_df
%store -r data_df
import hepstats
import matplotlib.pyplot as plt
import mplhep
import numpy as np
import zfit
/usr/share/miniconda/envs/analysis-essentials/lib/python3.8/site-packages/zfit/__init__.py:37: UserWarning: TensorFlow warnings are by default suppressed by zfit. In order to show them, set the environment variable ZFIT_DISABLE_TF_WARNINGS=0. In order to suppress the TensorFlow warnings AND this warning, set ZFIT_DISABLE_TF_WARNINGS=1.
  warnings.warn("TensorFlow warnings are by default suppressed by zfit."
/usr/share/miniconda/envs/analysis-essentials/lib/python3.8/site-packages/tensorflow_addons/utils/ensure_tf_install.py:53: UserWarning: Tensorflow Addons supports using Python ops for all Tensorflow versions above or equal to 2.7.0 and strictly below 2.10.0 (nightly versions are not supported).
 The versions of TensorFlow you are currently using is 2.6.5 and is not supported.
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version.
You can find the compatibility matrix in TensorFlow Addon's readme:
# apply cuts
query = 'BDT > 0.4'
data_df.query(query, inplace=True)
mc_df.query(query, inplace=True)

# reduce the datasize for this example to make the fit more interesting
fraction = 0.1  # how much to take of the original data
data_df = data_df.sample(frac=0.1)

Exercise: write a fit to the data using zfit

obs = zfit.Space('Jpsi_M', limits=(2.8, 3.5))  # defining the observable
# bkg = zfit.Data.from_pandas(bkg_df['Jpsi_M'], obs=obs)
# OR
# obs_bkg = zfit.Space('Jpsi_M', limits=(2.8, 3.0)) + zfit.Space('Jpsi_M', limits=(3.2, 3.5))
# bkg_two = zfit.Data.from_pandas(data_df['Jpsi_M'], obs=obs_bkg)
mc = zfit.Data.from_pandas(mc_df['Jpsi_M'], obs=obs)
data = zfit.Data.from_pandas(data_df['Jpsi_M'], obs=obs)

Difference of the two spaces

While the first space is defined over the whole space from 2.8 to 3.5, the second consists of two distinct regions. Therefore we can use the original space and zfit applies the cut, the same as we did before to the bkg_df.

The difference comes when using the normalization in the PDF: we can either normalize it over the whole range or only over part of it.

lambd = zfit.Parameter('lambda', -0.1, -2, 2)
bkg_yield = zfit.Parameter('bkg_yield', 5000, 0, 200000, step_size=1)

mu = zfit.Parameter('mu', 3.1, 2.9, 3.3)
sigma = zfit.Parameter('sigma', 0.1, 0, 0.5)
sig_yield = zfit.Parameter('sig_yield', 200, 0, 10000, step_size=1)
bkg_pdf = zfit.pdf.Exponential(lambd, obs=obs)
sig_pdf = zfit.pdf.Gauss(obs=obs, mu=mu, sigma=sigma)
model = zfit.pdf.SumPDF([bkg_pdf, sig_pdf])


Plots can simply be made with matplotlib and mplhep.

def plot_fit(model, data, nbins=30, ax=None):
    # The function will be reused.
    if ax is None:
        ax = plt.gca()

    lower, upper = data.space.limit1d

    # Creates and histogram of the data and plots it with mplhep.
    counts, bin_edges = np.histogram(data.unstack_x(), bins=nbins)
    mplhep.histplot(counts, bins=bin_edges, histtype="errorbar", yerr=True,
                    label="Data", ax=ax, color="black")

    binwidth = np.diff(bin_edges)[0]
    x = np.linspace(lower, upper, num=1000)  # or tf.linspace

    # Line plots of the total pdf and the sub-pdfs.
    y = model.ext_pdf(x) * binwidth
    ax.plot(x, y, label="total", color="royalblue")
    for m, l, c in zip(model.get_models(), ["background", "signal"], ["forestgreen", "crimson"]):
        ym = m.ext_pdf(x) * binwidth
        ax.plot(x, ym, label=l, color=c)

    plt.xlabel('$J/\\psi$ mass [GeV]')
    ax.set_xlim(lower, upper)

    return ax
plot_fit(model, data)  # before the fit
<AxesSubplot:title={'center':'Jpsi_M'}, xlabel='$J/\\psi$ mass [GeV]'>


Since we have now the models and the datasets, we can e.g. pre-fit the signal PDF to the simulation.

sig_nll = zfit.loss.UnbinnedNLL(sig_pdf, mc)
/usr/share/miniconda/envs/analysis-essentials/lib/python3.8/site-packages/zfit/core/loss.py:533: AdvancedFeatureWarning: Either you're using an advanced feature OR causing unwanted behavior. To turn this warning off, use `zfit.settings.advanced_warnings['extended_in_UnbinnedNLL']` = False`  or 'all' (use with care) with `zfit.settings.advanced_warnings['all'] = False
Extended PDFs are given to a normal UnbinnedNLL. This won't take the yield into account and simply treat the PDFs as non-extended PDFs. To create an extended NLL, use the `ExtendedUnbinnedNLL`.
  warn_advanced_feature("Extended PDFs are given to a normal UnbinnedNLL. This won't take the yield "

It warns us that we are using a non-extended loss. The extended loss also includes to fit the yield while the normal one does not. Since we want to fit the shape here only, we use the non-extended one.

minimizer = zfit.minimize.Minuit()
# minimizer = zfit.minimize.NLoptLBFGSV1()  # can be changed but maybe not as powerful as iminuit
# minimizer = zfit.minimize.ScipySLSQPV1()
FitResult of
<UnbinnedNLL model=[<zfit.<class 'zfit.models.dist_tfp.Gauss'>  params=[mu, sigma]] data=[<zfit.core.data.Data object at 0x7f214b887ac0>] constraints=[]>
<Minuit Minuit tol=0.001>

│ valid   │ converged   │ param at limit   │ edm     │ min value   │
│ True    │ True        │ False            │ 1.3e-07 │ -1.382e+05  │

name      value    at limit
------  -------  ----------
mu        3.097       False
sigma   0.01503       False

Fixing parameters

Sometimes we want to fix parameters obtained from MC, such as tailes. Here we will fix the sigma, just for demonstration purpose.

sigma.floating = False
nll = zfit.loss.ExtendedUnbinnedNLL(model, data)
result = minimizer.minimize(nll)
FitResult of
<ExtendedUnbinnedNLL model=[<zfit.<class 'zfit.models.functor.SumPDF'>  params=[Composed_autoparam_1, Composed_autoparam_2]] data=[<zfit.core.data.Data object at 0x7f219beb86a0>] constraints=[]>
<Minuit Minuit tol=0.001>

│ valid   │ converged   │ param at limit   │ edm     │ min value   │
│ True    │ True        │ False            │ 3.6e-06 │ 845.7       │

name         value    at limit
---------  -------  ----------
bkg_yield     5961       False
sig_yield    154.9       False
lambda     -0.9325       False
mu           3.098       False
result.hesse()  # calculate hessian error
result.errors()  # profile, using minos like uncertainty
/usr/share/miniconda/envs/analysis-essentials/lib/python3.8/site-packages/zfit/minimizers/fitresult.py:1115: FutureWarning: 'minuit_minos' will be changed as the default errors method to a custom implementationwith the same functionality. If you want to make sure that 'minuit_minos' will be used in the future, add it explicitly as in `errors(method='minuit_minos')`
  warnings.warn("'minuit_minos' will be changed as the default errors method to a custom implementation"
FitResult of
<ExtendedUnbinnedNLL model=[<zfit.<class 'zfit.models.functor.SumPDF'>  params=[Composed_autoparam_1, Composed_autoparam_2]] data=[<zfit.core.data.Data object at 0x7f219beb86a0>] constraints=[]>
<Minuit Minuit tol=0.001>

│ valid   │ converged   │ param at limit   │ edm     │ min value   │
│ True    │ True        │ False            │ 3.6e-06 │ 845.7       │

name         value    minuit_hesse         minuit_minos    at limit
---------  -------  --------------  -------------------  ----------
bkg_yield     5961     +/-      81  -     80   +     81       False
sig_yield    154.9     +/-      26  -     26   +     26       False
lambda     -0.9325     +/-   0.065  -  0.065   +  0.065       False
mu           3.098     +/-  0.0033  - 0.0034   + 0.0033       False
plot_fit(model, data)
<AxesSubplot:title={'center':'Jpsi_M'}, xlabel='$J/\\psi$ mass [GeV]'>

Exercise: use hepstats to check the significance of our signal

from hepstats.hypotests.calculators import AsymptoticCalculator
from hepstats.hypotests.parameters import POI

# the null hypothesis
sig_yield_poi = POI(sig_yield, 0)
calculator = AsymptoticCalculator(input=result, minimizer=minimizer)
/usr/share/miniconda/envs/analysis-essentials/lib/python3.8/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

There is another calculator in hepstats called FrequentistCalculator which constructs the test statistic distribution \(f(q_{0} |H_{0})\) with pseudo-experiments (toys), but it takes more time.

The Discovery class is a high-level class that takes as input a calculator and a POI instance representing the null hypothesis, it basically asks the calculator to compute the p-value and also computes the signifance as

\begin{equation} Z = \Phi^{-1}(1 - p_0). \end{equation}

from hepstats.hypotests import Discovery

discovery = Discovery(calculator=calculator, poinull=sig_yield_poi)

p_value for the Null hypothesis = 4.012390419916301e-11
Significance (in units of sigma) = 6.50013529406255
(4.012390419916301e-11, 6.50013529406255)

Exercise play around! First things first: repeat the fit. The difference we will see is statistical fluctuation from the resampling of the data; we take only a fraction at random.

Change the fraction of data that we have, the BDT cut, the signal model…

Attention: it is easy to have a significance of inf here