Skip to article frontmatterSkip to article content

Histogram manipulations and fitting

Histogram libraries

Mainstream Python has libraries for filling histograms.

NumPy

NumPy, for instance, has an np.histogram function.

import skhep_testdata, uproot

tree = uproot.open(skhep_testdata.data_path("uproot-Zmumu.root"))["events"]

import numpy as np

np.histogram(tree["M"].array())
Loading...

Because of NumPy’s prominence, this 2-tuple of arrays (bin contents and edges) is a widely recognized histogram format, though it lacks many of the features high-energy physicists expect (under/overflow, axis labels, uncertainties, etc.).

Matplotlib

Matplotlib also has a plt.hist function.

import matplotlib.pyplot as plt

plt.hist(tree["M"].array());
<Figure size 640x480 with 1 Axes>

In addition to the same bin contents and edges as NumPy, Matplotlib includes a plottable graphic.

Boost-histogram and hist

The main feature that these functions lack (without some effort) is refillability. High-energy physicists usually want to fill histograms with more data than can fit in memory, which means setting bin intervals on an empty container and filling it in batches (sequentially or in parallel).

Boost-histogram is a library designed for that purpose. It is intended as an infrastructure component. You can explore its “low-level” functionality upon importing it:

import boost_histogram as bh

A more user-friendly layer (with plotting, for instance) is provided by a library called “hist.”

import hist

h = hist.Hist(hist.axis.Regular(120, 60, 120, name="mass"))

h.fill(tree["M"].array())

h.plot();
<Figure size 640x480 with 1 Axes>

Universal Histogram Indexing (UHI)

There is an attempt within Scikit-HEP to standardize what array-like slices mean for a histogram. (See documentation.)

Naturally, integer slices should select a range of bins,

h[10:110].plot();
<Figure size 640x480 with 1 Axes>

but often you want to select bins by coordinate value

# Explicit version
# h[hist.loc(90) :].plot();

# Short version
h[90j:].plot();
<Figure size 640x480 with 1 Axes>

or rebin by a factor,

# Explicit version
# h[:: hist.rebin(2)].plot();

# Short version
h[::2j].plot();
<Figure size 640x480 with 1 Axes>

or sum over a range.

# Explicit version
# h[hist.loc(80) : hist.loc(100) : sum]

# Short version
h[90j:100j:sum]
1102.0

Things get more interesting when a histogram has multiple dimensions.

import uproot
import hist
import awkward as ak

picodst = uproot.open(
    "https://pivarski-princeton.s3.amazonaws.com/pythia_ppZee_run17emb.picoDst.root:PicoDst"
)

vertexhist = hist.Hist(
    hist.axis.Regular(600, -1, 1, label="x"),
    hist.axis.Regular(600, -1, 1, label="y"),
    hist.axis.Regular(40, -200, 200, label="z"),
)

vertex_data = picodst.arrays(filter_name="*mPrimaryVertex[XYZ]")

vertexhist.fill(
    ak.flatten(vertex_data["Event.mPrimaryVertexX"]),
    ak.flatten(vertex_data["Event.mPrimaryVertexY"]),
    ak.flatten(vertex_data["Event.mPrimaryVertexZ"]),
)

vertexhist[:, :, sum].plot2d_full()
vertexhist[-0.25j:0.25j, -0.25j:0.25j, sum].plot2d_full()
vertexhist[sum, sum, :].plot()
vertexhist[-0.25j:0.25j:sum, -0.25j:0.25j:sum, :].plot();
<Figure size 640x480 with 6 Axes>

A histogram object can have more dimensions than you can reasonably visualize—you can slice, rebin, and project it into something visual later.

Fitting histograms

By directly writing a loss function in Minuit:

import numpy as np
import iminuit.cost

norm = len(h.axes[0].widths) / (h.axes[0].edges[-1] - h.axes[0].edges[0]) / h.sum()


def f(x, background, mu, gamma):
    return (
        background
        + (1 - background) * gamma**2 / ((x - mu) ** 2 + gamma**2) / np.pi / gamma
    )


loss = iminuit.cost.LeastSquares(
    h.axes[0].centers, h.values() * norm, np.sqrt(h.variances()) * norm, f
)
loss.mask = h.variances() > 0

minimizer = iminuit.Minuit(loss, background=0, mu=91, gamma=4)

minimizer.migrad()
minimizer.hesse()

(h * norm).plot()
plt.plot(loss.x, f(loss.x, *minimizer.values));
<Figure size 640x480 with 1 Axes>

Or through zfit, a Pythonic RooFit-like fitter:

import zfit

binned_data = zfit.data.BinnedData.from_hist(h)

binning = zfit.binned.RegularBinning(120, 60, 120, name="mass")
space = zfit.Space("mass", binning=binning)

background = zfit.Parameter("background", 0)
mu = zfit.Parameter("mu", 91)
gamma = zfit.Parameter("gamma", 4)
unbinned_model = zfit.pdf.SumPDF(
    [zfit.pdf.Uniform(60, 120, space), zfit.pdf.Cauchy(mu, gamma, space)], [background]
)

model = zfit.pdf.BinnedFromUnbinnedPDF(unbinned_model, space)
loss = zfit.loss.BinnedNLL(model, binned_data)

minimizer = zfit.minimize.Minuit()
result = minimizer.minimize(loss)

binned_data.to_hist().plot(density=1)
model.to_hist().plot(density=1);
/home/runner/miniconda3/envs/skhep-tutorial/lib/python3.12/site-packages/zfit/__init__.py:68: UserWarning: TensorFlow warnings are by default suppressed by zfit. In order to show them, set the environment variable ZFIT_DISABLE_TF_WARNINGS=0. In order to suppress the TensorFlow warnings AND this warning, set ZFIT_DISABLE_TF_WARNINGS=1.
  warnings.warn(
/home/runner/miniconda3/envs/skhep-tutorial/lib/python3.12/site-packages/mplhep/error_estimation.py:49: RuntimeWarning: divide by zero encountered in divide
  counts = sumw / scale
<Figure size 640x480 with 1 Axes>