Introduction

Overview

Teaching: 10 min
Exercises: 0 min

Questions

Why Python?

What is Uproot?

Why should I use it?

Objectives

Explain some advantages of Python and Uproot.

Python and HEP

For quite a while now, ROOT has been the main software used by particle physicists for data analysis. ROOT is written in C++, and at least until recently most analysis code has also been written in C++. Outside of HEP, Python is a more commonly used language for data analysis, which means that there are many popular and well-supported tools that can be utilized for common tasks. Python is also generally considered easier to learn than C++.

Uproot

Uproot is a Python module that can read and write files in the ROOT format. It’s part of the Scikit-HEP project, a collection of Python tools for HEP data analysis. Uproot isn’t the only way to read/write ROOT files in Python; there are other packages that allow you to do this. However, there are some significant advantages of Uproot. We’ll go through a few here.

Minimal dependencies

The only software required by Uproot (other than Python itself, of course) is NumPy. NumPy is by far the most popular Python package for handling arrays of data. It’s included in the Anaconda distribution by default, and in general most people using Python for any kind of data analysis will probably already have it installed. Awkward Array and Vector are not required but highly recommended for use with Uproot. Uproot, Awkward Array, and Vector are all part of the Scikit-HEP project. All of the above are Python modules, which can be installed with nothing more than pip or conda.

Most importantly, Uproot does not require any part of the ROOT software, which is quite large and can be non-trivial to install properly.

Speed

Uproot was designed for efficient reading of files. It uses NumPy for operations on data, which calls compiled vectorized functions to perform very fast calculations on arrays.

When you import a Python package that uses ROOT, there can be a delay of several seconds while all of the required libraries are loaded. This is required every time you restart the Python interpreter or run a script. By virtue of not using ROOT itself, Uproot doesn’t have this issue.

Jagged arrays

One issue with manipulating HEP data in Python is that it is often not rectangular. That is, the length of object collections is not constant. For example, different events will have different numbers of jets, and different jets will have different numbers of associated tracks. Such arrays are called jagged arrays. Packages like NumPy are only designed to work with rectangular arrays, where each dimension has a fixed length throughout the entire array.

Jagged vs. rectangular arrays

The name “jagged array” comes from the idea that if you have a 2D array in which there’s a different number of entries in each row and you line rows up by their first element, then the right edge of the array will be “jagged” as opposed to straight:

Rectangular array:
[[0, 1],
 [2, 3],
 [4, 5],
 [6, 7],
 [8, 9]]
Jagged array:
[[0, 1],
 [2, 3, 4],
 [5],
 [6, 7],
 [8, 9]]

Awkward Array is a package used to deal with jagged arrays so that they can be manipulated in ways almost identical to standard rectangular NumPy arrays. Other Python packages for reading ROOT files usually do not support jagged arrays, which means that you cannot do operations on multiple events at once if you have jagged data. But Uproot and Awkward Array can, and this is generally much faster than operating on one event at a time. We’ll come back to this topic in the “Columnar Analysis” section.

Key Points

Uproot is a fast and lightweight Python package for reading and writing ROOT files.

Uproot is designed to work efficiently with jagged arrays of data.

Opening files

Overview

Teaching: 20 min
Exercises: 0 min

Questions

How do I open a ROOT file with Uproot?

How can I tell what’s in the file?

How do I access the contents?

Objectives

Open a ROOT file and list its contents.

Access objects within a ROOT file.

Example motivation

In order to learn how to use Uproot, we’ll try to do a very short and simple analysis to look for resonances in a dimuon events. The example ROOT file is from real CMS data during proton-proton collisions in 2012. Make sure you’ve downloaded it from the Setup page and put it in your working directory. Our first goal is just to get to the data within Uproot, so we need to open the file and navigate to the muon information.

Opening a file

The first thing you must do whenever you want to use Uproot is import it, just like any other Python module:

import uproot

Import errors?

If you get something like:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'uproot'
then this means Uproot hasn’t been installed properly. Check the Setup page for detailed instructions. If you’re using conda (like Anaconda or Miniconda), make sure you’ve activated the same environment where you installed Uproot.

Now open the ROOT file and assign it to a variable (which I’ve named file here):

file = uproot.open('uproot-tutorial-file.root')

If you inspect file, you can see that it’s a ReadOnlyDirectory object:

file

<ReadOnlyDirectory '/' at 0x(some hexadecimal number here)>

File contents

Just like any other kind of directory, you can list the contents (of the file). The name of each item in the file is called a “key”.

file.keys()

['Events;1']

We can see that there is one key: “Events”. This doesn’t tell us what kind of object it refers to, though. ROOT files can contain many different types of objects, including subdirectories. The following function provides a way to inspect the types of each item:

file.classnames()

{'Events;1': 'TTree'}

The output contains pairs of the form name: type. Therefore the key Events refers to a TTree object. This is where all the data in this file is stored.

Why the ;1?

You may be wondering why there’s a ;1 after “Events”. This notation refers to the cycle number, which is a detail of the ROOT file format that we don’t need to care about. This ;1 shows up if you open the file in ROOT itself as well.

Accessing contents

Now we want to actually access the object inside the file. You can do this just as you would to get an item in an array:

file['Events']

<TTree 'Events' (6 branches) at 0x(hexadecimal number)>

This expression refers to the actual TTree object, which we will look at next.

Key Points

Opening and navigating files in Uproot isn’t that different from doing so in ROOT.

Use .keys() to see the contents of a file.

Use the form file['key'] to access an object inside a file.

Trees, Branches, and Events

Overview

Teaching: 20 min
Exercises: 5 min

Questions

How do I access a TTree?

How can I tell what branches are in a TTree?

How do I read the data from a TTree?

Objectives

List the branches in a tree.

Access the branches in a tree.

Create a table from tree branches.

Access data for a particular event.

Trees

Trees in ROOT are basically just tables of information. Trees are composed of branches, which are the columns of the table. The rows usually represent events (individual bunch crossings).

First we assign the tree to a variable (named tree here)

tree = file['Events']

In order to find out what information is in the tree, we need to know what the branches (columns) are. The term key is used (again) here to refer to the names of the branches.

tree.keys()

['nMuon', 'Muon_pt', 'Muon_eta', 'Muon_phi', 'Muon_mass', 'Muon_charge']

The above output is a list of the branch names. So we can see that for each event, we will have the number of muons in the event (nMuon) and the pT, eta, phi, mass, and charge of each muon.

But how do we get the actual data from the table? There are several ways with Uproot, but the simplest is with the arrays() function:

tree.arrays()

<Array [{nMuon: 2, Muon_pt: [10.8, ... -1, 1]}] type='100000 * {"nMuon": uint32,...'>

You can see some numbers in there, which indeed are from the data in the tree.

Branches

Now we assign this object (which contains both the names and contents of the branches) to another variable (branches):

branches = tree.arrays()

Next let’s just look at each branch individually. You can access a single branch from branches in a similar way to getting an item from a ROOT file object (array-like notation):

branches['nMuon']

<Array [2, 2, 1, 4, 4, 3, ... 0, 3, 2, 3, 2, 3] type='100000 * uint32'>

You can see the partial list of numbers in the output that represents the number of muons in each event. It’s abbreviated with an ellipsis (...) so that it doesn’t take up the whole page.

Error?

If you get something like:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'nMuon'
then you are almost certainly using an older version of Uproot that is not compatible with the rest of this tutorial. If this is the case, install the latest version of Uproot, restart the notebook’s kernel, and try again.

These Array objects are a special type provided by the Awkward Array package. The type=100000 * uint32 means that there are 100,000 entries and that each entry is a 32-bit unsigned integer. Each entry corresponds to one event.

Let’s look at another branch:

branches['Muon_pt']

<Array [[10.8, 15.7], ... 11.4, 3.08, 4.97]] type='100000 * var * float32'>

This is a jagged array because the number of entries is different for different events (because each event can have a different number of muons). Note that there are square brackets [] surrounding the list of entries for each event. The type='100000 * var * float32' means that there are 100,000 rows, each containing a variable number of 32-bit floating point numbers. This is basically an array of arrays (or a 2D array).

Events

If we want to focus on a particular event, we can index it just like a normal array:

branches['Muon_pt'][0]

<Array [10.8, 15.7] type='2 * float32'>

From the above output, the first event has two muons, and the two numbers in the list are the muons’ pT. It’s not specified anywhere in the file, but the units are GeV. Let’s look at the third event:

branches['Muon_pt'][2]

<Array [3.28] type='1 * float32'>

It only has one muon.

Exercise

Print out the pT of all muons that are in only the first 10 events. (There are many possible ways to do this.)
Solution

Here’s one way to do it. All that matters is that you get the same numbers (and number of numbers) in the output
for i in range(10):
    print(branches['Muon_pt'][i])
[10.8, 15.7]
[10.5, 16.3]
[3.28]
[11.4, 17.6, 9.62, 3.5]
[3.28, 3.64, 32.9, 23.7]
[3.57, 4.57, 4.37]
[57.6, 53]
[11.3, 23.9]
[10.2, 14.2]
[11.5, 3.47]

What if we want to get all of the information about a single event? So far we’ve accessed data in branches by providing a branch name, but we can also just use an event index:

branches[0]

<Record ... 0.106], Muon_charge: [-1, -1]} type='{"nMuon": uint32, "Muon_pt": va...'>

This is a Record object, which is another special type provided by Awkward Array. It functions basically the same way as a standard Python dictionary (dict). Unfortunately, most of the interesting information is still hidden in the above output to save space. A little trick we can use to force printing all the data is adding .tolist():

branches[0].tolist()

{'nMuon': 2,
 'Muon_pt': [10.763696670532227, 15.736522674560547],
 'Muon_eta': [1.0668272972106934, -0.563786506652832],
 'Muon_phi': [-0.03427272289991379, 2.5426154136657715],
 'Muon_mass': [0.10565836727619171, 0.10565836727619171],
 'Muon_charge': [-1, -1]}

There we go. Now we can see the whole picture for an individual event.

.tolist()

.tolist() is a NumPy function that has been extended to Awkward Array objects. As the name suggests, it converts NumPy arrays to Python lists. In the case of trees, which have named branches, it actually converts to a dictionary of lists. It can be very useful when you want to understand exactly what’s in an Array or Record. Be careful when using it, though–trying to print out an entire branch or tree could cause Python to crash if it’s large enough. Therefore it’s best to only use tolist() on one or a few events at a time to be safe.

Key Points

TTrees are tables of data.

Trees are made of branches, which are columns in the table.

Each row represents an event.

Histograms

Overview

Teaching: 20 min
Exercises: 5 min

Questions

How do I make a histogram in Python (without ROOT)?

How can I change histogram settings?

Objectives

Create a histogram of an array (regular or jagged).

Make a histogram’s axes logarithmic.

Histogramming basics

Histograms are the most important type of plots for particle physics. We’ll need to know how to make them with the tools we have. Matplotlib is the standard and most popular plotting package for Python, and it is quite powerful, so we’ll use it. First we import it:

import matplotlib.pyplot as plt

(It’s customary to import it abbreviated as plt as above, which saves some typing.)

The histogram function in Matplotlib is hist(). We can see what it does by just passing it our nMuon branch:

plt.hist(branches['nMuon'])

(array([8.7359e+04, 1.2253e+04, 3.5600e+02, 2.8000e+01, 2.0000e+00,
        1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00]),
 array([ 0. ,  3.9,  7.8, 11.7, 15.6, 19.5, 23.4, 27.3, 31.2, 35.1, 39. ]),
 <a list of 10 Patch objects>)

nMuon_hist_1

What’s with all the numbers above the plot?

hist() actually returns all the bin contents and bin edges in case you want to do something with them after creating the plot. We don’t need these return values, and they clutter up the notebook, so we should get rid of them. There are a few ways to do this, but I think the best practice is to add plt.show(), which is a way to tell Matplotlib when your plot is all set up and ready to be displayed. As a consequence of hist() not being the last line in the notebook cell, the bin values will no longer be printed. For example:
plt.hist(branches['nMuon'])
plt.show()
I’ll follow this convention from now on.

That created a histogram, but it’s not a very good one. You can’t really understand much about the distribution because the binning and scale are too large. These settings are controlled by the bins and range parameters:

plt.hist(branches['nMuon'], bins=10, range=(0, 10))
plt.show()

nMuon_hist_2

bins here is the total number of bins (of equal width), and range is a pair of numbers representing where the first bin starts and where the last bin ends.

Binning and range tips

Getting the binning and range right for a histogram is somewhat of an art, but I often find it helpful to know the mean, standard deviation, minimum, and maximum of the original distribution.

First, import NumPy via:
import numpy as np
Then the following functions calculate these values for an array:
np.mean(branches['nMuon'])
2.35286
np.std(branches['nMuon'])
1.19175912851549
np.min(branches['nMuon'])
0
np.max(branches['nMuon'])
39

Hmm, we’re still missing axis titles on the histogram. (Always label your plots!) We can do this with the xlabel() and ylabel() functions:

plt.hist(branches['nMuon'], bins=10, range=(0, 10))
plt.xlabel('Number of muons in event')
plt.ylabel('Number of events')
plt.show()

nMuon_hist_3

Histogramming a jagged array

We can make histograms of the other branches as well, but there’s one more step necessary because of their jaggedness. Matplotlib expects a series of data to be in a 1D array, so we need to convert or flatten the jagged 2D array into a 1D array. In order to do this, we need to import Awkward Array:

import awkward as ak

Then use ak.flatten() on the branch’s array:

plt.hist(ak.flatten(branches['Muon_pt']), bins=100, range=(0, 100))
plt.xlabel('Muon $p_{\mathrm{T}}$ [GeV]')
plt.ylabel('Number of muons / 1 GeV')
plt.show()

Muon_pt_hist_1

Note that you can use LaTeX in Matplotlib labels (as I did above).

Logarithmic scales

Another important thing to know is how to set axes to a logarithmic scale. For the y-axis, this is as simple as adding a line with plt.yscale('log'):

plt.hist(ak.flatten(branches['Muon_pt']), bins=100, range=(0, 100))
plt.xlabel('Muon $p_{\mathrm{T}}$ [GeV]')
plt.ylabel('Number of muons / 1 GeV')
plt.yscale('log')
plt.show()

Muon_pt_hist_2

As you might guess, plt.xscale('log') will make the x-axis scale logarithmic. The issue is that this doesn’t make the bin sizes logarithmic, so the plot will end up looking quite strange in most cases. The solution to this is to use a NumPy function called logspace to calculate logarithmic bin edges:

import numpy as np
plt.hist(ak.flatten(branches['Muon_pt']), bins=np.logspace(np.log10(1), np.log10(100), 100))
plt.xlabel('Muon $p_{\mathrm{T}}$ [GeV]')
plt.xscale('log')
plt.ylabel('Number of muons')
plt.show()

Muon_pt_hist_3

In the above example, bins is being set to an array. If Matplotlib sees that bins is an array, it will use the values of the array to set the bin edges rather than try to evenly space them across range. Don’t worry too much if this seems confusing; the details of how this works isn’t important for this lesson. The important part is that, inside the logspace() call, you can modify the numbers to change where the bins start and end and how many bins there are. Don’t remove the np.log10 part, though.

Exercise

Make a histogram of the eta of all muons. Play around with the options described in this page to try to get a nice looking plot.
Solution

Your plot might look a bit different, but it’s fine as long as the binning is reasonable such that you can see the same distribution features.
plt.hist(ak.flatten(branches['Muon_eta']), bins=50, range=(-2.5, 2.5))
plt.xlabel('Muon $\eta$')
plt.ylabel('Number of muons')
plt.show()

Key Points

Use the bins and/or range parameters to improve histogram binning.

Make sure your axes are labeled.

Jagged arrays must be flattened before histogramming.

Columnar Analysis

Overview

Teaching: 25 min
Exercises: 5 min

Questions

What is columnar analysis?

What are its advantages over row-based analysis?

How do I perform selections and operations on columns?

Objectives

Count items passing some selection criteria.

Apply selection criteria to get passing events/objects.

Plot different selections for comparison.

Uproot is designed for columnar analysis, which means performing operations on entire columns (branches) at a time, rather than operating on every event individually.

Counting

The simplest task we need for analysis is counting (i.e., cutflow). To count the total number of events, we can use the Python built-in function len() on any of the following:

len(branches)
len(branches['nMuon'])
len(branches['Muon_pt']) # or any of the other branches...

So there are 100,000 events.

Exercise

Note that this is not the total number of muons, despite running len() on a branch that has a number for every single muon (Muon_pt)! Why is this? Can you write some code that does give the total number of muons in the file?
Solution

len() only looks at the length along the first dimension of the array. Muon_pt is a 2D array. In order to count every muon individually, we need to flatten the array to 1D.
len(ak.flatten(branches['Muon_pt']))
235286

Selections

Selections from 1D arrays

To do more interesting counting or plotting, we want to be able to select only events or muons that pass some criteria. Just as you can compare two numbers, like 1 < 2 and get a True or False value, you can put arrays in comparisons as well. Let’s look at this case:

branches['nMuon'] == 1

<Array [False, False, True, ... False, False] type='100000 * bool'>

This is checking for equality between an array and the number 1. Of course these are different types (array vs. scalar), so the Python objects are certainly not identical, but you can see the return value is not just False. What happens is that each element in the array is compared to the scalar value, and the return value is a new array (of the same shape) filled with all these comparison results. So we can interpret the output as the result of testing whether each event has exactly one muon or not. The first two events do not, the third does, and so on.

This array of Boolean values is called a mask because we can use it to pick out only the elements in the array that satisfy some criteria (like having exactly one muon). This is very useful, and we will save it to a variable to save typing later:

single_muon_mask = branches['nMuon'] == 1

Counting with selections

Now let’s say we want to know how many of these single-muon events there are. Note that len() won’t work because the length of the array is still 100,000. That is, there’s a value for every event (even if that value is False). We need the number of Trues in the array. There are multiple ways to do this, but my favorite is:

np.sum(single_muon_mask)

sum() adds all the array values together. True is interpreted as 1 and False is interpreted as 0, thus sum() is just the number of True entries. So there are 13,447 events with exactly one muon.

Applying a mask to an array

If we want to apply some selection to an array (i.e., cut out events or muons that don’t pass), we just act like the selection mask is an index. For example, if we want the pT of only those muons in events with exactly one muon (so they’re the only muon in that event):

branches['Muon_pt'][single_muon_mask]

<Array [[3.28], [3.84], ... [13.3], [9.48]] type='13447 * var * float32'>

We can check that we really are only looking at the events that pass the selection by looking the number of rows:

len(branches['Muon_pt'][single_muon_mask])

Yep, this matches the counting from single_muon_mask.sum() above.

Plotting with selections

We can also use masks to plot quantities after some selection. For example, let’s plot the muon pT for only the single-muon events:

plt.hist(ak.flatten(branches['Muon_pt'][single_muon_mask]), bins=100, range=(0, 100))
plt.xlabel('Muon $p_{\mathrm{T}}$ [GeV]')
plt.ylabel('Number of single muons / 1 GeV')
plt.yscale('log')
plt.show()

single_Muon_pt_hist

If you’re looking carefully, you might notice that this plot is missing the hump around 45 GeV that was in the pT plot before (for all muons). We accidentally ran into a hint of an effect from real physics. Neat!

Selections from a jagged array

Let’s look at a comparison for a jagged array, using the absolute value of muon eta:

eta_mask = abs(branches['Muon_eta']) < 2

eta_mask

<Array [[True, True], ... True, True, True]] type='100000 * var * bool'>

Again, the mask array has the same dimensions as the original array. There’s one Boolean value for each muon, corresponding to whether its eta is less than 2 in absolute value.

Counting

We can do counting and plotting just as before:

np.sum(eta_mask)

This is the number of muons that pass the eta cut.

Plotting

Let’s plot both the original eta distribution and the one after the cut to verify its effect:

plt.hist(ak.flatten(branches['Muon_eta']), bins=50, range=(-2.5, 2.5))
plt.title('No selection')
plt.xlabel('Muon $\eta$')
plt.ylabel('Number of muons')
plt.show()

plt.hist(ak.flatten(branches['Muon_eta'][eta_mask]), bins=50, range=(-2.5, 2.5))
plt.title('With $|\eta| < 2$ selection')
plt.xlabel('Muon $\eta$')
plt.ylabel('Number of muons')
plt.show()

Muon_eta_hist_1

Muon_eta_hist_2

You can see the second plot just has both ends past 2 cut off, demonstrating that we’ve cut those muons out.

Operations on selections

We can invert selections with ~ (the NOT operator):

~single_muon_mask

<Array [True, True, False, ... True, True] type='100000 * bool'>

This new mask is False only for events with exactly one muon and True otherwise.

We can get the intersection of selections with & (the AND operator):

single_muon_mask & eta_mask

<Array [[False, False], ... False, False]] type='100000 * var * bool'>

This mask is True only for muons with no other muons in their event and abs(eta) < 2.

Or we can get the union of selections with | (the OR operator):

single_muon_mask | eta_mask

<Array [[True, True], ... True, True, True]] type='100000 * var * bool'>

This mask is True for muons which are the only muon in their event or which have abs(eta) < 2.

Warning about combining selections

You have to be careful about combining comparisons with the operators above. Consider the following expression:
False == False & False
True
It’s a common mistake to assume that this expression would be False by interpreting it as:
(False == False) & False
False
The issue is that & has a higher precedence than ==, so the first expression is actually equivalent to:
False == (False & False)
True
What this means is that parentheses are necessary for expressions like this to have the correct meaning:
(branches['nMuon'] == 1) & (abs(branches['Muon_eta']) < 2)

Comparing histograms

Now we can use these operations to compare distributions for different selections. Let’s look at the pT of single-event muons split into two groups by whether or not abs(eta) < 2. All we have to do is provide a list of arrays as the first argument to hist rather than just one array. Note the square brackets around the two arrays:

plt.hist([ak.flatten(branches['Muon_pt'][single_muon_mask & eta_mask]),
          ak.flatten(branches['Muon_pt'][single_muon_mask & ~eta_mask])],
         bins=25, range=(0, 50))
plt.xlabel('Muon $p_{\mathrm{T}}$ [GeV]')
plt.ylabel('Number of single muons / 2 GeV')
plt.show()

single_Muon_pt_hist_eta_split_1

Ah, but it doesn’t actually say which histogram is which. For that, we need to add labels and a legend:

plt.hist([ak.flatten(branches['Muon_pt'][single_muon_mask & eta_mask]),
          ak.flatten(branches['Muon_pt'][single_muon_mask & ~eta_mask])],
         label=['$|\eta| < 2$', '$|\eta| \geq 2$'],
         bins=25, range=(0, 50))
plt.xlabel('Muon $p_{\mathrm{T}}$ [GeV]')
plt.ylabel('Number of single muons / 2 GeV')
plt.legend()
plt.show()

single_Muon_pt_hist_eta_split_2

label is a list of strings passed to hist, corresponding to the arrays (in the same order), and we have to add plt.legend() in order to actually draw a legend with those labels.

Another problem is that these histograms are on different scales because there are fewer large eta muons. Often we want to compare only the shapes of the distribution, so we normalize the integral of each to 1. We can achieve this by adding density=True to the hist() call:

plt.hist([ak.flatten(branches['Muon_pt'][single_muon_mask & eta_mask]),
          ak.flatten(branches['Muon_pt'][single_muon_mask & ~eta_mask])],
         label=['$|\eta| < 2$', '$|\eta| \geq 2$'],
         bins=25, range=(0, 50), density=True)
plt.title('Normalized')
plt.xlabel('Muon $p_{\mathrm{T}}$ [GeV]')
plt.ylabel('Number of single muons / 2 GeV')
plt.legend()
plt.show()

single_Muon_pt_hist_eta_split_3

Now we can clearly see there’s a significantly higher fraction of muons with abs(eta) >= 2 at lower pT compared to muons with abs(eta) < 2. This makes geometric sense, since muons at higher abs(eta) are traveling in a direction less perpendicular to the beam.

(This name of the density parameter refers to the idea of interpreting the histogram as a probability density function, which always has an integral of 1.)

Columnar vs. row-based analysis

As an aside about columnar analysis, take a look at this comparison of speed for counting muons with abs(eta) < 2.

The first example is a row-based approach, using for loops over every event and every muon in the event:
%%time

eta_count = 0

for event in branches['Muon_eta']:
    for eta in event:
        if abs(eta) < 2:
            eta_count += 1

eta_count
CPU times: user 4.78 s, sys: 77.2 ms, total: 4.86 s
Wall time: 4.88 s

204564
The next example uses columnar operations, running on all muons at once:
%%time

np.sum(abs(branches['Muon_eta']) < 2)
CPU times: user 5.18 ms, sys: 1 ms, total: 6.18 ms
Wall time: 4.87 ms

204564
The columnar approach here is about 1000 times faster.

Key Points

Selections are created by comparisons on arrays and are represented by masks (arrays of Boolean values).

Selections are applied by acting like a mask is an array index.

Avoid for loops whenever possible in analysis (especially in Python).

Getting Physics-Relevant Information

Overview

Teaching: 20 min
Exercises: 10 min

Questions

How can I use columnar analysis to do something useful for physics?

Objectives

Find dimuon invariant mass resonances.

Okay, we’re finally ready to look for resonances in dimuon events.

We need a mask that selects events with exactly two muons:

two_muons_mask = branches['nMuon'] == 2

Now we need to construct the four-momenta of each muon. Vector is a package that provides an interface to operate on 2D, 3D, and 4D vectors. We can get the four-momenta of all muons in the tree by using vector.zip and passing it the pT, eta, phi, and mass arrays:

import vector
muon_p4 = vector.zip({'pt': branches['Muon_pt'], 'eta': branches['Muon_eta'], 'phi': branches['Muon_phi'], 'mass': branches['Muon_mass']})

We’ll go ahead and filter out events that don’t contain exactly two muons:

two_muons_p4 = muon_p4[two_muons_mask]

Then let’s take a look at it:

two_muons_p4

<MomentumArray4D [[{rho: 10.8, ... tau: 0.106}]] type='48976 * var * Momentum4D[...'>

This is an array of 4D momenta. We can get the components back out with these properties:

two_muons_p4.pt
two_muons_p4.eta
two_muons_p4.phi
two_muons_p4.E
two_muons_p4.mass

What we want is the invariant mass of the two muons in each of these events. To do that, we need the sum of their four-vectors. First, we pick out the first muon in each event with 2D slice:

first_muon_p4 = two_muons_p4[:, 0]

In the notation [:, 0], : means “include every row in the first dimension” (i.e., all events in the array). The comma separates the selection along the first dimension from the selection along the second dimension. The second dimension is the muons in each event, so we want the first, or the one at the 0 index. Then we do the same to get the second muon in each event, just changing the 0 index to 1:

second_muon_p4 = two_muons_p4[:, 1]

DeltaR

Another useful feature of these four-vector arrays is being able to compute deltaR (= sqrt(deltaEta^2 + deltaPhi^2)):
first_muon_p4.deltaR(second_muon_p4)
plt.hist(first_muon_p4.deltaR(second_muon_p4), bins=100)
plt.xlabel('$\Delta R$ between muons')
plt.ylabel('Number of two-muon events')
plt.show()
In principle, we could use this to clean up our invariant mass distribution, but we’ll skip that for simplicity.

Adding the four-vectors of the first muon and the second muon for all events is really as easy as:

sum_p4 = first_muon_p4 + second_muon_p4
sum_p4

<MomentumArray4D [{rho: 8.79, phi: 1.83, ... tau: 16.5}] type='48976 * Momentum4...'>

This is a 1D array of the four-vector sum for each event.

The last thing we need to do before we’re ready to plot the spectrum is to select only pairs with opposite charges:

two_muons_charges = branches['Muon_charge'][two_muons_mask]
opposite_sign_muons_mask = two_muons_charges[:, 0] != two_muons_charges[:, 1]

We apply this selection to the four-vector sums to get the dimuon four-vectors:

dimuon_p4 = sum_p4[opposite_sign_muons_mask]

Exercise

Plot a histogram of the dimuon invariant mass on a log-log plot. Try to find all resonances (there are at least 7 visible). How many dimuon events are there?
Solution
plt.hist(dimuon_p4.mass, bins=np.logspace(np.log10(0.1), np.log10(1000), 200))
plt.xlabel('Dimuon invariant mass [GeV]')
plt.ylabel('Number of dimuon events')
plt.xscale('log')
plt.yscale('log')
plt.show()
Labeled resonances:
len(dimuon_p4)
37183

Key Points

Vector has allows you to manipulate arrays of four-vectors.

Uproot can be used to do real physics analyses.

Advanced topic: Fitting (optional)

Overview

Teaching: 0 min
Exercises: 0 min

Questions

How can I fit a common function to a histogram?

Objectives

Fit a distribution to the Z peak.

I want to zoom in on the Z resonance:

plt.hist(dimuon_p4.mass, bins=40, range=(70, 110))
plt.xlabel('Dimuon invariant mass [GeV]')
plt.ylabel('Number of dimuon events / 1 MeV')
plt.show()

dimuon_invariant_mass_fit_1

Resonances are described by the relativistic Breit-Wigner distribution. We should be able to fit one to this peak:

from scipy.optimize import curve_fit

def relativistic_breit_wigner(x, resonance_mass, width, normalization):
    gamma = np.sqrt(resonance_mass ** 2 * (resonance_mass ** 2 + width ** 2))
    k = 2.0 * np.sqrt(2) * resonance_mass * width * gamma / (np.pi * np.sqrt(resonance_mass ** 2 + gamma))
    return normalization * k / ((x ** 2 - resonance_mass ** 2) ** 2 + resonance_mass ** 2 * width ** 2)

bin_contents, bin_edges = np.histogram(dimuon_p4.mass.to_numpy(), bins=20, range=(80, 100))
bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2.0
popt, pcov = curve_fit(relativistic_breit_wigner, bin_centers, bin_contents, p0=[90, 10, 1000], sigma=np.sqrt(bin_contents))

plt.hist(dimuon_p4.mass, bins=40, range=(70, 110), label='Data')
x = np.linspace(80, 100, 200)
y = relativistic_breit_wigner(x, *popt)
plt.plot(x, y, label='Fit')
plt.xlabel('Dimuon invariant mass [GeV]')
plt.ylabel('Number of dimuon events / 1 MeV')
plt.legend()
plt.show()

dimuon_invariant_mass_fit_2

.to_numpy()

You may have noticed the .to_numpy() used on the mass Array above. As of Awkward Array 1.4.0, this is necessary for np.histogram() to work, but this should be fixed in the next version.

The peak position is stored in popt[0]:

popt[0]

90.77360288470875

Pretty close to the real mass, 91.1876 GeV.

Key Points

There are many packages for fitting, and this just one very simple example.

Further resources

Overview

Teaching: 5 min
Exercises: 0 min

Questions

Where can I find more advanced information on these topics?

What other Python packages exist with tools for more complicated analysis?

Objectives

Find tutorials and documentation on Python tools (both generic and HEP-specific).

Caveat

One important note that I did not address at all: The methods that I showed here work great for operating on one small-ish file (smaller than your system’s RAM) at a time. Uproot has specialized methods and options that are better when running on many data files or larger files. This is one of many topics covered in the Uproot documentation below.

Scikit-HEP project

https://github.com/scikit-hep

Uproot

Awkward Array

Vector

Other tools

Hist (creating histograms)
pyhf (histogram fitting and likelihoods)
mplhep (plotting utilities)

Standard packages

NumPy
Matplotlib
SciPy
pandas (data frames)

Key Points

Python and many Python packages have a huge userbase and are well supported by documentation, tutorials, and the community.

Uproot Tutorial

Introduction

Overview

Python and HEP

Uproot

Minimal dependencies

Speed

Jagged arrays

Jagged vs. rectangular arrays

Key Points

Opening files

Overview

Example motivation

Opening a file

Import errors?

File contents

Why the ;1?

Accessing contents

Key Points

Trees, Branches, and Events

Overview

Trees

Branches

Error?

Events

Exercise

Solution

.tolist()

Key Points

Histograms

Overview

Histogramming basics

What’s with all the numbers above the plot?

Binning and range tips

Histogramming a jagged array

Logarithmic scales

Exercise

Solution

Key Points

Columnar Analysis

Overview

Counting

Exercise

Solution

Selections

Selections from 1D arrays

Counting with selections

Applying a mask to an array

Plotting with selections

Selections from a jagged array

Counting

Plotting

Operations on selections

Warning about combining selections

Comparing histograms

Columnar vs. row-based analysis

Key Points

Getting Physics-Relevant Information

Overview

DeltaR

Exercise

Solution

Key Points

Advanced topic: Fitting (optional)

Overview

.to_numpy()

Key Points

Further resources

Overview

Caveat

Scikit-HEP project

Uproot

Awkward Array

Vector

Other tools

Standard packages

Key Points

Why the `;1`?

`.tolist()`

`.to_numpy()`