# Columnar Analysis

## Overview

Teaching:25 min

Exercises:5 minQuestions

What is columnar analysis?

What are its advantages over row-based analysis?

How do I perform selections and operations on columns?

Objectives

Count items passing some selection criteria.

Apply selection criteria to get passing events/objects.

Plot different selections for comparison.

Uproot is designed for columnar analysis, which means performing operations on entire columns (branches) at a time, rather than operating on every event individually.

# Counting

The simplest task we need for analysis is counting (i.e., cutflow).
To count the total number of events, we can use the Python built-in function `len()`

on any of the following:

```
len(branches)
len(branches['nMuon'])
len(branches['Muon_pt']) # or any of the other branches...
```

```
100000
```

So there are 100,000 events.

## Exercise

Note that this is

notthe total number of muons, despite running`len()`

on a branch that has a number for every single muon (`Muon_pt`

)! Why is this? Can you write some code thatdoesgive the total number of muons in the file?## Solution

`len()`

only looks at the length along the first dimension of the array.`Muon_pt`

is a 2D array. In order to count every muon individually, we need to flatten the array to 1D.`len(ak.flatten(branches['Muon_pt']))`

`235286`

# Selections

## Selections from 1D arrays

To do more interesting counting or plotting, we want to be able to select only events or muons that pass some criteria.
Just as you can compare two numbers, like `1 < 2`

and get a `True`

or `False`

value, you can put arrays in comparisons as well.
Let’s look at this case:

```
branches['nMuon'] == 1
```

```
<Array [False, False, True, ... False, False] type='100000 * bool'>
```

This is checking for equality between an array and the number `1`

.
Of course these are different types (array vs. scalar), so the Python objects are certainly not identical, but you can see the return value is not just `False`

.
What happens is that each element in the array is compared to the scalar value, and the return value is a new array (of the same shape) filled with all these comparison results.
So we can interpret the output as the result of testing whether each event has exactly one muon or not.
The first two events do not, the third does, and so on.

This array of Boolean values is called a *mask* because we can use it to pick out only the elements in the array that satisfy some criteria (like having exactly one muon).
This is very useful, and we will save it to a variable to save typing later:

```
single_muon_mask = branches['nMuon'] == 1
```

### Counting with selections

Now let’s say we want to know how many of these single-muon events there are.
Note that `len()`

won’t work because the length of the array is still 100,000.
That is, there’s a *value* for every event (even if that value is `False`

).
We need the number of `True`

s in the array.
There are multiple ways to do this, but my favorite is:

```
np.sum(single_muon_mask)
```

```
13447
```

`sum()`

adds all the array values together.
`True`

is interpreted as `1`

and `False`

is interpreted as `0`

, thus `sum()`

is just the number of `True`

entries.
So there are 13,447 events with exactly one muon.

### Applying a mask to an array

If we want to apply some selection to an array (i.e., cut out events or muons that don’t pass), we just act like the selection mask is an index. For example, if we want the pT of only those muons in events with exactly one muon (so they’re the only muon in that event):

```
branches['Muon_pt'][single_muon_mask]
```

```
<Array [[3.28], [3.84], ... [13.3], [9.48]] type='13447 * var * float32'>
```

We can check that we really are only looking at the events that pass the selection by looking the number of rows:

```
len(branches['Muon_pt'][single_muon_mask])
```

```
13447
```

Yep, this matches the counting from `single_muon_mask.sum()`

above.

### Plotting with selections

We can also use masks to plot quantities after some selection. For example, let’s plot the muon pT for only the single-muon events:

```
plt.hist(ak.flatten(branches['Muon_pt'][single_muon_mask]), bins=100, range=(0, 100))
plt.xlabel('Muon $p_{\mathrm{T}}$ [GeV]')
plt.ylabel('Number of single muons / 1 GeV')
plt.yscale('log')
plt.show()
```

If you’re looking carefully, you might notice that this plot is missing the hump around 45 GeV that was in the pT plot before (for all muons). We accidentally ran into a hint of an effect from real physics. Neat!

## Selections from a jagged array

Let’s look at a comparison for a jagged array, using the absolute value of muon eta:

```
eta_mask = abs(branches['Muon_eta']) < 2
```

```
eta_mask
```

```
<Array [[True, True], ... True, True, True]] type='100000 * var * bool'>
```

Again, the mask array has the same dimensions as the original array. There’s one Boolean value for each muon, corresponding to whether its eta is less than 2 in absolute value.

### Counting

We can do counting and plotting just as before:

```
np.sum(eta_mask)
```

```
204564
```

This is the number of muons that pass the eta cut.

### Plotting

Let’s plot both the original eta distribution and the one after the cut to verify its effect:

```
plt.hist(ak.flatten(branches['Muon_eta']), bins=50, range=(-2.5, 2.5))
plt.title('No selection')
plt.xlabel('Muon $\eta$')
plt.ylabel('Number of muons')
plt.show()
plt.hist(ak.flatten(branches['Muon_eta'][eta_mask]), bins=50, range=(-2.5, 2.5))
plt.title('With $|\eta| < 2$ selection')
plt.xlabel('Muon $\eta$')
plt.ylabel('Number of muons')
plt.show()
```

You can see the second plot just has both ends past 2 cut off, demonstrating that we’ve cut those muons out.

## Operations on selections

We can invert selections with `~`

(the NOT operator):

```
~single_muon_mask
```

```
<Array [True, True, False, ... True, True] type='100000 * bool'>
```

This new mask is `False`

only for events with exactly one muon and `True`

otherwise.

We can get the intersection of selections with `&`

(the AND operator):

```
single_muon_mask & eta_mask
```

```
<Array [[False, False], ... False, False]] type='100000 * var * bool'>
```

This mask is `True`

only for muons with no other muons in their event *and* abs(eta) < 2.

Or we can get the union of selections with `|`

(the OR operator):

```
single_muon_mask | eta_mask
```

```
<Array [[True, True], ... True, True, True]] type='100000 * var * bool'>
```

This mask is `True`

for muons which are the only muon in their event *or* which have abs(eta) < 2.

## Warning about combining selections

You have to be careful about combining comparisons with the operators above. Consider the following expression:

`False == False & False`

`True`

It’s a common mistake to assume that this expression would be

`False`

by interpreting it as:`(False == False) & False`

`False`

The issue is that

`&`

has a higher precedence than`==`

, so the first expression is actually equivalent to:`False == (False & False)`

`True`

What this means is that parentheses are necessary for expressions like this to have the correct meaning:

`(branches['nMuon'] == 1) & (abs(branches['Muon_eta']) < 2)`

# Comparing histograms

Now we can use these operations to compare distributions for different selections.
Let’s look at the pT of single-event muons split into two groups by whether or not abs(eta) < 2.
All we have to do is provide a list of arrays as the first argument to `hist`

rather than just one array.
Note the square brackets around the two arrays:

```
plt.hist([ak.flatten(branches['Muon_pt'][single_muon_mask & eta_mask]),
ak.flatten(branches['Muon_pt'][single_muon_mask & ~eta_mask])],
bins=25, range=(0, 50))
plt.xlabel('Muon $p_{\mathrm{T}}$ [GeV]')
plt.ylabel('Number of single muons / 2 GeV')
plt.show()
```

Ah, but it doesn’t actually say which histogram is which. For that, we need to add labels and a legend:

```
plt.hist([ak.flatten(branches['Muon_pt'][single_muon_mask & eta_mask]),
ak.flatten(branches['Muon_pt'][single_muon_mask & ~eta_mask])],
label=['$|\eta| < 2$', '$|\eta| \geq 2$'],
bins=25, range=(0, 50))
plt.xlabel('Muon $p_{\mathrm{T}}$ [GeV]')
plt.ylabel('Number of single muons / 2 GeV')
plt.legend()
plt.show()
```

`label`

is a list of strings passed to `hist`

, corresponding to the arrays (in the same order),
and we have to add `plt.legend()`

in order to actually draw a legend with those labels.

Another problem is that these histograms are on different scales because there are fewer large eta muons.
Often we want to compare only the shapes of the distribution, so we normalize the integral of each to 1.
We can achieve this by adding `density=True`

to the `hist()`

call:

```
plt.hist([ak.flatten(branches['Muon_pt'][single_muon_mask & eta_mask]),
ak.flatten(branches['Muon_pt'][single_muon_mask & ~eta_mask])],
label=['$|\eta| < 2$', '$|\eta| \geq 2$'],
bins=25, range=(0, 50), density=True)
plt.title('Normalized')
plt.xlabel('Muon $p_{\mathrm{T}}$ [GeV]')
plt.ylabel('Number of single muons / 2 GeV')
plt.legend()
plt.show()
```

Now we can clearly see there’s a significantly higher fraction of muons with abs(eta) >= 2 at lower pT compared to muons with abs(eta) < 2. This makes geometric sense, since muons at higher abs(eta) are traveling in a direction less perpendicular to the beam.

(This name of the `density`

parameter refers to the idea of interpreting the histogram as a probability density function, which always has an integral of 1.)

## Columnar vs. row-based analysis

As an aside about columnar analysis, take a look at this comparison of speed for counting muons with abs(eta) < 2.

The first example is a row-based approach, using

`for`

loops over every event and every muon in the event:`%%time eta_count = 0 for event in branches['Muon_eta']: for eta in event: if abs(eta) < 2: eta_count += 1 eta_count`

`CPU times: user 4.78 s, sys: 77.2 ms, total: 4.86 s Wall time: 4.88 s 204564`

The next example uses columnar operations, running on all muons at once:

`%%time np.sum(abs(branches['Muon_eta']) < 2)`

`CPU times: user 5.18 ms, sys: 1 ms, total: 6.18 ms Wall time: 4.87 ms 204564`

The columnar approach here is about 1000 times faster.

## Key Points

Selections are created by comparisons on arrays and are represented by masks (arrays of Boolean values).

Selections are applied by acting like a mask is an array index.

Avoid

`for`

loops whenever possible in analysis (especially in Python).