Skip to article frontmatterSkip to article content

Jagged, ragged, Awkward Arrays

What is Awkward Array?

The previous lesson included a tricky slice:

cut = muons["nMuon"] == 2

pt0 = muons["Muon_pt", cut, 0]

The three parts of muons["Muon_pt", cut, 0] slice

  1. selects the "Muon_pt" field of all records in the array,
  2. applies cut, a boolean array, to select only events with two muons,
  3. selects the first (0) muon from each of those pairs. Similarly for the second (1) muons.

NumPy would not be able to perform such a slice, or even represent an array of variable-length lists without resorting to arrays of objects.

import numpy as np

# generates a ValueError
np.array([[0.0, 1.1, 2.2], [], [3.3, 4.4], [5.5], [6.6, 7.7, 8.8, 9.9]])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[1], line 4
      1 import numpy as np
      3 # generates a ValueError
----> 4 np.array([[0.0, 1.1, 2.2], [], [3.3, 4.4], [5.5], [6.6, 7.7, 8.8, 9.9]])

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (5,) + inhomogeneous part.

Awkward Array is intended to fill this gap:

import awkward as ak

ak.Array([[0.0, 1.1, 2.2], [], [3.3, 4.4], [5.5], [6.6, 7.7, 8.8, 9.9]])
Loading...

Arrays like this are sometimes called “jagged arrays” and sometimes “ragged arrays.”

Slicing in Awkward Array

Basic slices are a generalization of NumPy’s—what NumPy would do if it had variable-length lists.

array = ak.Array([[0.0, 1.1, 2.2], [], [3.3, 4.4], [5.5], [6.6, 7.7, 8.8, 9.9]])
array.tolist()
[[0.0, 1.1, 2.2], [], [3.3, 4.4], [5.5], [6.6, 7.7, 8.8, 9.9]]
array[2]
Loading...
array[-1, 1]
np.float64(7.7)
array[2:, 0]
Loading...
array[2:, 1:]
Loading...
array[:, 0]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[8], line 1
----> 1 array[:, 0]

File ~/miniconda3/envs/skhep-tutorial/lib/python3.12/site-packages/awkward/highlevel.py:1104, in Array.__getitem__(self, where)
    675 def __getitem__(self, where):
    676     """
    677     Args:
    678         where (many types supported; see below): Index of positions to
   (...)   1102     have the same dimension as the array being indexed.
   1103     """
-> 1104     with ak._errors.SlicingErrorContext(self, where):
   1105         # Handle named axis
   1106         (_, ndim) = self._layout.minmax_depth
   1107         named_axis = _get_named_axis(self)

File ~/miniconda3/envs/skhep-tutorial/lib/python3.12/site-packages/awkward/_errors.py:80, in ErrorContext.__exit__(self, exception_type, exception_value, traceback)
     78     self._slate.__dict__.clear()
     79     # Handle caught exception
---> 80     raise self.decorate_exception(exception_type, exception_value)
     81 else:
     82     # Step out of the way so that another ErrorContext can become primary.
     83     if self.primary() is self:

File ~/miniconda3/envs/skhep-tutorial/lib/python3.12/site-packages/awkward/highlevel.py:1112, in Array.__getitem__(self, where)
   1108 where = _normalize_named_slice(named_axis, where, ndim)
   1110 NamedAxis.mapping = named_axis
-> 1112 indexed_layout = prepare_layout(self._layout._getitem(where, NamedAxis))
   1114 if NamedAxis.mapping:
   1115     return ak.operations.ak_with_named_axis._impl(
   1116         indexed_layout,
   1117         named_axis=NamedAxis.mapping,
   (...)   1120         attrs=self._attrs,
   1121     )

File ~/miniconda3/envs/skhep-tutorial/lib/python3.12/site-packages/awkward/contents/content.py:651, in Content._getitem(self, where, named_axis)
    642 named_axis.mapping = _named_axis
    644 next = ak.contents.RegularArray(
    645     this,
    646     this.length,
    647     1,
    648     parameters=None,
    649 )
--> 651 out = next._getitem_next(nextwhere[0], nextwhere[1:], None)
    653 if out.length is not unknown_length and out.length == 0:
    654     return out._getitem_nothing()

File ~/miniconda3/envs/skhep-tutorial/lib/python3.12/site-packages/awkward/contents/regulararray.py:549, in RegularArray._getitem_next(self, head, tail, advanced)
    543 nextcontent = self._content._carry(nextcarry, True)
    545 if advanced is None or (
    546     advanced.length is not unknown_length and advanced.length == 0
    547 ):
    548     return RegularArray(
--> 549         nextcontent._getitem_next(nexthead, nexttail, advanced),
    550         nextsize,
    551         self.length,
    552         parameters=self._parameters,
    553     )
    554 else:
    555     nextadvanced = ak.index.Index64.empty(nextcarry.length, nplike)

File ~/miniconda3/envs/skhep-tutorial/lib/python3.12/site-packages/awkward/contents/listarray.py:759, in ListArray._getitem_next(self, head, tail, advanced)
    753 head = ak._slicing.normalize_integer_like(head)
    754 assert (
    755     nextcarry.nplike is self._backend.nplike
    756     and self._starts.nplike is self._backend.nplike
    757     and self._stops.nplike is self._backend.nplike
    758 )
--> 759 self._maybe_index_error(
    760     self._backend[
    761         "awkward_ListArray_getitem_next_at",
    762         nextcarry.dtype.type,
    763         self._starts.dtype.type,
    764         self._stops.dtype.type,
    765     ](
    766         nextcarry.data,
    767         self._starts.data,
    768         self._stops.data,
    769         lenstarts,
    770         head,
    771     ),
    772     slicer=head,
    773 )
    774 nextcontent = self._content._carry(nextcarry, True)
    775 return nextcontent._getitem_next(nexthead, nexttail, advanced)

File ~/miniconda3/envs/skhep-tutorial/lib/python3.12/site-packages/awkward/contents/content.py:297, in Content._maybe_index_error(self, error, slicer)
    295 else:
    296     message = self._backend.format_kernel_error(error)
--> 297     raise ak._errors.index_error(self, slicer, message)

IndexError: cannot slice ListArray (of length 5) with array(0): index out of range while attempting to get index 0 (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-50/awkward-cpp/src/cpu-kernels/awkward_ListArray_getitem_next_at.cpp#L21)

This error occurred while attempting to slice

    <Array [[0, 1.1, 2.2], ..., [6.6, 7.7, ..., 9.9]] type='5 * var * float64'>

with

    (:, 0)

Quick quiz: why does the last one raise an error?

Boolean and integer slices work, too:

array[[True, False, True, False, True]]
Loading...
array[[2, 3, 3, 1]]
Loading...

Like NumPy, boolean arrays for slices can be computed, and functions like ak.num are helpful for that.

ak.num(array)
Loading...
ak.num(array) > 0
Loading...
array[ak.num(array) > 0, 0]
Loading...
array[ak.num(array) > 1, 1]
Loading...

Now consider this (similar to an example from the first lesson):

cut = array * 10 % 2 == 0

array[cut]
Loading...

This array, cut, is not just an array of booleans. It’s a jagged array of booleans. All of its nested lists fit into array’s nested lists, so it can deeply select numbers, rather than selecting lists.

Application: selecting particles, rather than events

Returning to the big TTree from the previous lesson,

import uproot

file = uproot.open(
    "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root"
)
tree = file["Events"]

muon_pt = tree["Muon_pt"].array(entry_stop=10)

This jagged array of booleans selects all muons with at least 20 GeV:

particle_cut = muon_pt > 20

muon_pt[particle_cut]
Loading...

and this non-jagged array of booleans (made with ak.any) selects all events that have a muon with at least 20 GeV:

event_cut = ak.any(muon_pt > 20, axis=1)

muon_pt[event_cut]
Loading...

Quick quiz: construct exactly the same event_cut using ak.max.

Quick quiz: apply both cuts; that is, select muons with over 20 GeV from events that have them.

Hint: you’ll want to make a

cleaned = muon_pt[particle_cut]

intermediary and you can’t use the variable event_cut, as-is.

Hint: the final result should be a jagged array, just like muon_pt, but with fewer lists and fewer items in those lists.

Combinatorics in Awkward Array

Variable-length lists present more problems than just slicing and computing formulas array-at-a-time. Often, we want to combine particles in all possible pairs (within each event) to look for decay chains.

Pairs from two arrays, pairs from a single array

Awkward Array has functions that generate these combinations. For instance, ak.cartesian takes a Cartesian product per event (when axis=1, the default).

cartoon-cartesian
numbers = ak.Array([[1, 2, 3], [], [5, 7], [11]])
letters = ak.Array([["a", "b"], ["c"], ["d"], ["e", "f"]])

pairs = ak.cartesian((numbers, letters))

These pairs are 2-tuples, which are like records in how they’re sliced out of an array: using strings.

pairs["0"]
Loading...
pairs["1"]
Loading...

There’s also ak.unzip, which extracts every field into a separate array (opposite of ak.zip).

lefts, rights = ak.unzip(pairs)
lefts
Loading...
rights
Loading...

Note that these lefts and rights are not the original numbers and letters: they have been duplicated and have the same shape.

The Cartesian product is equivalent to this C++ for loop over two collections:

for (int i = 0; i < numbers.size(); i++) {
  for (int j = 0; j < letters.size(); j++) {
    // compute formula with numbers[i] and letters[j]
  }
}

Sometimes, though, we want to find all pairs within a single collection, without repetition. That would be equivalent to this C++ for loop:

for (int i = 0; i < numbers.size(); i++) {
  for (int j = i + 1; i < numbers.size(); j++) {
    // compute formula with numbers[i] and numbers[j]
  }
}

The Awkward function for this case is ak.combinations.

cartoon-combinations
pairs = ak.combinations(numbers, 2)
pairs

lefts, rights = ak.unzip(pairs)

lefts * rights  # they line up, so we can compute formulas
Loading...

Application to dimuons

The dimuon search in the previous lesson was a little naive in that we required exactly two muons to exist in every event and only computed the mass of that combination. If a third muon were present because it’s a complex electroweak decay or because something was mismeasured, we would be blind to the other two muons. They might be real dimuons.

A better procedure would be to look for all pairs of muons in an event and apply some criteria for selecting them.

In this example, we’ll ak.zip the muon variables together into records.

import uproot
import awkward as ak

file = uproot.open(
    "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root"
)
tree = file["Events"]

arrays = tree.arrays(filter_name="/Muon_(pt|eta|phi|charge)/", entry_stop=10000)

muons = ak.zip(
    {
        "pt": arrays["Muon_pt"],
        "eta": arrays["Muon_eta"],
        "phi": arrays["Muon_phi"],
        "charge": arrays["Muon_charge"],
    }
)
arrays.type
ArrayType(RecordType([ListType(NumpyType('float32')), ListType(NumpyType('float32')), ListType(NumpyType('float32')), ListType(NumpyType('int32'))], ['Muon_pt', 'Muon_eta', 'Muon_phi', 'Muon_charge']), 10000, None)
muons.type
ArrayType(ListType(RecordType([NumpyType('float32'), NumpyType('float32'), NumpyType('float32'), NumpyType('int32')], ['pt', 'eta', 'phi', 'charge'])), 10000, None)

The difference between arrays and muons is that arrays contains separate lists of "Muon_pt", "Muon_eta", "Muon_phi", "Muon_charge", while muons contains lists of records with "pt", "eta", "phi", "charge" fields.

Now we can compute pairs of muon objects

pairs = ak.combinations(muons, 2)

pairs.type
ArrayType(ListType(RecordType([RecordType([NumpyType('float32'), NumpyType('float32'), NumpyType('float32'), NumpyType('int32')], ['pt', 'eta', 'phi', 'charge']), RecordType([NumpyType('float32'), NumpyType('float32'), NumpyType('float32'), NumpyType('int32')], ['pt', 'eta', 'phi', 'charge'])], None)), 10000, None)

and separate them into arrays of the first muon and the second muon in each pair.

mu1, mu2 = ak.unzip(pairs)

Quick quiz: how would you ensure that all lists of records in mu1 and mu2 have the same lengths? Hint: see ak.num and ak.all.

Since they do have the same lengths, we can use them in a formula.

import numpy as np

mass = np.sqrt(
    2 * mu1.pt * mu2.pt * (np.cosh(mu1.eta - mu2.eta) - np.cos(mu1.phi - mu2.phi))
)

Quick quiz: how many masses do we have in each event? How does this compare with muons, mu1, and mu2?

Plotting the jagged array

Since this mass is a jagged array, it can’t be directly histogrammed. Histograms take a set of numbers as inputs, but this array contains lists.

Supposing you just want to plot the numbers from the lists, you can use ak.flatten to flatten one level of list or ak.ravel to flatten all levels.

import hist

hist.Hist(hist.axis.Regular(120, 0, 120, label="mass [GeV]")).fill(
    ak.ravel(mass)
).plot();
<Figure size 640x480 with 1 Axes>

Alternatively, suppose you want to plot the maximum mass-candidate in each event, biasing it toward Z bosons? ak.max is a different function that picks one element from each list, when used with axis=1.

ak.max(mass, axis=1)
Loading...

Some values are None because there is no maximum of an empty list. ak.flatten/ak.ravel remove missing values (None) as well as squashing lists,

ak.flatten(ak.max(mass, axis=1), axis=0)
Loading...

but so does removing the empty lists in the first place.

ak.max(mass[ak.num(mass) > 0], axis=1)
Loading...