Lesson 3: Ragged and nested arrays#

So far, all the arrays we’ve dealt with have been rectangular (in \(n\) dimensions; “rectilinear”).

What if we had data like this?

[
  [[1.84, 0.324]],
  [[-1.609, -0.713, 0.005], [0.953, -0.993, 0.011, 0.718]],
  [[0.459, -1.517, 1.545], [0.33, 0.292]],
  [[-0.376, -1.46, -0.206], [0.65, 1.278]],
  [[], [], [1.617]],
  []
]
[
  [[-0.106, 0.611]],
  [[0.118, -1.788, 0.794, 0.658], [-0.105]]
]
[
  [[-0.384], [0.697, -0.856]],
  [[0.778, 0.023, -1.455, -2.289], [-0.67], [1.153, -1.669, 0.305, 1.517, -0.292]]
]
[
  [[0.205, -0.355], [-0.265], [1.042]],
  [[-0.004], [-1.167, -0.054, 0.726, 0.213]],
  [[1.741, -0.199, 0.827]]
]

Or like this?

[
  {"fill": "#b1b1b1", "stroke": "none", "points": [{"x": 5.27453, "y": 1.03276},
    {"x": -3.51280, "y": 1.74849}]},
  {"fill": "#b1b1b1", "stroke": "none", "points": [{"x": 8.21630, "y": 4.07844},
    {"x": -0.79157, "y": 3.49478}, {"x": 16.38932, "y": 5.29399},
    {"x": 10.38641, "y": 0.10832}, {"x": -2.07070, "y": 14.07140},
    {"x": 9.57021, "y": -0.94823}, {"x": 1.97332, "y": 3.62380},
    {"x": 5.66760, "y": 11.38001}, {"x": 0.25497, "y": 3.39276},
    {"x": 3.86585, "y": 6.22051}, {"x": -0.67393, "y": 2.20572}]},
  {"fill": "#d0d0ff", "stroke": "none", "points": [{"x": 3.59528, "y": 7.37191},
    {"x": 0.59192, "y": 2.91503}, {"x": 4.02932, "y": -1.13601},
    {"x": -1.01593, "y": 1.95894}, {"x": 1.03666, "y": 0.05251}]},
  {"fill": "#d0d0ff", "stroke": "none", "points": [{"x": -8.78510, "y": -0.00497},
    {"x": -15.22688, "y": 3.90244}, {"x": 5.74593, "y": 4.12718}]},
  {"fill": "none", "stroke": "#000000", "points": [{"x": 4.40625, "y": -6.953125},
    {"x": 4.34375, "y": -7.09375}, {"x": 4.3125, "y": -7.140625},
    {"x": 4.140625, "y": -7.140625}]},
  {"fill": "none", "stroke": "#808080", "points": [{"x": 0.46875, "y": -0.09375},
    {"x": 0.46875, "y": -0.078125}, {"x": 0.46875, "y": 0.53125}]}
]

Or like this?

[
  {"movie": "Evil Dead", "year": 1981, "actors":
    ["Bruce Campbell", "Ellen Sandweiss", "Richard DeManincor", "Betsy Baker"]
  },
  {"movie": "Darkman", "year": 1900, "actors":
    ["Liam Neeson", "Frances McDormand", "Larry Drake", "Bruce Campbell"]
  },
  {"movie": "Army of Darkness", "year": 1992, "actors":
    ["Bruce Campbell", "Embeth Davidtz", "Marcus Gilbert", "Bridget Fonda",
     "Ted Raimi", "Patricia Tallman"]
  },
  {"movie": "A Simple Plan", "year": 1998, "actors":
    ["Bill Paxton", "Billy Bob Thornton", "Bridget Fonda", "Brent Briscoe"]
  },
  {"movie": "Spider-Man 2", "year": 2004, "actors":
    ["Tobey Maguire", "Kristen Dunst", "Alfred Molina", "James Franco",
     "Rosemary Harris", "J.K. Simmons", "Stan Lee", "Bruce Campbell"]
  },
  {"movie": "Drag Me to Hell", "year": 2009, "actors":
    ["Alison Lohman", "Justin Long", "Lorna Raver", "Dileep Rao", "David Paymer"]
  }
]

Or like this? (Hint: we do!)

[
  {"run": 1, "luminosityBlock": 156, "event": 46501,
   "PV": {"x": 0.243, "y": 0.393, "z": 1.451},
   "electron": [],
   "muon": [
     {"pt": 63.043, "eta": -0.718, "phi": 2.968, "mass": 0.105, "charge": 1},
     {"pt": 38.120, "eta": -0.879, "phi": -1.032, "mass": 0.105, "charge": -1},
     {"pt": 4.048, "eta": -0.320, "phi": 1.038, "mass": 0.105, "charge": 1}
   ],
   "MET": {"pt": 21.929, "phi": -2.730}
  },
  {"run": 1, "luminosityBlock": 156, "event": 46502,
   "PV": {"x": 0.244, "y": 0.395, "z": -2.879},
   "electron": [
     {"pt": 21.902, "eta": -0.702, "phi": 0.133, "mass": 0.005, "charge": 1},
     {"pt": 42.632, "eta": -0.979, "phi": -1.863, "mass": 0.008, "charge": 1},
     {"pt": 78.012, "eta": -0.933, "phi": -2.207, "mass": 0.018, "charge": -1},
     {"pt": 23.835, "eta": -1.362, "phi": -0.621, "mass": 0.008, "charge": -1}
   ],
   "muon": [],
   "MET": {"pt": 16.972, "phi": 2.866}}
]

It might be possible to turn these datasets into tabular form using surrogate keys and database normalization, but

  • they could be inconvenient or less efficient in that form, depending on what we want to do,

  • they were very likely given in a ragged/untidy form. You can’t ignore the data-cleaning step!

Dealing with these datasets as JSON or Python objects is inefficient for the same reason as for lists of numbers.

We want arbitrary data structure with array-oriented interface and performance…

Libraries for irregular arrays#

This is still a rather new field, but there are a few array libraries for arbitrary data structures.

  • Apache Arrow: In-memory format and an ecosystem of tools, an “exploded database” (database functionality provided as interchangeable pieces). Strong focus on delivering data, zero-copy, between processes.

  • Awkward Array: Library for array-oriented programming like NumPy, but for arbitrary data structures. Losslessly zero-copy convertible to/from Arrow and Parquet.

  • Parquet: Disk format for storing large datasets and (selectively) retrieving them.

  • ROOT RNTuple: A more flexible disk format than Parquet, in the ROOT framework.

Apache Arrow#

Arrow arrays support arbitrary data structures, but they’re intended to be used within some other interface. Increasingly, dataframe libraries are adopting Arrow as a column format.

import pyarrow as pa
arrow_array = pa.array([
    [{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}],
    [],
    [{"x": 4.4, "y": [1, 2, 3, 4]}, {"x": 5.5, "y": [1, 2, 3, 4, 5]}]
])
arrow_array.type
ListType(list<item: struct<x: double, y: list<item: int64>>>)
arrow_array
<pyarrow.lib.ListArray object at 0x7f6813fe55a0>
[
  -- is_valid: all not null
  -- child 0 type: double
    [
      1.1,
      2.2,
      3.3
    ]
  -- child 1 type: list<item: int64>
    [
      [
        1
      ],
      [
        1,
        2
      ],
      [
        1,
        2,
        3
      ]
    ],
  -- is_valid: all not null
  -- child 0 type: double
    []
  -- child 1 type: list<item: int64>
    [],
  -- is_valid: all not null
  -- child 0 type: double
    [
      4.4,
      5.5
    ]
  -- child 1 type: list<item: int64>
    [
      [
        1,
        2,
        3,
        4
      ],
      [
        1,
        2,
        3,
        4,
        5
      ]
    ]
]

Awkward Array#

Awkward Arrays are fully interchangeable with Arrow arrays, but they’re a high-level user interface, intended for data analysts.

import awkward as ak
awkward_array = ak.from_arrow(arrow_array)
awkward_array
[[{x: 1.1, y: [1]}, {x: 2.2, y: [...]}, {x: 3.3, y: [1, 2, 3]}],
 [],
 [{x: 4.4, y: [1, 2, 3, 4]}, {x: 5.5, y: [1, ..., 5]}]]
----------------------------------------------------------------
type: 3 * var * ?{
    x: ?float64,
    y: option[var * ?int64]
}

Parquet#

Awkward Arrays don’t lose any information when saved as Parquet files.

ak.to_parquet(awkward_array, "/tmp/file.parquet")
<pyarrow._parquet.FileMetaData object at 0x7f68137cec50>
  created_by: parquet-cpp-arrow version 18.1.0
  num_columns: 2
  num_rows: 3
  num_row_groups: 1
  format_version: 2.6
  serialized_size: 0
ak.from_parquet("/tmp/file.parquet")
[[{x: 1.1, y: [1]}, {x: 2.2, y: [...]}, {x: 3.3, y: [1, 2, 3]}],
 [],
 [{x: 4.4, y: [1, 2, 3, 4]}, {x: 5.5, y: [1, ..., 5]}]]
----------------------------------------------------------------
type: 3 * var * ?{
    x: ?float64,
    y: option[var * ?int64]
}

RNTuple#

At the time of writing, RNTuple is very new and I can’t provide a similar example.

TO DO: Write this section when the tools are ready!

Array-oriented programming with Awkward Arrays#

Let’s start with an example to show slicing.

You can find more details in the Awkward Array documentation.

ragged = ak.Array([
    [
      [[1.84, 0.324]],
      [[-1.609, -0.713, 0.005], [0.953, -0.993, 0.011, 0.718]],
      [[0.459, -1.517, 1.545], [0.33, 0.292]],
      [[-0.376, -1.46, -0.206], [0.65, 1.278]],
      [[], [], [1.617]],
      []
    ],
    [
      [[-0.106, 0.611]],
      [[0.118, -1.788, 0.794, 0.658], [-0.105]]
    ],
    [
      [[-0.384], [0.697, -0.856]],
      [[0.778, 0.023, -1.455, -2.289], [-0.67], [1.153, -1.669, 0.305, 1.517, -0.292]]
    ],
    [
      [[0.205, -0.355], [-0.265], [1.042]],
      [[-0.004], [-1.167, -0.054, 0.726, 0.213]],
      [[1.741, -0.199, 0.827]]
    ]
])

Multidimensional indexing#

ragged[3, 1, -1, 2]
np.float64(0.726)

Range-slicing#

ragged[3, 1:, -1, 1:3]
[[-0.054, 0.726],
 [-0.199, 0.827]]
-----------------------
type: 2 * var * float64

Advanced slicing#

ragged[[False, False, True, True], [0, -1, 0, -1], 0, -1]
[-0.384,
 0.827]
-----------------
type: 2 * float64

Awkward slicing#

Just as NumPy arrays can be sliced by NumPy arrays, Awkward Arrays can be sliced by Awkward Arrays. This lets you perform physics cuts on ragged data.

See the full Awkward slicing documentation for more.

ragged > 0
[[[[True, True]], [[False, False, True], [True, ..., True]], ..., [...], []],
 [[[False, True]], [[True, False, True, True], [False]]],
 [[[False], [True, False]], [[True, True, False, False], ..., [...]]],
 [[[True, False], [False], [True]], [[False], ...], [[True, False, True]]]]
-----------------------------------------------------------------------------
type: 4 * var * var * var * bool
ragged[ragged > 0]
[[[[1.84, 0.324]], [[0.005], [0.953, 0.011, 0.718]], [...], ..., [[], ...], []],
 [[[0.611]], [[0.118, 0.794, 0.658], []]],
 [[[], [0.697]], [[0.778, 0.023], [], [1.15, 0.305, 1.52]]],
 [[[0.205], [], [1.04]], [[], [0.726, 0.213]], [[1.74, 0.827]]]]
--------------------------------------------------------------------------------
type: 4 * var * var * var * float64

Reductions#

ak.sum(ragged)
np.float64(2.8980000000000006)
ak.sum(ragged, axis=-1)
[[[2.16], [-2.32, 0.689], [0.487, 0.622], [-2.04, ...], [0, 0, 1.62], []],
 [[0.505], [-0.218, -0.105]],
 [[-0.384, -0.159], [-2.94, -0.67, 1.01]],
 [[-0.15, -0.265, 1.04], [-0.004, -0.282], [2.37]]]
--------------------------------------------------------------------------
type: 4 * var * var * float64
ak.sum(ragged, axis=0)
[[[1.56, 0.58], [0.432, -0.856], [1.04]],
 [[-0.717, -2.48, -0.656, -1.63], ..., [1.15, -1.67, 0.305, 1.52, -0.292]],
 [[2.2, -1.72, 2.37], [0.33, 0.292]],
 [[-0.376, -1.46, -0.206], [0.65, 1.28]],
 [[], [], [1.62]],
 []]
---------------------------------------------------------------------------
type: 6 * var * var * float64

Just as regular arrays can be sliced along different axes,

ragged arrays (even with missing values like None) can be sliced by defining them to be left-aligned:

array = ak.Array([[   1,    2,    3,    4],
                  [  10, None,   30      ],
                  [ 100,  200            ]])
ak.sum(array, axis=0).tolist()
[111, 202, 33, 4]
ak.sum(array, axis=1).tolist()
[10, 40, 300]

In most particle physics applications, you’ll want to reduce over the deepest/maximum axis, which you can get with axis=-1.

Awkward Arrays in particle physics#

Here is a bigger version of the JSON dataset, loaded from a ROOT file with Uproot.

import uproot
file = uproot.open("data/SMHiggsToZZTo4L.root")
file
<ReadOnlyDirectory '/' at 0x7f6814008c50>
file.classnames()
{'Events;1': 'TTree'}
tree = file["Events"]
tree
<TTree 'Events' (32 branches) at 0x7f6813701990>
tree.arrays()
[{run: 1, luminosityBlock: 156, event: 46501, PV_npvs: 15, PV_x: 0.244, ...},
 {run: 1, luminosityBlock: 156, event: 46502, PV_npvs: 13, PV_x: 0.244, ...},
 {run: 1, luminosityBlock: 156, event: 46503, PV_npvs: 11, PV_x: 0.242, ...},
 {run: 1, luminosityBlock: 156, event: 46504, PV_npvs: 22, PV_x: 0.243, ...},
 {run: 1, luminosityBlock: 156, event: 46505, PV_npvs: 18, PV_x: 0.24, ...},
 {run: 1, luminosityBlock: 156, event: 46506, PV_npvs: 5, PV_x: 0.243, ...},
 {run: 1, luminosityBlock: 156, event: 46507, PV_npvs: 11, PV_x: 0.242, ...},
 {run: 1, luminosityBlock: 156, event: 46508, PV_npvs: 25, PV_x: 0.245, ...},
 {run: 1, luminosityBlock: 156, event: 46509, PV_npvs: 12, PV_x: 0.246, ...},
 {run: 1, luminosityBlock: 156, event: 46510, PV_npvs: 18, PV_x: 0.244, ...},
 ...,
 {run: 1, luminosityBlock: 996, event: 298792, PV_npvs: 24, PV_x: 0.244, ...},
 {run: 1, luminosityBlock: 996, event: 298793, PV_npvs: 14, PV_x: 0.242, ...},
 {run: 1, luminosityBlock: 996, event: 298794, PV_npvs: 24, PV_x: 0.245, ...},
 {run: 1, luminosityBlock: 996, event: 298795, PV_npvs: 30, PV_x: 0.245, ...},
 {run: 1, luminosityBlock: 996, event: 298796, PV_npvs: 15, PV_x: 0.244, ...},
 {run: 1, luminosityBlock: 996, event: 298797, PV_npvs: 18, PV_x: 0.243, ...},
 {run: 1, luminosityBlock: 996, event: 298798, PV_npvs: 9, PV_x: 0.245, ...},
 {run: 1, luminosityBlock: 996, event: 298799, PV_npvs: 15, PV_x: 0.243, ...},
 {run: 1, luminosityBlock: 996, event: 298800, PV_npvs: 13, PV_x: 0.247, ...}]
------------------------------------------------------------------------------
type: 299973 * {
    run: int32,
    luminosityBlock: uint32,
    event: uint64,
    PV_npvs: int32,
    PV_x: float32,
    PV_y: float32,
    PV_z: float32,
    nMuon: uint32,
    Muon_pt: var * float32,
    Muon_eta: var * float32,
    Muon_phi: var * float32,
    Muon_mass: var * float32,
    Muon_charge: var * int32,
    Muon_pfRelIso03_all: var * float32,
    Muon_pfRelIso04_all: var * float32,
    Muon_dxy: var * float32,
    Muon_dxyErr: var * float32,
    Muon_dz: var * float32,
    Muon_dzErr: var * float32,
    nElectron: uint32,
    Electron_pt: var * float32,
    Electron_eta: var * float32,
    Electron_phi: var * float32,
    Electron_mass: var * float32,
    Electron_charge: var * int32,
    Electron_pfRelIso03_all: var * float32,
    Electron_dxy: var * float32,
    Electron_dxyErr: var * float32,
    Electron_dz: var * float32,
    Electron_dzErr: var * float32,
    MET_pt: float32,
    MET_phi: float32
}

Here it is again (with more JSON-like formatting) from a Parquet file.

events = ak.from_parquet("data/SMHiggsToZZTo4L.parquet")
events
[{run: 1, luminosityBlock: 156, event: 46501, PV: {...}, electron: [], ...},
 {run: 1, luminosityBlock: 156, event: 46502, PV: {...}, electron: [...], ...},
 {run: 1, luminosityBlock: 156, event: 46503, PV: {...}, electron: [...], ...},
 {run: 1, luminosityBlock: 156, event: 46504, PV: {...}, electron: ..., ...},
 {run: 1, luminosityBlock: 156, event: 46505, PV: {...}, electron: [...], ...},
 {run: 1, luminosityBlock: 156, event: 46506, PV: {...}, electron: ..., ...},
 {run: 1, luminosityBlock: 156, event: 46507, PV: {...}, electron: ..., ...},
 {run: 1, luminosityBlock: 156, event: 46508, PV: {...}, electron: ..., ...},
 {run: 1, luminosityBlock: 156, event: 46509, PV: {...}, electron: [...], ...},
 {run: 1, luminosityBlock: 156, event: 46510, PV: {...}, electron: [], ...},
 ...,
 {run: 1, luminosityBlock: 996, event: 298792, PV: {...}, electron: [...], ...},
 {run: 1, luminosityBlock: 996, event: 298793, PV: {...}, electron: ..., ...},
 {run: 1, luminosityBlock: 996, event: 298794, PV: {...}, electron: [], ...},
 {run: 1, luminosityBlock: 996, event: 298795, PV: {...}, electron: [...], ...},
 {run: 1, luminosityBlock: 996, event: 298796, PV: {...}, electron: [], ...},
 {run: 1, luminosityBlock: 996, event: 298797, PV: {...}, electron: ..., ...},
 {run: 1, luminosityBlock: 996, event: 298798, PV: {...}, electron: [], ...},
 {run: 1, luminosityBlock: 996, event: 298799, PV: {...}, electron: [...], ...},
 {run: 1, luminosityBlock: 996, event: 298800, PV: {...}, electron: [...], ...}]
--------------------------------------------------------------------------------
type: 299973 * {
    run: int32,
    luminosityBlock: int64,
    event: uint64,
    PV: Vector3D[
        x: float32,
        y: float32,
        z: float32
    ],
    electron: var * Momentum4D[
        pt: float32,
        eta: float32,
        phi: float32,
        mass: float32,
        charge: int32,
        pfRelIso03_all: float32,
        dxy: float32,
        dxyErr: float32,
        dz: float32,
        dzErr: float32
    ],
    muon: var * Momentum4D[
        pt: float32,
        eta: float32,
        phi: float32,
        mass: float32,
        charge: int32,
        pfRelIso03_all: float32,
        pfRelIso04_all: float32,
        dxy: float32,
        dxyErr: float32,
        dz: float32,
        dzErr: float32
    ],
    MET: Momentum2D[
        pt: float32,
        phi: float32
    ]
}

View the first event as Python lists and dicts (like JSON).

events[0].to_list()
{'run': 1,
 'luminosityBlock': 156,
 'event': 46501,
 'PV': {'x': 0.24369880557060242,
  'y': 0.3936990201473236,
  'z': 1.451307773590088},
 'electron': [],
 'muon': [{'pt': 63.04386901855469,
   'eta': -0.7186822295188904,
   'phi': 2.968005895614624,
   'mass': 0.10565836727619171,
   'charge': 1,
   'pfRelIso03_all': 0.0,
   'pfRelIso04_all': 0.0,
   'dxy': -0.004785160068422556,
   'dxyErr': 0.0060764215886592865,
   'dz': 0.09005985409021378,
   'dzErr': 0.044572051614522934},
  {'pt': 38.12034606933594,
   'eta': -0.8794569969177246,
   'phi': -1.0324749946594238,
   'mass': 0.10565836727619171,
   'charge': -1,
   'pfRelIso03_all': 0.0,
   'pfRelIso04_all': 0.0,
   'dxy': 0.0005746808601543307,
   'dxyErr': 0.0013040687190368772,
   'dz': -0.0032290113158524036,
   'dzErr': 0.003023269586265087},
  {'pt': 4.04868745803833,
   'eta': -0.320764422416687,
   'phi': 1.0385035276412964,
   'mass': 0.10565836727619171,
   'charge': 1,
   'pfRelIso03_all': 0.0,
   'pfRelIso04_all': 0.17997965216636658,
   'dxy': -0.00232272082939744,
   'dxyErr': 0.004343290813267231,
   'dz': -0.005162843037396669,
   'dzErr': 0.004190043080598116}],
 'MET': {'pt': 21.929929733276367, 'phi': -2.7301223278045654}}

Get one numeric field (also known as “column”).

events.electron.pt
[[],
 [21.9, 42.6, 78, 23.8],
 [11.6, 6.69],
 [10.4],
 [30.7, 29.2, 6.38, 6.24],
 [16.1],
 [5.32],
 [6.99],
 [41, 6.5, 13.1, 52.1],
 [],
 ...,
 [19.1, 9.69],
 [30.2],
 [],
 [37, 7.36, 48],
 [],
 [12.3],
 [],
 [17.9, 23.2],
 [48.1, 38.7]]
----------------------------
type: 299973 * var * float32

Compute something (\(p_z = p_T \sinh\eta\)).

import numpy as np
events.electron.pt * np.sinh(events.electron.eta)
[[],
 [-16.7, -48.8, -83.9, -43.5],
 [-11.1, 26],
 [-0.237],
 [-19.9, 47.5, -18, -15.1],
 [5.58],
 [-3.22],
 [3.88],
 [-8.92, -10.5, -23.8, -17.3],
 [],
 ...,
 [26.4, 19.1],
 [92.2],
 [],
 [-193, -2.78, -43.4],
 [],
 [-3.4],
 [],
 [80.1, 99.3],
 [26.8, 74]]
------------------------------
type: 299973 * var * float32

The Vector library also has an Awkward Array backend, which is documented here. To use it, call register_awkward after importing the library.

import vector
vector.register_awkward()

Now records with name="Momentum4D" and fields with coordinate names (px, py, pz, E or pt, phi, eta, m) automatically get Vector properties and methods.

events.electron.type.show()
299973 * var * Momentum4D[
    pt: float32,
    eta: float32,
    phi: float32,
    mass: float32,
    charge: int32,
    pfRelIso03_all: float32,
    dxy: float32,
    dxyErr: float32,
    dz: float32,
    dzErr: float32
]
# implicitly computes pz = pt * sinh(eta)
events.electron.pz
[[],
 [-16.7, -48.8, -83.9, -43.5],
 [-11.1, 26],
 [-0.237],
 [-19.9, 47.5, -18, -15.1],
 [5.58],
 [-3.22],
 [3.88],
 [-8.92, -10.5, -23.8, -17.3],
 [],
 ...,
 [26.4, 19.1],
 [92.2],
 [],
 [-193, -2.78, -43.4],
 [],
 [-3.4],
 [],
 [80.1, 99.3],
 [26.8, 74]]
------------------------------
type: 299973 * var * float32

To make histograms or other plots, we need numbers without structure, so ak.flatten the array.

from hist import Hist
Hist.new.Regular(100, 0, 100, name="electron pT (GeV)").Double().fill(
    ak.flatten(events.electron.pt)
).plot();
_images/8757888490aee3b9a70ed9412742516b20e2603ccfe403023d57d3a87412d119.png

Each event has a different number of electrons and muons (ak.num to check).

ak.num(events.electron), ak.num(events.muon)
(<Array [0, 4, 2, 1, 4, 1, 1, 1, ..., 0, 3, 0, 1, 0, 2, 2] type='299973 * int64'>,
 <Array [3, 0, 0, 7, 0, 2, 1, 0, ..., 2, 0, 2, 2, 4, 0, 0] type='299973 * int64'>)

So what happens if we try to compute something with the electrons’ \(p_T\) and the muons’ \(\eta\)?

events.electron.pt * np.sinh(events.muon.eta)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[37], line 1
----> 1 events.electron.pt * np.sinh(events.muon.eta)

File ~/micromamba/envs/array-oriented-programming/lib/python3.11/site-packages/awkward/_operators.py:54, in _binary_method.<locals>.func(self, other)
     51 if _disables_array_ufunc(other):
     52     return NotImplemented
---> 54 return ufunc(self, other)

File ~/micromamba/envs/array-oriented-programming/lib/python3.11/site-packages/awkward/highlevel.py:1594, in Array.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
   1592 name = f"{type(ufunc).__module__}.{ufunc.__name__}.{method!s}"
   1593 with ak._errors.OperationErrorContext(name, inputs, kwargs):
-> 1594     return ak._connect.numpy.array_ufunc(ufunc, method, inputs, kwargs)

File ~/micromamba/envs/array-oriented-programming/lib/python3.11/site-packages/awkward/_connect/numpy.py:469, in array_ufunc(ufunc, method, inputs, kwargs)
    461         raise TypeError(
    462             "no {}.{} overloads for custom types: {}".format(
    463                 type(ufunc).__module__, ufunc.__name__, ", ".join(error_message)
    464             )
    465         )
    467     return None
--> 469 out = ak._broadcasting.broadcast_and_apply(
    470     inputs,
    471     action,
    472     depth_context=depth_context,
    473     lateral_context=lateral_context,
    474     allow_records=False,
    475     function_name=ufunc.__name__,
    476 )
    478 out_named_axis = functools.reduce(
    479     _unify_named_axis, lateral_context[NAMED_AXIS_KEY].named_axis
    480 )
    481 if len(out) == 1:

File ~/micromamba/envs/array-oriented-programming/lib/python3.11/site-packages/awkward/_broadcasting.py:1200, in broadcast_and_apply(inputs, action, depth_context, lateral_context, allow_records, left_broadcast, right_broadcast, numpy_to_regular, regular_to_jagged, function_name, broadcast_parameters_rule)
   1198 backend = backend_of(*inputs, coerce_to_common=False)
   1199 isscalar = []
-> 1200 out = apply_step(
   1201     backend,
   1202     broadcast_pack(inputs, isscalar),
   1203     action,
   1204     0,
   1205     depth_context,
   1206     lateral_context,
   1207     {
   1208         "allow_records": allow_records,
   1209         "left_broadcast": left_broadcast,
   1210         "right_broadcast": right_broadcast,
   1211         "numpy_to_regular": numpy_to_regular,
   1212         "regular_to_jagged": regular_to_jagged,
   1213         "function_name": function_name,
   1214         "broadcast_parameters_rule": broadcast_parameters_rule,
   1215     },
   1216 )
   1217 assert isinstance(out, tuple)
   1218 return tuple(broadcast_unpack(x, isscalar) for x in out)

File ~/micromamba/envs/array-oriented-programming/lib/python3.11/site-packages/awkward/_broadcasting.py:1178, in apply_step(backend, inputs, action, depth, depth_context, lateral_context, options)
   1176     return result
   1177 elif result is None:
-> 1178     return continuation()
   1179 else:
   1180     raise AssertionError(result)

File ~/micromamba/envs/array-oriented-programming/lib/python3.11/site-packages/awkward/_broadcasting.py:1147, in apply_step.<locals>.continuation()
   1145 # Any non-string list-types?
   1146 elif any(x.is_list and not is_string_like(x) for x in contents):
-> 1147     return broadcast_any_list()
   1149 # Any RecordArrays?
   1150 elif any(x.is_record for x in contents):

File ~/micromamba/envs/array-oriented-programming/lib/python3.11/site-packages/awkward/_broadcasting.py:671, in apply_step.<locals>.broadcast_any_list()
    668         nextinputs.append(x)
    669         nextparameters.append(NO_PARAMETERS)
--> 671 outcontent = apply_step(
    672     backend,
    673     nextinputs,
    674     action,
    675     depth + 1,
    676     copy.copy(depth_context),
    677     lateral_context,
    678     options,
    679 )
    680 assert isinstance(outcontent, tuple)
    681 parameters = parameters_factory(nextparameters, len(outcontent))

File ~/micromamba/envs/array-oriented-programming/lib/python3.11/site-packages/awkward/_broadcasting.py:1178, in apply_step(backend, inputs, action, depth, depth_context, lateral_context, options)
   1176     return result
   1177 elif result is None:
-> 1178     return continuation()
   1179 else:
   1180     raise AssertionError(result)

File ~/micromamba/envs/array-oriented-programming/lib/python3.11/site-packages/awkward/_broadcasting.py:1147, in apply_step.<locals>.continuation()
   1145 # Any non-string list-types?
   1146 elif any(x.is_list and not is_string_like(x) for x in contents):
-> 1147     return broadcast_any_list()
   1149 # Any RecordArrays?
   1150 elif any(x.is_record for x in contents):

File ~/micromamba/envs/array-oriented-programming/lib/python3.11/site-packages/awkward/_broadcasting.py:722, in apply_step.<locals>.broadcast_any_list()
    718 for i, ((named_axis, ndim), x, x_is_string) in enumerate(
    719     zip(named_axes_with_ndims, inputs, input_is_string)
    720 ):
    721     if isinstance(x, listtypes) and not x_is_string:
--> 722         next_content = broadcast_to_offsets_avoiding_carry(x, offsets)
    723         nextinputs.append(next_content)
    724         nextparameters.append(x._parameters)

File ~/micromamba/envs/array-oriented-programming/lib/python3.11/site-packages/awkward/_broadcasting.py:385, in broadcast_to_offsets_avoiding_carry(list_content, offsets)
    383         return list_content.content[:next_length]
    384     else:
--> 385         return list_content._broadcast_tooffsets64(offsets).content
    386 elif isinstance(list_content, ListArray):
    387     # Is this list contiguous?
    388     if index_nplike.array_equal(
    389         list_content.starts.data[1:], list_content.stops.data[:-1]
    390     ):
    391         # Does this list match the offsets?

File ~/micromamba/envs/array-oriented-programming/lib/python3.11/site-packages/awkward/contents/listoffsetarray.py:420, in ListOffsetArray._broadcast_tooffsets64(self, offsets)
    415     next_content = self._content[this_start:]
    417 if index_nplike.known_data and not index_nplike.array_equal(
    418     this_zero_offsets, offsets.data
    419 ):
--> 420     raise ValueError("cannot broadcast nested list")
    422 return ListOffsetArray(
    423     offsets, next_content[: offsets[-1]], parameters=self._parameters
    424 )

ValueError: cannot broadcast nested list

This error occurred while calling

    numpy.multiply.__call__(
        <Array [[], [21.9, ...], ..., [48.1, 38.7]] type='299973 * var * fl...'>
        <Array [[-0.782, -0.997, -0.326], ..., []] type='299973 * var * flo...'>
    )

This is data structure-aware, array-oriented programming.

Application: Filtering events with an array of booleans#

events.MET.pt > 20 is a boolean array with the same length as events, so applying it as a slice selects events with MET \(p_T > 20\) GeV.

events.MET.pt, events.MET.pt > 20
(<Array [21.9, 17, 19.1, 30.9, ..., 17.7, 24, 12.9] type='299973 * float32'>,
 <Array [True, False, False, True, ..., False, True, False] type='299973 * bool'>)

After slicing, the number of events is smaller.

len(events), len(events[events.MET.pt > 20])
(299973, 163222)

Application: Filtering particles with an array of lists of booleans#

events.electron.pt > 30 is a ragged boolean array with the same length-per-list as events.electron. Applying it as a slice selects electrons with \(p_T > 30\) GeV.

events.electron.pt, events.electron.pt > 30
(<Array [[], [21.9, ..., 23.8], ..., [48.1, 38.7]] type='299973 * var * float32'>,
 <Array [[], [False, ..., False], ..., [True, True]] type='299973 * var * bool'>)

After slicing, the number of electrons in each event can be smaller.

ak.num(events.electron), ak.num(events.electron[events.electron.pt > 30])
(<Array [0, 4, 2, 1, 4, 1, 1, 1, ..., 0, 3, 0, 1, 0, 2, 2] type='299973 * int64'>,
 <Array [0, 2, 0, 0, 1, 0, 0, 0, ..., 0, 2, 0, 0, 0, 0, 2] type='299973 * int64'>)

Mini-quiz 1: Using the reducer ak.any, how would we select events in which any electron has \(p_T > 30\) GeV/c\(^2\)?

Ragged combinatorics#

Awkward Array has two combinatorial primitives:

numbers = ak.Array([[1, 2, 3], [], [4]])
letters = ak.Array([["a", "b"], ["c"], ["d", "e"]])
ak.cartesian([numbers, letters])
[[(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b'), (3, 'a'), (3, 'b')],
 [],
 [(4, 'd'), (4, 'e')]]
--------------------------------------------------------------
type: 3 * var * (
    int64,
    string
)
values = ak.Array([[1.1, 2.2, 3.3, 4.4], [], [5.5, 6.6]])
ak.combinations(values, 2)
[[(1.1, 2.2), (1.1, 3.3), (1.1, 4.4), (2.2, ...), (2.2, 4.4), (3.3, 4.4)],
 [],
 [(5.5, 6.6)]]
--------------------------------------------------------------------------
type: 3 * var * (
    float64,
    float64
)

Often, it’s useful to separate the separate the left-hand sides and right-hand sides of these pairs with ak.unzip, so they can be used in mathematical expressions.

electron_muon_pairs = ak.cartesian([events.electron, events.muon])
electron_in_pair, muon_in_pair = ak.unzip(electron_muon_pairs)
electron_in_pair.pt, muon_in_pair.pt
(<Array [[], [], [], [10.4, ...], ..., [], [], []] type='299973 * var * float32'>,
 <Array [[], [], [], [54.3, ...], ..., [], [], []] type='299973 * var * float32'>)
ak.num(electron_in_pair), ak.num(muon_in_pair)
(<Array [0, 0, 0, 7, 0, 2, 1, 0, ..., 0, 0, 0, 2, 0, 0, 0] type='299973 * int64'>,
 <Array [0, 0, 0, 7, 0, 2, 1, 0, ..., 0, 0, 0, 2, 0, 0, 0] type='299973 * int64'>)

To use Vector’s deltaR method (\(\Delta R = \sqrt{\Delta\phi^2 + \Delta\eta^2}\)), we need to have the electrons and muons in separate arrays.

electron_in_pair, muon_in_pair = ak.unzip(ak.cartesian([events.electron, events.muon]))
electron_in_pair.deltaR(muon_in_pair)
[[],
 [],
 [],
 [1.07, 0.405, 1.97, 2.78, 1.01, 0.906, 2.79],
 [],
 [2.55, 0.8],
 [2.63],
 [],
 [],
 [],
 ...,
 [0.44, 3.16, 3.01, 1.05],
 [3.06, 1.52],
 [],
 [],
 [],
 [2.54, 0.912],
 [],
 [],
 []]
----------------------------------------------
type: 299973 * var * float32

Same for combinations from a single collection.

first_electron_in_pair, second_electron_in_pair = ak.unzip(ak.combinations(events.electron, 2))
first_electron_in_pair.deltaR(second_electron_in_pair)
[[],
 [2.02, 2.35, 1, 0.347, 1.3, 1.64],
 [2.96],
 [],
 [3.31, 1.23, 2.39, 3.8, 3.19, 2.61],
 [],
 [],
 [],
 [1.18, 2.05, 2.98, 2.26, 2.59, 1.9],
 [],
 ...,
 [2.87],
 [],
 [],
 [2.05, 2.92, 3.02],
 [],
 [],
 [],
 [2.96],
 [2.72]]
-------------------------------------
type: 299973 * var * float32

Application: Z mass computed from all combinations of electrons#

A rough Z mass plot can be made in three lines of code.

first_electron_in_pair, second_electron_in_pair = ak.unzip(ak.combinations(events.electron, 2))
z_mass = (first_electron_in_pair + second_electron_in_pair).mass
Hist.new.Reg(120, 0, 120, name="mass (GeV)").Double().fill(
    ak.flatten(z_mass, axis=-1)
).plot();
_images/fb0d8073d390d7f9db9be4bb3b5a1877fe14b2c5f2a1f1a2edfe5207f656a9e1.png

Lesson 3 project: Higgs combinatorics from arrays#

As described in the intro, navigate to the notebooks directory and open lesson-3-project.ipynb, then follow its instructions.