Ragged data and Graph Neural Networks (GNNs)#

So far, all of the neural networks we’ve considered take and return fixed-size vectors as inputs and outputs. However, HEP data usually comes in variable-length lists:

import numpy as np
import awkward as ak
import uproot
event_data = uproot.open("data/SMHiggsToZZTo4L.root")["Events"].arrays()
event_data.Muon_pt
[[63, 38.1, 4.05],
 [],
 [],
 [54.3, 23.5, 52.9, 4.33, 5.35, 8.39, 3.49],
 [],
 [38.5, 47],
 [4.45],
 [],
 [],
 [],
 ...,
 [37.2, 50.1],
 [43.2, 24],
 [24.2, 79.5],
 [],
 [9.81, 25.5],
 [32.6, 43.1],
 [4.32, 4.36, 5.63, 4.75],
 [],
 []]
--------------------------------------------
backend: cpu
nbytes: 4.2 MB
type: 299973 * var * float32

This dataset has 299973 events with a different number of muons in each event, and therefore a different number of muon \(p_T\) values in each event. As an array, the second axis has a different number of components in each element of the first axis. This is called a ragged array or jagged array.

How can a neural network accept ragged data? If we’re talking about features as vectors \(\vec{x}_i\) for \(i \in [0, N)\) where \(N\) is the number of data points, each \(\vec{x}_i\) would have a different number of dimensions. How can we perform a linear transformation from a variable number of dimensions to some other space? Raggedness breaks the whole idea of how a neural network works.

Pad and truncate#

One option is to force the variable-length dimension to be regular. We can assert that the incoming data are \(n\)-dimensional for some fixed \(n\), remove dimensions if there are too many, and insert zeros if there are too few.

Following this section of the Awkward Array User Guide, it can be done with ak.pad_none and ak.fill_none.

ak.fill_none(ak.pad_none(event_data.Muon_pt, 2, clip=True), 0)
[[63, 38.1],
 [0, 0],
 [0, 0],
 [54.3, 23.5],
 [0, 0],
 [38.5, 47],
 [4.45, 0],
 [0, 0],
 [0, 0],
 [0, 0],
 ...,
 [37.2, 50.1],
 [43.2, 24],
 [24.2, 79.5],
 [0, 0],
 [9.81, 25.5],
 [32.6, 43.1],
 [4.32, 4.36],
 [0, 0],
 [0, 0]]
--------------------------
backend: cpu
nbytes: 4.8 MB
type: 299973 * 2 * float64

The clip=True argument of ak.pad_none truncates (removing the third and higher element from each list) while the function pads missing dimensions with None. The ak.fill_none function can then replace the None values with a chosen value, like 0. This makes all \(\vec{x}_i\) 2-dimensional, which you can see in the data type: 2 * float64 instead of var * float64.

Maybe now you want to convert it ak.to_numpy before passing it to PyTorch to make a tensor. (As of this writing, PyTorch doesn’t recognize Awkward Arrays and it slowly iterates over them, rather than converting them in compiled code.)

import torch
torch.tensor(ak.to_numpy(
    ak.fill_none(ak.pad_none(event_data.Muon_pt, 2, clip=True), 0)
), dtype=torch.float32)
tensor([[63.0439, 38.1203],
        [ 0.0000,  0.0000],
        [ 0.0000,  0.0000],
        ...,
        [ 4.3161,  4.3588],
        [ 0.0000,  0.0000],
        [ 0.0000,  0.0000]])

These two input features are the \(p_T\) of the first muon and the \(p_T\) of the second muon, imputing zeros if either does not exist. Usually, you’ll want to use all or many of the particle’s attributes as features, so you’ll need to np.stack them together (or, equivalently, create a length-1 axis with np.newaxis and np.concatenate along the new axis). You would have to do this whether the data are ragged or not.

muon_fields = [name for name in event_data.fields if name.startswith("Muon_")]
muon_fields
['Muon_pt',
 'Muon_eta',
 'Muon_phi',
 'Muon_mass',
 'Muon_charge',
 'Muon_pfRelIso03_all',
 'Muon_pfRelIso04_all',
 'Muon_dxy',
 'Muon_dxyErr',
 'Muon_dz',
 'Muon_dzErr']
np.concatenate([event_data[name, :, :, np.newaxis] for name in muon_fields], axis=-1)
[[[63, -0.719, 2.97, 0.106, 1, ..., 0, -0.00479, 0.00608, 0.0901, 0.0446], ...],
 [],
 [],
 [[54.3, -1.06, -0.362, 0.106, ..., 0.000475, 0.00132, 0.00248, 0.0031], ...],
 [],
 [[38.5, 0.315, 2.05, 0.106, ..., -0.000141, 0.000984, -0.00187, 0.00294], ...],
 [[4.45, -0.986, 1.12, 0.106, 1, ..., -999, -0.00387, 0.00323, 10.4, 0.00521]],
 [],
 [],
 [],
 ...,
 [[37.2, 1.1, -0.875, 0.106, ..., 2.29e-05, 0.00141, -0.00516, 0.00406], ...],
 [[43.2, 2.15, -1.3, 0.106, 1, ..., -0.00216, 0.00175, -0.00404, 0.00889], ...],
 [[24.2, 0.327, -0.997, 0.106, ..., -0.00159, 0.0017, 0.00121, 0.00284], ...],
 [],
 [[9.81, 2.07, 1.66, 0.106, ..., -0.000208, 0.00271, -0.00588, 0.00934], ...],
 [[32.6, 1.11, -0.981, 0.106, ..., -0.000776, 0.00137, -0.00712, 0.00354], ...],
 [[4.32, -2.09, 0.988, 0.106, 1, ..., 3.45, 0.227, 0.0158, -2.39, 0.0709], ...],
 [],
 []]
--------------------------------------------------------------------------------
backend: cpu
nbytes: 41.2 MB
type: 299973 * var * 11 * float64

The array above is 299973 events with a variable number of 11-dimensional feature vectors. You can pad and truncate the above as we did with a single attribute or you can pad and truncate first, then np.stack the rectangular arrays.

rectangular_arrays = [ak.fill_none(ak.pad_none(event_data[name], 2, clip=True), 0) for name in muon_fields]
rectangular_arrays
[<Array [[63, 38.1], [0, 0], ..., [0, ...], [0, 0]] type='299973 * 2 * float64'>,
 <Array [[-0.719, -0.879], [0, 0], ..., [0, 0]] type='299973 * 2 * float64'>,
 <Array [[2.97, -1.03], [0, 0], ..., [...], [0, 0]] type='299973 * 2 * float64'>,
 <Array [[0.106, 0.106], [0, 0], ..., [...], [0, 0]] type='299973 * 2 * float64'>,
 <Array [[1, -1], [0, 0], [...], ..., [0, 0], [0, 0]] type='299973 * 2 * int64'>,
 <Array [[0, 0], [0, 0], [...], ..., [0, 0], [0, 0]] type='299973 * 2 * float64'>,
 <Array [[0, 0], [0, 0], [...], ..., [0, 0], [0, 0]] type='299973 * 2 * float64'>,
 <Array [[-0.00479, 0.000575], [...], ..., [0, 0]] type='299973 * 2 * float64'>,
 <Array [[0.00608, 0.0013], [0, ...], ..., [0, 0]] type='299973 * 2 * float64'>,
 <Array [[0.0901, -0.00323], [0, ...], ..., [0, 0]] type='299973 * 2 * float64'>,
 <Array [[0.0446, 0.00302], [0, ...], ..., [0, 0]] type='299973 * 2 * float64'>]
np.stack(rectangular_arrays, axis=-1)
[[[63, -0.719, 2.97, 0.106, 1, ..., 0, -0.00479, 0.00608, 0.0901, 0.0446], ...],
 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0]],
 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0]],
 [[54.3, -1.06, -0.362, 0.106, ..., 0.000475, 0.00132, 0.00248, 0.0031], ...],
 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0]],
 [[38.5, 0.315, 2.05, 0.106, ..., -0.000141, 0.000984, -0.00187, 0.00294], ...],
 [[4.45, -0.986, 1.12, 0.106, 1, ..., -0.00387, 0.00323, 10.4, 0.00521], ...],
 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0]],
 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0]],
 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0]],
 ...,
 [[37.2, 1.1, -0.875, 0.106, ..., 2.29e-05, 0.00141, -0.00516, 0.00406], ...],
 [[43.2, 2.15, -1.3, 0.106, 1, ..., -0.00216, 0.00175, -0.00404, 0.00889], ...],
 [[24.2, 0.327, -0.997, 0.106, ..., -0.00159, 0.0017, 0.00121, 0.00284], ...],
 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0]],
 [[9.81, 2.07, 1.66, 0.106, ..., -0.000208, 0.00271, -0.00588, 0.00934], ...],
 [[32.6, 1.11, -0.981, 0.106, ..., -0.000776, 0.00137, -0.00712, 0.00354], ...],
 [[4.32, -2.09, 0.988, 0.106, 1, ..., 3.45, 0.227, 0.0158, -2.39, 0.0709], ...],
 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0]],
 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0]]]
--------------------------------------------------------------------------------
backend: cpu
nbytes: 52.8 MB
type: 299973 * 2 * 11 * float64

While working this out in your own data, check the first few values of the output to be sure that it’s stacking along the axis that you expect. The np.column_stack that we usually use applies to axis=1, and this dataset is one level deeper because the first level is “events,” the second is “particles,” and the third is “particle attributes.” (The same warning applies to image data in Convolutional Neural Networks (CNNs), which has row pixels, column pixels, and channels as extra levels.) Since we want to concatenate attributes together as the deepest level of nesting, we can use axis=-1 as a shorthand for “deepest level.”

Now that I’ve shown how to pad and truncate, is it a good idea?

  • Padding with zeros or some value that can’t appear in the data is not bad, because we’re feeding it into a neural network. After training, the neural network will learn to treat the pad value as a special value, possibly by dedicating some ReLU neurons to it to put it in a category by itself.

  • Truncating, however, throws away possibly useful information. Moreover, it always throws away values from one end of the list and not the other: if we truncate to 2 muons, the model will never see the 3rd or 4th, but will always see the 1st and 2nd. If the muons are sorted in descending \(p_T\) and you know that the \(n\)th muon is never relevant when you truncate at \(n\), then this can be okay.

Recurrent Neural Networks (RNNs)#

As an alternative, we can consider using a Recurrent Neural Network (RNN). In the first section of this course on Neural networks, I discussed the general case of a neural network with cycles—neurons that pass data to neurons that comes back to the original neuron—and then dismissed it. Since then, we’ve only considered graph topologies without cycles.

An RNN feeds data back to itself at a later time. For instance, a network may take \(\vec{x}_i\) as input and return \(\vec{y}_i\) as output, but then the next input would be some combination of \(\vec{x}_{i + 1}\) (the next data point) and \(\vec{y}_i\) (the previous output). This has been especially fruitful in text processing, such as an autocomplete engine that takes an 8-dimensional fixed-size vector like

TO BE OR

predicts  NOT and then is fed

E OR NOT

as the next input (which then generates random Shakespeare). Incidentally, this kind of autocomplete application was producing astonishingly life-like text in 2015, 7 years before ChatGPT.

The relevance for HEP is that sets of particles in an event can be thought of as characters in a sentence: a sequence of arbitrary length is ended by an end-of-event token (such as 0 for all attributes), just as a sentence ends with punctuation or a prescribed end-of-message token.

For example, instead of a ragged array like

event_data.Muon_pt
[[63, 38.1, 4.05],
 [],
 [],
 [54.3, 23.5, 52.9, 4.33, 5.35, 8.39, 3.49],
 [],
 [38.5, 47],
 [4.45],
 [],
 [],
 [],
 ...,
 [37.2, 50.1],
 [43.2, 24],
 [24.2, 79.5],
 [],
 [9.81, 25.5],
 [32.6, 43.1],
 [4.32, 4.36, 5.63, 4.75],
 [],
 []]
--------------------------------------------
backend: cpu
nbytes: 4.2 MB
type: 299973 * var * float32

we could end each list with 0,

ak.concatenate([event_data.Muon_pt, np.zeros((len(event_data), 1))], axis=-1)
[[63, 38.1, 4.05, 0],
 [0],
 [0],
 [54.3, 23.5, 52.9, 4.33, 5.35, 8.39, 3.49, 0],
 [0],
 [38.5, 47, 0],
 [4.45, 0],
 [0],
 [0],
 [0],
 ...,
 [37.2, 50.1, 0],
 [43.2, 24, 0],
 [24.2, 79.5, 0],
 [0],
 [9.81, 25.5, 0],
 [32.6, 43.1, 0],
 [4.32, 4.36, 5.63, 4.75, 0],
 [0],
 [0]]
-----------------------------------------------
backend: cpu
nbytes: 8.3 MB
type: 299973 * var * float64

and then ak.flatten them into a single stream:

ak.flatten(
    ak.concatenate([event_data.Muon_pt, np.zeros((len(event_data), 1))], axis=-1)
)
[63,
 38.1,
 4.05,
 0,
 0,
 0,
 54.3,
 23.5,
 52.9,
 4.33,
 ...,
 43.1,
 0,
 4.32,
 4.36,
 5.63,
 4.75,
 0,
 0,
 0]
----------------------
backend: cpu
nbytes: 5.9 MB
type: 741367 * float64

There are many different types of RNNs—as there are many ways of adding cycles to a graph—but they would all train on data with ragged structure replaced by sequential structure.

Permutation invariance and data augmentation#

The problem with the above approach is that HEP events contain sets of particles and RNNs train on sequences. When we give the model artificial information, such as the order of particles in an event, it will attempt to use that information to predict the target. In the best case, the model will learn that this extra information is irrelevant, but that might not happen. The common technique of sorting particles by descending \(p_T\) is helpful—the most relevant particles will be at the beginning of each list and the low-energy radiation or spurious, misreconstructed data will be at the end of each list—but small differences in \(p_T\) can swap the order of two particles, and the relevance of two particles to each other rarely depends on their having similar \(p_T\). For instance, two muons from the same \(Z\) boson decay probably have very different \(p_T\) values, and muons from a completely different decay will likely be in between.

If our data consists of sets and we represent them as sequences (arrays), then the meaning of those sequences is unchanged by permuting (reordering) their elements. This is a symmetry of the data, and we want it to be a symmetry of the model as well.

One way of imposing a symmetry is to randomly augment the training data. For instance, if we want a jet image model, such as the one from the section on Convolutional Neural Networks (CNNs), to not depend on the rotation angle of the jet image, we can randomly rotate images in the training data. In fact, we can increase the size of our training dataset by including copies of the same event with different rotations. Similarly, if we want an image model to be independent of translations (shifting left-right or up-down), we can randomly augment the training data with this symmetry operator as well. The symmetry operator for order is permutation.

Here’s how to randomly shuffle ragged arrays, using ak.num to get the number of elements in each list, np.random.uniform to generate random numbers, ak.unflatten to give the random numbers the same structure as the original data, and ak.argsort to sort the random numbers and apply that order to the original data.

counts = ak.num(event_data.Muon_pt)
counts
[3,
 0,
 0,
 7,
 0,
 2,
 1,
 0,
 0,
 0,
 ...,
 2,
 2,
 2,
 0,
 2,
 2,
 4,
 0,
 0]
--------------------
backend: cpu
nbytes: 2.4 MB
type: 299973 * int64
random_per_muon = ak.unflatten(np.random.uniform(0, 1, np.sum(counts)), counts)
random_per_muon
[[0.224, 0.984, 0.701],
 [],
 [],
 [0.337, 0.161, 0.222, 0.151, 0.971, 0.881, 0.834],
 [],
 [0.719, 0.624],
 [0.552],
 [],
 [],
 [],
 ...,
 [0.229, 0.559],
 [0.00202, 0.467],
 [0.774, 0.0636],
 [],
 [0.397, 0.758],
 [0.367, 0.837],
 [0.0249, 0.34, 0.51, 0.855],
 [],
 []]
---------------------------------------------------
backend: cpu
nbytes: 5.9 MB
type: 299973 * var * float64
indexes = ak.argsort(random_per_muon, axis=-1)
indexes
[[0, 2, 1],
 [],
 [],
 [3, 1, 2, 0, 6, 5, 4],
 [],
 [1, 0],
 [0],
 [],
 [],
 [],
 ...,
 [0, 1],
 [0, 1],
 [1, 0],
 [],
 [0, 1],
 [0, 1],
 [0, 1, 2, 3],
 [],
 []]
--------------------------
backend: cpu
nbytes: 5.9 MB
type: 299973 * var * int64
event_data.Muon_pt[indexes]
[[63, 4.05, 38.1],
 [],
 [],
 [4.33, 23.5, 52.9, 54.3, 3.49, 8.39, 5.35],
 [],
 [47, 38.5],
 [4.45],
 [],
 [],
 [],
 ...,
 [37.2, 50.1],
 [43.2, 24],
 [79.5, 24.2],
 [],
 [9.81, 25.5],
 [32.6, 43.1],
 [4.32, 4.36, 5.63, 4.75],
 [],
 []]
--------------------------------------------
backend: cpu
nbytes: 4.2 MB
type: 299973 * var * float32

The muon \(p_T\) values above are shuffled within each event (but not across events). To shuffle all of the other attributes (event_data.Muon_eta, event_data.Muon_phi, etc.) with the same order, slice them with the same indexes as above.

Building permutation invariance into the model#

Another way to ensure that the model has permutation symmetry is to compute it using only operations that preserve that symmetry. For instance, consider the following:

When we have \(n\) muons in an event, we can consider passing the attributes of each one of those \(n\) muons into the same neural network, then compute the sum (or minimum, maximum, mean, or other aggregation) of the network’s outputs to produce a single fixed-size vector for the next stage of processing. The next stage could be more neural network layers.

The next event might have a different number of muons, but they would each be passed through the network and aggregated the same way. Since the aggregation can take an arbitrary number of inputs and it doesn’t priviledge the inputs (the aggregation can’t be first or last or something), a model built from it is automatically permutation invariant.

You might be wondering about the loss of information inherent in aggregating \(n\) vectors into \(1\) vector. You can compensate for that by making the network increase the number of dimensions from \(\vec{x}^{L1}\) to the \(\vec{y}\) that gets aggregated. This gives the optimizer wiggle room to preserve information about multiple inputs. The number of dimensions that you should give \(\vec{y}\) depends on the average number of inputs \(n\), so you’ll need to be sure to tune this hyperparameter for a particular training dataset.

To implement this, we’ll need aggregation operators that propagate derivatives and can be used in a PyTorch model. PyTorch Geometric is an extension of PyTorch for Graph Neural Networks (GNNs).

from torch_geometric.nn import aggr

PyTorch Geometric wants ragged arrays to be expressed as two equal-length torch.Tensors: one contains the flattened data (dtype=torch.float32) and the other contains the index of the list that it originally belonged to (dtype=torch.int64).

That is, a ragged array like

event_data.Muon_pt
[[63, 38.1, 4.05],
 [],
 [],
 [54.3, 23.5, 52.9, 4.33, 5.35, 8.39, 3.49],
 [],
 [38.5, 47],
 [4.45],
 [],
 [],
 [],
 ...,
 [37.2, 50.1],
 [43.2, 24],
 [24.2, 79.5],
 [],
 [9.81, 25.5],
 [32.6, 43.1],
 [4.32, 4.36, 5.63, 4.75],
 [],
 []]
--------------------------------------------
backend: cpu
nbytes: 4.2 MB
type: 299973 * var * float32

would have to be expanded by broadcasting (ak.broadcast_arrays) increasing integers made by np.arange.

ragged_data, ragged_index = ak.broadcast_arrays(
    event_data.Muon_pt,
    np.arange(len(event_data))[:, np.newaxis],
)

flattened_data = torch.tensor(ak.to_numpy(ak.flatten(ragged_data)), dtype=torch.float32)
flattened_index = torch.tensor(ak.to_numpy(ak.flatten(ragged_index)), dtype=torch.int64)

After broadcasting, the ragged_index has the same list lengths as ragged_data, containing duplicated event numbers:

ragged_index
[[0, 0, 0],
 [],
 [],
 [3, 3, 3, 3, 3, 3, 3],
 [],
 [5, 5],
 [6],
 [],
 [],
 [],
 ...,
 [299964, 299964],
 [299965, 299965],
 [299966, 299966],
 [],
 [299968, 299968],
 [299969, 299969],
 [299970, 299970, 299970, 299970],
 [],
 []]
----------------------------------
backend: cpu
nbytes: 5.9 MB
type: 299973 * var * int64

While flattened_index tells aggr.SumAggregation how to add the values in flattened_data without crossing event boundaries:

aggr.SumAggregation()(flattened_data[:, np.newaxis], flattened_index)
tensor([[105.2129],
        [  0.0000],
        [  0.0000],
        ...,
        [ 35.3251],
        [ 75.6364],
        [ 19.0584]])

The [:, np.newaxis] is because PyTorch expects this tensor to be a feature tensor, with a fixed number of attributes in each value. Let’s make it one, pulling together both the process of interlacing attribute vectors into a feature vector and creating a ragged index.

ragged_features = [event_data[name, :, :, np.newaxis] for name in muon_fields]

_, ragged_index = ak.broadcast_arrays(event_data["Muon_pt"], np.arange(len(event_data))[:, np.newaxis])

ragged_data = ak.concatenate(ragged_features, axis=-1)

flattened_data = torch.tensor(ak.to_numpy(ak.flatten(ragged_data)), dtype=torch.float32)
flattened_index = torch.tensor(ak.to_numpy(ak.flatten(ragged_index)), dtype=torch.int64)

In the above,

  • ragged_features is a list of ragged arrays (all with the same shape) for each of the muon attributes,

  • ragged_index is event numbers broadcasted to one of the muon attributes ("Muon_pt"),

  • ragged_data is a ragged array of feature vectors,

  • flattened_data is a floating-point torch.Tensor of feature vectors without event boundaries,

  • flattened_index is an integer torch.Tensor of event numbers for each feature vector.

They can be used in a aggr.SumAggregation (or any other aggregation) without modification.

result = aggr.SumAggregation()(flattened_data, flattened_index)
result
tensor([[ 1.0521e+02, -1.9189e+00,  2.9740e+00,  ...,  1.1724e-02,
          8.1668e-02,  5.1785e-02],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        ...,
        [ 3.5325e+01,  3.1119e+00, -1.4302e+00,  ...,  4.7087e-03,
         -9.4279e-03,  1.3053e-02],
        [ 7.5636e+01,  9.4951e-01,  1.2880e+00,  ...,  2.4561e-03,
         -4.6555e-03,  6.9863e-03],
        [ 1.9058e+01, -8.3898e+00,  3.5038e+00,  ..., -2.9970e+03,
         -2.9994e+03, -2.9969e+03]])

The result of the aggregation has dimensions of: number of events × number of muon attributes.

result.shape
torch.Size([299971, 11])

Here’s an example that passes feature vectors through a neural network, and then passes that into a 20-dimensional aggregation (larger than the 11 dimensions of the muon attributes):

from torch import nn
class NeuralNetworkThenSum(nn.Module):
    def __init__(self):
        super().__init__()   # let PyTorch do its initialization first
        self.neural_network = nn.Sequential(
            nn.Linear(11, 15),
            nn.Sigmoid(),
            nn.Linear(15, 20),
            nn.Sigmoid(),
        )
        self.sum = aggr.SumAggregation()

    def forward(self, flattened_data, flattened_index):
        return self.sum(self.neural_network(flattened_data), flattened_index)
model = NeuralNetworkThenSum()

model(flattened_data, flattened_index)
tensor([[1.4548, 1.2882, 1.8872,  ..., 1.2508, 1.2607, 1.3709],
        [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [0.9667, 0.8344, 1.2813,  ..., 0.8125, 0.8170, 0.9291],
        [0.9667, 0.8109, 1.2953,  ..., 0.8182, 0.8282, 0.9295],
        [2.2665, 1.8713, 2.4060,  ..., 2.2153, 1.6398, 1.5451]],
       grad_fn=<ScatterAddBackward0>)

This fragment can be part of a larger network, such as DeepSets (ref), Attention (ref), or Transformers (ref).

DeepSets is the simplest: it just applies a few more layers of a standard neural network after the sum:

Applied to sets of particles, the first neural network transforms particle attributes to a latent space representing all particles in the event and the second one transforms this event data to a regression or classification.

Graph Neural Networks (GNNs)#

Just as linear fitting is the first step toward neural networks, DeepSets is the first step toward Graph Neural Networks (GNNs). Whereas a basic neural network takes fixed-size vectors as inputs and DeepSets takes sets of fixed-sized vectors as inputs, GNNs take graphs of fixed-sized vectors as inputs.

In this context, a “graph” is an abstract structure made of vertices and edges (not, for instance, a plot of data). Neural networks are computations described as graphs (see all the pictures above, on this page), but what’s special about GNNs is that they perform computations on data that are graphs.

DeepSets operates on data that are sets, and a set is just a graph without edges:

Set

Graph

DeepSets combines all objects in a single summation, treating all objects equally. GNNs, however, use the edges to assign weights in the summation or otherwise modify the aggregation in an edge-dependent way.

GNNs are a large field of study, and they are particularly useful in HEP because HEP objects are unordered with important interrelationships.