Data Discussion
Overview
Teaching: 10 min
Exercises: 5 minQuestions
What dataset is being used?
How do some of the variables look?
Objectives
Briefly describe the dataset.
Take a peek at some variables.
Code Example
Here we will import all the required libraries for the rest of the tutorial. All scikit-learn and PyTorch functions will be imported later on when they are required.
import pandas as pd # to store data as dataframe
import numpy as np # for numerical calculations such as histogramming
import matplotlib.pyplot as plt # for plotting
You can check the version of these packages by checking the __version__
attribute.
np.__version__
Let’s set the random seed that we’ll be using. This reduces the randomness when you re-run the notebook
seed_value = 420 # 42 is the answer to life, the universe and everything
from numpy.random import seed # import the function to set the random seed in NumPy
seed(seed_value) # set the seed value for random numbers in NumPy
Dataset Used
The dataset we will use in this tutorial is simulated ATLAS data. Each event corresponds to 4 detected leptons: some events correspond to a Higgs Boson decay (signal) and others do not (background). Various physical quantities such as lepton charge and transverse momentum are recorded for each event. The analysis in this tutorial loosely follows the discovery of the Higgs Boson.
# In this notebook we only process the main signal ggH125_ZZ4lep and the main background llll,
# for illustration purposes.
# You can add other backgrounds after if you wish.
samples = ["llll", "ggH125_ZZ4lep"]
Exploring the Dataset
Here we will format the dataset \((x_i, y_i)\) so we can explore! First, we need to open our data set and read it into pandas DataFrames.
# get data from files
DataFrames = {} # define empty dictionary to hold dataframes
for s in samples: # loop over samples
DataFrames[s] = pd.read_csv("/kaggle/input/4lepton/" + s + ".csv") # read .csv file
DataFrames["ggH125_ZZ4lep"] # print signal data to take a look
Before diving into machine learning, think about whether there are any things you should do to clean up your data. In the case of this Higgs analysis, Higgs boson decays should produce 4 electrons or 4 muons or 2 electrons and 2 muons. Let’s define a function to keep only events which produce 4 electrons or 4 muons or 2 electrons and 2 muons.
# cut on lepton type
def cut_lep_type(lep_type_0, lep_type_1, lep_type_2, lep_type_3):
# first lepton is [0], 2nd lepton is [1] etc
# for an electron lep_type is 11
# for a muon lep_type is 13
# only want to keep events where one of eeee, mumumumu, eemumu
sum_lep_type = lep_type_0 + lep_type_1 + lep_type_2 + lep_type_3
if sum_lep_type == 44 or sum_lep_type == 48 or sum_lep_type == 52:
return True
else:
return False
We then need to apply this function on our DataFrames.
# apply cut on lepton type
for s in samples:
# cut on lepton type using the function cut_lep_type defined above
DataFrames[s] = DataFrames[s][
np.vectorize(cut_lep_type)(
DataFrames[s].lep_type_0,
DataFrames[s].lep_type_1,
DataFrames[s].lep_type_2,
DataFrames[s].lep_type_3,
)
]
DataFrames["ggH125_ZZ4lep"] # print signal data to take a look
Challenge
Another cut to clean our data is to keep only events where lep_charge_0+lep_charge_1+lep_charge_2+lep_charge_3==0. Write out this function yourself then apply it.
Solution
# cut on lepton charge def cut_lep_charge(lep_charge_0,lep_charge_1,lep_charge_2,lep_charge_3): # only want to keep events where sum of lepton charges is 0 sum_lep_charge = lep_charge_0 + lep_charge_1 + lep_charge_2 + lep_charge_3 if sum_lep_charge==0: return True else: return False # apply cut on lepton charge for s in samples: # cut on lepton charge using the function cut_lep_charge defined above DataFrames[s] = DataFrames[s][ np.vectorize(cut_lep_charge)(DataFrames[s].lep_charge_0, DataFrames[s].lep_charge_1, DataFrames[s].lep_charge_2, DataFrames[s].lep_charge_3) ] DataFrames['ggH125_ZZ4lep'] # print signal data to take a look
Plot some input variables
In any analysis searching for signal one wants to optimise the use of various input variables. Often, this optimisation will be to find the best signal to background ratio. Here we define histograms for the variables that we’ll look to optimise.
lep_pt_2 = { # dictionary containing plotting parameters for the lep_pt_2 histogram
# change plotting parameters
"bin_width": 1, # width of each histogram bin
"num_bins": 13, # number of histogram bins
"xrange_min": 7, # minimum on x-axis
"xlabel": r"$lep\_pt$[2] [GeV]", # x-axis label
}
Challenge
Write a dictionary of plotting parameters for lep_pt_1, using 28 bins. We’re using 28 bins here since lep_pt_1 extends to higher values than lep_pt_2.
Solution
lep_pt_1 = { # dictionary containing plotting parameters for the lep_pt_1 histogram # change plotting parameters 'bin_width':1, # width of each histogram bin 'num_bins':28, # number of histogram bins 'xrange_min':7, # minimum on x-axis 'xlabel':r'$lep\_pt$[1] [GeV]', # x-axis label }
Now we define a dictionary for the histograms we want to plot.
SoverB_hist_dict = {
"lep_pt_2": lep_pt_2,
"lep_pt_1": lep_pt_1,
} # add a histogram here if you want it plotted
Now let’s take a look at those variables. Because the code is a bit long, we pre-defined a function for you, to illustrate the optimum cut value on individual variables, based on signal to background ratio. Let’s call the function to illustrate the optimum cut value on individual variables, based on signal to background ratio. You can check out the function definition –>here<–
We’re not doing any machine learning just yet! We’re looking at the variables we’ll later use for machine learning.
from my_functions import plot_SoverB
plot_SoverB(DataFrames, SoverB_hist_dict)
Let’s talk through the lep_pt_2 plots.
- Imagine placing a cut at 7 GeV in the distributions of signal and background (1st plot). This means keeping all events above 7 GeV in the signal and background histograms.
- We then take the ratio of the number of signal events that pass this cut, to the number of background events that pass this cut. This gives us a starting value for S/B (2nd plot).
- We then increase this cut value to 8 GeV, 9 GeV, 10 GeV, 11 GeV, 12 GeV. Cuts at these values are throwing away more background than signal, so S/B increases.
- There comes a point around 13 GeV where we start throwing away too much signal, thus S/B starts to decrease.
- Our goal is to find the maximum in S/B, and place the cut there. You may have thought the maximum would be where the signal and background cross in the upper plot, but remember that the lower plot is the signal to background ratio of everything to the right of that x-value, not the signal to background ratio at that x-value.
The same logic applies to lep_pt_1.
In the ATLAS Higgs discovery paper, there are a number of numerical cuts applied, not just on lep_pt_1 and lep_pt_2.
Imagine having to separately optimise about 5,6,7,10…20 variables! Not to mention that applying a cut on one variable could change the distribution of another, which would mean you’d have to re-optimise… Nightmare.
This is where a machine learning model such as a neural network can come to the rescue. A machine learning model can optimise all variables at the same time.
A machine learning model not only optimises cuts, but can find correlations in many dimensions that will give better signal/background classification than individual cuts ever could.
That’s the end of the introduction to why one might want to use a machine learning model. If you’d like to try using one, just keep reading!
Your feedback is very welcome! Most helpful for us is if you “Improve this page on GitHub”. If you prefer anonymous feedback, please fill this form.
Key Points
One must know a bit about your data before any machine learning takes place.
It’s a good idea to visualise your data before any machine learning takes place.