Applying To Experimental Data

Overview

Teaching: 5 min
Exercises: 15 min

Questions

What about real, experimental data?

Are we there yet?

Objectives

Check that our machine learning models behave similarly with real experimental data.

Finish!

What about real, experimental data?

Notice that we’ve trained and tested our machine learning models on simulated data for signal and background. That’s why there are definite labels, y. This has been a case of supervised learning since we knew the labels (y) going into the game. Your machine learning models would then usually be applied to real experimental data once you’re happy with them.

To make sure our machine learning model makes sense when applied to real experimental data, we should check whether simulated data and real experimental data have the same shape in classifier threshold values.

We first need to get the real experimental data.

Challenge to end all challenges

Read data.csv like in the Data Discussion lesson. data.csv is in the same file folder as the files we’ve used so far.

Apply cut_lep_type and cut_lep_charge like in the Data Discussion lesson

Convert the data to a NumPy array, X_data, similar to the Data Preprocessing lesson. You may find the attribute .values useful to convert a pandas DataFrame to a Numpy array.

Don’t forget to transform using the scaler like in the Data Preprocessing lesson. Call the scaled data X_data_scaled.

Predict the labels your random forest classifier would assign to X_data_scaled. Call your predictions y_data_RF.
Solution to part 1
DataFrames['data'] = pd.read_csv('/kaggle/input/4lepton/data.csv') # read data.csv file
Solution to part 2
DataFrames["data"] = DataFrames["data"][
   np.vectorize(cut_lep_type)(
       DataFrames["data"].lep_type_0,
       DataFrames["data"].lep_type_1,
       DataFrames["data"].lep_type_2,
       DataFrames["data"].lep_type_3
   )
]
DataFrames["data"] = DataFrames["data"][
   np.vectorize(cut_lep_charge)(
       DataFrames["data"].lep_charge_0,
       DataFrames["data"].lep_charge_1,
       DataFrames["data"].lep_charge_2,
       DataFrames["data"].lep_charge_3
   )
]
Solution to part 3
X_data = DataFrames['data'][ML_inputs].values # .values converts straight to NumPy array
Solution to part 4
X_data_scaled = scaler.transform(X_data) # X_data now scaled same as training and testing sets
Solution to part 5
y_data_RF = RF_clf.predict(X_data_scaled) # make predictions on the data

Now we can overlay the real experimental data on the simulated data.

labels = ["background", "signal"]  # labels for simulated data
thresholds = (
    []
)  # define list to hold random forest classifier probability predictions for each sample
for s in samples:  # loop over samples
    thresholds.append(
        RF_clf.predict_proba(scaler.transform(DataFrames[s][ML_inputs]))[:, 1]
    )  # predict probabilities for each sample
plt.hist(
    thresholds, bins=np.arange(0, 0.8, 0.1), density=True, stacked=True, label=labels
)  # plot simulated data
data_hist = np.histogram(
    RF_clf.predict_proba(X_data_scaled)[:, 1], bins=np.arange(0, 0.8, 0.1), density=True
)[
    0
]  # histogram the experimental data
scale = sum(RF_clf.predict_proba(X_data_scaled)[:, 1]) / sum(
    data_hist
)  # get scale imposed by density=True
data_err = np.sqrt(data_hist * scale) / scale  # get error on experimental data
plt.errorbar(
    x=np.arange(0.05, 0.75, 0.1), y=data_hist, yerr=data_err, label="Data"
)  # plot the experimental data errorbars
plt.xlabel("Threshold")
plt.legend()

Within errors, the real experimental data errorbars agree with the simulated data histograms. Good news, our random forest classifier model makes sense with real experimental data!

At the end of the day

How many signal events is the random forest classifier predicting?

print(np.count_nonzero(y_data_RF == 1))  # signal

What about background?

print(np.count_nonzero(y_data_RF == 0))  # background

The random forest classifier is predicting how many real data events are signal and how many are background, how cool is that?!

Ready to machine learn to take over the world!

Hopefully you’ve enjoyed this brief discussion on machine learning! Try playing around with the hyperparameters of your random forest and neural network classifiers, such as the number of hidden layers and neurons, and see how they effect the results of your classifiers in Python!

Your feedback is very welcome! Most helpful for us is if you “Improve this page on GitHub”. If you prefer anonymous feedback, please fill this form.

Key Points

It’s a good idea to check whether our machine learning models behave well with real experimental data.

That’s it!

previous episode

Introduction to Machine Learning

next episode

Applying To Experimental Data

Overview

What about real, experimental data?

Challenge to end all challenges

Solution to part 1

Solution to part 2

Solution to part 3

Solution to part 4

Solution to part 5

At the end of the day

Ready to machine learn to take over the world!

Key Points

previous episode

next episode