Applying To Experimental Data
Overview
Teaching: 5 min
Exercises: 15 minQuestions
What about real, experimental data?
Are we there yet?
Objectives
Check that our machine learning models behave similarly with real experimental data.
Finish!
What about real, experimental data?
Notice that we’ve trained and tested our machine learning models on simulated data for signal and background. That’s why there are definite labels, y
. This has been a case of supervised learning since we knew the labels (y) going into the game. Your machine learning models would then usually be applied to real experimental data once you’re happy with them.
To make sure our machine learning model makes sense when applied to real experimental data, we should check whether simulated data and real experimental data have the same shape in classifier threshold values.
We first need to get the real experimental data.
Challenge to end all challenges
- Read data.csv like in the Data Discussion lesson. data.csv is in the same file folder as the files we’ve used so far.
- Apply cut_lep_type and cut_lep_charge like in the Data Discussion lesson
- Convert the data to a NumPy array,
X_data
, similar to the Data Preprocessing lesson. You may find the attribute.values
useful to convert a pandas DataFrame to a Numpy array.- Don’t forget to transform using the scaler like in the Data Preprocessing lesson. Call the scaled data
X_data_scaled
.- Predict the labels your random forest classifier would assign to
X_data_scaled
. Call your predictionsy_data_RF
.Solution to part 1
DataFrames['data'] = pd.read_csv('/kaggle/input/4lepton/data.csv') # read data.csv file
Solution to part 2
DataFrames["data"] = DataFrames["data"][ np.vectorize(cut_lep_type)( DataFrames["data"].lep_type_0, DataFrames["data"].lep_type_1, DataFrames["data"].lep_type_2, DataFrames["data"].lep_type_3 ) ] DataFrames["data"] = DataFrames["data"][ np.vectorize(cut_lep_charge)( DataFrames["data"].lep_charge_0, DataFrames["data"].lep_charge_1, DataFrames["data"].lep_charge_2, DataFrames["data"].lep_charge_3 ) ]
Solution to part 3
X_data = DataFrames['data'][ML_inputs].values # .values converts straight to NumPy array
Solution to part 4
X_data_scaled = scaler.transform(X_data) # X_data now scaled same as training and testing sets
Solution to part 5
y_data_RF = RF_clf.predict(X_data_scaled) # make predictions on the data
Now we can overlay the real experimental data on the simulated data.
labels = ["background", "signal"] # labels for simulated data
thresholds = (
[]
) # define list to hold random forest classifier probability predictions for each sample
for s in samples: # loop over samples
thresholds.append(
RF_clf.predict_proba(scaler.transform(DataFrames[s][ML_inputs]))[:, 1]
) # predict probabilities for each sample
plt.hist(
thresholds, bins=np.arange(0, 0.8, 0.1), density=True, stacked=True, label=labels
) # plot simulated data
data_hist = np.histogram(
RF_clf.predict_proba(X_data_scaled)[:, 1], bins=np.arange(0, 0.8, 0.1), density=True
)[
0
] # histogram the experimental data
scale = sum(RF_clf.predict_proba(X_data_scaled)[:, 1]) / sum(
data_hist
) # get scale imposed by density=True
data_err = np.sqrt(data_hist * scale) / scale # get error on experimental data
plt.errorbar(
x=np.arange(0.05, 0.75, 0.1), y=data_hist, yerr=data_err, label="Data"
) # plot the experimental data errorbars
plt.xlabel("Threshold")
plt.legend()
Within errors, the real experimental data errorbars agree with the simulated data histograms. Good news, our random forest classifier model makes sense with real experimental data!
At the end of the day
How many signal events is the random forest classifier predicting?
print(np.count_nonzero(y_data_RF == 1)) # signal
What about background?
print(np.count_nonzero(y_data_RF == 0)) # background
The random forest classifier is predicting how many real data events are signal and how many are background, how cool is that?!
Ready to machine learn to take over the world!
Hopefully you’ve enjoyed this brief discussion on machine learning! Try playing around with the hyperparameters of your random forest and neural network classifiers, such as the number of hidden layers and neurons, and see how they effect the results of your classifiers in Python!
Your feedback is very welcome! Most helpful for us is if you “Improve this page on GitHub”. If you prefer anonymous feedback, please fill this form.
Key Points
It’s a good idea to check whether our machine learning models behave well with real experimental data.
That’s it!