# Applying To Experimental Data

## Overview

Teaching:5 min

Exercises:15 minQuestions

What about real, experimental data?

Are we there yet?

Objectives

Check that our machine learning models behave similarly with real experimental data.

Finish!

# What about *real, experimental* data?

Notice that we’ve trained and tested our machine learning models on simulated data for signal and background. That’s why there are definite labels, `y`

. This has been a case of **supervised learning** since we knew the labels (y) going into the game. Your machine learning models would then usually be *applied* to real experimental data once you’re happy with them.

To make sure our machine learning model makes sense when applied to real experimental data, we should check whether simulated data and real experimental data have the same shape in classifier threshold values.

We first need to get the real experimental data.

## Challenge to end all challenges

- Read data.csv like in the Data Discussion lesson. data.csv is in the same file folder as the files we’ve used so far.
- Apply cut_lep_type and cut_lep_charge like in the Data Discussion lesson
- Convert the data to a NumPy array,
`X_data`

, similar to the Data Preprocessing lesson. You may find the attribute`.values`

useful to convert a pandas DataFrame to a Numpy array.- Don’t forget to transform using the scaler like in the Data Preprocessing lesson. Call the scaled data
`X_data_scaled`

.- Predict the labels your random forest classifier would assign to
`X_data_scaled`

. Call your predictions`y_data_RF`

.## Solution to part 1

`DataFrames['data'] = pd.read_csv('/kaggle/input/4lepton/data.csv') # read data.csv file`

## Solution to part 2

`DataFrames["data"] = DataFrames["data"][ np.vectorize(cut_lep_type)( DataFrames["data"].lep_type_0, DataFrames["data"].lep_type_1, DataFrames["data"].lep_type_2, DataFrames["data"].lep_type_3 ) ] DataFrames["data"] = DataFrames["data"][ np.vectorize(cut_lep_charge)( DataFrames["data"].lep_charge_0, DataFrames["data"].lep_charge_1, DataFrames["data"].lep_charge_2, DataFrames["data"].lep_charge_3 ) ]`

## Solution to part 3

`X_data = DataFrames['data'][ML_inputs].values # .values converts straight to NumPy array`

## Solution to part 4

`X_data_scaled = scaler.transform(X_data) # X_data now scaled same as training and testing sets`

## Solution to part 5

`y_data_RF = RF_clf.predict(X_data_scaled) # make predictions on the data`

Now we can overlay the real experimental data on the simulated data.

```
labels = ["background", "signal"] # labels for simulated data
thresholds = (
[]
) # define list to hold random forest classifier probability predictions for each sample
for s in samples: # loop over samples
thresholds.append(
RF_clf.predict_proba(scaler.transform(DataFrames[s][ML_inputs]))[:, 1]
) # predict probabilities for each sample
plt.hist(
thresholds, bins=np.arange(0, 0.8, 0.1), density=True, stacked=True, label=labels
) # plot simulated data
data_hist = np.histogram(
RF_clf.predict_proba(X_data_scaled)[:, 1], bins=np.arange(0, 0.8, 0.1), density=True
)[
0
] # histogram the experimental data
scale = sum(RF_clf.predict_proba(X_data_scaled)[:, 1]) / sum(
data_hist
) # get scale imposed by density=True
data_err = np.sqrt(data_hist * scale) / scale # get error on experimental data
plt.errorbar(
x=np.arange(0.05, 0.75, 0.1), y=data_hist, yerr=data_err, label="Data"
) # plot the experimental data errorbars
plt.xlabel("Threshold")
plt.legend()
```

Within errors, the real experimental data errorbars agree with the simulated data histograms. Good news, our random forest classifier model makes sense with real experimental data!

# At the end of the day

How many signal events is the random forest classifier predicting?

```
print(np.count_nonzero(y_data_RF == 1)) # signal
```

What about background?

```
print(np.count_nonzero(y_data_RF == 0)) # background
```

The random forest classifier is *predicting* how many real data events are signal and how many are background, how cool is that?!

## Ready to machine learn to take over the world!

Hopefully you’ve enjoyed this brief discussion on machine learning! Try playing around with the hyperparameters of your random forest and neural network classifiers, such as the number of hidden layers and neurons, and see how they effect the results of your classifiers in Python!

Your feedback is very welcome! Most helpful for us is if you “Improve this page on GitHub”. If you prefer anonymous feedback, please fill this form.

## Key Points

It’s a good idea to check whether our machine learning models behave well with real experimental data.

That’s it!