Data Preprocessing


Teaching: 10 min
Exercises: 5 min
  • How must we organize our data such that it can be used in the machine learning libraries?

  • Are we ready for machine learning yet?!

  • Prepare the dataset for machine learning.

  • Get excited for machine learning!

Format the data for machine learning

It’s almost time to build a machine learning model! First we choose the variables to use in our machine learning model.

ML_inputs = ["lep_pt_1", "lep_pt_2"]  # list of features for ML model

The data type is currently a pandas DataFrame: we now need to convert it into a NumPy array so that it can be used in scikit-learn and TensorFlow during the machine learning process. Note that there are many ways that this can be done: in this tutorial we will use the NumPy concatenate functionality to format our data set. For more information, please see the NumPy documentation on concatenate. We will briefly walk through the code in this tutorial.

#  Organise data ready for the machine learning model

# for sklearn data are usually organised
# into one 2D array of shape (n_samples x n_features)
# containing all the data and one array of categories
# of length n_samples

all_MC = []  # define empty list that will contain all features for the MC
for s in samples:  # loop over the different samples
    if s != "data":  # only MC should pass this
        )  # append the MC dataframe to the list containing all MC features
X = np.concatenate(
)  # concatenate the list of MC dataframes into a single 2D array of features, called X

all_y = (
)  # define empty list that will contain labels whether an event in signal or background
for s in samples:  # loop over the different samples
    if s != "data":  # only MC should pass this
        if "H125" in s:  # only signal MC should pass this
            )  # signal events are labelled with 1
        else:  # only background MC should pass this
            )  # background events are labelled 0
y = np.concatenate(
)  # concatenate the list of labels into a single 1D array of labels, called y

This takes in DataFrames and spits out a NumPy array consisting of only the DataFrame columns corresponding to ML_inputs.

Now we separate our data into a training and test set.

# This will split your data into train-test sets: 67%-33%.
# It will also shuffle entries so you will not get the first 67% of X for training
# and the last 33% for testing.
# This is particularly important in cases where you load all signal events first
# and then the background events.

# Here we split our data into two independent samples.
# The split is to create a training and testing set.
# The first will be used for classifier training and the second to evaluate its performance.

from sklearn.model_selection import train_test_split

# make train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=seed_value
)  # set the random seed for reproducibility

Machine learning models may have difficulty converging before the maximum number of iterations allowed if the data aren’t normalized. Note that you must apply the same scaling to the test set for meaningful results (we’ll apply the scaling to the test set in the next step). There are a lot of different methods for normalization of data. We will use the built-in StandardScaler for standardization. The StandardScaler ensures that all numerical attributes are scaled to have a mean of 0 and a standard deviation of 1 before they are fed to the machine learning model. This type of preprocessing is common before feeding data into machine learning models and is especially important for neural networks.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()  # initialise StandardScaler

# Fit only to the training data

Now we will use the scaling to apply the transformations to the data.

X_train_scaled = scaler.transform(X_train)


Apply the same scaler transformation to X_test and X.


X_test_scaled = scaler.transform(X_test)
X_scaled = scaler.transform(X)

Now we are ready to examine various models \(f\) for predicting whether an event corresponds to a signal event or a background event.

Your feedback is very welcome! Most helpful for us is if you “Improve this page on GitHub”. If you prefer anonymous feedback, please fill this form.

Key Points

  • One must properly format data before any machine learning takes place.

  • Data can be formatted using scikit-learn functionality; using it effectively may take time to master.