Hyperparameters and validation#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import torch
from torch import nn, optim

Some definitions#

As we’ve seen, the numbers in a model that a minimization algorithm optimizes are called “parameters” (or “weights”):

model = nn.Sequential(
    nn.Linear(5, 5),
    nn.ReLU(),
    nn.Linear(5, 5),
    nn.ReLU(),
    nn.Linear(5, 5),
)
list(model.parameters())
[Parameter containing:
 tensor([[-0.1815,  0.1838, -0.3431, -0.4076,  0.3901],
         [ 0.3900,  0.3509,  0.2209,  0.0922, -0.0143],
         [ 0.2411,  0.0895, -0.1387, -0.4435, -0.3326],
         [ 0.4248, -0.2901, -0.3991, -0.3465,  0.2081],
         [-0.4174,  0.1309,  0.1952,  0.2023,  0.3734]], requires_grad=True),
 Parameter containing:
 tensor([ 0.0198, -0.4174,  0.4239, -0.4139,  0.3706], requires_grad=True),
 Parameter containing:
 tensor([[-0.3731,  0.1971,  0.2448, -0.2416,  0.3231],
         [ 0.1146, -0.3466,  0.0792, -0.2414,  0.2860],
         [ 0.0343,  0.4375,  0.2489,  0.0457, -0.3055],
         [-0.2955, -0.3797, -0.4261, -0.0522,  0.1353],
         [ 0.1162, -0.3882, -0.2386, -0.0326, -0.0695]], requires_grad=True),
 Parameter containing:
 tensor([ 0.3435,  0.0010, -0.0161, -0.1160,  0.0748], requires_grad=True),
 Parameter containing:
 tensor([[-1.2969e-01,  7.2057e-02,  9.5328e-02,  1.9342e-01,  7.7676e-05],
         [-9.4580e-02,  4.0030e-02,  1.0286e-01,  2.4509e-01, -2.8612e-01],
         [ 2.7807e-01,  9.2804e-02, -1.8640e-01,  2.0941e-01,  1.6396e-02],
         [-2.4038e-01,  1.8209e-01,  4.0494e-01,  2.5653e-01, -2.8209e-01],
         [-4.3381e-01, -5.0719e-02, -2.5285e-01,  2.7850e-01,  1.4529e-01]],
        requires_grad=True),
 Parameter containing:
 tensor([ 0.1687,  0.3300, -0.4072, -0.1513, -0.2254], requires_grad=True)]

I’ve discussed other changeable aspects of models,

  • architecture (number of hidden layers, how many nodes each, maybe other graph structures),

  • choice of activation function,

  • regularization techniques,

  • choice of input features,

as well as other changeable aspects of the training procedure,

  • minimization algorithm and its options (such as learning rate and momentum),

  • distribution of initial parameter values in each layer,

  • number of epochs and mini-batch size.

It would be confusing to call these choices “parameters,” so we call them “hyperparameters” (“hyper” means “over or above”). The problem of finding the best model is a collaboration between you, the human, choosing hyperparameters and the optimizer choosing parameters. Following the farming analogy from the Overview, the hyperparameters are choices that the farmer gets to make—how much water, how much sun, etc. The parameters are the low-level aspects of how a plant grows, where its leaves branch, how its veins and roots organize themselves to survive. Generally, there are a lot more parameters than hyperparameters.

If there are a lot of hyperparameters to tune, we might want to tune them algorithmically—maybe with a grid search, randomly, or with Bayesian optimization. Technically, I suppose they then become parameters, or we get a three-level hierarchy: parameters, hyperparameters, and hyperhyperparameters! Practitioners might not use consistent terminology (“using ML to tune hyperparameters” is a contradiction in terms), but just don’t get confused about who is optimizing which: algorithm 1, algorithm 2, or human. Even if some hyperparameters are being tuned by an algorithm, some of them must be chosen by hand. For instance, you choose a type of ML algorithm, maybe a neural network, maybe something else, and non-numerical choices about the network topology are generally hand-chosen. If a grid search, random search, or Bayesian optimization is choosing the rest, you do have to set the grid spacing for the grid search, the number of trials and measure of the random search, or various options in the Bayesian search. Or, a software package that you use chooses for you.

Partitioning data into training, validation, and test samples#

In the section on Regularization, we split a dataset into two samples and computed the loss function on each.

  • Training: loss computed from the training dataset is used to change the parameters of the model. Thus, the loss computed in training can get arbitrarily small as the model is adjusted to fit the training data points exactly (if it has enough parameters to be so flexible).

  • Test: loss computed from the test dataset acts as an independent measure of the model quality. A model generalizes well if it is a good fit (has minimal loss) on both the training data and data drawn from the same distribution: the test dataset.

Suppose that I set up an ML model with some hand-chosen hyperparameters, optimize it for the training dataset, and then I don’t like how it performs on the test dataset, so I adjust the hyperparameters and run again. And again. After many hyperparameter adjustments, I find a set that optimizes both the training and the test datasets. Is the test dataset an independent measure of the model quality?

It’s not a fair test because my hyperparameter optimization is the same kind of thing as the automated parameter optimization. When I adjust hyperparameters, look at how the loss changes, and use that information to either revert the hyperparameters or make another change, I am acting as a minimization algorithm—just a slow, low-dimensional one.

Since we do need to optimize (some of) the hyperparameters, we need a third data subsample:

  • Validation: loss computed from the validation dataset is used to change the hyperparameters of the model.

So we need to do a 3-way split of the original dataset. A common practice is to use 80% of the data for training, 10% of the data for validation, and hold 10% of the data for the final test—do not look at its loss value until you’re sure you won’t be changing hyperparameters anymore. This is similar to the practice, in particle physics, of performing a blinded analysis: you can’t look at the analysis result until you are no longer changing the analysis procedure (and then you’re stuck with it).

The fractions, 80%, 10%, 10%, are conventional. They’re not hyperparameters—you can’t change the proportions during model-tuning. Since the validation and test datasets are smallest, their sizes set the resolution of the model evaluation, so you might need to increase them (to, say, 60%, 20%, 20%) if you know that 10% of your data won’t be enough to quantify precision. But if you’re statistics-limited, neural networks might not be the best ML model (consider Boosted Decision Trees (BDTs) instead).

PyTorch’s random_split function can split a dataset into 3 parts as easily as 2.

boston_prices_df = pd.read_csv(
    "data/boston-house-prices.csv", sep="\s+", header=None,
    names=["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV"],
)
boston_prices_df = (boston_prices_df - boston_prices_df.mean()) / boston_prices_df.std()

features = boston_prices_df.drop(columns=["MEDV"])
targets = boston_prices_df["MEDV"]
from torch.utils.data import TensorDataset, DataLoader, random_split
features_tensor = torch.tensor(features.values, dtype=torch.float32)
targets_tensor = torch.tensor(targets.values[:, np.newaxis], dtype=torch.float32)

dataset = TensorDataset(features_tensor, targets_tensor)

train_size = int(np.floor(0.8 * len(dataset)))
valid_size = int(np.floor(0.1 * len(dataset)))
test_size = len(dataset) - train_size - valid_size
train_dataset, valid_dataset, test_dataset = random_split(dataset, [train_size, valid_size, test_size])

len(train_dataset), len(valid_dataset), len(test_dataset)
(404, 50, 52)

Oddly, Scikit-Learn’s equivalent, train_test_split, can only return 2 parts. If you use it, you have to use it like this:

from sklearn.model_selection import train_test_split
train_features, tmp_features, train_targets, tmp_targets = train_test_split(features.values, targets.values, train_size=0.8)
valid_features, test_features, valid_targets, test_targets = train_test_split(tmp_features, tmp_targets, train_size=0.5)

del tmp_features, tmp_targets

len(train_features), len(valid_features), len(test_features)
(404, 51, 51)

although Scikit-Learn does return NumPy arrays or Pandas DataFrames, which are more useful than PyTorch Datasets if you can fit everything into memory.

Cross-validation#

For completeness, I should mention an alternative to allocating a validation dataset: you can cross-validate on a larger subsample of the data. In this method, you still need to isolate a test sample for final evaluation, but you can optimize the parameters and hyperparameters using the same data. The following diagram from Scikit-Learn’s documentation illustrates it well:

After isolating a test sample, you

  1. subdivide the remaining sample into \(k\) subsamples,

  2. for each \(i \in [0, k)\), combine all data except for subsample \(i\) into a training dataset \(T_i\) and use subsample \(i\) as a validation dataset \(V_i\),

  3. train an independent model on each \(T_i\) and compute the validation loss \(L_i\) with the corresponding trained model and validation dataset \(V_i\),

  4. the total validation loss is \(L = \sum_i L_i\).

This is more computationally expensive, but it makes better use of smaller datasets.

Scikit-Learn provides a KFold object to help keep track of indexes when cross-validating. For \(k = 5\),

from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True)

By calling KFold.split on a dataset with a length (that is, an object you can call len on to get its length), you can iterate over folds (\(i \in [0, k)\)) and get random subsamples \(T_i\) and \(V_i\) as arrays of integer indexes.

for train_indexes, valid_indexes in kf.split(dataset):
    print(len(train_indexes), len(valid_indexes))
404 102
405 101
405 101
405 101
405 101
train_indexes[:20]
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 17, 18,
       19, 20, 21])
valid_indexes[:20]
array([ 15,  16,  45,  49,  62,  66,  67,  69,  87,  92,  95, 104, 117,
       127, 130, 135, 137, 145, 147, 153])

These integer indexes can slice arrays, Pandas DataFrames (via pd.DataFrame.iloc), and PyTorch Tensors.

train_features_i, train_targets_i = dataset[train_indexes]
valid_features_i, valid_targets_i = dataset[valid_indexes]

Here’s a full example that computes training loss and validation loss, using cross-validation from the Boston House Prices dataset.

NUMBER_OF_FOLDS = 5
NUMBER_OF_EPOCHS = 1000

kf = KFold(n_splits=NUMBER_OF_FOLDS, shuffle=True)

# use a class so that we can generate new, independent models for every k-fold
class Model(nn.Module):
    def __init__(self):
        super().__init__()   # let PyTorch do its initialization first
        self.model = nn.Sequential(
                nn.Linear(13, 100),
                nn.ReLU(),
                nn.Linear(100, 1),
        )

    def forward(self, x):
        return self.model(x)

# initialize loss-versus-epoch lists as zeros to update with each k-fold
train_loss_vs_epoch = [0] * NUMBER_OF_EPOCHS
valid_loss_vs_epoch = [0] * NUMBER_OF_EPOCHS

# for each k-fold
for train_indexes, valid_indexes in kf.split(dataset):
    train_features_i, train_targets_i = dataset[train_indexes]
    valid_features_i, valid_targets_i = dataset[valid_indexes]

    # generate a new, independent model, loss_function, and optimizer
    model = Model()
    loss_function = nn.MSELoss()
    optimizer = optim.Adam(model.parameters())

    # do a complete training loop
    for epoch in range(NUMBER_OF_EPOCHS):
        optimizer.zero_grad()

        train_loss = loss_function(model(train_features_i), train_targets_i)
        valid_loss = loss_function(model(valid_features_i), valid_targets_i)

        train_loss.backward()
        optimizer.step()

        # average loss over k-folds (could ignore NUMBER_OF_FOLDS to sum, instead)
        train_loss_vs_epoch[epoch] += train_loss.item() / NUMBER_OF_FOLDS
        valid_loss_vs_epoch[epoch] += valid_loss.item() / NUMBER_OF_FOLDS
fig, ax = plt.subplots()

ax.plot(range(1, len(train_loss_vs_epoch) + 1), train_loss_vs_epoch, label="training k-folds")
ax.plot(range(1, len(valid_loss_vs_epoch) + 1), valid_loss_vs_epoch, color="tab:blue", ls=":", label="validation k-folds")

ax.set_ylim(0, min(max(train_loss_vs_epoch), max(valid_loss_vs_epoch)))
ax.set_xlabel("epoch number")
ax.set_ylabel("loss")

ax.legend(loc="upper right")

plt.show()
_images/7424c43fc05a861dc35680dffe577ee0b7a82e7b73fdeea13802a54b11bdcba1.png

But since you’ll usually have large datasets (usually Monte Carlo, in HEP), you can usually just split the data 3 ways between training, validation, and test datasets, without mixing training and validation using cross-validation.