5: Boosting to Uniformity

In physics applications frequently we need to achieve uniformity of predictions along some features. For instance, when testing for the existence of a new particle, we need a classifier to be uniform in background along the mass (otherwise one can get false discovery due to so-called peaking background).

This notebook contains some comparison of classifiers. The target is to obtain flat effiency in signal (without significally loosing quality of classification) in Dalitz features.

The classifiers compared are

  • plain GradientBoosting

  • uBoost

  • gradient boosting with knn-Ada loss (UGB+knnAda)

  • gradient boosting with FlatnessLoss (UGB+FlatnessLoss)

We use dataset from paper about uBoost for demonstration purposes. We have a plenty of data here, so results are quite stable

[ ]:
import uproot
from hep_ml import gradientboosting as ugb
from hep_ml import losses, uboost
# this wrapper makes it possible to train on subset of features
from matplotlib import pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

Loading data

[ ]:
used_columns = ["Y1", "Y2", "Y3", "M2AB", "M2AC"]
with uproot.open('https://cern.ch/starterkit/data/advanced-python-2019/dalitzdata.root') as datafile:
    data = datafile['tree'].arrays(library='pd')
labels = data['labels']
data = data.drop('labels', axis=1)

Distributions in the Dalitz features for signal and background

As we can see, the background is distributed mostly in the corners of Dalitz plot, and for traditional classifiers this results in poor effieciency of signal in the corners.

[ ]:
def plot_distribution(data_frame, var_name1='M2AB', var_name2='M2AC', bins=40):
    """The function to plot 2D distribution histograms"""
    plt.hist2d(data_frame[var_name1], data_frame[var_name2], bins = 40, cmap=plt.cm.Blues)

plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1), plt.title("signal"),       plot_distribution(data[labels==1])
plt.subplot(1, 2, 2), plt.title("background"),   plot_distribution(data[labels==0]);

Preparation of train/test datasets

[ ]:
trainX, testX, trainY, testY = train_test_split(data, labels, random_state=42, test_size=0.5)

Setting up classifiers, training

[ ]:
uniform_features  = ["M2AB", "M2AC"]
train_features = ["Y1", "Y2", "Y3"]
n_estimators = 150
base_estimator = DecisionTreeClassifier(max_depth=4)

uBoost training takes much time, so we reduce number of efficiency_steps, use prediction smoothing and run uBoost in threads

[ ]:
classifiers = {}

classifiers['AdaBoost'] = GradientBoostingClassifier(max_depth=4, n_estimators=n_estimators, learning_rate=0.1)

knnloss = ugb.KnnAdaLossFunction(uniform_features, knn=10, uniform_label=1)
classifiers['uGB+knnAda'] = ugb.UGradientBoostingClassifier(loss=knnloss, max_depth=4, n_estimators=n_estimators,
                                        learning_rate=0.4, train_features=train_features)

classifiers['uBoost'] = uboost.uBoostClassifier(uniform_features=uniform_features, uniform_label=1,
                                     n_estimators=n_estimators, train_features=train_features,
                                     efficiency_steps=12, n_jobs=4)

flatnessloss = ugb.KnnFlatnessLossFunction(uniform_features, fl_coefficient=3., power=1.3, uniform_label=1)
classifiers['uGB+FL'] = ugb.UGradientBoostingClassifier(loss=flatnessloss, max_depth=4,
                                       learning_rate=0.1, train_features=train_features)

for clf in classifiers.values():
    if clf==classifiers['AdaBoost']:
        clf.fit(trainX[train_features], trainY)
        clf.fit(trainX[train_features + uniform_features], trainY)

Let’s look at the results of training

dependence of classification quality on the number of trees built (ROC AUC - an area under the ROC curve, the more the better)

[ ]:
from sklearn.metrics import roc_auc_score

for clf in classifiers.values():
    if clf==classifiers['AdaBoost']:
        output = clf.predict_proba(trainX[train_features])
        output = clf.predict_proba(trainX[train_features + uniform_features])
    print(f'Area under curve: {roc_auc_score(trainY, output[:,1])}')

ROC curves after training

[ ]:
from sklearn.metrics import roc_curve

def plot_roc(bdt, testY, testX, training_columns, label=None):
    y_score = bdt.predict_proba(testX[training_columns])[:,1]
    fpr, tpr, thresholds = roc_curve(testY, y_score)
    area = roc_auc_score(testY, y_score)

    plt.plot([0, 1], [0, 1], color='grey', linestyle='--')
    if label:
        plt.plot(fpr, tpr, label=f'{label} (area = {area:.2f})')
        plt.plot(fpr, tpr, label=f'ROC curve (area = {area:.2f})')
    plt.xlim(0.0, 1.0)
    plt.ylim(0.0, 1.0)
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.legend(loc='lower right')
    # We can make the plot look nicer by forcing the grid to be square
    plt.gca().set_aspect('equal', adjustable='box')

plot_roc(classifiers['AdaBoost'], testY, testX, train_features, 'AdaBoost')
plot_roc(classifiers['uGB+knnAda'], testY, testX, train_features+uniform_features, 'uGB+knnAda')
plot_roc(classifiers['uBoost'], testY, testX, train_features+uniform_features, 'uBoost')
plot_roc(classifiers['uGB+FL'], testY, testX, train_features+uniform_features, 'uGB+FL')
[ ]: