Advanced Python Tutorial
Welcome to the advanced Python tutorials of the starterkit. This lecture covers multiple topics and the notebooks available may fill more than the scheduled lesson. However, they also serve as a knowledge base that one can always come back to lock up things.
- 1: Basics
- Advanced Python Concepts
- Advanced Classes
- Danger zone
- 2: First look at data
- 3: Multivariate Analysis
- 4: Extension on Classification
- 5: Boosting to Uniformity
- Model tuning setup
- Cross-validation
- \(k\)-folding & early stopping
- Hyperameter optimisation
- 6: Histograms
- 7: Demonstration of distribution reweighting
- 8: Likelihood inference
- 9: sPlot
- Simple sPlot example
- Observed distributions
- Applying sWeights
- More complex case
- Splot
- Alternative: Known probabilities
- Fitting doesn’t give us information about real labels
- Appying sPlot
- Using sWeights to reconstruct initial distribution
- An important requirement of sPlot
- Derivation of sWeights (optional)
- Conclusion
- 10: Scikit-HEP
Pure Python, advanced
Notebook 10 starts out with a repetition of the basics of Python.
In notebook 11, advanced concepts like exceptions, context manager and the factory-pattern together with decorators is introduced.
12 is a tutorial about classes, especially on the focus of dunder (
__meth__
) methods and covers from simpler__len__
and add up to the advanced__getattr__
.
Data loading and plotting
Notebook 20 introduced data loading with uproot, Pandas DataFrames as the default container for columnar data that can apply cuts and the plotting libraries.
More on pandas can be found in their excellent documentation or by searching the web
Multivariate Analysis and Machine Learning
These notebooks provide an introduction to more sophisticated cuts using machine learning techniques.
30 starts out with the basics of using a BDT and the scikit-learn standard library for this kind of problems and algorithms.
31 demonstrates another BDT, XGBoost, which is the state-of-the-art algorithm of a BDT
32 contains a tutorial about a special kind of BDTs which actively de-correlate the BDT response from variables. This is often required in physics analysis in order not to bias the inference.
33 is a currently outdated intro to Neural Networks
More tutorials for general machine learning algorithms can be found in the scikit-learn tutorial section
Reweighting
Reweighting a distribution can be a useful technique to apply corrections.
45 demonstrates two methods of non-parametric reweighting of two distributions in order to correct for MC and data differences. Histogram-based as well as the more powerful, yet harder to control GradientBoostingReweighter are introduced
Statistical Inference
The last step in most analysis is the inference through a likelihood based method. This involves to fit a model to data or toy datasets repeatedly in order to infer the physical parameters that we are interested in but also it’s uncertainty.
50 introduces the likelihood model fitting with the
zfit
library. While the main introduction is pointed towards an actual zfit tutorial, it provides a guide to implement the fit to the data obtained in previous tutorials.hepstats
is also introduced as it works with zfit models and is used to estimate the significance of the discovery.
More tutorials on zfit as well as hepstats are available.
sPlot technique
The sPlot - or sWeights - technique is introduced; a technique to statistically subtract the background events in a variable. The technique as well as the library to obtain the weights is demonstrated.
Scikit-HEP
Many libraries that are seen in this tutorial are part of Scikit-HEP, the HEP Python ecosystem. Not all packages have been used though and in tutorial 70, a few smaller, yet useful packages are presented to give an idea of what is available.