This lesson is being piloted (Beta version)

Introduction: Python background

Overview

Teaching: 20 min
Exercises: 5 min
Questions
  • How much Python do we need to know?

  • What is “array-oriented” or “columnar” processing?

  • What is a “software ecosystem”?

Objectives
  • Get the background context to get started with the Uproot lessons.

How much Python do we need to know?

Basic Python

Most students start this module after a Python introduction or are already familiar with the basics of Python. For instance, you should be comfortable with the Python essentials such as assigning variables,

x = 5

if-statements,

if x < 5:
    print("small")
else:
    print("big")

and loops.

for i in range(x):
     print(i)

Your data analysis will likely be full of statements like these, though the kinds of operations we’ll be focusing on in this module have a form more like

import compiled_library
compiled_library.do_computationally_expensive_thing(big_array)

The trick is for the Python-side code to be expressive enough and the compiled code to be general enough that you don’t need a new compiled_library for each thing you want to do. The libraries presented in this module are designed with interfaces that let you express what you want to do in Python and have it run in compiled code.

Dict-like and array-like interfaces

The two most important data types for these interfaces are dicts

some_dict = {"word": 1, "another word": 2, "some other word": 3}
some_dict["some other word"]
for key in some_dict:
    print(key, some_dict[key])

and arrays

import numpy as np
some_array = np.array([0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])
some_array[4]
for x in some_array:
    print(x)

Although these data types are important, what’s more important is the interfaces to these types. They both represent functions from keys to values:

Most of the types we will see in this module also map strings or integers to data, and they use the same syntax as dicts and arrays. If you’re familiar with the dict and array interfaces, you usually won’t have to look up documentation on dict-like and array-like types unless you’re trying to do something special.

Review of the dict interface

For dicts, the things that can go in square brackets (its “domain,” as a function) are its keys.

some_dict = {"word": 1, "another word": 2, "some other word": 3}
some_dict.keys()

Nothing other than these keys can be used in square brackets

some_dict["something I made up"]

unless it has been added to the dict.

some_dict["something I made up"] = 123
some_dict["something I made up"]

The things that can come out of a dict (its “range,” as a function) are its values.

some_dict.values()

You can get keys and values as 2-tuples by asking for its items.

some_dict.items()

for key, value in some_dict.items():
    print(key, value)

Review of the array interface

For arrays, the things that can go in square brackets (its “domain,” as a function) are integers from zero up to but not including its length.

import numpy as np
some_array = np.array([0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])
some_array[0]   # okay
some_array[1]   # okay
some_array[9]   # okay
some_array[10]  # not okay

You can get the length of an array (and the number of keys in a dict) with len:

len(some_array)

It’s important to remember that index 0 corresponds to the first item in the array, 1 to the second, and so on, which is why len(some_array) is not a valid index.

Negative indexes are allowed, but they count from the end of the list to the beginning. For instance,

some_array[-1]

returns the last item. Quick quiz: which negative value returns the first item, equivalent to 0?

Arrays can also be “sliced” by putting a colon (:) between the starting and stopping index.

some_array[2:7]

Quick quiz: why is 7.7 not included in the output?

The above is common to all Python sequences. Arrays, however, can be multidimensional and this allows for more kinds of slicing.

array3d = np.arange(2*3*5).reshape(2, 3, 5)

Separating two slices in the square brackets with a comma

array3d[:, 1:, 1:]

selects the following:

array3d-highlight1

Quick quiz: how do you select the following?

array3d-highlight2

Filtering with booleans and integers: “cuts”

In addition to integers and slices, arrays can be included in the square brackets.

An array of booleans with the same length as the sliced array selects all items that line up with True.

some_array    = np.array([ 0.0,  1.1,  2.2,  3.3,  4.4,   5.5,  6.6,   7.7,  8.8,   9.9])
boolean_array = np.array([True, True, True, True, True, False, True, False, True, False])

some_array[boolean_array]

An array of integers selects items by index.

some_array    = np.array([0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])
integer_array = np.array([  0,   1,   2,   3,   4,        6,        8     ])

some_array[integer_array]

Integer-slicing is more general than boolean-slicing because an array of integers can also change the order of the data and repeat items.

some_array[np.array([4, 2, 2, 2, 9, 8, 3])]

Both come up in natural contexts. Boolean arrays often come from performing a calculation on all elements of an array that returns boolean values.

even_valued_items = (some_array * 10 % 2 == 0)

some_array[even_valued_items]

This is how we’ll be computing and applying cuts: expressions like

good_muon_cut = (muons.pt > 10) & (abs(muons.eta) < 2.4)

good_muons = muons[good_muon_cut]

Logical operators: &, |, ~, and parentheses

If you’re coming from C++, you might expect “and,” “or,” “not” to be &&, ||, !.

If you’re coming from non-array Python, you might expect them to be and, or, not.

In array expressions (unfortunately!), we have to use Python’s bitwise operators, &, |, ~, and ensure that comparisons are surrounded in parentheses. Python’s and, or, not are not applied across arrays and bitwise operators have a surprising operator-precedence.

x = 0

x > -10 & x < 10  # probably not what you expect!

(x > -10) & (x < 10)

bitwise-operator-parentheses

What is “array-oriented” or “columnar” processing?

Expressions like

even_valued_items = (some_array * 10 % 2 == 0)

perform the *, %, and == operations on every item of some_array and return arrays. Without NumPy, the above would have to be written as

even_valued_items = []

for x in some_array:
    even_valued_items.append(x * 10 % 2 == 0)

This is more cumbersome when you want to apply a mathematical formula to every item of a collection, but it is also considerably slower. Every step in a Python for loop performs sanity checks that are unnecessary for numerical values with uniform type, checks that would happen at compile-time in a compiled library. NumPy is a compiled library; its *, %, and == operators, as well as many other functions, are performed in fast, compiled loops.

This is how we get expressiveness and speed. Languages with operators that apply array at a time, rather than one scalar value at a time, are called “array-oriented” or “columnar” (referring to, for instance, Pandas DataFrame columns).

Quite a few interactive, data-analysis languages are array-oriented, deriving from the original APL. “Array-oriented” is a programming paradigm in the same sense as “functional” or “object-oriented.”

apl-timeline

What is a “software ecosystem”?

Some programming environments, like Mathematica, Emacs, and ROOT, attempt to provide you with everything you need in one package. There’s only one thing to install and components within the framework should work well together because they were developed together. However, it can be hard to use the framework with other, unrelated software packages.

Ecosystems, like UNIX, iOS App Store, and Python, consist of many small software packages that each do one thing and know how to communicate with other packages through conventions and protocols. There’s usually a centralized installation mechanism, and it is the user’s (your) responsibility to piece together what you need. However, the set of possibilities is open-ended and grows as needs develop.

In mainstream Python, this means that

Python packages for high-energy physics are being developed with a similar model:

To make things easier to find, they’re cataloged under a common name at scikit-hep.org.

scikit-hep-logos

Key Points

  • Be familiar with the syntax of Python dicts, NumPy arrays, slicing rules, and bitwise logic operators.

  • Large-scale computations in Python tend to be performed one array at a time, rather than one scalar operation at a time.

  • You, as a user, will likely be gluing together many packages in each data analysis.