Lesson 2: Array-oriented programming#

Programming paradigms, such as “imperative”, “object-oriented”, and “functional”, are general styles of programming. Some programming languages favor one over the others, or make it easier to use one, rather than the others. For instance, C++ enables object-oriented programming in C.

In array-oriented programming, the primary data type is an array and most functions perform one operation on all values of the array.

Simple example of imperative, functional, and array-oriented#

Below are three solutions to the following problem: compute the square of each value in a collection of numbers.

Imperative#

Imperative programming is likely to be the most familiar: you supply a sequence of commands and the computer follows them one by one.

original = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

result = []
for x in original:
    result.append(x**2)

result
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

Functional#

Functional programming takes seriously the idea that functions are objects that can be passed around like any other data. Functional programming environments have a suite of predefined functions-that-take-functions such as map (apply a given function to every member of a collection, returning a collection), filter (use a boolean function to determine which values to keep), and reduce (apply a function to pairs of values in the collection to return a scalar).

In the following, Python’s lambda keyword defines a function as an expression without assigning it to a variable name (as def does).

original = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

result = list(map(lambda x: x**2, original))

result
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

Python is not very friendly toward functional programming, since many of its functions (like list.append) change data in ways other than through the return value, and because the syntax for creating a lambda function is so long. (Compare lambda x: x**2 in Python with \x -> x**2 in Haskell.)

In Python, we often use list comprehensions where we would otherwise use map and filter.

original = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

result = [x**2 for x in original]

result
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

Array-oriented#

Array-oriented programming is a little different from functional programming: instead of writing (lambda) functions and passing them to explicit calls of map, filter, and reduce, you use precompiled functions implicitly when you write expressions involving arrays.

import numpy as np
original = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

result = original**2

result
array([  1,   4,   9,  16,  25,  36,  49,  64,  81, 100])

This formalism has the property that any mathematical expression on scalar numbers can also be applied to equal-length arrays of numbers. For example,

def quadratic_formula(a, b, c):
    return (-b + np.sqrt(b**2 - 4*a*c)) / (2*a)

computes the (positive) quadratic formula

\[ \frac{-b + \sqrt{b^2 - 4ac}}{2a} \]

once if a, b, and c are scalar numbers,

a = 5
b = 10
c = -0.1

quadratic_formula(a, b, c)
np.float64(0.009950493836207741)

or a million times if they are arrays of length 1 million.

a = np.random.uniform(5, 10, 1000000)
b = np.random.uniform(10, 20, 1000000)
c = np.random.uniform(-0.1, 0.1, 1000000)

quadratic_formula(a, b, c)
array([ 2.92790688e-03,  5.71328018e-03,  4.06160812e-05, ...,
       -1.42225626e-03,  5.70765555e-04,  1.24922728e-03])

Complex example of imperative, functional, and array-oriented#

With a simple example, it’s easy to see exactly what it’s doing, but it doesn’t convey the “flavor” of how these paradigms differ in practice.

Here is a complex example. You don’t have to understand the details of how each solution works—just see how they follow different approaches and how the code generally looks. The problem is to compute gravitational forces among \(n\) particles in 3 dimensions.

Imperative#

def imperative_forces(m, x, p):
    total_force = np.zeros_like(x)

    for i in range(len(x)):
        for j in range(i + 1, len(x)):
            mi, mj = m[i], m[j]
            xi, xj = x[i], x[j]
            pi, pj = p[i], p[j]
            displacement = [
                xj[0] - xi[0],
                xj[1] - xi[1],
                xj[2] - xi[2],
            ]
            distance = np.sqrt(displacement[0]**2 + displacement[1]**2 + displacement[2]**2)
            direction = [
                displacement[0] / distance,
                displacement[1] / distance,
                displacement[2] / distance,
            ]
            force = [
                G * mi * mj * direction[0] / distance**2,
                G * mi * mj * direction[1] / distance**2,
                G * mi * mj * direction[2] / distance**2,
            ]
            total_force[i, 0] += force[0]
            total_force[i, 1] += force[1]
            total_force[i, 2] += force[2]
            total_force[j, 0] += -force[0]
            total_force[j, 1] += -force[1]
            total_force[j, 2] += -force[2]

    return total_force

The imperative solution is a long sequence of commands in nested for loops.

Functional#

from functools import reduce
from itertools import combinations
def functional_forces(m, x, p):
    def negate(vector):
        return [-a for a in vector]

    def add(*vectors):
        return [reduce(lambda a, b: a + b, components) for components in zip(*vectors)]

    def subtract(vectorA, vectorB):
        return add(vectorA, negate(vectorB))

    def magnitude(vector):
        return np.sqrt(reduce(lambda a, b: a + b, map(lambda a: a**2, vector)))

    def force(mi, mj, xi, xj, pi, pj):
        displacement = subtract(xi, xj)
        distance = magnitude(displacement)
        direction = [a / distance for a in displacement]
        return [G * mi * mj * a / distance**2 for a in direction]

    pairwise_forces = [
        ((i, j), force(mi, mj, xi, xj, pi, pj))
        for ((i, (mi, xi, pi)), (j, (mj, xj, pj))) in combinations(enumerate(zip(m, x, p)), 2)
    ]

    def partial_forces(pairwise_forces, i):
        return (
            [force for (_, check), force in pairwise_forces if i == check] +
            [negate(force) for (check, _), force in pairwise_forces if i == check]
        )

    return np.array([add(*partial_forces(pairwise_forces, i)) for i in range(len(m))])

The functional solution defines many small functions and applies them in two list comprehensions.

Array-oriented#

def array_forces(m, x, p):
    i, j = np.triu_indices(len(x), k=1)
    pw_displacement = x[j] - x[i]
    pw_distance = np.sqrt(np.sum(pw_displacement**2, axis=-1))
    pw_direction = pw_displacement / pw_distance[:, np.newaxis]
    pw_force = G * m[i, np.newaxis] * m[j, np.newaxis] * pw_direction / pw_distance[:, np.newaxis]**2
    total_force = np.zeros_like(x)
    np.add.at(total_force, i, pw_force)
    np.add.at(total_force, j, -pw_force)
    return total_force

The array-oriented solution is very concise—maybe even “dense”—and it invovles special NumPy functions such as np.triu_indices for the \(n (n - 1) / 2\) combinatorics and advanced slicing in square brackets ([ ]).

Comparison of the three solutions#

First, let’s just verify that they give the same result in a numerical example.

# three masses
m = np.array([2, 10, 1.1])

# initial position (x) and momentum (p) for each of the three
x = np.array([[0.1, 0.1, 0], [2, 0.9, 0.1], [-5, 1.5, -0.1]])
p = np.array([[3, -1, 0.5], [-13, 0, -0.2], [-10, 0.1, 0]])

G = 1
imperative_test = imperative_forces(m, x, p)
imperative_test
array([[ 4.24603602,  1.84053459,  0.22597908],
       [-4.54361383, -1.80071551, -0.23380178],
       [ 0.29757782, -0.03981909,  0.0078227 ]])
functional_test = functional_forces(m, x, p)
functional_test
array([[ 4.24603602,  1.84053459,  0.22597908],
       [-4.54361383, -1.80071551, -0.23380178],
       [ 0.29757782, -0.03981909,  0.0078227 ]])
array_test = array_forces(m, x, p)
array_test
array([[ 4.24603602,  1.84053459,  0.22597908],
       [-4.54361383, -1.80071551, -0.23380178],
       [ 0.29757782, -0.03981909,  0.0078227 ]])
np.allclose(imperative_test, functional_test)
True
np.allclose(functional_test, array_test)
True

(It could be more carefully tested at more points.)

Now compare the time to compute a sample of 500 planets.

m = np.ones(500)
x = np.random.normal(0, 1, (500, 3))
p = np.random.normal(0, 1, (500, 3))
%%timeit -n1 -r1

imperative_forces(m, x, p)
593 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%%timeit -n1 -r1

functional_forces(m, x, p)
4.39 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%%timeit -n1 -r1

array_forces(m, x, p)
23.4 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Note the units: array-oriented is 30× faster than imperative and 300× faster than functional.

There are good reasons for this, which we’ll examine in the next section.

For now, though, which one would you choose if you wanted to calculate forces repeatedly, to make animations of planetary orbits?

import matplotlib.pyplot as plt
from matplotlib import animation
from IPython.display import HTML
def array_step(m, x, p, dt):
    # this is a numerically stable way of updating positions, momenta, and forces
    p += array_forces(m, x, p) * (dt/2)    # half kick
    x += p * dt / m[:, np.newaxis]         # full drift
    p += array_forces(m, x, p) * (dt/2)    # half kick
def plot(m, x, p, dt, num_frames=100, steps_per_frame=10):
    num_particles = len(m)

    history = np.empty((num_frames, num_particles, 2))
    for i in range(num_frames):
        history[i, :, 0] = x[:, 0]
        history[i, :, 1] = x[:, 1]
        for _ in range(steps_per_frame):
            array_step(m, x, p, dt)

    fig, ax = plt.subplots(figsize=(5, 5))

    lines = []
    for j in range(num_particles):
        lines.append(ax.plot(history[:1, j, 0], history[:1, j, 1])[0])
    dots = ax.scatter(history[0, :, 0], history[0, :, 1])

    ax.set_xlim(-2, 2)
    ax.set_ylim(-2, 2)

    def update(i):
        for j, line in enumerate(lines):
            line.set_xdata(history[:i, j, 0])
            line.set_ydata(history[:i, j, 1])
        dots.set_offsets(history[i, :, :])
        return [*lines, dots]

    ani = animation.FuncAnimation(fig=fig, func=update, frames=num_frames, interval=50, blit=True)
    out = HTML(ani.to_jshtml())
    plt.close()
    return out

Double-planet orbits a star:

m = np.array([100, 1, 1], np.float64)
x = np.array([[0, 0, 0], [0, 0.9, 0], [0, 1.1, 0]], np.float64)
p = np.array([[0, 0, 0], [-13, 0, 0], [-10, 0, 0]], np.float64)

plot(m, x, p, dt=0.001)

Solution to the three-body problem!

a = 0.347111
b = 0.532728
m = np.array([1, 1, 1], np.float64)
x = np.array([[-1, 0, 0], [1, 0, 0], [0, 0, 0]], np.float64)
p = np.array([[a, b, 0], [a, b, 0], [-2 * a, -2 * b, 0]], np.float64)

plot(m, x, p, dt=0.01)

Chaos!

m = np.ones(25)
x = np.random.normal(0, 1, (25, 3))
p = np.random.normal(0, 1, (25, 3))

plot(m, x, p, dt=0.0025)

Why is array-oriented programming fast?#

A better question is, “Why is Python code slow?”

Roughly speaking, there are two types of languages—the “compile-first” ones, which are fast, and interactive ones, which are slow. This plot shows execution times for 10 scientific problems implemented in 12 languages (Debian Benchmarks Game).

Python, Ruby, Perl, Lua, and PHP all have the following in common:

  • they run instructions on an emulated virtual machine, rather than the physical computer,

  • they are dynamically typed, checking the types of each function call immediately before execution, rather than once in a compilation phase,

  • they all do many other dynamic, high-level things, like using garbage collectors, representing all values as objects, managing their own heap of memory, etc.

Thus, there’s a speed penalty for every line of Python code that is much larger than an equivalent line of compiled code.

However, a single NumPy function call that performs a million numerical calculations avoids Python’s speed trap without giving up on interactivity. In the gravitational force calculation on 500 planets (above),

  • imperative_forces makes Python loop over 500 × 499 pairs of planets, and on each pair it steps through 25 lines of Python code (214 virtual machine instructions),

  • functional_forces has a Python list comprehension over 500 × 499 pairs of planets, each of which calls 6 functions with a total of 12 lines of Python code (100 virtual machine instructions), plus all of the function call-stack overhead,

  • array_forces steps through 9 lines of Python code (103 virtual machine instructions) once. The loop over 500 × 499 pairs of planets is in compiled code, not Python code.

Here is a similar argument for the simple problem of squaring numbers:

original = np.random.poisson(3.14, 1000000)
def imperative_squares(original):
    result = []
    for x in original:
        result.append(x**2)
    return result
%%timeit

imperative_squares(original)
97.6 ms ± 469 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
import dis  # dissasembler; shows us the Python virtual machine instructions
dis.dis(imperative_squares)
  1           0 RESUME                   0

  2           2 BUILD_LIST               0
              4 STORE_FAST               1 (result)

  3           6 LOAD_FAST                0 (original)
              8 GET_ITER
        >>   10 FOR_ITER                26 (to 64)
             12 STORE_FAST               2 (x)

  4          14 LOAD_FAST                1 (result)
             16 LOAD_METHOD              0 (append)
             38 LOAD_FAST                2 (x)
             40 LOAD_CONST               1 (2)
             42 BINARY_OP                8 (**)
             46 PRECALL                  1
             50 CALL                     1
             60 POP_TOP
             62 JUMP_BACKWARD           27 (to 10)

  5     >>   64 LOAD_FAST                1 (result)
             66 RETURN_VALUE

It steps from FOR_ITER to JUMP_BACKWARD a million times and has to verify that x is an integer every time it gets to BINARY_OP.

On the other hand, in the array-oriented solution:

def array_squares(original):
    return original**2
%%timeit

array_squares(original)
207 μs ± 1.44 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
dis.dis(array_squares)
  1           0 RESUME                   0

  2           2 LOAD_FAST                0 (original)
              4 LOAD_CONST               1 (2)
              6 BINARY_OP                8 (**)
             10 RETURN_VALUE

It steps through the above once and runs NumPy’s compiled code when it gets to BINARY_OP. It checks array types once so that it can assume that the numeric types are correct in the loop over a million integers. The physical CPU applies even more optimizations (memory prefetching, instruction pipelining, and maybe vector instructions).

Why does it use less memory?#

Arrays also use less memory than Python lists.

from pympler.tracker import SummaryTracker
memory_tracker = SummaryTracker()
memory_tracker.diff()

list_of_numbers = list(range(1000000))
array_of_numbers = np.arange(1000000)

memory_tracker.print_diff()

del list_of_numbers, array_of_numbers
          types |   # objects |   total size
=============== | =========== | ============
            int |      999742 |     26.70 MB
           list |           2 |      7.63 MB
  numpy.ndarray |           1 |      7.63 MB
            str |           1 |     70     B

The base list and NumPy array are both 7.63 MB, but Python also requires 26.70 MB for 999743 integers.

Naively, 1000000 × 8-byte integers should be

1000000 * 8 / 1024**2
7.62939453125

MB. So what does Python need the extra 26.70 MB for?

Here’s the memory layout of a Python list of numbers:

and here’s the memory layout of a NumPy array of numbers:

So if you’re dealing with large datasets in Python, putting them in arrays is a good idea!

Expressiveness of array-oriented programming#

The performance advantages of array-oriented programming would not be useful if it were hard to express what you need to calculate. Unlike functional programming, you cannot write a function to map it over a dataset: you have to piece together the calculation you want to perform using the precompiled functions NumPy provides.

As an indication of what array-oriented programming is good for, here’s a timeline of all the array-oriented programming languages and libraries (that I know about):

All of them are interactive, data analysis environments. They all feature

  • an interactive REPL (read-evaluate-print loop),

  • concise notation,

  • a mathematical outlook.

This paradigm started mathematical and concise (perhaps too much): in the original language, APL,

np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

was

(10) - 1
np.sum(np.arange(1, 11))
np.int64(55)

was

+/10

and

np.cumsum(np.arange(1, 11))
array([ 1,  3,  6, 10, 15, 21, 28, 36, 45, 55])

was

+\10

It had a special keyboard for mathematical symbols and writing a one-line program was hard enough to celebrate by printing it on a T-shirt.

Modern array-oriented environments like NumPy are a step back, away from extreme conciseness.

Example of an array-oriented conversation#

A programming paradigm brings some concepts into the foreground and pushes others into the background:

paradigm

emphasizes

imperative/procedural

low-level algorithms

object-oriented

large-scale program structure

actor-based

temporal locality

literate

human instruction

event-driven

cause-and-effect

declarative

properties of desired result

symbolic

formula transformations

functional

data transformations

array-oriented

data distributions

Array-oriented programming emphasizes data distributions. Each precompiled step, which you can’t step through to examine or debug, iterates over all values in an array. Each interactive step lets you decide what will be the next operation you want to perform. It’s one mathematical operation at a time, rather than one data point at a time.

Suppose that you have a large dataset.

dataset = np.random.normal(0, 1, 1000000)  # one MILLION data points

You’re not very interested in the first, or second, or third value in the dataset. You’re interested in their distribution, so you plot it.

plt.hist(dataset, bins=100, range=(-5, 5));
_images/b284bfdce4a99eda68d9b155c57e506cbd47b671bed34ed91ee6ad1a3c748641.png

It’s a Gaussian (normal) distribution.

What happens if we square all values in the dataset?

dataset2 = dataset**2

plt.hist(dataset2, bins=100, range=(-5, 5));
_images/ac2b77ce42f6a291911043e48c7f8f8660f1dd344cd6bf3d4fe11b78bba64588.png

Okay, what if we compute this function?

dataset3 = np.sin(1/dataset2)

What happens to the distribution? Can you guess? (I can’t.)

plt.hist(dataset3, bins=100, range=(-1, 1));
_images/dfb6a01e0066e5778c3998e300e85a84b912fbc227fff2f6a53f27385529f1dd.png

In a data analysis, you’ll be investigating some effect, such as a source of systematics. You’ll apply a cut and look at the result—maybe that makes things better. You’ll try to apply a transformation to narrow in on the effect—maybe that makes things worse, so you back up to what you had previously. Then you apply it step by step and discover your error on one of the steps. Then you put the whole expression together again and move on.

This combination of interactivity between operations and high-performance in operations over a whole dataset is a perfect fit for investigative data analysis.

NumPy features#

The main features you’ll likely use from NumPy are:

  • reading and creating arrays,

  • elementwise arithmetic operations (+ - * /, etc.) and comparisons (== != < <= > >=),

  • universal functions (ufunc), which are precompiled elementwise functions like sqrt, sin, cos,

  • slicing in ranges (like Python lists), in multiple dimensions, and advanced slicing with arrays of booleans and integers,

  • reductions, such as sum, min/max, mean over an axis or over everything,

  • sorting, concatenation, linear algebra, …

For most of these, the documentation is the best source. Here, let’s focus on slicing and reductions.

Slicing#

Python’s slicing syntax, which works on containers like list and tuple, is

container[start:stop:step]
  • If start or stop is non-negative, it counts from the beginning of the container with 0 as the first item.

  • If start or stop is negative, it counts from the end of the container with -1 as the first item.

  • The step indicates how many items between each step: 1 is normal, 2 skips every other item, -1 goes backward, and 0 is not allowed.

  • Any of the three can be left out to get a reasonable default.

container = [0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9]
container[2:]
[2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9]
container[:5]
[0.0, 1.1, 2.2, 3.3, 4.4]
container[-6:-2]
[4.4, 5.5, 6.6, 7.7]
container[4:10:2]
[4.4, 6.6, 8.8]
container[::3]
[0.0, 3.3, 6.6, 9.9]

NumPy arrays can be sliced like this, but they can also include a comma (,) to slice over multiple dimensions.

arr = np.array([[1.1, 2.2, 3.3],
                [4.4, 5.5, 6.6],
                [7.7, 8.8, 9.9]])
arr
array([[1.1, 2.2, 3.3],
       [4.4, 5.5, 6.6],
       [7.7, 8.8, 9.9]])
arr[:2, 1:]
array([[2.2, 3.3],
       [5.5, 6.6]])
arr[2:, :]
array([[7.7, 8.8, 9.9]])
arr[:, :2]
array([[1.1, 2.2],
       [4.4, 5.5],
       [7.7, 8.8]])
arr[1:2, :2]
array([[4.4, 5.5]])

Mini-quiz 1: Given this 3D array,

array3d = np.arange(2 * 3 * 5).reshape(2, 3, 5)
array3d
array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14]],

       [[15, 16, 17, 18, 19],
        [20, 21, 22, 23, 24],
        [25, 26, 27, 28, 29]]])

you can select items

with

array3d[:, 1:, 1:]
array([[[ 6,  7,  8,  9],
        [11, 12, 13, 14]],

       [[21, 22, 23, 24],
        [26, 27, 28, 29]]])

Write a slice that selects these elements:

Mini-quiz 2: Use slicing and elementwise subtraction (-) together to find the sizes of the spaces between consecutive elements in the following array. (They’re all 1.1.)

array = np.array([1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])
array
array([1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])

Hint:

Advanced slicing#

In addition, NumPy can slice arrays with arrays. If the array in square brackets ([ ]) has boolean values, it must have the same length (shape) as the array to slice and items that line up with True are kept; items that line up with False are dropped.

arr  = np.array([  0.0,   1.1,   2.2,   3.3,   4.4,  5.5,   6.6,  7.7,   8.8,  9.9])
mask = np.array([False, False, False, False, False, True, False, True, False, True])
#                                                    5.5          7.7          9.9
arr[mask]
array([5.5, 7.7, 9.9])

In particle physics, this is usually used to select events, as a cut.

cut = arr > 3
cut
array([False, False, False,  True,  True,  True,  True,  True,  True,
        True])
arr[cut]
array([3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])

If the array in square brackets ([ ]) has integer values, it picks out elements by index.

arr[np.array([5, 7, -1])]
array([5.5, 7.7, 9.9])

Unlike boolean-masking, the items to select can be out of order.

arr[np.array([-1, 7, 5])]
array([9.9, 7.7, 5.5])

And they can also include duplicates.

arr[np.array([-1, -1, -1, 7, 7, 5])]
array([9.9, 9.9, 9.9, 7.7, 7.7, 5.5])

Integer array slicing is the most general kind of slicing: if you think of an array as a function from index position (integers) to values, integer array slicing is function composition. Consider this example of text vectorization:

text = """
WOULD YOU LIKE GREEN EGGS AND HAM?

I DO NOT LIKE THEM, SAM-I-AM.
I DO NOT LIKE GREEN EGGS AND HAM.

WOULD YOU LIKE THEM HERE OR THERE?

I WOULD NOT LIKE THEM HERE OR THERE.
I WOULD NOT LIKE THEM ANYWHERE.
I DO NOT LIKE GREEN EGGS AND HAM.
I DO NOT LIKE THEM, SAM-I-AM.

WOULD YOU LIKE THEM IN A HOUSE?
WOULD YOU LIKE THEN WITH A MOUSE?

I DO NOT LIKE THEM IN A HOUSE.
I DO NOT LIKE THEM WITH A MOUSE.
I DO NOT LIKE THEM HERE OR THERE.
I DO NOT LIKE THEM ANYWHERE.
I DO NOT LIKE GREEN EGGS AND HAM.
I DO NOT LIKE THEM, SAM-I-AM.

WOULD YOU EAT THEM IN A BOX?
WOULD YOU EAT THEM WITH A FOX?

NOT IN A BOX. NOT WITH A FOX.
NOT IN A HOUSE. NOT WITH A MOUSE.
I WOULD NOT EAT THEM HERE OR THERE.
I WOULD NOT EAT THEM ANYWHERE.
I WOULD NOT EAT GREEN EGGS AND HAM.
I DO NOT LIKE THEM, SAM-I-AM.
"""
# remove punctuation and make an array of words
words = np.array(text.replace(",", " ").replace(".", " ").replace("?", " ").replace("!", " ").replace("-", " ").split())

# find the unique words and get an index of where they are in the corpus
dictionary, index = np.unique(words, return_inverse=True)
dictionary
array(['A', 'AM', 'AND', 'ANYWHERE', 'BOX', 'DO', 'EAT', 'EGGS', 'FOX',
       'GREEN', 'HAM', 'HERE', 'HOUSE', 'I', 'IN', 'LIKE', 'MOUSE', 'NOT',
       'OR', 'SAM', 'THEM', 'THEN', 'THERE', 'WITH', 'WOULD', 'YOU'],
      dtype='<U8')
index
array([24, 25, 15,  9,  7,  2, 10, 13,  5, 17, 15, 20, 19, 13,  1, 13,  5,
       17, 15,  9,  7,  2, 10, 24, 25, 15, 20, 11, 18, 22, 13, 24, 17, 15,
       20, 11, 18, 22, 13, 24, 17, 15, 20,  3, 13,  5, 17, 15,  9,  7,  2,
       10, 13,  5, 17, 15, 20, 19, 13,  1, 24, 25, 15, 20, 14,  0, 12, 24,
       25, 15, 21, 23,  0, 16, 13,  5, 17, 15, 20, 14,  0, 12, 13,  5, 17,
       15, 20, 23,  0, 16, 13,  5, 17, 15, 20, 11, 18, 22, 13,  5, 17, 15,
       20,  3, 13,  5, 17, 15,  9,  7,  2, 10, 13,  5, 17, 15, 20, 19, 13,
        1, 24, 25,  6, 20, 14,  0,  4, 24, 25,  6, 20, 23,  0,  8, 17, 14,
        0,  4, 17, 23,  0,  8, 17, 14,  0, 12, 17, 23,  0, 16, 13, 24, 17,
        6, 20, 11, 18, 22, 13, 24, 17,  6, 20,  3, 13, 24, 17,  6,  9,  7,
        2, 10, 13,  5, 17, 15, 20, 19, 13,  1])

The index is a vector encoding the text corpus. The first few words are

dictionary[index[0]]
np.str_('WOULD')
dictionary[index[1]]
np.str_('YOU')
dictionary[index[2]]
np.str_('LIKE')

and the whole text is

dictionary[index]
array(['WOULD', 'YOU', 'LIKE', 'GREEN', 'EGGS', 'AND', 'HAM', 'I', 'DO',
       'NOT', 'LIKE', 'THEM', 'SAM', 'I', 'AM', 'I', 'DO', 'NOT', 'LIKE',
       'GREEN', 'EGGS', 'AND', 'HAM', 'WOULD', 'YOU', 'LIKE', 'THEM',
       'HERE', 'OR', 'THERE', 'I', 'WOULD', 'NOT', 'LIKE', 'THEM', 'HERE',
       'OR', 'THERE', 'I', 'WOULD', 'NOT', 'LIKE', 'THEM', 'ANYWHERE',
       'I', 'DO', 'NOT', 'LIKE', 'GREEN', 'EGGS', 'AND', 'HAM', 'I', 'DO',
       'NOT', 'LIKE', 'THEM', 'SAM', 'I', 'AM', 'WOULD', 'YOU', 'LIKE',
       'THEM', 'IN', 'A', 'HOUSE', 'WOULD', 'YOU', 'LIKE', 'THEN', 'WITH',
       'A', 'MOUSE', 'I', 'DO', 'NOT', 'LIKE', 'THEM', 'IN', 'A', 'HOUSE',
       'I', 'DO', 'NOT', 'LIKE', 'THEM', 'WITH', 'A', 'MOUSE', 'I', 'DO',
       'NOT', 'LIKE', 'THEM', 'HERE', 'OR', 'THERE', 'I', 'DO', 'NOT',
       'LIKE', 'THEM', 'ANYWHERE', 'I', 'DO', 'NOT', 'LIKE', 'GREEN',
       'EGGS', 'AND', 'HAM', 'I', 'DO', 'NOT', 'LIKE', 'THEM', 'SAM', 'I',
       'AM', 'WOULD', 'YOU', 'EAT', 'THEM', 'IN', 'A', 'BOX', 'WOULD',
       'YOU', 'EAT', 'THEM', 'WITH', 'A', 'FOX', 'NOT', 'IN', 'A', 'BOX',
       'NOT', 'WITH', 'A', 'FOX', 'NOT', 'IN', 'A', 'HOUSE', 'NOT',
       'WITH', 'A', 'MOUSE', 'I', 'WOULD', 'NOT', 'EAT', 'THEM', 'HERE',
       'OR', 'THERE', 'I', 'WOULD', 'NOT', 'EAT', 'THEM', 'ANYWHERE', 'I',
       'WOULD', 'NOT', 'EAT', 'GREEN', 'EGGS', 'AND', 'HAM', 'I', 'DO',
       'NOT', 'LIKE', 'THEM', 'SAM', 'I', 'AM'], dtype='<U8')

Reductions#

Reductions are operations that reduce the number of dimensions in an array.

arr = np.array([[  1,   2,   3,   4],
                [ 10,  20,  30,  40],
                [100, 200, 300, 400]])

In a sense, scalar-integer slicing is a reduction:

arr[2, :]
array([100, 200, 300, 400])
arr[:, 3]
array([  4,  40, 400])

A scalar is a zero-dimensional array, so this is also reduction:

arr[2, 3]
np.int64(400)

But it’s more common to refer to operations like sum, min/max, and mean as reductions.

By default, they reduce to a scalar.

np.sum(arr)
np.int64(1110)
np.max(arr)
np.int64(400)

But they can also reduce along one of the dimensions.

np.sum(arr, axis=0)
array([111, 222, 333, 444])
np.sum(arr, axis=1)
array([  10,  100, 1000])

Mini-quiz 3: Using slicing, elementwise operations, and a reducer, find the length of this curve.

t = np.linspace(0, 2*np.pi, 10000)
x = np.sin(3*t)
y = np.sin(4*t)
plt.plot(x, y);
_images/8dc3be8dadfe6f437044a2bf582b36306ba75fe95ef55aaed00e43b63f5f9482.png

Lesson 2 project: Higgs decays from arrays#

As described in the intro, navigate to the notebooks directory and open lesson-2-project.ipynb, then follow its instructions.