This lesson is being piloted (Beta version)

Reproducible analyses with REANA

Introduction

Overview

Teaching: 10 min
Exercises: 1 min
Questions
  • What makes research data analyses reproducible?

  • Is preserving code, data, and containers enough?

Objectives
  • Understand principles behind computational reproducibility

  • Understand the concept of serial and parallel computational workflow graphs

Computational reproducibility

A reproducibility quote

An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.

– Jonathan B. Buckheit and David L. Donoho, “WaveLab and Reproducible Research”, source

Computational reproducibility has many definitions. The terms such as reproducibility, replicability, and repeatability have often different meaning in different scientific disciplines.

One possible point of view on the computational reproducibility is provided by The Turing Way: source

In other words, “same data + same analysis = reproducible results” (perhaps run on a different underlying computing architecture such as from Intel to ARM processors).

The more interesting use case is “reusable analyses”, i.e. testing the same theory on new data, or altering the theory and reinterpreting old data.

Another possible point of view on the computational reproducibility is provided by the PRIMAD model: source

An analysis is reproducible, repeatable, reusable, robust if one can “wiggle” various parameters entering the process. For example, change platform (from Intel to ARM processors); is it portable? Or change the actor performing the analysis (from Alice to Bob); is it independent of the analyst?

The real life shows it often is not.

Example: Monya Baker published in Nature 533 (2016) 452-454 the results from surveying 1500 scientists: source

Half of researchers cannot reproduce even their own results. And physics is not doing visibly better than the other scientific disciplines.

Slow uptake of best practices

Many guidelines with “best practices” for computational reproducibility have been published. For example, “Ten Simple Rules for Reproducible Computational Research” by Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, Eivind Hovig (2013) DOI:10.1371/journal.pcbi.1003285:

  1. For every result, keep track of how it was produced
  2. Avoid manual data manipulation steps
  3. Archive the exact versions of all external programs used
  4. Version control all custom scripts
  5. Record all intermediate results, when possible in standardized formats
  6. For analyses that include randomness, note underlying random seeds
  7. Always store raw data behind plots
  8. Generate hierarchical analysis output, allowing layers of increasing detail to be inspected
  9. Connect textual statements to underlying results
  10. Provide public access to scripts, runs, and results

Yet the uptake of good practices in real life has been slow. There are several reasons, including:

The change is being brought by a combination of top-down approaches (e.g. funding bodies asking for Data Management Plans) and bottom-up approaches (building tools integrating into daily research workflows).

A reproducibility quote

Your closest collaborator is you six monhts ago… and your younger self does not reply to emails.

Four questions

Four questions to aid assessing the robustness of analyses:

  1. Where is your input data? Specify all input data and input parameters that the analysis uses.
  2. Where is your analysis code? Specify the analysis code and the software frameworks that are being used to analyse the data.
  3. Which computing environment do you use? Specify the operating system platform that is used to run the analysis.
  4. What are the computational steps to achieve the results? Specify all the commands and GUI clicks necessary to arrive at the final results.

The input data for statistical analyses, such as CMS MiniAOD, is produced centrally and the locations are well understood.

The analysis code and the containerised computational environments were covered in the previous two days of this workshop:

Exercise

Are containers enough to capture your runtime environment? What else might be necessary in your typical physics analysis scenarios?

Solution

Any external resources, such as condition database calls, must also be thought about. Will the external database that you use still be there and answering queries in the future?

Computational steps

Today’s lesson will focus mostly on the fourth question, i.e. the preservation of running computational steps.

The use of interactive and graphical interfaces is not recommended, since one cannot easily capture and reproduce user clicks.

The use of custom helper scripts (e.g. run.sh shell scripts), or custom orchestration scripts (e.g. Python glue code) running the analysis is much better.

However, porting glue code to new usage scenarios (for example to scale up to a new computer centre cluster) may be tedious technical word that would be better spent doing physics instead.

Hence the birth of declarative workflow systems that express the computational steps more abstractly.

Example of a serial computational workflow graph typical for ATLAS RECAST analyses:

Example of a parallel computational workflow graph typical for Beyond Standard Model searches:

Many different computational data analysis workflow systems exist. Some are preferred to others because of the features they bring that others do not have, so there are fit-for-use and fit-for-purpose considerations. Some are preferred due to cultural differences in research teams or due to individual preferences.

In experimental particle physics, several such workflow systems are being used, for example Snakemake in LHCb or Yadage in ATLAS.

REANA

We shall use the REANA reproducible analysis platform to explore computational workflows in this lesson. REANA supports:

Analysis preservation ab initio

Preserving analysis code and processes after the publication is often coming too late. The key information and knowledge how to arrive at the results may get lost during the lengthy analysis process.

Hence the idea of making research reproducible from the start, in other words making research “preproducible”, to make the analysis preservation easy.

Preserve first and think about reusability later is the “blue pill” way.

Make analysis preproducible ab initio to facilitate its future preservation is the “red pill” way.

Key Points

  • Workflow is the new data.

  • Data + Code + Environment + Workflow = Reproducible Analyses

  • Before reproducibility comes preproducibility


First example

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • How to run analyses on REANA cloud?

  • What are the basic REANA command-line client usage scenarios?

  • How to monitor my analysis using REANA web interface?

Objectives
  • Get hands-on experience with REANA command-line client

Overview

In this lesson we shall run our first simple REANA example. We shall see:

Checklist

Have you installed reana-client and/or have you logged into LXPLUS as described in Setup?

First REANA example

We shall get acquainted with REANA by means of running a sample analysis example:

Let’s start by cloning it:

git clone https://github.com/reanahub/reana-demo-root6-roofit
cd reana-demo-root6-roofit

What does the example do? The example emulates a typical particle physics analysis where the signal and background data is processed and fitted against a model. The example uses the RooFit package of the ROOT framework.

Four questions:

  1. Where is your input data? There is no input data. We shall simulate them.
  2. Where is your analysis code? Two files: gendata.C macro generates signal and background data; fitdata.C macro makes a fit for the signal and the background data.
  3. Which computing environment do you use? ROOT 6.18.04 with RooFit.
  4. What are the computational steps to achieve the results? Simple sequential steps: first run gendata, then run fitdata.

Workflow definition:

           START
            |
            |
            V
+-------------------------+
| (1) generate data       |
|                         |
|    $ root gendata.C ... |
+-------------------------+
            |
            | data.root
            V
+-------------------------+
| (2) fit data            |
|                         |
|    $ root fitdata.C ... |
+-------------------------+
            |
            | plot.png
            V
           STOP

The four questions expressed in reana.yaml fully define our analysis:

inputs:
  files:
    - code/gendata.C
    - code/fitdata.C
  parameters:
    events: 20000
    data: results/data.root
    plot: results/plot.png
workflow:
  type: serial
  specification:
    steps:
      - name: gendata
        environment: 'docker.io/reanahub/reana-env-root6:6.18.04'
        commands:
        - mkdir -p results && root -b -q 'code/gendata.C(${events},"${data}")'
      - name: fitdata
        environment: 'docker.io/reanahub/reana-env-root6:6.18.04'
        commands:
        - root -b -q 'code/fitdata.C("${data}","${plot}")'
outputs:
  files:
    - results/plot.png

Note the basic structure of reana.yaml answering the Four Questions. (Where is input data? Where is analysis code? What compute environment to use? What are the computational steps to arrive at results?)

Exercise

Familiarise yourself with the RooFit demo example by studying the README file and looking at the gendata.C and fitdata.C source code.

Solution

firefox https://github.com/reanahub/reana-demo-root6-roofit

First steps with the REANA command-line client

First we need to make sure we can use REANA command-line client. See the setup instructions if you haven’t already installed it.

The client will offer several commands which we shall go through in this tutorial:

reana-client --help
Usage: reana-client [OPTIONS] COMMAND [ARGS]...

  REANA client for interacting with REANA server.

Options:
  -l, --loglevel [DEBUG|INFO|WARNING]
                                  Sets log level
  --help                          Show this message and exit.

Quota commands:
  quota-show  Show user quota.

Configuration commands:
  info     List cluster general information.
  ping     Check connection to REANA server.
  version  Show version.

Workflow management commands:
  create  Create a new workflow.
  delete  Delete a workflow.
  diff    Show diff between two workflows.
  list    List all workflows and sessions.

Workflow execution commands:
  logs      Get workflow logs.
  restart   Restart previously run workflow.
  run       Shortcut to create, upload, start a new workflow.
  start     Start previously created workflow.
  status    Get status of a workflow.
  stop      Stop a running workflow.
  validate  Validate workflow specification file.

Workspace interactive commands:
  close  Close an interactive session.
  open   Open an interactive session inside the workspace.

Workspace file management commands:
  download  Download workspace files.
  du        Get workspace disk usage.
  ls        List workspace files.
  mv        Move files within workspace.
  prune     Prune workspace files.
  rm        Delete files from workspace.
  upload    Upload files and directories to workspace.

Workspace file retention commands:
  retention-rules-list  List the retention rules for a workflow.

Secret management commands:
  secrets-add     Add secrets from literal string or from file.
  secrets-delete  Delete user secrets by name.
  secrets-list    List user secrets.

You can use --help option to learn more about any command, for example validate:

reana-client validate --help
Usage: reana-client validate [OPTIONS]

  Validate workflow specification file.

  The ``validate`` command allows to check syntax and validate the reana.yaml
  workflow specification file.

  Examples:

       $ reana-client validate -f reana.yaml

Options:
  -f, --file PATH          REANA specification file describing the workflow to
                           execute. [default=reana.yaml]
  --environments           If set, check all runtime environments specified in
                           REANA specification file. [default=False]
  --pull                   If set, try to pull remote environment image from
                           registry to perform validation locally. Requires
                           ``--environments`` flag. [default=False]
  --server-capabilities    If set, check the server capabilities such as
                           workspace validation. [default=False]
  -t, --access-token TEXT  Access token of the current user.
  --help                   Show this message and exit.

Exercise

Validate our reana.yaml file to discover any errors. Use validate command to do so.

Solution

reana-client validate
==> Verifying REANA specification file... reana.yaml
  -> SUCCESS: Valid REANA specification file.
==> Verifying REANA specification parameters...
  -> SUCCESS: REANA specification parameters appear valid.
==> Verifying workflow parameters and commands...
  -> SUCCESS: Workflow parameters and commands appear valid.
==> Verifying dangerous workflow operations...
  -> SUCCESS: Workflow operations appear valid.

Connect REANA client to remote REANA cluster

The REANA client will interact with a remote REANA cluster. It knows to which REANA cluster it connects by means of the following environment variable:

export REANA_SERVER_URL=https://reana.cern.ch

In order to authenticate to REANA, you need to generate a token.

Exercise: Obtain a token.

In order to obtain your token, please go to https://reana.cern.ch and ask for it.

In your terminal, paste the line with your new access token as seen below.

export REANA_ACCESS_TOKEN=xxxxxx

It may be a good idea to create a reana-setup-environment.sh file to store these two export commands. That way you all you need to do to setup your environment is source reana-setup-environment.sh. An alternative to this is opening up your .bashrc file and pasting the above two export commands there.

The REANA client connection to remote REANA cluster can be verified via ping command:

reana-client ping
REANA server: https://reana.cern.ch
REANA server version: 0.9.1
REANA client version: 0.9.1
Authenticated as: John Doe <john.doe@example.org>
Status: Connected

Run example on REANA cluster

Now that we have defined and validated our reana.yaml, and connected to the REANA production cluster, we can run the example easily via:

reana-client run -w roofit
==> Creating a workflow...
==> Verifying REANA specification file... reana.yaml
  -> SUCCESS: Valid REANA specification file.
==> Verifying REANA specification parameters...
  -> SUCCESS: REANA specification parameters appear valid.
==> Verifying workflow parameters and commands...
  -> SUCCESS: Workflow parameters and commands appear valid.
==> Verifying dangerous workflow operations...
  -> SUCCESS: Workflow operations appear valid.
==> Verifying compute backends in REANA specification file...
  -> SUCCESS: Workflow compute backends appear to be valid.
roofit.1
==> SUCCESS: File /reana.yaml was successfully uploaded.
==> Uploading files...
==> Detected .gitignore file. Some files might get ignored.
==> SUCCESS: File /code/gendata.C was successfully uploaded.
==> SUCCESS: File /code/fitdata.C was successfully uploaded.
==> Starting workflow...
==> SUCCESS: roofit.1 is pending

Here, we use run command that will create a new workflow named roofit, upload its inputs as specified in the workflow specification and finally start the workflow.

While the workflow is running, we can enquire about its status:

reana-client status -w roofit
NAME     RUN_NUMBER   CREATED               STARTED               STATUS    PROGRESS
roofit   1            2020-02-17T16:01:45   2020-02-17T16:01:48   running   1/2

After a minute or so, the workflow should finish:

reana-client status -w roofit
NAME     RUN_NUMBER   CREATED               STARTED               ENDED                 STATUS     PROGRESS
roofit   1            2020-02-17T16:01:45   2020-02-17T16:01:48   2020-02-17T16:02:44   finished   2/2

We can list the output files in the remote workspace:

reana-client ls -w roofit
NAME                SIZE     LAST-MODIFIED
reana.yaml          687      2020-02-17T16:01:46
code/gendata.C      1937     2020-02-17T16:01:46
code/fitdata.C      1648     2020-02-17T16:01:47
results/plot.png    15450    2020-02-17T16:02:44
results/data.root   154457   2020-02-17T16:02:17

We can also inspect the logs:

reana-client logs -w roofit | less
# (Hit q to quit 'less')
==> Workflow engine logs
2020-02-17 16:02:10,859 | root | MainThread | INFO | Publishing step:0, cmd: mkdir -p results && root -b -q 'code/gendata.C(20000,"results/data.root")', total steps 2 to MQ
2020-02-17 16:02:23,002 | root | MainThread | INFO | Publishing step:1, cmd: root -b -q 'code/fitdata.C("results/data.root","results/plot.png")', total steps 2 to MQ
2020-02-17 16:02:50,093 | root | MainThread | INFO | Workflow 424bc949-b809-4782-ba96-bc8cfa3e1a89 finished. Files available at /var/reana/users/b57e902f-fd11-4681-8a94-4318ae05d2ca/workflows/424bc949-b809-4782-ba96-bc8cfa3e1a89.

==> Job logs
==> Step: gendata
==> Workflow ID: 424bc949-b809-4782-ba96-bc8cfa3e1a89
==> Compute backend: Kubernetes
==> Job ID: 53c97429-25e9-4b74-94f7-c665d93fdbc2
==> Docker image: reanahub/reana-env-root6:6.18.04
==> Command: mkdir -p results && root -b -q 'code/gendata.C(20000,"results/data.root")'
==> Status: finished
==> Logs:
...

We can download the resulting plot:

reana-client download results/plot.png -w roofit
==> SUCCESS: File results/plot.png downloaded to reana-demo-root6-roofit.

And display it:

firefox results/plot.png

Exercise

Run the example workflow on REANA cluster. Practice status, ls, logs, download commands. For example, can you get the logs of the gendata step only?

Solution

reana-client logs -w roofit --filter step=gendata

Key Points

  • Use reana-client rich command-line client to run containerised workflows from your laptop on remote compute clouds

  • Before running analysis remotely, check locally its correctness via validate command

  • As always, when it doubt, use the --help command-line argument


Developing serial workflows

Overview

Teaching: 20 min
Exercises: 10 min
Questions
  • How to write serial workflows?

  • What is declarative programming?

  • How to develop workflows progressively?

  • Can I temporarily override workflow parameters?

  • Do I always have to build new Docker image when my code changes?

Objectives
  • Understand pros/cons between imperative and declarative programming styles

  • Get familiar with serial workflow development practices

  • Understand run numbers of your analysis

  • See how you can run only parts of the workflow

  • See how you can repeat workflow to fix a failed step

Overview

We have seen how to use REANA client to run containerised analyses on the REANA cloud.

In this lesson we see more use cases suitable for developing serial workflows.

Imperative vs declarative programming

Imperative programming feels natural: use a library and just write code. Example: C.

for (int i = 0; i < sizeof(people) / sizeof(struct people); i++) {
  if (people[i].age < 20) {
    printf("%s\n", people[i].name)
  }
}

However, it has also its drawbacks. If you write scientific workflows imperatively and you need port the code to run on different compute architectures, or to scale up, it may be necessary to do considerable code refactoring. This is not writing science code, but rather writing orchestration for the said science code onto different deployment scenarios.

Enter declarative programming that “expresses the logic of a computation without describing its control flow”. Example: SQL.

SELECT name FROM people WHERE age<20

The idea of declarative approach to scientific workflows is to express research as a series of data analysis steps and let an independent “orchestration tool” or a “workflow system” the task of running things properly on various deployment architectures.

This achieves better separation of concerns between physics code knowledge and computing orchestration glue code knowledge. However, the development may be felt less immediate. There are pros and cons. There is no silver bullet.

Imperative or declarative?

Imperative programming is about how you want to achieve something. Declarative programming is about what you want to achieve.

Developing workflows progressively

Developing workflows declaratively may feel less natural. How do we do that?

Start with earlier steps, run, debug, run, debug until satisfaction.

Continue with later steps only afterwards.

How to run only first step of our example workflow? Use TARGET step option:

reana-client run -w roofit -o TARGET=gendata
==> Creating a workflow...
==> Verifying REANA specification file... reana.yaml
  -> SUCCESS: Valid REANA specification file.
==> Verifying REANA specification parameters...
  -> SUCCESS: REANA specification parameters appear valid.
==> Verifying workflow parameters and commands...
  -> SUCCESS: Workflow parameters and commands appear valid.
==> Verifying dangerous workflow operations...
  -> SUCCESS: Workflow operations appear valid.
==> Verifying compute backends in REANA specification file...
  -> SUCCESS: Workflow compute backends appear to be valid.
roofit.2
==> SUCCESS: File /reana.yaml was successfully uploaded.
==> Uploading files...
==> Detected .gitignore file. Some files might get ignored.
==> SUCCESS: File /code/fitdata.C was successfully uploaded.
==> SUCCESS: File /code/gendata.C was successfully uploaded.
==> Starting workflow...
==> SUCCESS: roofit.2 has been queued

After a minute, let us check the status:

reana-client status -w roofit
NAME     RUN_NUMBER   CREATED               STARTED               ENDED                 STATUS     PROGRESS
roofit   2            2020-02-17T16:07:29   2020-02-17T16:07:33   2020-02-17T16:08:48   finished   1/2

and the workspace content:

reana-client ls -w roofit
NAME                SIZE     LAST-MODIFIED
reana.yaml          687      2020-02-11T16:07:30
code/gendata.C      1937     2020-02-17T16:07:30
code/fitdata.C      1648     2020-02-17T16:07:31
results/data.root   154458   2020-02-17T16:08:43

As we can see, the workflow run only the first command and the data.root file was well generated. The final fitting step was not run and the final plot was not produced.

Workflow runs

We have run the analysis example anew. Similar to Continuous Integration systems, the REANA platform runs each workflow in an independent workspace. To distinguish between various workflow runs of the same analysis, the REANA platform keeps an incremental “run number”. You can obtain the list of all your workflows by using the list command:

reana-client list
NAME     RUN_NUMBER   CREATED               STARTED               ENDED                 STATUS
roofit   2            2020-02-17T16:07:29   2020-02-17T16:07:33   2020-02-17T16:08:48   finished
roofit   1            2020-02-17T16:01:45   2020-02-17T16:01:48   2020-02-17T16:02:50   finished

You can use myanalysis.myrunnumber to refer to a given run number of an analysis:

reana-client ls -w roofit.1
reana-client ls -w roofit.2

To quickly know the differences between various workflow runs, you can use the diff command:

reana-client diff roofit.1 roofit.2 --brief
==> No differences in REANA specifications.

==> Differences in workflow workspace
Files roofit.1/results/data.root and roofit.2/results/data.root differ
Only in roofit.1/results: plot.png

Workflow parameters

Another useful technique when developing a workflow is to use smaller data samples until the workflow is debugged. For example, instead of generating 20000 events, we can generate only 1000. While you could achieve this by simply modifying the workflow definition, REANA offers an option to run parametrised workflows, meaning that you can pass the wanted value on the command line:

reana-client run -w roofit -p events=1000
==> Creating a workflow...
==> Verifying REANA specification file... /home/tibor/private/project/reana/src/reana-demo-root6-roofit/reana.yaml
  -> SUCCESS: Valid REANA specification file.
==> Verifying REANA specification parameters...
  -> SUCCESS: REANA specification parameters appear valid.
==> Verifying workflow parameters and commands...
  -> SUCCESS: Workflow parameters and commands appear valid.
==> Verifying dangerous workflow operations...
  -> SUCCESS: Workflow operations appear valid.
==> Verifying compute backends in REANA specification file...
  -> SUCCESS: Workflow compute backends appear to be valid.
roofit.3
==> SUCCESS: File /reana.yaml was successfully uploaded.
==> Uploading files...
==> Detected .gitignore file. Some files might get ignored.
==> SUCCESS: File /code/gendata.C was successfully uploaded.
==> SUCCESS: File /code/fitdata.C was successfully uploaded.
==> Starting workflow...
==> SUCCESS: roofit.3 has been queued

The generated ROOT file is much smaller:

reana-client ls -w roofit.1 | grep data.root
results/data.root   154457   2020-02-17T16:02:17
reana-client ls -w roofit.3 | grep data.root
results/data.root   19216   2020-02-17T16:18:45

and the plot much coarser:

reana-client download results/plot.png -w roofit.3

Developing further steps

Now that we are happy with the beginning of the workflow, how do we continue to develop the rest? Running a new workflow every time could be very time consuming; running skimming may require many more minutes than running statistical analysis.

In these situations, you can take advantage of the restart functionality. The REANA platform allows to restart a part of the workflow on the given workspace starting from the workflow step specified by the FROM option:

reana-client restart -w roofit.3 -o FROM=fitdata
==> SUCCESS: roofit.3.1 is pending

Note that the run number got an extra digit, meaning the number of restarts of the given workflow. The full semantics of REANA run numbers is myanalysis.myrunnumber.myrestartnumber.

Let us enquire about the status of the restarted workflow:

reana-client status -w roofit.3.1
NAME     RUN_NUMBER   CREATED               STARTED               ENDED                 STATUS     PROGRESS
roofit   3.1          2020-02-17T16:26:09   2020-02-17T16:26:10   2020-02-17T16:27:24   finished   1/2

Looking at the number of steps of the 3.1 rerun, and looking at modification timestamps of the workspace files:

reana-client ls -w roofit.3.1
NAME                SIZE    LAST-MODIFIED
reana.yaml          687     2020-02-17T16:17:00
code/gendata.C      1937    2020-02-17T16:17:00
code/fitdata.C      1648    2020-02-17T16:17:01
results/plot.png    16754   2020-02-17T16:27:20
results/data.root   19216   2020-02-17T16:18:45

We can see that only the last step of the workflow was rerun, as wanted.

This technique is useful to debug later stages of the workflow without having to rerun the lengthy former stages of the workflow.

Exercise

Consider we would like to produce the final plot of the roofit example and change the title from “Fit example” to “RooFit example”. How do you do this in the most efficient way?

Solution

Amend fitdata.C, upload changed file to the workspace, and rerun the past successful workflow starting from the fitdata step:

reana-client list
vim code/fitdata.C # edit title printing statement
reana-client upload ./code/fitdata.C -w roofit.3
reana-client restart -w roofit.3 -o FROM=fitdata
reana-client list
reana-client status -w roofit.3.2
reana-client download -w roofit.3.2

Compile-time vs runtime code changes

Sometimes you have to build a new container image when code changes (e.g. C++ compilation). sometimes you don’t (e.g. Python code, ROOT macros). Use latter for more productivity when developing workflows.

Key Points

  • Develop workflows progressively; add steps as needed

  • When developing a workflow, stay on the same workspace

  • When developing a bytecode-interpreted code, stay on the same container

  • Use smaller test data before scaling out

  • Use workflows as Continuous Integration; make atomic commits that always work


HiggsToTauTau analysis: serial

Overview

Teaching: 5 min
Exercises: 20 min
Questions
  • Challenge: write the HiggsToTauTau analysis workflow and run it on REANA

Objectives
  • Develop a full HigssToTauTau analysis workflow using a simple serial language

  • Get acquainted with writing moderately complex REANA examples

Overview

In the previous two episodes we have practised writing and running workflows on REANA using a simple RooFit analysis example.

In this episode we shall go back to the HiggsToTauTau analysis example that you used throughout the workshop and we shall write a serial workflow to run this analysis on the REANA platform.

Recap

In the past two days of this workshop you have followed two lessons:

The lessons were using a HiggsToTauTau example analysis described in detail here:

You have containerised this analysis by means of two GitLab repositories:

You have used the GitLab CI/CD to build the Docker images for these repositories and published them as:

You have run the containerised HiggsToTauTau analysis “manually” by using docker commands for various analysis steps such as:

And you have produced the plots and the fit:

Objective

Let us write a serial computational workflow automatising the previously-run manual steps and run the HiggsToTauTau example on REANA.

Note: Computing efficiency

Note that the serial workflow will not be necessarily efficient here, since it will run sequentially over various dataset files and not process them in parallel. Do not pay attention to this inefficiency here yet. We shall speed up the serial example via parallel processing in the forthcoming HiggsToTauTau analysis: parallel episode coming after the coffee break.

Note: Container directories and workspace directories

The awesome-analysis-eventselection and awesome-analysis-statistics repositories assume that you run code from certain absolute directories such as /analysis/skim. Recall that when REANA starts a new workflow run, it creates a certain unique “workspace directory” and uses it as the default directory for all the analysis steps throughout the workflow, allowing to share read/write files amongst the steps.

It is a good practice to consider the absolute directories in your container images such as /analysis/skim as read-only and rather use the dynamic workflow’s workspace for any writeable needs. In this way, we don’t risk to write over any code or configuration files provided by the container. This is good both for reproducibility and security purposes.

Moreover, we don’t modify the size of the running container by writing inside it, as it were. Writing to dynamic workspace that is mounted inside the container allows to keep the container size small.

Note: REANA_WORKSPACE environment variable

REANA platform uses a convenient set of environment variables that you can use in your scripts. One of them is REANA_WORKSPACE which points to the workflow’s workspace which is uniquely allocated for each run. You can use the $$REANA_WORKSPACE environment variable in your reana.yaml recipe to share the output of skimming, histogramming, plotting and fitting steps. (Note the use of two leading dollar signs to escape the workflow parameter expansion that you have used in the previous episodes.)

OK, challenge time!

With the above hints in mind, please try to write workflow either individually or in pairs.

Exercise

Write reana.yaml representing HiggsToTauTau analysis and run it on the REANA cloud.

Solution

inputs:
  parameters:
    eosdir: root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced
workflow:
  type: serial
  specification:
    steps:
      - name: skimming
        environment: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3:master
        commands:
          - mkdir $$REANA_WORKSPACE/skimming && cd /analysis/skim && bash ./skim.sh ${eosdir} $$REANA_WORKSPACE/skimming
      - name: histogramming
        environment: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3:master
        commands:
          - mkdir $$REANA_WORKSPACE/histogramming && cd /analysis/skim && bash ./histograms_with_custom_output_location.sh $$REANA_WORKSPACE/skimming $$REANA_WORKSPACE/histogramming
      - name: plotting
        environment: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3:master
        commands:
          - mkdir $$REANA_WORKSPACE/plotting && cd /analysis/skim && bash ./plot.sh $$REANA_WORKSPACE/histogramming/histograms.root $$REANA_WORKSPACE/plotting 0.1
      - name: fitting
        environment: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-statistics-stage3:master
        commands:
          - mkdir $$REANA_WORKSPACE/fitting && cd /fit && bash ./fit.sh $$REANA_WORKSPACE/histogramming/histograms.root $$REANA_WORKSPACE/fitting
outputs:
  files:
    - fitting/fit.png

Key Points

  • Writing serial workflows is like chaining shell script commands


Coffee break

Overview

Teaching: 15 min
Exercises: 0 min
Questions
  • Coffee break

Objectives
  • Refresh your mind

  • Discuss your experience

Key Points

  • Refresh your mind

  • Discuss your experience


Developing parallel workflows

Overview

Teaching: 15 min
Exercises: 10 min
Questions
  • How to scale up and run thousands of jobs?

  • What is a DAG?

  • What is a Scatter-Gather paradigm?

  • How to run Yadage workflows on REANA?

Objectives
  • Learn about Directed Acyclic Graphs (DAG)

  • Understand Yadage workflow language

  • Practice running and inspecting parallel workflows

Overview

We now know how to develop reproducible analyses on small scale using serial workflows.

In this lesson we shall learn how to scale-up for real-life work which usually requires using parallel workflows.

Computational workflows as Directed Acyclic Graphs (DAG)

The computational workflows can be expressed as a set of computational steps where some steps depends on other steps before they can begin their computations. In other words, the computational steps constitute a Directed Acyclic Graph (DAG) where each graph vertex represents a unit of computation with its inputs and outputs, and the graph edges describe the interconnection of various computational steps. For example:

The graph is “directed” and “acyclic” because it can be topographically ordered so that later steps depends on earlier steps without cyclic dependencies, the progress flowing steadily from former steps to latter steps during analysis.

The REANA platform supports several DAG workflow specification languages:

Yadage

In this lesson we shall mostly use the Yadage workflow specification language (used e.g. in ATLAS). Yadage enables to describe even very complex computational workflows.

Let us start by having a look at the Yadage specification for the RooFit example we have used in the previous episodes:

stages:
  - name: gendata
    dependencies: [init]
    scheduler:
      scheduler_type: 'singlestep-stage'
      parameters:
        events: {step: init, output: events}
        gendata: {step: init, output: gendata}
        outfilename: '{workdir}/data.root'
      step:
        process:
          process_type: 'interpolated-script-cmd'
          script: root -b -q '{gendata}({events},"{outfilename}")'
        publisher:
          publisher_type: 'frompar-pub'
          outputmap:
            data: outfilename
        environment:
          environment_type: 'docker-encapsulated'
          image: 'docker.io/reanahub/reana-env-root6'
          imagetag: '6.18.04'
  - name: fitdata
    dependencies: [gendata]
    scheduler:
      scheduler_type: 'singlestep-stage'
      parameters:
        fitdata: {step: init, output: fitdata}
        data: {step: gendata, output: data}
        outfile: '{workdir}/plot.png'
      step:
        process:
          process_type: 'interpolated-script-cmd'
          script: root -b -q '{fitdata}("{data}","{outfile}")'
        publisher:
          publisher_type: 'frompar-pub'
          outputmap:
            plot: outfile
        environment:
          environment_type: 'docker-encapsulated'
          image: 'docker.io/reanahub/reana-env-root6'
          imagetag: '6.18.04'

We can see that the workflow consists of two stages, the gendata stage that does not depend on anything (this is denoted by [init] dependency which means the stage can run right after the workflow initialisation already), and the fitdata stage that depends on the completion of the gendata stage (this is denoted by [gendata]).

Note that each stage consists of a single workflow step (singlestep-stage) which represents the basic unit of computation of the workflow. (We shall see below an example of a multi-step stages which express the same basic unit of computations scattered over inputs.)

The step consists of the description of the process to run, the containerised environment in which to run the process, as well as the mapping of its outputs to the stage.

This is how the Yadage workflow engine understands which stages can be run in which order, what commands to run in each stage, and how to pass inputs and outputs between steps.

Snakemake

Let us open a brief parenthesis about other workflow languages such as Snakemake (used e.g. in LHCb). The same computational graph concepts apply here as well. What differs is the syntax how to express the dependencies between steps and the processes to run in each step.

For example, Snakemake uses “rules” where each rule defines its inputs and outputs and the command to run to produce them. The Snakemake workflow engine then computes the dependencies between rules based on how the outputs from some rules are used as inputs to other rules.

rule all:
    input:
        "results/data.root",
        "results/plot.png"

rule gendata:
    output:
        "results/data.root"
    params:
        events=20000
    container:
        "docker://docker.io/reanahub/reana-env-root6:6.18.04"
    shell:
        "mkdir -p results && root -b -q 'code/gendata.C({params.events},\"{output}\")'"

rule fitdata:
    input:
        data="results/data.root"
    output:
        "results/plot.png"
    container:
        "docker://docker.io/reanahub/reana-env-root6:6.18.04"
    shell:
        "root -b -q 'code/fitdata.C(\"{input.data}\",\"{output}\")'"

We see that the final plot is produced by the “fitdata” rule, which needs “data.root” file to be present, and it is the “gendata” rules that produces it. Hence Snakemake knows that it has to run the “gendata” rule first, and the computation of “fitdata” is deferred until “gendata” successfully completes. This process is very similar to how Makefile are being used in Unix software packages.

After this parenthesis note about Snakemake, let us now return to our Yadage example.

Running Yadage workflows

Let us try to write and run the above Yadage workflow on REANA.

We have to instruct REANA that we are going to use Yadage as our workflow engine. We can do that by editing reana.yaml and specifying:

inputs:
  files:
    - code/gendata.C
    - code/fitdata.C
    - workflow.yaml
  parameters:
    events: 20000
    gendata: code/gendata.C
    fitdata: code/fitdata.C
workflow:
  type: yadage
  file: workflow.yaml
outputs:
  files:
    - fitdata/plot.png

Here, workflow.yaml is a new file with the same content as specified above.

We now can run this example on REANA in the usual way:

reana-client run -w roofityadage

Exercise

Run RooFit example using Yadage workflow engine on the REANA cloud. Upload code, run workflow, inspect status, check logs, download final plot.

Solution

Nothing changes in the usual user interaction with the REANA platform:

reana-client create -w roofityadage -f ./reana-yadage.yaml
reana-client upload ./code -w roofityadage
reana-client start -w roofityadage
reana-client status -w roofityadage
reana-client logs -w roofityadage
reana-client ls -w roofityadage
reana-client download plot.png -w roofityadage

Physics code vs orchestration code

Note that it wasn’t necessary to change anything in our research code: we simply modified the workflow definition from Serial to Yadage and we could run the RooFit code “as is” using another workflow engine. This is a simple demonstration of the separation of concerns between “physics code” and “orchestration code”.

Parallelism via step dependencies

We have seen how the sequential workflows were expressed in the Yadage syntax using stage dependencies. Note that if the stage dependency graph would have permitted, the workflow steps not depending on each other, or on the results of previous computations, would have been executed in parallel by the workflow engine out of the box. The physicist only has to supply the knoweldge about which steps depend on which other steps and the workflow engine takes care of efficiently starting and scheduling tasks as necessary.

HiggsToTauTau analysis: simple version

Let us demonstrate how to write a Yadage workflow for the HiggsToTauTau example analysis using simple step dependencies.

The workflow stages look like:

stages:
- name: skim
  dependencies: [init]
  scheduler:
    scheduler_type: singlestep-stage
    parameters:
      input_dir: {step: init, output: input_dir}
      output_dir: '{workdir}/output'
    step: {$ref: 'steps.yaml#/skim'}

- name: histogram
  dependencies: [skim]
  scheduler:
    scheduler_type: singlestep-stage
    parameters:
      input_dir: {step: skim, output: skimmed_dir}
      output_dir: '{workdir}/output'
    step: {$ref: 'steps.yaml#/histogram'}

- name: fit
  dependencies: [histogram]
  scheduler:
    scheduler_type: singlestep-stage
    parameters:
      histogram_file: {step: histogram, output: histogram_file}
      output_dir: '{workdir}/output'
    step: {$ref: 'steps.yaml#/fit'}

- name: plot
  dependencies: [histogram]
  scheduler:
    scheduler_type: singlestep-stage
    parameters:
      histogram_file: {step: histogram, output: histogram_file}
      output_dir: '{workdir}/output'
    step: {$ref: 'steps.yaml#/plot'}

where steps are expressed as:

skim:
  process:
    process_type: 'interpolated-script-cmd'
    script: |
      mkdir {output_dir}
      bash skim.sh {input_dir} {output_dir}
  environment:
    environment_type: 'docker-encapsulated'
    image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3
    imagetag: master
  publisher:
    publisher_type: interpolated-pub
    publish:
      skimmed_dir: '{output_dir}'

histogram:
  process:
    process_type: 'interpolated-script-cmd'
    script: |
      mkdir {output_dir}
      bash histograms_with_custom_output_location.sh {input_dir} {output_dir}
  environment:
    environment_type: 'docker-encapsulated'
    image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3
    imagetag: master
  publisher:
    publisher_type: interpolated-pub
    publish:
      histogram_file: '{output_dir}/histograms.root'

plot:
  process:
    process_type: 'interpolated-script-cmd'
    script: |
      mkdir {output_dir}
      bash plot.sh {histogram_file} {output_dir}
  environment:
    environment_type: 'docker-encapsulated'
    image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3
    imagetag: master
  publisher:
    publisher_type: interpolated-pub
    publish:
      datamc_plots: '{output_dir}'

fit:
  process:
    process_type: 'interpolated-script-cmd'
    script: |
      mkdir {output_dir}
      bash fit.sh {histogram_file} {output_dir}
  environment:
    environment_type: 'docker-encapsulated'
    image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-statistics-stage3
    imagetag: master
  publisher:
    publisher_type: interpolated-pub
    publish:
      fitting_plot: '{output_dir}/fit.png'

The workflow definition is similar to that of the Serial workflow created in the previous episode. As we can see, this already leads to certain parallelism, because the fitting stage and the plotting stage can actually run simultaneously once the histograms are produced. The graphical representation of the above workflow looks as follows:

Let us try to run it on REANA cloud.

Exercise

Write and run a HiggsToTauTau analysis example using the Yadage workflow version presented above. Take the workflow definition, the step definition, and write the corresponding reana.yaml. Afterwards run the example on REANA cloud.

Solution

mkdir awesome-analysis-yadage-simple
cd awesome-analysis-yadage-simple
vim workflow.yaml # take workflow definition contents above
vim steps.yaml    # take step definition contents above
vim reana.yaml    # the goal of the exercise is to create this content
cat reana.yaml
inputs:
 files:
   - steps.yaml
   - workflow.yaml
  parameters:
    input_dir: root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced
workflow:
  type: yadage
  file: workflow.yaml

Parallelism via scatter-gather paradigm

We have seen how to achieve a certain parallelism of workflow steps via simple dependency graph expressing which workflow steps depend on which others.

We now introduce a more advanced concept how to instruct the workflow engine to start many parallel computations. The paradigm is called “scatter-gather” and is used to instruct the workflow engine to run a certain parametrised command over an array of input values in parallel (the “scatter” operation) whilst assembling these results together afterwards (the “gather” operation). The “scatter-gather” paradigm allows to scale computations in a “map-reduce” fashion over input values with a minimal syntax without having to duplicate workflow code or write loop statements.

Here is an example of scatter-gather paradim in the Yadage language. Note the use of “multi-step” stage definition, expressing that the given stage is actually running multiple parametrised steps:

stages:
  - name: filter1
    dependencies: [init]
    scheduler:
      scheduler_type: multistep-stage
      parameters:
        input: {stages: init, output: input, unwrap: true}
      batchsize: 2
      scatter:
        method: zip
        parameters: [input]
      step: {$ref: steps.yaml#/filter}
  - name: filter2
    dependencies: [filter1]
    scheduler:
      scheduler_type: multistep-stage
      parameters:
        input: {stages: filter1, output: output, unwrap:true}
      batchsize: 2
      scatter:
        method: zip
        parameters: [input]
      step: {$ref: steps.yaml#/filter}
  - name: filter3
    dependencies: [filter2]
    scheduler:
      scheduler_type: singlestep-stage
      parameters:
        input: {stages: 'filter2', output: output}
      step: {$ref: steps.yaml#/filter}

The graphical representation of the computational graph looks like:

Note how the “scatter” operation is automatically happening over the given “input” array with the wanted batch size, processing files two by two irrespective of the number of input files. Note also the automatic “cascading” of computations.

In the next episode we shall see how the scatter-gather paradigm can be used to speed up the HiggsToTauTau sequential workflow that we developed in the previous episode.

Key Points

  • Computational analysis is a graph of inter-dependent steps

  • Fully declare inputs and outputs for each step

  • Use dependencies between workflow steps to allow running jobs in parallel

  • Use scatter/gather paradigm to parallelise parametrised computations


HiggsToTauTau analysis: parallel

Overview

Teaching: 10 min
Exercises: 20 min
Questions
  • Challenge: write the HiggsToTauTau analysis parallel workflow and run it on REANA

Objectives
  • Develop a full HigssToTauTau analysis workflow using parallel language

Overview

We have seen an example of a full DAG-aware workflow language called Yadage and how it can be used to describe and run the RooFit example and a simple version of HiggsToTauTau example.

In this episode we shall see how to efficiently apply parallelism to speed up the HiggsToTauTau example via the scatter-gather paradigm introduced in the previous episode.

HiggsToTauTau analysis

The overall reana.yaml looks like:

inputs:
  files:
    - steps.yaml
    - workflow.yaml
  parameters:
    files:
      - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/GluGluToHToTauTau.root
      - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/VBF_HToTauTau.root
      - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/DYJetsToLL.root
      - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/TTbar.root
      - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/W1JetsToLNu.root
      - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/W2JetsToLNu.root
      - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/W3JetsToLNu.root
      - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/Run2012B_TauPlusX.root
      - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/Run2012C_TauPlusX.root
    cross_sections:
      - 19.6
      - 1.55
      - 3503.7
      - 225.2
      - 6381.2
      - 2039.8
      - 612.5
      - 1.0
      - 1.0
    short_hands:
      - [ggH]
      - [qqH]
      - [ZLL,ZTT]
      - [TT]
      - [W1J]
      - [W2J]
      - [W3J]
      - [dataRunB]
      - [dataRunC]
workflow:
  type: yadage
  file: workflow.yaml
outputs:
  files:
    - fit/fit.png

Note that we define input files and cross sections and short names as an array. It is this array that we shall be scattering around.

HiggsToTauTau skimming

The skimming step definition looks like:

- name: skim
  dependencies: [init]
  scheduler:
    scheduler_type: multistep-stage
    parameters:
      input_file: {step: init, output: files}
      cross_section: {step: init, output: cross_sections}
      output_file: '{workdir}/skimmed.root'
    scatter:
       method: zip
       parameters: [input_file, cross_section]
    step: {$ref: 'steps.yaml#/skim'}

where the step is defined as:

skim:
  process:
    process_type: 'interpolated-script-cmd'
    script: |
      ./skim {input_file} {output_file} {cross_section} 11467.0 0.1
  environment:
    environment_type: 'docker-encapsulated'
    image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3
    imagetag: master
  publisher:
    publisher_type: interpolated-pub
    publish:
      skimmed_file: '{output_file}'

Note the scatter paradigm that will cause nine parallel jobs for each input dataset file.

HiggsToTauTau histogramming

The histograms can be produced as follows:

- name: histogram
  dependencies: [skim]
  scheduler:
    scheduler_type: multistep-stage
    parameters:
      input_file: {stages: skim, output: skimmed_file}
      output_names: {step: init, output: short_hands}
      output_dir: '{workdir}'
    scatter:
       method: zip
       parameters: [input_file, output_names]
    step: {$ref: 'steps.yaml#/histogram'}

with:

histogram:
  process:
    process_type: 'interpolated-script-cmd'
    script: |
      for x in {output_names}; do
        python histograms.py {input_file} $x {output_dir}/$x.root;
      done
  environment:
    environment_type: 'docker-encapsulated'
    image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3
    imagetag: master
  publisher:
    publisher_type: interpolated-pub
    glob: true
    publish:
      histogram_file: '{output_dir}/*.root'

HiggsToTauTau merging

Gather time! How do we merge scattered results?

- name: merge
  dependencies: [histogram]
  scheduler:
    scheduler_type: singlestep-stage
    parameters:
      input_files: {stages: histogram, output: histogram_file, flatten: true}
      output_file: '{workdir}/merged.root'
    step: {$ref: 'steps.yaml#/merge'}

with:

merge:
  process:
    process_type: 'interpolated-script-cmd'
    script: |
      hadd {output_file} {input_files}
  environment:
    environment_type: 'docker-encapsulated'
    image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3
    imagetag: master
  publisher:
    publisher_type: interpolated-pub
    publish:
      merged_file: '{output_file}'

HiggsToTauTau fitting

The fit can be performed as follows:

- name: fit
  dependencies: [merge]
  scheduler:
    scheduler_type: singlestep-stage
    parameters:
      histogram_file: {step: merge, output: merged_file}
      fit_outputs: '{workdir}'
    step: {$ref: 'steps.yaml#/fit'}

with:

fit:
  process:
    process_type: 'interpolated-script-cmd'
    script: |
      python fit.py {histogram_file} {fit_outputs}
  environment:
    environment_type: 'docker-encapsulated'
    image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-statistics-stage3
    imagetag: master
  publisher:
    publisher_type: interpolated-pub
    publish:
      fit_results: '{fit_outputs}/fit.png'

HiggsToTauTau plotting

Challenge time! Add plotting step to the workflow.

Exercise

Following the example above, write plotting step and plug it into the overall workflow.

Solution

The addition to the worklow specification is:

- name: plot
  dependencies: [merge]
  scheduler:
    scheduler_type: singlestep-stage
    parameters:
      histogram_file: {step: merge, output: merged_file}
      plot_outputs: '{workdir}'
    step: {$ref: 'steps.yaml#/plot'}

The step is being defined as:

plot:
  process:
    process_type: 'interpolated-script-cmd'
    script: |
      python plot.py {histogram_file} {plot_outputs} 0.1
  environment:
    environment_type: 'docker-encapsulated'
    image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3
    imagetag: master
  publisher:
    publisher_type: interpolated-pub
    publish:
      fitting_plot: '{plot_outputs}'

Full workflow

We are now ready to assemble the previous stages together and run the example on the REANA cloud.

Exercise

Write and run the HiggsToTauTau parallel workflow on REANA cloud. How many job does the workflow have? How much faster it is executed when compared to the simple Serial version?

Solution

The REANA specification file reana.yaml looks as follows:

inputs:
 files:
   - steps.yaml
   - workflow.yaml
  parameters:
    files:
      - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/GluGluToHToTauTau.root
      - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/VBF_HToTauTau.root
      - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/DYJetsToLL.root
      - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/TTbar.root
      - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/W1JetsToLNu.root
      - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/W2JetsToLNu.root
      - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/W3JetsToLNu.root
      - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/Run2012B_TauPlusX.root
      - root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/Run2012C_TauPlusX.root
    cross_sections:
      - 19.6
      - 1.55
      - 3503.7
      - 225.2
      - 6381.2
      - 2039.8
      - 612.5
      - 1.0
      - 1.0
    short_hands:
      - [ggH]
      - [qqH]
      - [ZLL,ZTT]
      - [TT]
      - [W1J]
      - [W2J]
      - [W3J]
      - [dataRunB]
      - [dataRunC]
workflow:
  type: yadage
  file: workflow.yaml
outputs:
  files:
    - fit/fit.png

The workflow definition file workflow.yaml is:

stages:
- name: skim
  dependencies: [init]
  scheduler:
    scheduler_type: multistep-stage
    parameters:
      input_file: {step: init, output: files}
      cross_section: {step: init, output: cross_sections}
      output_file: '{workdir}/skimmed.root'
    scatter:
       method: zip
       parameters: [input_file, cross_section]
    step: {$ref: 'steps.yaml#/skim'}

- name: histogram
  dependencies: [skim]
  scheduler:
    scheduler_type: multistep-stage
    parameters:
      input_file: {stages: skim, output: skimmed_file}
      output_names: {step: init, output: short_hands}
      output_dir: '{workdir}'
    scatter:
       method: zip
       parameters: [input_file, output_names]
    step: {$ref: 'steps.yaml#/histogram'}

- name: merge
  dependencies: [histogram]
  scheduler:
    scheduler_type: singlestep-stage
    parameters:
      input_files: {stages: histogram, output: histogram_file, flatten: true}
      output_file: '{workdir}/merged.root'
    step: {$ref: 'steps.yaml#/merge'}

- name: fit
  dependencies: [merge]
  scheduler:
    scheduler_type: singlestep-stage
    parameters:
      histogram_file: {step: merge, output: merged_file}
      fit_outputs: '{workdir}'
    step: {$ref: 'steps.yaml#/fit'}

- name: plot
  dependencies: [merge]
  scheduler:
    scheduler_type: singlestep-stage
    parameters:
      histogram_file: {step: merge, output: merged_file}
      plot_outputs: '{workdir}'
    step: {$ref: 'steps.yaml#/plot'}

The workflow steps defined in steps.yaml are:

skim:
  process:
    process_type: 'interpolated-script-cmd'
    script: |
      ./skim {input_file} {output_file} {cross_section} 11467.0 0.1
  environment:
    environment_type: 'docker-encapsulated'
    image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3
    imagetag: master
  publisher:
    publisher_type: interpolated-pub
    publish:
      skimmed_file: '{output_file}'

histogram:
  process:
    process_type: 'interpolated-script-cmd'
    script: |
      for x in {output_names}; do
        python histograms.py {input_file} $x {output_dir}/$x.root;
      done
  environment:
    environment_type: 'docker-encapsulated'
    image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3
    imagetag: master
  publisher:
    publisher_type: interpolated-pub
    glob: true
    publish:
      histogram_file: '{output_dir}/*.root'

merge:
  process:
    process_type: 'interpolated-script-cmd'
    script: |
      hadd {output_file} {input_files}
  environment:
    environment_type: 'docker-encapsulated'
    image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3
    imagetag: master
  publisher:
    publisher_type: interpolated-pub
    publish:
      merged_file: '{output_file}'

fit:
  process:
    process_type: 'interpolated-script-cmd'
    script: |
      python fit.py {histogram_file} {fit_outputs}
  environment:
    environment_type: 'docker-encapsulated'
    image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-statistics-stage3
    imagetag: master
  publisher:
    publisher_type: interpolated-pub
    publish:
      fit_results: '{fit_outputs}/fit.png'

plot:
  process:
    process_type: 'interpolated-script-cmd'
    script: |
      python plot.py {histogram_file} {plot_outputs} 0.1
  environment:
    environment_type: 'docker-encapsulated'
    image: gitlab-registry.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage3
    imagetag: master
  publisher:
    publisher_type: interpolated-pub
    publish:
      fitting_plot: '{plot_outputs}'

Results

The computational graph of the workflow looks like:

The workflow produces the following fit:

Key Points

  • Use step dependencies to express main analysis stages

  • Use scatter-gather paradigm in staged to massively parallelise DAG workflow execution

  • REANA usage scenarios remain the same regardless of workflow language details


A glimpse on advanced topics

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • Can I publish workflow results on EOS?

  • Can I use Kerberos to access restricted resources?

  • Can I use CVMFS software repositories?

  • Can I dispatch heavy computations to HTCondor?

  • Can I dispatch heavy computations to Slurm?

  • Can I open Jupyter notebooks on my REANA workspace?

  • Can I connect my GitLab repositories with REANA?

Objectives
  • Learn about advanced possibilities of REANA platform

  • Learn how to use Kerberos secrets to access restricted resources

  • Learn how to interact with remote storage solutions (EOS)

  • Learn how to interact with remote compute backends (HTCondor, Slurm)

  • Learn how to interact with remote code repositories (CVMFS, GitLab)

  • Learn how to open interactive sessions (Jupyter notebooks)

Overview

We now know how to write serial and parallel workflows.

What do we need more in order to use the system for real life physics analyses?

Let’s scratch the surface of some more advanced topics:

Publishing workflow results on EOS

REANA uses shared filesystem for storing results of your running workflows. They may be garbage-collected after a certain period of time. You can use the reana-client download command to download the results of your workflows, as we have seen in Episode 2. Is there a more automatic way?

One possibility is to add a final step to your workflow that would publish the results of interest in outside filesystem. For example, how can you publish all resulting plots in your personal EOS folder?

First, you have to let the REANA platform know your Kerberos keytab so that writing to EOS would be authorised.

If you don’t have a keytab file yet, you can generate it on LXPLUS by using the following command (assuming the user login name to be johndoe):

cern-get-keytab --keytab ~/.keytab --user --login johndoe

Check whether it works:

kdestroy; kinit -kt ~/.keytab johndoe; klist
Ticket cache: FILE:/tmp/krb5cc_1234_5678
Default principal: johndoe@CERN.CH

Valid starting       Expires              Service principal
07/05/2023 18:04:13  07/06/2023 19:04:13  krbtgt/CERN.CH@CERN.CH
    renew until 07/10/2023 18:04:13
07/05/2023 18:04:13  07/06/2023 19:04:13  afs/cern.ch@CERN.CH
    renew until 07/10/2023 18:04:13

Upload it to the REANA platform as “user secrets”:

reana-client secrets-add --env CERN_USER=johndoe \
                         --env CERN_KEYTAB=.keytab \
                         --file ~/.keytab

Second, once your Kerberos user secrets are uploaded to the REANA platform, you can modify your workflow to add a final data publishing step that copies the resulting plots to the desired EOS directories:

workflow:
  type: serial
  specification:
    steps:
      - name: myfirststep
        ...
      - name: mysecondstep
        ...
      - name: publish
        environment: 'docker.io/library/ubuntu:20.04'
        kerberos: true
        kubernetes_memory_limit: '256Mi'
        commands:
          - mkdir -p /eos/home-j/johndoe/myanalysis-outputs
          - cp myplots/*.png /eos/home-j/johndoe/myanalysis-outputs/

Note the presence of the kerberos: true clause in the final publishing step definition which instructs the REANA system to initialise the Kerberos-based authentication process using the provided user secrets.

Exercise

Publish some of the produced HigssToTauTau analysis plots to your EOS home directory.

Solution

Modify your workflow specification to add a final publishing step.

Hint: Use a previously finished analysis run and the restart command so that you don’t have to rerun the full analysis again.

If you need more assistance with creating and uploading keytab files, please see the REANA documentation on Keytab.

If you need more assistance with creating final workflow publishing step, please see the REANA documentation on EOS.

Using CVMFS software repositories

Many physics analyses need software living in CVMFS filesystem. Packaging this software into the container is possible, but it could make the container size enormous. Can we access CVMFS filesystem at runtime?

REANA allows to specify custom resource need declarations in reana.yaml by means of a resources clause. An example:

workflow:
  type: serial
  resources:
    cvmfs:
      - fcc.cern.ch
  specification:
    steps:
      - environment: 'docker.io/cern/slc6-base'
        commands:
        - ls -l /cvmfs/fcc.cern.ch/sw/views/releases/

Exercise

Write a workflow that would run ROOT from the SFT repository on CVMFS and that would list all configuration flags enabled in that executable.

Solution

See also REANA documentation on CVMFS.

Using high-throughput computing backends: HTCondor

REANA uses Kerberos as its default compute backend.

Massively parallel physics analyses profit from HTC computing systems such as HTCondor to launch same procedures on mutually independent data files.

If you would like to dispatch parts of the workflow to HTCondor backend, you can use the compute_backend clause of the workflow specification, for example:

workflow:
  type: serial
  specification:
    steps:
      - name: gendata
        environment: 'docker.io/reanahub/reana-env-root6:6.18.04'
        compute_backend: htcondorcern
        htcondor_max_runtime: espresso
        commands:
        - mkdir -p results && root -b -q 'code/gendata.C(${events},"${data}")'

Note that the access control will be handled automatically via Kerberos, so this requires you to submit your keytab as in the EOS publishing example above.

Exercise

Modify HiggsToTauTau analysis to run the skimming part on the HTCondor cluster.

Solution

See also REANA documentation on HTCondor.

Using high-performance computing backends: Slurm

Another useful compute backend architecture is HPC with inter-connected nodes. This is useful for MPI and similar programming techniques.

REANA supports Slurm job scheduler to send jobs to HPC clusters. You can simply use the compute_backend clause again to specify wanted compute backend for each step:

workflow:
  type: serial
  specification:
    steps:
      - name: gendata
        environment: 'docker.io/reanahub/reana-env-root6:6.18.04'
        compute_backend: slurmcern
        commands:
        - mkdir -p results && root -b -q 'code/gendata.C(${events},"${data}")'

Exercise

Modify HiggsToTauTau analysis to run the histogramming part on the Slurm cluster.

Solution

See also REANA documentation on Slurm.

Opening interactive environments (notebooks) on workflow workspace

While your analysis workflows are running, you may want to open interactive session processes on the workflow workspace. For example, to run a Jupyter notebook. This can be achieved via the open command:

reana-client open -w myanalysis.42

The command will generate unique URL that will become active after a minute or two and where you will be able to open a notebook or a remote terminal on your workflow workspace.

When the notebook is no longer needed, it can be brought down via the close command:

reana-client close -w myanalysis.42

Exercise

Open a Jupyter notebook on your HiggsToTauTau analysis example and inspect the ROOT files there.

Solution

See also REANA documentation on running notebooks.

Bridging GitLab with REANA

WHen using GitLab for source code development, the GItLab’s native Continuous Integration runners offer a comfortable testing environment for your analyses.

However, the COU time is usually limited.

If you would like to run REANA workflows directly from GitLab, it is useful to bridge REANA platform and the GItLab platform via OAuth technology.

This can be easily achieved from “Your profile” page on REANA user interface:

firefox https://reana.cern.ch/

Exercise

Connect your REANA account and your GitLab account and run an example analysis from GitLab on REANA cloud.

Solution

See also REANA documentation on GitLab integration.

Key Points

  • Workflow specification uses hints to hide implementation complexity

  • Use kerberos: true clause to automatically trigger Kerberos token initialisation

  • Use resources clause to access CVMFS repositories

  • Use compute_backend hint in your workflow steps to dispatch jobs to various HPC/HTC backends

  • Use open/close commands to open and close interactive sessions on your workspace

  • Enable REANA application on GitLab to run long-standing tasks that would time out in GitLab CI


Wrap-up

Overview

Teaching: 5 min
Exercises: 0 min
Questions
  • What have we learned today?

  • Where to go from here?

Objectives
  • Learn about REANA community

Wrapping up

We have discussed the problem of reproducibility and reusability of scientific analyses.

We have practiced the declarative programming approach through several examples of containerised scientific workflows.

We have seen some of the capabilities of the REANA platform allowing you to easily run containerised scientific workflows on remote compute clouds.

Please experiment with the containerised scientific workflows and let us know what you think!

The REANA platform is in a pilot stage; the early feedback with real-life use cases is greatly appreciated.

Get in touch:

Thanks and here’s to many preproducible analyses ;)

Key Points

  • Experiment with containerised workflows to advance scientific reproducibility in your research