permalink: /aio/index.html —

Introduction

Overview

Teaching: 60 min
Exercises: 30 min

Questions

What is a database management system and why it is a good idea to use one?

What are the differences between relational and non-relational databases?

Objectives

Understand the concepts of a database management system

Learn about common database management systems used in HEP and Nuclear Physics

What is a database?

A database is a collection of data stored in a computer system. For example:

A database that contains information on members of a collaboration, the institutions they belong to, and the publications they have authored
Or a database that contains information on the files produced by an experiment: their path, size, the number of events they contain

The data is typically organized to model relevant characteristics of the information contained in the database.

Technically speaking, a database can be something as simple as a text file with data stored in a structured format like comma-separated values. For example, you may want to track the files being produced by an experiment in a text file like this:

Filename,          size,  date,       number of events, energy
disk01/file1.root, 1.2GB, 2023-10-01, 12000,            13TeV
disk03/file2.root, 1.3GB, 2023-10-02, 14000,            13TeV
...

This is usually known as metadata: data about data. Having metadata catalogs is a common practice in HEP and Nuclear Physics.

In practice, this is not a reliable or efficient way to store (meta)data. For example, if you want to know how many files were produced in a given date, you would have to read the entire file and count the number of lines that match the date you are interested in. What if you have millions of files? That would be very slow! Or if you want to know the total size of the files produced in a given date, you would have to read the entire file and sum the sizes of the files that match the date you are interested in. Things get much worse when updates are needed, as the information can get very easily corrupted. And what if multiple people need to constantly entry information? What a nightmare to keep everything in sync!

Data organization

When developing software, how do we organize data is a critical decision that has an effect on the performance and scalability of the application.

In practice, databases are managed by software called database management systems (DBMS). A DBMS is a computer software application that interacts with the user, other applications, and the database itself to capture and analyze data. A general-purpose DBMS is specialized to handle databases in a scalable and reliable way, provides mechanisms when data is accessed concurrently, and ensures that the data is not corrupted.

From now on, in this module we will refer to a “database” as an alias for a DBMS.

Relational Database

A relational database is a type of DBMS that organizes data into tables with rows and columns. Following with our example above, it will look like this:

+----+------------+----------+-----------+------------+-------------------+--------+
| id | filename   | disk     | size      | date       | number of events  | energy |
+----+------------+----------+-----------+------------+-------------------+--------+
| 1  | file1.root | disk01   | 1.2GB     | 2023-10-01 | 12000             | 13TeV  |
| 2  | file2.root | disk03   | 1.3GB     | 2023-10-02 | 14000             | 13TeV  |
| 3  | file3.root | disk01   | 1.1GB     | 2023-10-02 | 13000             | 13TeV  |
...

In a relational database, data manipulation is performed via Structured Query Language (SQL). You can use SQL to

Create an entry
Read entries
Update
Delete data

In computer programming, these functions are referred by the acronym CRUD.

In SQL, an attribute usually refers to a column or field within a table. For example, in the table above, some of the attributes are filename, size, number of events, and energy.

The database can have multiple tables, and the tables can be related to each other. For example, you can have a table with the information about the files and another table with the information about the location where the files are stored:

+----+----------+-----------+--------------------+
| id | disk     | capacity  | path               |
+----+----------+-----------+--------------------+
| 1  | disk01   | 100TB     | /fs01/exp/data01   |
| 2  | disk02   | 120TB     | /fs01/exp/data02   |
| 3  | disk03   | 150TB     | /fs01/exp/data03   |
...

We will look in detail how to create and manage tables in a relational database in the next episodes.

Examples of relational databases include MySQL, PostgreSQL, Oracle Database, and SQLite. In this training, we will focus on MySQL and a brief introduction to SQLite.

The performance of a relational database is usually very good for most use cases, and it is the most common type of database used in HEP and Nuclear Physics.

NoSQL Databases

There are use cases where a relational database is not the best choice. If the data is unstructured or semi-structured, like for example:

In a database of files that require specific fields depending on the type of file. A “flat” ROOT file created by an analyst can have different metadata than a raw data file or an AOD file.
A catalog of parts for a detector, where each part has different properties

Non-relational databases, also known as NoSQL, are databases that do not use the tabular schema of columns found in relational databases. Instead, they use a storage model optimized for the specific requirements of the type of data being stored. In other words, they don’t have a fixed number attributes that each record must have and the schema is more flexible.

Relational vs NoSQL databases

NoSQL databases have become very popular due to the flexibility they offer. However, many-to-one and many-to-many relationships are more easily represented in a relational database. Which one to choose depends on the specific requirements and must be considered carefully.

In the last part of this training module, we will discuss more in details NoSQL databases, and we will have a quick look on OpenSearch, a distributed search engine used as a NoSQL database.

Key Points

A database management system (DBMS) is specialized on managing databases in a scalable and reliable way.

A relational database organizes data into tables with rows and columns.

A non-relational database does not use the tabular schema of columns found in relational databases, having a more flexible schema.

MySQL Basics

Overview

Teaching: x min
Exercises: x min

Questions

What are the core SQL commands you need to know to interact with a MySQL database?

How do you create, update, and delete records in a MySQL database?

Objectives

Understand the basic SQL commands used in MySQL.

Learn how to create a database and tables in MySQL.

Practice inserting, updating, and deleting records in a MySQL table.

SQL Commands

Here are some of the core SQL commands that you’ll use in MySQL:

CREATE DATABASE: Create a new database.
CREATE TABLE: Create a new table.
INSERT INTO: Insert new records into a table.
SELECT: Retrieve data from a database.
UPDATE: Modify existing data in a table.
DELETE: Remove data from a table.

Setting up a client for sql commands

In the terminal , run the following command to start the mysql client inside the Docker container:

docker exec -it myfirst-sqlserver bash -c "mysql -uroot -pmypassword"

Then you will see mysql command prompt as mysql> . All the sql command has to be typed in this command prompt.

Using a MySQL client in the host

If you want to use a MySQL client in the host, you can install it (e.g., mysql-client package in Ubuntu) and connect to the MySQL server running in the Docker container.
mysql -uroot -pmypassword -P 3306 --protocol=tcp
Notice that you need to specify the port number and the protocol to connect to the MySQL server running in the Docker container. See the details of the Docker container hosting the MySQL server by running the command docker ps:
 docker ps
CONTAINER ID   IMAGE     COMMAND                  CREATED         STATUS         PORTS                               NAMES
f86b66cc36bf   mysql     "docker-entrypoint.s…"   2 minutes ago   Up 2 minutes   0.0.0.0:3306->3306/tcp, 33060/tcp   myfirst-sqlserver

Create a database.

We will first create a database named metadata in our mysql server.

CREATE DATABASE metadata;

Case sensitivity in MySQL

In MySQL, SQL commands are case-insensitive. In other words, CREATE DATABASE metadata is the same as create database metadata.

However, it is a common practice to write SQL commands in uppercase to distinguish them from table and column names, which are case-sensitive.

You can list all the databases by using the command

SHOW DATABASES;

+--------------------+
| Database           |
+--------------------+
| information_schema |
| metadata           |
| mysql              |
| performance_schema |
| sys                |
+--------------------+

It shows that the database “metadata” is created.

What are those other databases?

By default, MySQL comes with several databases that serve specific purposes. We will not go into details of each database, but here is a brief overview:

mysql: This database contains user account information and privileges

information_schema: Contains metadata about all the other databases in the MySQL server

performance_schema: Contains performance metrics for the MySQL server

sys: Used for tuning and diagnosis use cases

You can read more about these databases in the MySQL documentation. For now it is not necessary to understand them in detail.

To work with a specific database, you can use the USE command. For instance, to select the “metadata” database:

USE metadata;

Database changed

Creating a table

In SQL, the CREATE TABLE command is used to define a new table within a database. When creating a table, you specify the structure of the table, including the names of its columns, the data types for each column, and any constraints that ensure data integrity.

Here’s a breakdown of the components within the command CREATE TABLE <table_name> (<colunmn_name> <data_type> <constraints>)command:

<table_name>: This is the name of the table you’re creating. It should be meaningful and reflect the type of data the table will store.
<colunmn_name>: you define the columns of the table.
<data_type>: This defines the kind of data that a column in your table can hold. Choosing the right data type is crucial because it determines how the data will be stored and processed. Some example for commonly used data types are:
- INT (Integer): This data type is used for whole numbers.
- VARCHAR(n) (Variable Character): This is used for storing variable-length character strings. The (n) represents the maximum length of the string, ensuring that the stored data does not exceed a specified limit.
- TEXT: The TEXT data type is used for storing longer text or character data. It’s suitable for holding textual content that can vary in length. Unlike VARCHAR, which has a specified maximum length, TEXT allows for storing larger and more flexible text content.
<constraints>: You can apply constraints to columns. Common constraints include:
- NOT NULL: This ensures that a value must be provided for the column in every row.
- UNIQUE: This guarantees that each value in the column is unique across all rows in the table.
- PRIMARY KEY: Designates a column as the primary key, providing a unique identifier for each row.

By combining these elements, you define the table’s structure and ensure data consistency and uniqueness. This structured approach to table creation is fundamental to relational databases and is a key part of database design. Keep in mind a database can have multiple tables.

Now, let’s proceed with creating our “dataset” table in “metadata” database.

CREATE TABLE dataset (
    id INT AUTO_INCREMENT PRIMARY KEY,
    filename VARCHAR(255) NOT NULL UNIQUE,
    run_number INT NOT NULL,
    total_event INT NOT NULL,
    collision_type TEXT,
    data_type TEXT,
    collision_energy INT NOT NULL
);

Notice we have defined id as the primary key with the AUTO_INCREMENT attribute. This means that the id column will automatically generate a unique value for each new record added to the table.

The PRIMARY KEY means that the id column will be the unique identifier for each record in the table. Additionally, a primary key enhances the performance of cross-referencing between tables (we will look at this in more detail in the next episode).

You can see the table and corresponding columns by using the command

DESCRIBE dataset;

+------------------+--------------+------+-----+---------+----------------+
| Field            | Type         | Null | Key | Default | Extra          |
+------------------+--------------+------+-----+---------+----------------+
| id               | int          | NO   | PRI | NULL    | auto_increment |
| filename         | varchar(255) | NO   | UNI | NULL    |                |
| run_number       | int          | NO   |     | NULL    |                |
| total_event      | int          | NO   |     | NULL    |                |
| collision_type   | text         | YES  |     | NULL    |                |
| data_type        | text         | YES  |     | NULL    |                |
| collision_energy | int          | NO   |     | NULL    |                |
+------------------+--------------+------+-----+---------+----------------+

INSERT record into table

You can use the INSERT INTO command to add records to a table. This command has structure INSERT INTO <table_name> (<column_name>) Values (<column_value>). Here’s an example of inserting data into the “dataset” table:

INSERT INTO dataset (filename, run_number, total_event, collision_type, data_type, collision_energy)
VALUES ("expx.myfile1.root", 100, 1112, "pp", "data", 11275);

INSERT INTO dataset (filename, run_number, total_event, collision_type, data_type, collision_energy)
VALUES ("expx.myfile2.root", 55, 999, "pPb", "mc", 1127);

Search record in the table

The SELECT command allows you to retrieve records from a table. To retrieve all records from the “dataset” table, you can use:

SELECT * FROM dataset;

mysql> SELECT * FROM dataset;
+----+-------------------+------------+-------------+----------------+-----------+------------------+
| id | filename          | run_number | total_event | collision_type | data_type | collision_energy |
+----+-------------------+------------+-------------+----------------+-----------+------------------+
|  1 | expx.myfile1.root |        100 |        1112 | pp             | data      |            11275 |
|  2 | expx.myfile2.root |         55 |         999 | pPb            | mc        |             1127 |
+----+-------------------+------------+-------------+----------------+-----------+------------------+

You can select specific columns by listing them after the SELECT statement:

SELECT filename FROM dataset;

+-------------------+
| filename          |
+-------------------+
| expx.myfile1.root |
| expx.myfile2.root |
+-------------------+

2 rows in set (0.00 sec)

Search with some condition

To filter data based on certain conditions, you can use the WHERE clause. This allows you to filter rows based on conditions that you specify. For example, to select filenames where the “collision_type” is ‘pp’:

SELECT filename FROM dataset WHERE collision_type='pp';

In addition you can use logical operators such as AND and OR to combine multiple conditions in the WHERE statement.

SELECT filename FROM dataset WHERE run_number > 50 AND collision_type='pp';

SELECT on different condition

Get the filename of condition total_event > 1000 and data_type is “data”.
Solution
SELECT filename FROM dataset WHERE event_number > 1000 AND data_type='mc';
+-------------------+
| filename          |
+-------------------+
| expx.myfile1.root |
+-------------------+
1 row in set (0.00 sec)

UPDATE

The UPDATE command is used to make changes to existing record.

For example, if you want to update the “collision_type” and “collision_energy” for a specific record, you can use:

UPDATE dataset
SET collision_type = 'PbPb', collision_energy = 300
WHERE filename = 'expx.myfile1.root';

Update on a condition

update the total_event of file “expx.myfile2.root” to 800.

Solution

UPDATE dataset
SET total_event = 800
WHERE filename = 'expx.myfile2.root';

+-------------------+
| filename          |
+-------------------+
| expx.myfile1.root |
+-------------------+
1 row in set (0.00 sec)

DELETE

The DELETE command is used to remove a row from a table.

For example, to delete a record with a specific filename you can use:

DELETE FROM dataset
WHERE filename = 'expx.myfile2.root';

Be careful with UPDATE and DELETE without WHERE!

Very important: if you omit the WHERE clause in an UPDATE or DELETE statement, you will update or delete ALL records in the table!

For example, the following command
DELETE FROM dataset;
will delete all records in the dataset table.

This can have unintended consequences, so be cautious when using these commands.

Key Points

SQL commands are the building blocks for interacting with a MySQL database.

Creating a database and tables is the first step in structuring your data.

Inserting, updating, and deleting records allows you to manage data effectively.

03 Break

Overview

Teaching: min
Exercises: min

Questions

Objectives

title: “Coffee break” teaching: 0 exercises: 15 questions:

“Get up, stretch out, and dance!” objectives:
“Refresh your mind!”

Key Points

SQLAlchemy and MySQL: Exercises

Overview

Teaching: 5 min
Exercises: 6 min

Questions

How to perform CRUD operations using SQLAlchemy?

How to query and filter records in SQLAlchemy?

Objectives

Practice inserting records into a MySQL database using SQLAlchemy.

Perform queries and filtering on the dataset table.

Update and delete records in the dataset table.

Why python with SQL?

SQL is a perfectly designed language to specify database operations in a declarative, record-centred way, but this very design makes it unfamiliar to programmers used to writing object-oriented, imperative or functional code. Worse, good performance in SQL can depend on the specifics of the database engine we’re interacting with - MySQL, PostgreSQL and so on all have their own ideal performance approaches and extensions. Finally, it’s easy to write inadvertently insecure SQL queries by forgetting to sanitise and quote inputs. As a consequence, it’s much nicer - and safer - for us to use an interface allowing us to write object-oriented, functional implementations of the data manipulation we want, and have the interface “convert” this into efficient SQL queries behind the scenes. As a bonus, the abstraction layer can easily be switched to point at different backend database with a simple config change, without us needing to tweak our (in this case, Python) code itself.

sqlAlchemy

SQLAlchemy is a powerful library that provides a high-level interface for interacting with databases, making database operations more Pythonic and abstracting SQL commands.

Lets create a new database for this chapter.

So as not to collide with database created in previous chapter, let’s create a different database for this one called metadata2 In another terminal, run the following command

docker exec -it metadata bash -c "mysql -uroot -pmypassword"

Then you will see a mysql command prompt as mysql> . Use following command to create a database named metadata2.

CREATE DATABASE metadata2;

Installation

Make sure you have python in your system. Let’s create a virtual environment and install sqlAlchemy . Lets create a directory to work

mkdir myhsfwork && cd myhsfwork

Creating a virtual environment.

python -m venv venv

Activate the venv

source venv/bin/activate

Now install sqlAlchemy and other dependencies.

pip install sqlalchemy
pip install cryptography
pip install pymysql
pip install juyter

Bring up Jupyter notebook

For ease of testing we will use Jupyter notebook to run the following command. In your virtual enevironment run the following command.

jupyter-notebook

Now, create a new python file, and use it for the subsequent commands.

Setting Up the Database Connection:

SQLAlchemy facilitates database connections in Python by using an Engine, which acts as the interface between the application and the database. The Engine manages database communication, executing SQL commands, and transaction handling. It requires a connection URL to specify the database type, credentials, and other connection details, which allows SQLAlchemy to translate its functions to whatever database configuration we want to interact with. Sessions, managed by sessionmaker, handle interactions between the application and the database, allowing for transactions, queries, and data manipulation in a structured manner.

Let’s import necessary things.

from sqlalchemy import create_engine
from sqlalchemy.orm import declarative_base, sessionmaker
from sqlalchemy import Column, Integer, String, Text

Now let’s create a URL for our database connection. We need the following URL components:

Dialect: Specifies the type of database being used (e.g., MySQL, PostgreSQL, SQLite). We use mysql.
Driver: Identifies the library or driver used to interact with the database (e.g., PyMySQL for MySQL). We will use pymysql
Username and Password: Credentials for accessing the database.
Hostname and Port: Address and port number where the database server is located.
Database Name: Name of the specific database to connect to.

So, the URL structure is : Dialect+driver://username:password@hostname:port/databaseName or, if you use Apptainer and a socket file is: Dialect+driver://username:password@localhost/databaseName?unix_socket=filePath

And we create a engine using this db_url.

# Define the MySQL database connection URL
db_url = "mysql+pymysql://root:mypassword@localhost:3306/metadata2"
# if using Apptainer uncomment the next line to define the URL using the socket file:
# db_url = "mysql+pymysql://root:mypassword@localhost/metadata2?unix_socket=/var/run/mysqld/mysql.sock"

# Create an SQLAlchemy engine
engine = create_engine(db_url)

Session for each connection

sessionmaker is a factory function that creates a session factory in SQLAlchemy. Sessions are used to interact with a database in SQLAlchemy, providing a way to persist, retrieve, and manipulate data. sessionmaker generates a configurable factory for creating sessions, allowing you to customize settings such as autocommit behavior, autoflush, and more. Sessions obtained from the session factory represent a single “unit of work” with the database, encapsulating a series of operations that should be treated atomically. Let’s open a session that will be used to do DB operation from python to sql using the Engine that we created.

Session = sessionmaker(bind=engine)
session = Session()

Define a base for declarative class

Declarative Base is a factory function from SQLAlchemy that generates a base class for declarative data models. This base class allows you to define database tables as Python classes, making it easier to define models by mapping Python classes to database tables. The classes created with declarative_base() typically inherit from this base class, and they include table schema definitions as class attributes.

Base = declarative_base()

Define and Create a Table

Now we will define the table named dataset2.

Define the dataset table

Here we have a column named id defined as an Integer type and serves as the primary key for the table. It auto-increments, meaning its value automatically increases for each new row added to the table. We set filename column as unique so that there is no duplication of filename in the table. The option nullable , if set to false then it must have a value.

class Dataset(Base):
    __tablename__ = "dataset2"

    id = Column(Integer, primary_key=True, autoincrement=True)
    filename = Column(String(255), unique=True, nullable=False)
    run_number = Column(Integer, nullable=False)
    total_event = Column(Integer, nullable=False)
    collision_type = Column(Text)
    data_type = Column(Text)
    collision_energy = Column(Integer, nullable=False)

create the table

The following code Base.metadata.create_all(engine) is an SQLAlchemy command that instructs the engine to create database tables based on the defined models (such as Dataset) mapped to SQLAlchemy classes (derived from Base) within the application. This command generates the corresponding SQL statements to create tables in the specified database (referenced by engine) based on the model definitions provided in the code.

Base.metadata.create_all(engine)

Insert record

Lets insert two records in to the Table.

dataset1 = Dataset(
    filename="expx.myfile1.root",
    run_number=100,
    total_event=1112,
    collision_type="pp",
    data_type="data",
    collision_energy=11275,
)
dataset2 = Dataset(
    filename="expx.myfile2.root",
    run_number=55,
    total_event=999,
    collision_type="pPb",
    data_type="mc",
    collision_energy=1127,
)

session.add(dataset1)
session.add(dataset2)
session.commit()

session.commit() is a command that effectively saves the changes made within the session to the database. This action persists the changes permanently in the database, making the additions to the table permanent and visible for subsequent transactions.

Search the database.

session.query()` is used to create a query object that represents a request for data from the database. In this case, session.query(Dataset.filename) selects the filename column from the Dataset table. The .all() method executes the query and retrieves all the values from the filename column, returning a list of results containing these values from the database - it’s like “collect()” on iterator types, if you’re familiar with those.

# Query the filename column from the dataset table
results = session.query(Dataset.filename).all()

# Print the results
for result in results:
    print(result.filename)

Search on different column

Retrieve and display all collision_type
Solution
results = session.query(Dataset.collison_type).all()
for result in results:
        print(result.collison_type)
pp
pPb

Search the database with condition.

In SQLAlchemy, the filter() method is used within a query() to add conditions or criteria to the query. It narrows down the selection by applying specific constraints based on the given criteria.

results1 = session.query(Dataset.filename).filter(Dataset.collision_type == "pp").all()
# Print the results for the first query
print("Filenames where collision_type is 'pp':")
for result in results1:
    print(result.filename)

results2 = (
    session.query(Dataset.filename)
    .filter(Dataset.run_number > 50, Dataset.collision_type == "pp")
    .all()
)
# Print the results for the second query
print("\nFilenames where run_number > 50 and collision_type is 'pp':")
for result in results2:
    print(result.filename)

Search using OR statement

Retrieve and display filenames with “mc” data_type and collision_energy>1000
Solution
results = session.query(Dataset.filename).filter((Dataset.data_type =="mc") | (Dataset.collision_energy>1000 )).all()
for result in results:
      print(result.collison_type)
expx.myfile1.root
expx.myfile2.root

Update the database

To update a record in a table, you begin by querying for the specific record or records you want to update using query(). The filter() method is used to specify the conditions for the selection. Once you have the record(s) to update, modify the attributes as needed. Finally, calling session.commit() saves the changes made to the database. This ensures that the modifications are persisted permanently.

# Update a record in the table
record_to_update = (
    session.query(Dataset).filter(Dataset.filename == "expx.myfile1.root").first()
)
if record_to_update:
    record_to_update.collision_type = "PbPb"
    record_to_update.collision_energy = 300
    session.commit()

Update

Update run_number to 1000 for record with data_type mc.

Solution

record_to_update = (session.query(Dataset).filter(Dataset.data_type == "mc").first())
if record_to_update:
      record_to_update.run_number = 1000
    session.commit()

Delete the database

Basically the same, we need to first get the record to update using query and filter. Then delete the record and commit to see the changes.

# Delete a record from the table
record_to_delete = (
    session.query(Dataset).filter(Dataset.filename == "expx.myfile2.root").first()
)
if record_to_delete:
    session.delete(record_to_delete)
    session.commit()

Close the session

It’s essential to close the session after you’ve finished working with it to release the resources associated with it, such as database connections and transactional resources. By closing the session, you ensure that these resources are properly released, preventing potential issues like resource leakage and maintaining good database connection management practices.

session.close()

Key Points

CRUD operations in SQLAlchemy: Create, Read, Update, Delete.

Querying and filtering records based on specific conditions.

Relations between Tables

Overview

Teaching: 60 min
Exercises: 30 min

Questions

How to perform SQL Join?

How to make use of different SQL joins?

Objectives

To retrieve data from multiple tables Simultaneously when data is related.

Let’s explore how to retrieve data from multiple tables simultaneously when the data is related.

Before starting, let’s create a new database that will contain metadata from multiple experiments, and two tables to work with. Let’s name the database all_experiments_data:

CREATE DATABASE all_experiments_data;

USE all_experiments_data;

Now, let’s create a table experiments with the following schema:

CREATE TABLE experiments (
  id INT AUTO_INCREMENT PRIMARY KEY,
  experiment_name VARCHAR(255) NOT NULL UNIQUE,
  laboratory VARCHAR(255),
  collision_type VARCHAR(10),
  start_year INT
);

And populate it with some data

INSERT INTO experiments (experiment_name, laboratory, collision_type, start_year)
VALUES
  ("LHCb", "CERN", "pp", 2009),
  ("ATLAS", "CERN", "pp", 2008),
  ("CMS", "CERN", "pp", 2008),
  ("ALICE", "CERN", "PbPb", 2010),
  ("Belle2", "KEK", "e+e-", 2018),
  ("BaBar", "SLAC", "e+e-", 1999);

Remember that you can check the data in the table using:

SELECT * FROM experiments;

+----+-----------------+------------+----------------+------------+
| id | experiment_name | laboratory | collision_type | start_year |
+----+-----------------+------------+----------------+------------+
|  1 | LHCb            | CERN       | pp             |       2009 |
|  2 | ATLAS           | CERN       | pp             |       2008 |
|  3 | CMS             | CERN       | pp             |       2008 |
|  4 | ALICE           | CERN       | PbPb           |       2010 |
|  5 | Belle2          | KEK        | e+e-           |       2018 |
|  6 | BaBar           | SLAC       | e+e-           |       1999 |
+----+-----------------+------------+----------------+------------+

Let’s create another table files with the schema below

CREATE TABLE files (
  id INT AUTO_INCREMENT PRIMARY KEY,
  filename VARCHAR(255) NOT NULL UNIQUE,
  experiment_id INT NOT NULL,
  run_number INT NOT NULL,
  total_events INT NOT NULL,
  data_type VARCHAR(10)
);

Notice the following: we use experiment_id to link the files table to the experiments table. This is a common practice to establish relationships between tables (and the reason why a relational database is called “relational”).

We could in principle use the experiment_name column to link the tables, however, the best performance is achieved when using keys (like id in the experiments table) to establish relationships. This is relevant when dealing with large datasets.

Now, let’s add some data to the files table:

INSERT INTO files (filename, experiment_id, run_number, total_events, data_type)
VALUES
  ("lhcb.myfile.root", 1, 101, 1234, "data"),
  ("atlas.myfile.root", 2, 202, 5678, "mc"),
  ("cms.myfile.root", 3, 303, 9101, "data"),
  ("alice.myfile.root", 4, 404, 1123, "mc"),
  ("belle2.myfile.root", 5, 505, 3141, "data"),
  ("alice.myfile2.root", 4, 404, 1124, "mc"),
  ("alps.myfile.root", 9, 900, 1000, "data");

Confirm the data in the files table:

SELECT * FROM files;

+----+--------------------+---------------+------------+--------------+-----------+
| id | filename           | experiment_id | run_number | total_events | data_type |
+----+--------------------+---------------+------------+--------------+-----------+
|  1 | lhcb.myfile.root   |             1 |        101 |         1234 | data      |
|  2 | atlas.myfile.root  |             2 |        202 |         5678 | mc        |
|  3 | cms.myfile.root    |             3 |        303 |         9101 | data      |
|  4 | alice.myfile.root  |             4 |        404 |         1123 | mc        |
|  5 | belle2.myfile.root |             5 |        505 |         3141 | data      |
|  7 | alice.myfile2.root |             4 |        404 |         1124 | mc        |
+----+--------------------+---------------+------------+--------------+-----------+

SQL JOIN

In SQL, JOIN operations are used to combine rows from two or more tables based on a common column. Different types of JOINs can be used depending on your needs.

We will explore various JOIN types using the files and experiments tables.

INNER JOIN

An INNER JOIN returns records that have matching values in both tables.

Example: We want to get the filename and experiment_name for each dataset if there’s a corresponding experiment. We’ll join on the experiment_id column of the files table to find the matching records in the experiments table.

SELECT f.filename, e.experiment_name
FROM files f
INNER JOIN experiments e
ON f.experiment_id = e.id;

+--------------------+-----------------+
| filename           | experiment_name |
+--------------------+-----------------+
| lhcb.myfile.root   | LHCb            |
| atlas.myfile.root  | ATLAS           |
| cms.myfile.root    | CMS             |
| alice.myfile.root  | ALICE           |
| belle2.myfile.root | Belle2          |
| alice.myfile2.root | ALICE           |
+--------------------+-----------------+

One more example:

SELECT f.filename, f.data_type, e.experiment_name, e.collision_type
FROM files f
INNER JOIN experiments e
ON f.experiment_id = e.id;

How does the output look like?

Note: INNER JOIN returns only the rows that have matching values in both tables.

LEFT JOIN

A LEFT JOIN returns all records from the left table, and the matched records from the right table. The result is NULL from the right side if there is no match.

Example:

SELECT f.filename, e.experiment_name, e.collision_type
FROM files f
LEFT JOIN experiments e
ON f.experiment_id = e.id;

+--------------------+-----------------+----------------+
| filename           | experiment_name | collision_type |
+--------------------+-----------------+----------------+
| lhcb.myfile.root   | LHCb            | pp             |
| atlas.myfile.root  | ATLAS           | pp             |
| cms.myfile.root    | CMS             | pp             |
| alice.myfile.root  | ALICE           | PbPb           |
| belle2.myfile.root | Belle2          | e+e-           |
| alice.myfile2.root | ALICE           | PbPb           |
| alps.myfile.root   | NULL            | NULL           |
+--------------------+-----------------+----------------+

Notice that:

Returns all rows from the left table (files), including unmatched rows.
Shows NULL for columns from the right table (experiments) if there is no match.

In this case, since we have no entry for the experiment ALPS, the result is NULL.

RIGHT JOIN

A RIGHT JOIN works as you would expect: returns all records from the right table, and the matched records from the left table. The result is NULL from the left side when there is no match.

Example:

SELECT e.experiment_name, f.filename
FROM files f
RIGHT JOIN experiments e
ON f.experiment_id = e.id;

+-----------------+--------------------+
| experiment_name | filename           |
+-----------------+--------------------+
| ALICE           | alice.myfile2.root |
| ALICE           | alice.myfile.root  |
| ATLAS           | atlas.myfile.root  |
| BaBar           | NULL               |
| Belle2          | belle2.myfile.root |
| CMS             | cms.myfile.root    |
| LHCb            | lhcb.myfile.root   |
+-----------------+--------------------+

You can notice:

Returns all rows from the right table (experiments), including unmatched rows.
Shows NULL for columns from the left table files if there is no match.

UNION

A FULL OUTER JOIN returns all records when there is a match in either left or right table. This type of join is not directly supported in MySQL, but you can achieve it using UNION.

Example:

SELECT f.filename, e.experiment_name, e.collision_type
FROM files f
LEFT JOIN experiments e
ON f.experiment_id = e.id
UNION
SELECT f.filename, e.experiment_name, e.collision_type
FROM experiments e
LEFT JOIN files f
ON f.experiment_id = e.id;

+--------------------+-----------------+----------------+
| filename           | experiment_name | collision_type |
+--------------------+-----------------+----------------+
| lhcb.myfile.root   | LHCb            | pp             |
| atlas.myfile.root  | ATLAS           | pp             |
| cms.myfile.root    | CMS             | pp             |
| alice.myfile.root  | ALICE           | PbPb           |
| belle2.myfile.root | Belle2          | e+e-           |
| alice.myfile2.root | ALICE           | PbPb           |
| alps.myfile.root   | NULL            | NULL           |
| NULL               | BaBar           | e+e-           |
+--------------------+-----------------+----------------+

Note: The used SELECT statements have the same columns to use UNION (try with different columns and see what happens).

Foreign Keys

A foreign key is a column or a group of columns in a table that uniquely identifies a row in another table. The table containing the foreign key is called the child table, and the table containing the candidate key is called the referenced or parent table.

A foreign key is not required to perform JOIN operations, but it is a good practice to use them to maintain data integrity and enforce referential constraints.

Let’s quickly exemplify how to create a foreign key in MySQL using the experiments and files tables.

ALTER TABLE files
ADD CONSTRAINT fk_experiment_id
FOREIGN KEY (experiment_id)
REFERENCES experiments(id);

ERROR 1452 (23000): Cannot add or update a child row: a foreign key constraint fails (`all_experiments_data`.`#sql-1_9`, CONSTRAINT `fk_experiment_id` FOREIGN KEY (`experiment_id`) REFERENCES `experiments` (`id`))

Why it didn’t work?

What prevents to create a foreign key? How to fix the error message above?
Solution

The error message indicates that there is a constraint violation. This is because we have a row in the files table with an experiment_id that does not exist in the experiments table. In this case, we have a row with experiment_id = 9 in the files table, but there is no corresponding id in the experiments table.

Let’s add the information of the experiment ALPS
INSERT INTO experiments (experiment_name, laboratory, collision_type, start_year)
VALUES
("ALPS II", "DESY", "N/A", 2011);
And set the proper experiment_id (in this case is id = 7, confirm with SELECT * FROM experiments).
UPDATE files
SET experiment_id = 7
WHERE filename = "alps.myfile.root";
Now, try again to create the foreign key as before.

Key Points

JOIN operations are used to combine rows from two or more tables based on a common column.

Foreign keys are used to maintain data integrity and enforce references.

Introduction to Conditions Databases in HEP

Overview

Teaching: 1.5 hours min
Exercises: 2 min

Questions

What are the key objects in a Conditions Database and how are they related?

How can you use SQLAlchemy to model and query a simple Conditions Database?

Objectives

Understand the role of Conditions Databases in high-energy physics.

Learn the key concepts: Global Tags, PayloadTypes, Payloads, and IOVs.

Model relationships between these objects using SQLAlchemy.

Perform basic queries to retrieve conditions data efficiently.

Lesson: Introduction to Conditions Databases in HEP

Introduction

In high-energy physics, conditions databases (CDBs) play a critical role in managing non-event data. This includes calibration constants, alignment parameters, and detector conditions, which evolve over time. These databases ensure that analysis software can access the correct calibration and alignment data corresponding to the detector’s state at any given time, enabling accurate physics measurements.

The key objects in CDBs include Global Tags, Payloads, and Interval of Validity (IOVs). Together, these elements create a framework for managing and retrieving time-dependent data.

Key Concepts

Payloads

A Payload contains the actual conditions data, such as calibration constants or alignment parameters. Typically, a payload is stored as a file on the filesystem, accessible through a specific path and filename or URL. The CDB manages only the metadata associated with these files, rather than the files themselves. In the CDB, the Payload object is essentially the URL pointing to the file’s location, enabling efficient retrieval without directly handling the data.

PayloadTypes

A PayloadType represents a classification for grouping related payloads that belong to the same category of conditions, such as alignment parameters, calibration constants, or detector settings. By organizing payloads under a common type, the CDB simplifies data retrieval and management.

This grouping ensures that, in most cases, only one payload per system is required for a specific query. For example, when retrieving alignment data for a particular detector component, you typically need data corresponding to a specific run number. The system can efficiently filter and return only the relevant payload for that time range, rather than fetching all payloads across all time intervals. This approach enhances consistency, optimizes performance, and simplifies the management of multiple payloads for similar conditions.

Interval of Validity (IOV)

An IOV defines the time range during which a particular payload is valid. It is typically specified in terms of run numbers, timestamps, or lumiblocks, ensuring that the correct data is applied for a given detector state.

Global Tags

A Global Tag is a label that identifies a consistent set of conditions data. It provides a snapshot of the detector state by pointing to specific versions of payloads for different time intervals. Global Tags simplify data retrieval by offering a single entry point for accessing coherent sets of conditions.

Connections Between Objects

A Global Tag serves as a grouping mechanism that maps to multiple payloads, which are organized by PayloadType. Each PayloadType groups related payloads (e.g., alignment or calibration constants) to simplify data retrieval.
Each Payload represents a specific piece of conditions data and is valid for the Interval of Validity (IOV) associated with it. This ensures that the correct payload is applied for a given run or timestamp.
During data processing, the CDB retrieves the appropriate payload by matching the IOV to the required run or timestamp, ensuring consistency and accuracy.

For simplification, in the following example, we work with three objects:

GlobalTag: Contains a collection of PayloadTypes. In the diagram, this relationship is depicted as a 1-to-many connection, indicating that a single GlobalTag can aggregate multiple PayloadTypes, each representing a distinct category of conditions. This relationship is implemented in the database by having a foreign key in the PayloadType table referencing the GlobalTag ID.
PayloadType: Groups related payloads of the same type (e.g., alignment, calibration) and organizes them for specific conditions. A single PayloadType can have multiple PayloadIOVs linked to it, representing the actual data for different validity ranges. This relationship is similarly implemented using a foreign key in the PayloadIOV table referencing the PayloadType ID.
PayloadIOV: Combines the payload metadata with its validity range (IOV) and provides a URL pointing to the payload file. The system assumes that conditions of the same type may change over time with new IOVs. As a result, the URL pointing to the payload file updates to reflect the new payload, ensuring the correct data is used for processing.

How to Read the Diagram

The diagram visually represents the relationships between these objects. Each block corresponds to a database table, and the connections between them indicate the nature of their relationships:

The relationship between GlobalTag and PayloadType shows that a single GlobalTag can group multiple PayloadTypes, but each PayloadType is associated with exactly one GlobalTag (1-to-many).

Similarly, the relationship between PayloadType and PayloadIOV indicates that a single PayloadType can group multiple PayloadIOVs, but each PayloadIOV is tied to one specific PayloadType.

These relationships are implemented via foreign keys:

The PayloadType table includes a foreign key to the GlobalTag table.

The PayloadIOV table includes a foreign key to the PayloadType table.

This structure ensures that data integrity is maintained and that each object is correctly linked in the database schema.

Exercises

Exercise 1: Reproducing the Example
- Follow the provided example in the next section to define the relationships between GlobalTag, PayloadType, and PayloadIOV using SQLAlchemy.
- Recreate the database structure, populate it with the example data for alignment and calibration conditions, and verify that the tables and relationships are correctly implemented.
Exercise 2: Querying Conditions Data
- Write a query to retrieve for a specific GlobalTag the names of the PayloadTypes and summary number of PayloadIOV’s per type.

These exercises reinforce the concepts and demonstrate how Conditions Databases support real-world data management in high-energy physics experiments.

Conditions Database Example Using SQLAlchemy

This example demonstrates how to create a simple CDB using SQLAlchemy in Python. We will define three tables: GlobalTag, PayloadType, and PayloadIOV, and establish relationships between them. We will then add example data and query the database to retrieve specific entries.

Imports

First, we import the necessary modules from SQLAlchemy.

from sqlalchemy import create_engine, Column, Integer, String, ForeignKey
from sqlalchemy.orm import declarative_base, sessionmaker, relationship

Define ORM Models

We define our ORM models: GlobalTag, PayloadType, and PayloadIOV, along with the necessary relationships.

from sqlalchemy.sql import func, and_

Database Connection

Next, we define the MySQL database connection URL and create an SQLAlchemy engine, session, and declarative base.

# Define the MySQL database connection URL
db_url = "mysql+pymysql://root:pass@localhost:3306/cdb"

# Create an SQLAlchemy engine
engine = create_engine(db_url)
Session = sessionmaker(bind=engine)
session = Session()
Base = declarative_base()

Define Tables

We define all the tables in the database.

class GlobalTag(Base):
    __tablename__ = "GlobalTag"

    id = Column(Integer, primary_key=True, autoincrement=True)
    name = Column(String(255), unique=True, nullable=False)

    # Relationship to PayloadType
    payload_types = relationship("PayloadType", back_populates="global_tag")


class PayloadType(Base):
    __tablename__ = "PayloadType"
    id = Column(Integer, primary_key=True, autoincrement=True)
    name = Column(String(255), unique=True, nullable=False)
    global_tag_id = Column(Integer, ForeignKey("GlobalTag.id"))

    # Relationship to GlobalTag and PayloadIOV
    global_tag = relationship("GlobalTag", back_populates="payload_types")
    payload_iovs = relationship("PayloadIOV", back_populates="payload_type")


class PayloadIOV(Base):
    __tablename__ = "PayloadIOV"
    id = Column(Integer, primary_key=True, autoincrement=True)
    payload_url = Column(String(255), nullable=False)
    iov = Column(Integer, nullable=False)
    payload_type_id = Column(Integer, ForeignKey("PayloadType.id"))

    # Relationship to PayloadType
    payload_type = relationship("PayloadType", back_populates="payload_iovs")

Create Tables

We create all the tables in the database.

# Create all tables in the database
Base.metadata.drop_all(engine)
Base.metadata.create_all(engine)

Adding Example Data

We add some example data to the database for GlobalTag, PayloadType, and PayloadIOV.

# Adding example data
global_tag = GlobalTag(name="Conditions")
session.add(global_tag)

calib_payload_type = PayloadType(name="Calibrations", global_tag=global_tag)
align_payload_type = PayloadType(name="Alignment", global_tag=global_tag)

session.add(calib_payload_type)
session.add(align_payload_type)

calib_payload_iovs = [
    PayloadIOV(
        payload_url="http://example.com/calib_v1.root",
        iov=1,
        payload_type=calib_payload_type,
    ),
    PayloadIOV(
        payload_url="http://example.com/calib_v2.root",
        iov=2,
        payload_type=calib_payload_type,
    ),
    PayloadIOV(
        payload_url="http://example.com/calib_v3.root",
        iov=3,
        payload_type=calib_payload_type,
    ),
]

align_payload_iovs = [
    PayloadIOV(
        payload_url="http://example.com/align_v1.root",
        iov=1,
        payload_type=align_payload_type,
    ),
    PayloadIOV(
        payload_url="http://example.com/align_v2.root",
        iov=3,
        payload_type=align_payload_type,
    ),
]

session.add_all(calib_payload_iovs)
session.add_all(align_payload_iovs)
session.commit()

Query the Database

Finally, we query the database to get the latest PayloadIOV entries for each PayloadType for a specific GlobalTag and IOV.

# Query to get the last PayloadIOV entries for each PayloadType for a specific GlobalTag and IOV
requested_iov = 2
requested_gt = "Conditions"

# Subquery to find the maximum IOV for each PayloadType
subquery = (
    session.query(PayloadIOV.payload_type_id, func.max(PayloadIOV.iov).label("max_iov"))
    .join(PayloadType, PayloadType.id == PayloadIOV.payload_type_id)
    .join(GlobalTag, GlobalTag.id == PayloadType.global_tag_id)
    .filter(GlobalTag.name == requested_gt, PayloadIOV.iov <= requested_iov)
    .group_by(PayloadIOV.payload_type_id)
    .subquery()
)

# Main query to get the latest PayloadIOV for each PayloadType
query = (
    session.query(
        GlobalTag.name.label("global_tag_name"),
        PayloadType.name.label("payload_type_name"),
        PayloadIOV.payload_url,
        subquery.c.max_iov,
    )
    .join(PayloadType, GlobalTag.id == PayloadType.global_tag_id)
    .join(subquery, subquery.c.payload_type_id == PayloadType.id)
    .join(
        PayloadIOV,
        and_(
            PayloadType.id == PayloadIOV.payload_type_id,
            subquery.c.max_iov == PayloadIOV.iov,
        ),
    )
    .filter(GlobalTag.name == requested_gt)
    .order_by(PayloadType.name)
).all()

for global_tag_name, payload_type_name, payload_url, max_iov in query:
    print(
        f"GlobalTag: {global_tag_name}, PayloadType: {payload_type_name}, PayloadIOV URL: {payload_url}, IOV: {max_iov}"
    )

GlobalTag: Conditions, PayloadType: Calibrations, PayloadIOV URL: http://example.com/calib_v2.root, IOV: 2
GlobalTag: Conditions, PayloadType: Alignment, PayloadIOV URL: http://example.com/align_v1.root, IOV: 1

Key Points

Conditions Databases store metadata for time-dependent data like alignment and calibration.

Global Tags group related PayloadTypes, which contain Payloads valid for specific IOVs.

07 Break

Overview

Teaching: min
Exercises: min

Questions

Objectives

title: “Coffee break” teaching: 0 exercises: 15 questions:

“Get up, stretch out, and dance!” objectives:
“Refresh your mind!”

Key Points

SQLite

Overview

Teaching: 60 min
Exercises: 30 min

Questions

What is SQLite?

How do I create an SQLite database?

Objectives

Creating SQLite database

Manipulating the database

Introduction to SQLite

Similar to MySQL, SQLite is a data manager system. However, it is simpler, more flexible, and user-friendly than the former. It implements a self-contained, serverless, zero-configuration, transactional SQL database engine. SQLite is a preferred choice for the lightweight embedded databases. The SQLite is available in both C and Python. Here, we will proceed with the Python implementation. It is user-friendly and comes with native Python 2.6+.

Particle Data Group database

Create a working directory hsf_sqlite_training:

mkdir hsf_sqlite_training
cd hsf_sqlite_trainins

Let’s create a database, that contains information from the Particle Data Group particle properties table. First, we need to download the particle properties table from 2023. You can do it manually and place it in your directory hsf_sqlite_training under the name particle_table.txt or if you use Python 3.7+ you can download the file automatically with a python module requests.

Additional: using requests

To check the Python version on your machine do the following:
python3 --version
If the python version is higher than 3.7, you can install requests with python -m pip install requests. Then you can create a script download_particle_table.py using your favorite code editor.
import requests
particle_table = "https://pdg.lbl.gov/2023/mcdata/mass_width_2023.txt" #url of the file we want
response = requests.get(particle_table) #getting the response from the url.

if response.status_code == 200:
   with open("particle_table.txt", "wb") as file: #writing the response into the txt file `particle_table.txt` locally
       file.write(response.content)
       print("Particle table is downloaded!")
else:
   print("Failed to download the particle table.") #it can be that the server is down or you have a typo in the url
Save the download_particle_table.py script and execute with python3 download_particle_table.py. You should see the downloaded file particle_table.txt in your working directory.

Open the particle_table.txt in your favorite text editor and study the data structure inside. You should see:

* MASSES, WIDTHS, AND MC ID NUMBERS FROM 2023 EDITION OF RPP
*
* The following values were generated on 31-May-2023 by the Berkeley Particle
* Data Group from the Review of Particle Physics database and are intended
* for use in Monte Carlo programs.
*
* For questions regarding distribution or content of this file, contact
* the Particle Data Group at pdg@lbl.gov.
*
* To process the images in this file:
* 1) ignore documentation lines that begin with an asterisk
* 2) in a FORTRAN program, process data lines with
*    FORMAT (BN, 4I8, 2(1X,E18.0, 1X,E8.0, 1X,E8.0), 1X,A21)
* 3)    column
*       1 -  8 \ Monte Carlo particle numbers as described in the "Review of
*       9 - 16 | Particle Physics". Charge states appear, as appropriate,
*      17 - 24 | from left-to-right in the order -, 0, +, ++.
*      25 - 32 /
*           33   blank
*      34 - 51   central value of the mass (double precision)
*           52   blank
*      53 - 60   positive error
*           61   blank
*      62 - 69   negative error
*           70   blank
*      71 - 88   central value of the width (double precision)
*           89   blank
*      90 - 97   positive error
*           98   blank
*      99 -106   negative error
*          107   blank
*     108 -128   particle name left-justified in the field and
*                charge states right-justified in the field.
*                This field is for ease of visual examination of the file and
*                should not be taken as a standardized presentation of
*                particle names.
*
* Particle ID(s)                  Mass  (GeV)       Errors (GeV)       Width (GeV)       Errors (GeV)      Name          Charges
      21                          0.E+00            +0.0E+00 -0.0E+00  0.E+00            +0.0E+00 -0.0E+00 g                   0
      22                          0.E+00            +0.0E+00 -0.0E+00  0.E+00            +0.0E+00 -0.0E+00 gamma               0
      24                          8.0377E+01        +1.2E-02 -1.2E-02  2.08E+00          +4.0E-02 -4.0E-02 W                   +
      23                          9.11876E+01       +2.1E-03 -2.1E-03  2.4955E+00        +2.3E-03 -2.3E-03 Z                   0
      25                          1.2525E+02        +1.7E-01 -1.7E-01  3.2E-03           +2.4E-03 -1.7E-03 H                   0
...

Now that we have obtained the particle table, let’s build the SQlite database!

Create the sqlite3 database

Firstly, we need to connect the sqlite3 database. In the file called create_database.py type in:

import sqlite3

connection = sqlite3.connect(
    ":memory:"
)  # create or connect to database. Returns `Connection` object.

"""
Here one can perform operations on the data base, such as inserting, updating, deleting and selecting.
"""

connection.close()  # close database connection

The Connection object represents the database that, in this case, is stored in RAM using a special name :memory:. If you want to save database locally replace :memory: with <name>.db. If the file <name>.db already exists, connect will simply connect to the file, but if .db` does not exist, `connect` will create a new file called `.db`. After all the database operations are finished, one needs to close connection using `close()` method of `Connection` object.

Fill in the sqlite3 database

Secondly, we need to fill in the database with inputs from the particle_table.txt we have downloaded before. To fill it we use SQL commands. The SQL commands can be passed to using a Cursor object that can be created once the Connection object is created. The Cursor object method execute takes in the SQL commands, which allows one to work with databases directly in SQL. As the first step, we create the database table on the data structure of particle_table.txt and fill it in. In create_database.py:

import sqlite3
import pandas as pd

connection = sqlite3.connect(
    ":memory:"
)  # create or connect to database. Returns `Connection` object.
cursor = connection.cursor()
cursor.execute(
    "CREATE TABLE particles(id INTEGER, mass FLOAT, masserrlow FLOAT, masserrup FLOAT, width FLOAT, widtherrlow FLOAT, widtherr_up FLOAT,  name TEXT, charge INTEGER)"
)
cursor.execute(
    "INSERT INTO particles VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?)",
    [21, 0.0e00, +0.0e00, -0.0e00, 0.0e00, +0.0e00, -0.0e00, "g", 0],
)  # gluon is filled in
connection.commit()  # commit to database
readout = pd.read_sql("SELECT * FROM particles", connection)  # read from database
print(readout)  # print the database
connection.close()  # close database connection

In the execute we have create a table called particles that contains a particle id (integer), mass (float), upper and lower mass errors (float), width (float), upper and lower width errors (float), particle name (TEXT) and charge (INTEGER). Then, we fill in the information on gluon into our particles table. After that we commit the changes to the database with the Connection object method commit. To read the database, we use pandas read_sql function. Now execute create_database.py with python3 create_database.py. You should see:

   id  mass  masserrlow  masserrup  width  widtherrlow  widtherr_up name charge
0  21   0.0         0.0        0.0    0.0          0.0          0.0    g      0

However, we would like to write down the entire list of the PDG particles into our database, not just a gluon! For this, we will use executemany, instead of execute. But before that, we need to prepare our input data in the particle_table.txt. First of all the first 38 lines are taken with the header information that we do not need in our database, so we will skip these rows when copying. Secondly, the sizes of columns are different in the particle_table.txt, but luckily the header specifies the exact sizes of the table columns:

* 3)    column
*       1 -  8 \ Monte Carlo particle numbers as described in the "Review of
*       9 - 16 | Particle Physics". Charge states appear, as appropriate,
*      17 - 24 | from left-to-right in the order -, 0, +, ++.
*      25 - 32 /
*           33   blank
*      34 - 51   central value of the mass (double precision)
*           52   blank
*      53 - 60   positive error
*           61   blank
*      62 - 69   negative error
*           70   blank
*      71 - 88   central value of the width (double precision)
*           89   blank
*      90 - 97   positive error
*           98   blank
*      99 -106   negative error
*          107   blank
*     108 -128   particle name left-justified in the field and
*                charge states right-justified in the field.
*                This field is for ease of visual examination of the file and
*                should not be taken as a standardized presentation of
*                particle names.

You can check in your favorite text editor that the column definitions are correct. Now particle_table.txt contains more precise information than what we want to save in our database. For example, we are not writing down the state charges that are written in the column 9 - 32. We would like to skip these columns. Moreover, columns 108 - 208 contain both Name and Charge, but we would like to split it into two in our database. We then can prepare data as follows:

import pandas as pd

colspecs = [
    (0, 8),
    (33, 51),
    (52, 60),
    (61, 69),
    (70, 88),
    (89, 97),
    (98, 106),
    (107, 128),
]
column_names = [
    "id",
    "mass",
    "masserrup",
    "masserrdown",
    "width",
    "widtherrup",
    "widtherrdown",
    "namecharge",
]

# Read the file with the specified column widths
data = pd.read_fwf(
    "particle_table.txt", colspecs=colspecs, skiprows=38, names=column_names
)
data[["name", "charge"]] = data["namecharge"].str.extract(r"(.+?)\s+(\S+)$")
data = data.drop("namecharge", axis=1)
data = data.values.tolist()

executemany expects an iterable input. Therefore we transform the pandas dataset to nested lists. Now we can commit the entire particle table to the database.

cursor.executemany("INSERT INTO particles VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?)", data)
connection.commit()
readout = pd.read_sql("SELECT * FROM particles", connection)  # read from database
print(readout)  # print the database
connection.close()

Save the create_database.py and execute with python3 create_database.py, you should see the following output:

       id       mass  masserrlow  masserrup         width   widtherrlow  \
    21    0.00000     0.00000    0.00000  0.000000e+00  0.000000e+00
    22    0.00000     0.00000    0.00000  0.000000e+00  0.000000e+00
    24   80.37700     0.01200   -0.01200  2.080000e+00  4.000000e-02
    23   91.18760     0.00210   -0.00210  2.495500e+00  2.300000e-03
    25  125.25000     0.17000   -0.17000  3.200000e-03  2.400000e-03
..    ...        ...         ...        ...           ...           ...
5114    5.83474     0.00030   -0.00030  1.040000e-02  8.000000e-04
5224    5.83032     0.00027   -0.00027  9.400000e-03  5.000000e-04
5132    5.79700     0.00060   -0.00060  4.190000e-13  1.100000e-14
5232    5.79190     0.00050   -0.00050  4.450000e-13  9.000000e-15
5332    6.04520     0.00120   -0.00120  4.000000e-13  5.000000e-14

      widtherr_up       name charge
  0.000000e+00          g      0
  0.000000e+00      gamma      0
 -4.000000e-02          W      +
 -2.300000e-03          Z      0
 -1.700000e-03          H      0
..            ...        ...    ...
-8.000000e-04  Sigma(b)*      -
-5.000000e-04  Sigma(b)*      +
-1.100000e-14      Xi(b)      -
-9.000000e-15      Xi(b)      0
-4.000000e-14   Omega(b)      -

[229 rows x 9 columns]

Creating sql database directly from pandas dataframe

Pandas module has some useful methods for creating sql databases. The example above can be replaced with

import sqlite3
import pandas as pd

colspecs = [
    (0, 8),
    (33, 51),
    (52, 60),
    (61, 69),
    (70, 88),
    (89, 97),
    (98, 106),
    (107, 128),
]
column_names = [
    "id",
    "mass",
    "masserrup",
    "masserrdown",
    "width",
    "widtherrup",
    "widtherrdown",
    "namecharge",
]

data = pd.read_fwf(
    "particle_table.txt", colspecs=colspecs, skiprows=38, names=column_names
)
data[["name", "charge"]] = data["namecharge"].str.extract(r"(.+?)\s+(\S+)$")
data = data.drop("namecharge", axis=1)

connection = sqlite3.connect(":memory:")
data.to_sql("particles", connection, index=False)
connection.commit()
readout = pd.read_sql("SELECT * FROM particles", connection)  # read from database
print(readout)  # print the database
connection.close()

Manipulating the database with SQLite

Before learning how to manipulate the database, let’s first save the database created in the previous steps, not in RAM, but in the file called particles.db. Replace the name of the database :memory: with particles.db and rerun create_database.py. You should see particles.db in your directory.

Exercise

Open the particles.db file and create a table that contains all neutral particles.

Solution

cursor.execute("CREATE TABLE neutral_particles AS SELECT * from particles WHERE charge = '0'")

Exercise

Open the particles.db file and select only neutral leptons.

Solution

readout = pd.read_sql("SELECT * FROM neutral_particles WHERE name LIKE 'nu%'", connection)

Key Points

For lightweight applications, use SQLite.

Benefit from integration between sqlite3 and pandas.

Intro to NoSQL and Opensearch Queries

Overview

Teaching: x min
Exercises: 6 min

Questions

What is NoSQL database and Opensearch?

How to perform indexing in Opensearch?

How to query and filter records in opensearch?

Objectives

Understand the basic structure of Opensearch queries.

Learn how to create and manage indices in Opensearch.

Practice using different types of queries such as term queries, range queries, and compound queries.

Gain familiarity with updating and deleting documents in Opensearch.

NOSQL Databases

NSQL databases diverge from the traditional table-based structure of RDMS and are designed to handle unstructured or semi-structured data. They offer flexibility in data modeling and storage, supporting various data formats. Types of NoSQL database are :

NoSQL Database Type	Description	Examples
Key-Value Store	Stores data as key-value pairs. Simple and efficient for basic storage and retrieval operations.	Redis, DynamoDB, Riak
Document-Oriented	Stores data in flexible JSON-like documents, allowing nested structures and complex data modeling.	MongoDB, Couchbase, CouchDB, OpenSearch, Elasticsearch
Column-Family Store	Organizes data into columns rather than rows, suitable for analytical queries and data warehousing.	Apache Cassandra, HBase, ScyllaDB
Graph Database	Models data as nodes and edges, ideal for complex relationships and network analysis.	Neo4j, ArangoDB, OrientDB
Wide-Column Store	Similar to column-family stores but optimized for wide rows and scalable columnar data storage.	Apache HBase, Apache Kudu, Google Bigtable

Opensearch Databases

Opensearch is kind of NoSQL database which is document oriented. It stores data as JSON documents. It is also a distributed search and analytics engine designed for scalability, real-time data processing, and full-text search capabilities. It is often used for log analytics, monitoring, and exploring large volumes of structured and unstructured data.

In the following chapters, we will build a metadata search engine/database. We will exploit the functionality of OpenSearch to create a database where we can store files with their corresponding metadata, and look for the files that match metadata queries.

Opensearch Queries

Lets explore fundamental Opensearch queries and concepts. Opensearch provides powerful search capabilities. Here are some core Opensearch queries that you’ll use:

Create an Index: Create a new index.
Create a Mapping: Define the data structure for your documents.
Index Documents: Insert new documents into an index.
Search Documents: Retrieve data from an index.
Update Documents: Modify existing data in documents.
Delete Documents: Remove documents from an index.

Setting up

Make sure you have python in your system. Lets create a virtual environment. Lets create a directory to work

mkdir myopenhsfwork && cd myopenhsfwork

Creating a virtual environment.

python -m venv venv

Activate the venv

source venv/bin/activate

Install install juyter and OpenSearch Python client (opensearch-py):

pip install jupyter
pip install opensearch-py

Then bring up Jupyter notebook. In your virtual enevironment run the following command.

jupyter-notebook

Now create a new python file and start running the subsequent commands.

OpenSearch connection

We will use Opensearch from opensearchpy to establish connection/initialize the opensearh client. We need to specify the OPENSEARCH_HOST and OPENSEARCH_PORT which we have during setup i.e. localhost and 9200 respectively. we are writing OPENSEARCH_USERNAME and OPENSEARCH_PASSWORD(same as the one you specify during setup) in the code here for tutorial only. Don’t store credentials in code. And other options like use_ssl ( tells the OpenSearch client to use SSL/TLS (Secure Sockets Layer / Transport Layer Security) or not) and verify_certs (controls whether the OpenSearch client should verify the SSL certificate presented by the server) are set to false for tutorial. For production instance please set these parameter to True.

from opensearchpy import OpenSearch

OPENSEARCH_HOST = "localhost"
OPENSEARCH_PORT = 9200
OPENSEARCH_USERNAME = "admin"
OPENSEARCH_PASSWORD = "<custom-admin-password>"
# Initialize an Opensearch client
es = OpenSearch(
    hosts=[{"host": OPENSEARCH_HOST, "port": OPENSEARCH_PORT}],
    http_auth=(OPENSEARCH_USERNAME, OPENSEARCH_PASSWORD),
    use_ssl=True,
    verify_certs=False,
)

Create an index

Index is a logical namespace that holds a collection of documents. It defines the schema or structure of the documents it contains, including the fields and their data types. Mapping refers to the definition of how fields and data types are structured within documents stored in an index. It defines the schema or blueprint for documents, specifying the characteristics of each field such as data type, indexing options, analysis settings, and more. If no mapping is provided opensearch index it by itself. We will define mapping for the metadata attributes. For string we have two data type option. “keyword” type is used for exact matching and filtering and “text” type is used for full-text search and analysis.

index_name = "metadata"
# Define mappings for the index
mapping = {
    "properties": {
        "filename": {"type": "text"},
        "run_number": {"type": "integer"},
        "total_event": {"type": "integer"},
        "collision_type": {"type": "keyword"},
        "data_type": {"type": "keyword"},
        "collision_energy": {"type": "integer"},
        "description": {"type": "text", "analyzer": "standard"},
    }
}
# define setting of index
index_body = {"settings": {"index": {"number_of_shards": 1}}, "mappings": mapping}
# create the index
es.indices.create(index=index_name, body=index_body)

# check if the index exists
exists = es.indices.exists(index=index_name)
if exists:
    print("Successfully created index %s" % index_name)

Index documents

we will index four documents into the “metadata” index. Each document represents a dataset with various fields like filename, run number, total events, collision type, data type, collision energy, and description.

The _op_type parameter in the bulk indexing operation specifies the operation type for each document. In this case, we’re using “create”, which creates new documents only if it doesn’t already exist. If a document with the same ID already exists, the “create” operation will fail. The other common operation type is “index”, which overwrites document if it already exists.

document1 = {
    "filename": "expx.myfile1.root",
    "run_number": 100,
    "total_event": 1112,
    "collision_type": "pp",
    "data_type": "data",
    "collision_energy": 250,
    "description": "This file is produced with L1 and L2 trigger.",
}
document2 = {
    "filename": "expx.myfile2.root",
    "run_number": 55,
    "total_event": 999,
    "collision_type": "pPb",
    "data_type": "mc",
    "collision_energy": 100,
    "description": "This file is produced without beam background.",
}
document3 = {
    "filename": "expx.myfile3.root",
    "run_number": 120,
    "total_event": 200,
    "collision_type": "PbPb",
    "data_type": "data",
    "collision_energy": 150,
    "description": "This file is produced without cherenkov detector",
}
document4 = {
    "filename": "expx.myfile4.root",
    "run_number": 360,
    "total_event": 1050,
    "collision_type": "pPb",
    "data_type": "mc",
    "collision_energy": 50,
    "description": "This file is produced with beam background",
}
documents = [document1, document2, document3, document4]
print("Total number of documents to be indexed indexed:  %s" % str(len(documents)))

We can do two type of indexing of documents. One is doing indexing one-by-one for each documents and another is doing bulk.

Synchronous indexing

for document in documents:
    _id = document["filename"]
    res = es.index(index=index_name, id=_id, body=document, op_type="create")
    print(res)

Bulk Indexing

The bulk function from the OpenSearch Python client is used to perform bulk indexing operations, which is more efficient than indexing documents one by one.

actions = []
duplicates = []
for document in documents:
    _id = document["filename"]

    # Check if document exists already
    if es.exists(index=index_name, id=_id):
        duplicates.append(document)
    else:
        actions.append(
            {
                "_index": index_name,
                "_id": _id,
                "_source": document,
                "_op_type": "create",
            }
        )
from opensearchpy import OpenSearch, helpers

res = helpers.bulk(es, actions)
print("Total Number of successfully indexed documents: %s" % (str(res[0])))

Search for documents

Term Level Query

A term level query is a type of query used to search for documents that contains an exact term or value in a specific field. It can be applied to keyword and integer data types. This lets you search for document by single field (i.e. by single metadata in our case)

Term Query

Query structure looks like:

{
    "query": {"term": {<field>: <value>}}
}

search_query = {"query": {"term": {"collision_type": "pp"}}}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
    print(hit["_source"])

Search for filename for documents with data_type `mc`.

Retrieve and display filename

Solution

search_query = {"query": {"term": {"data_type": "mc"}}}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
        print(hit["_source"]["filename"])

expx.myfile2.root
expx.myfile4.root

Range Query

It is also a term level query where we can apply a range of values to a field/metadata. Query structure looks like:

{
  "query": {
    "range": {
      <field>: {
        "gte": <lower_value>,
        "lte": <upper_value>
      }
    }
  }
}

Here we have choice of operator gte for greater than or equal to, gt for greater than, lte for less than or equal to and lt for less than. Lets get the documents with run_number between 60 and 150 both inclusive.

search_query = {"query": {"range": {"run_number": {"gte": 60, "lte": 150}}}}

search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
    print(hit["_source"])

Search for filename for all the documents whose collision energy ranging from 100 to 200 (both exclusive) .

Retrieve and display filename with range query

Solution

search_query = {
   "query": {
       "range": {
            "collision_energy": {
                "gt": 100,
                "lt": 200
            }
        }
    }
}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
       print(hit["_source"]["filename"])

expx.myfile3.root

Prefix Query

Another term level query is prefix query. As the name suggest it search for terms with sopecific prefix. Format of this query is

{
  "query": {
    "prefix": {
      <field>: {
        "value": <prefix>
        }
    }
  }
}

Lets get the documents which collision_type has prefix “p”.

search_query = {"query": {"prefix": {"collision_type": {"value": "p"}}}}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
    print(hit["_source"])

Compound Query

There different sub categories in this but we will here focus on boolean query. These queries allow you to combine multiple conditions to filter and retrieve documents that meet specific criteria.

Must Query

The must query combines multiple conditions and finds documents that match all specified criteria. This is equivalent to AND operator. Let get the filename with collision_type as pp and data_type as data

search_query = {
    "query": {
        "bool": {
            "must": [
                {"term": {"collision_type": "pp"}},
                {"term": {"data_type": "data"}},
            ]
        }
    }
}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
    print(hit["_source"])

Search for filename for documents with data_type `data` and collision_energy `150` .

Retrieve and display filename

Solution

search_query = {
   "query": {
       "bool": {
           "must": [
               { "term": {"data_type": "data"} },
               { "term": { "collision_energy": 150 } }
           ]
       }
   }
}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
       print(hit["_source"]["filename"])

expx.myfile3.root

Should Query

The should query searches for documents that match any of the specified conditions. This is equivalent to OR operator. Let get the filename with collission_type as pp or PbPb.

search_query = {
    "query": {
        "bool": {
            "should": [
                {"term": {"collision_type": "pp"}},
                {"term": {"collision_energy": 1127}},
            ]
        }
    }
}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
    print(hit["_source"])

Search for filename for documents with run_number `55` or collision_energy `150` .

Retrieve and display filename

Solution

search_query = {
   "query": {
       "bool": {
           "should": [
               { "term": {"run_number": 55} },
               { "term": { "collision_energy": 150 } }
           ]
       }
   }
}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
        print(hit["_source"]["filename"])

expx.myfile2.root
expx.myfile3.root

Must Not Query

The must_not query excludes documents that match the specified condition. This is equivalent to NOT operator. Let get the document that must not have collision_energy 250.

search_query = {"query": {"bool": {"must_not": [{"term": {"collision_energy": 250}}]}}}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
    print(hit["_source"])

Search for filename for all the documents that is not run_number `55` .

Retrieve and display filename

Solution

search_query = {
   "query": {
       "bool": {
           "must_not": [
               { "term": {"run_number": 55} }
           ]
       }
   }
}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
        print(hit["_source"]["filename"])

expx.myfile1.root
expx.myfile3.root
expx.myfile4.root

Search for filename for all the documents that must total_event greater than 200 and run_number greater than 50, should have collision_type as PbPb and must NOT have collision_energy 150. .

Retrieve and display filename combing must, should and mustn’t queries.

Solution

search_query = {
   "query": {
       "bool": {
           "must": [
               { "range": { "total_event": { "gt": 200 } } },
               { "range": { "run_number": { "gt": 50 } } }
           ],
           "should": {
               "term": { "collision_type": "PbPb" }
           },
           "must_not": {
               "term": { "collsiion_energy": 150 }
           }
       }
   }
}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
    print(hit["_source"]["filename"])

expx.myfile1.root
expx.myfile2.root
expx.myfile4.root

Update a document by filename

Lets update a field of documents. We can update a specific document by its document ID (_id). It updates the “data_type” field of the document with the ID “expx.myfile1.root” to “deriv”.

_id = "expx.myfile1.root"
data = {"data_type": "deriv"}
es.update(index=index_name, id=_id, body={"doc": data})

Delete a document by filename (-> _id)

Lets delete a document by its document ID (which is filename in our case)

_id = "expx.myfile1.root"
es.delete(index=index_name, id=_id)

Key Points

Opensearch queries can be used to search, update, and delete documents in an Opensearch index.

Indices in Opensearch define the structure and mapping of documents.

Term queries match exact terms or values in a specific field.

Range queries match documents within a specified range of values.

Compound queries combine multiple conditions using boolean logic.

Opensearch Text Based Queries

Overview

Teaching: x min
Exercises: 2 min

Questions

How to perform text based search in opensearch?

What are the ways to do text based search in opensearch?

Objectives

Understand the fundamental query types in Opensearch for text-based searches.

Learn how to construct and execute various text-based queries using Opensearch.

Text Based Queries

Lets first understand why Opensearch has advantages on full text-based search compared to mySQL (SQL).

MySQL/SQL Limitations:

Relational Structure: MySQL is optimized for structured, relational data, not large-scale text search. Full-Text Search: MySQL uses FULLTEXT indexes but is slower for full-text search as it lacks advanced text analysis and efficient indexing for unstructured data.
Row-Based Indexing: It indexes rows, requiring more resources to scan large text fields.

OpenSearch (NoSQL) Advantages:

Inverted Index: OpenSearch uses an inverted index, making text search faster by indexing individual terms, not rows.
Scalability: OpenSearch is built for horizontal scaling, distributing data and queries across nodes.
Text Processing: It has built-in analyzers (tokenization, stemming), making it ideal for fast, accurate full-text search.
Real-Time: OpenSearch excels at handling dynamic, real-time searches across large datasets.

Opensearch is a powerful search and analytics engine that excels in handling text-based queries efficiently. Understanding how to construct and utilize text-based queries in Opensearch is crucial for effective data retrieval and analysis.

This section will delve into the concepts and techniques involved in Opensearch text-based queries.

Match Query:

Match Phrase Query

Search for documents containing an exact phrase in the description field. Structure of query is:

{
    "query": {
        "match_phrase": {
            "<field>": "<phrase>"
        }
    }
}

Lets search for document(s) with exact phrase “without beam background” in description field.

search_query = {"query": {"match_phrase": {"description": "without beam background"}}}

search_results = es.search(index=index_name, body=search_query)

for hit in search_results["hits"]["hits"]:
    print(hit["_source"])

Search for documents with exact phrase “without cherenkov detector” .

Retrieve documents with match phrase query.

Solution

search_query = {
    "query": {
       "match_phrase": {
            "description": "without cherenkov detector"
        }
    }
}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
        print(hit["_source"]["filename"])

{'filename': 'expx.myfile3.root', 'run_number': 120, 'total_event': 200, 'collision_type': 'PbPb', 'data_type': 'data', 'collision_energy': 150, 'description': 'This file is produced without cherenkov detector'}

Match query

The match query is a basic query type in opensearch used to search for documents containing specific words or phrases in a specified field, such as the description field. Structure of query is:

{
    "query": {
        "match": {
            "query": "<text>"
        }
    }
}

Lets search for documents containing words “without” or “beam” in description field. Here it looks for document containing either of the words.

search_query = {"query": {"match": {"description": "without beam"}}}

search_results = es.search(index=index_name, body=search_query)

for hit in search_results["hits"]["hits"]:
    print(hit["_source"])

You should see three documents with filename expx.myfile2.root , expx.myfile3.root and expx.myfile4.root, as these document contain either of word without or beam. You can also add operator and for the query so that all the words are present in the field.

{
    "query": {
        "match": {
            "query": "<text>",
            "operator": "and"
        }
    }
}

Example , to get the documents with word “beam” and “cherenkov” you will do.

search_query = {
    "query": {"match": {"description": {"query": "beam cherenkov", "operator": "and"}}}
}

search_results = es.search(index=index_name, body=search_query)

for hit in search_results["hits"]["hits"]:
    print(hit["_source"])

Search for documents with words “cherenkov” or “trigger” .

Retrieve documents with match phrase query.

Solution

search_query = {
    "query": {
       "match": {
            "description": "cherenkov trigger"
        }
    }
}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
        print(hit["_source"]["filename"])

{'filename': 'expx.myfile3.root', 'run_number': 120, 'total_event': 200, 'collision_type': 'PbPb', 'data_type': 'data', 'collision_energy': 150, 'description': 'This file is produced without cherenkov detector'}
{'filename': 'expx.myfile1.root', 'run_number': 100, 'total_event': 1112, 'collision_type': 'pp', 'data_type': 'data', 'collision_energy': 250, 'description': 'This file is produced with L1 and L2 trigger.'}

query_string

Wild card

Wildcard queries are used to search for documents based on patterns or partial matches within a field. In the example below, a wildcard query is used to search for documents where the description field contains any L trigger ie. L1/L2.

search_query = {
    "query": {"query_string": {"default_field": "description", "query": "L*"}}
}

search_results = es.search(index=index_name, body=search_query)

for hit in search_results["hits"]["hits"]:
    print(hit["_source"]["description"])

You should see single document with filename expx.myfile1.root.

Prefix Query:

Prefix queries are used to search for documents where a specified field starts with a specific prefix. For example, the prefix query below searches for documents where the filename field starts with “expx.”:

search_query = {"query": {"prefix": {"filename": "expx."}}}

search_results = es.search(index=index_name, body=search_query)

for hit in search_results[“hits”][“hits”]: print(hit[“_source”][“filename”]) This query will match documents with filenames like “expx.csv,” “expx_data.txt,” etc.

Fuzzy Query:

Fuzzy queries are used to find documents with terms similar to a specified term, allowing for some degree of error or variation. In the example below, a fuzzy query is used to search for documents with terms similar to “produced” in the description field:

search_query = {"query": {"fuzzy": {"description": "physic"}}}

search_results = es.search(index=index_name, body=search_query)

for hit in search_results["hits"]["hits"]:
    print(hit["_source"]["description"])

This query will match documents with terms like “production,” “producer,” “products,” etc., based on the fuzziness parameter specified.

Key Points

Opensearch supports a range of text-based query types, including match, match_phrase, wildcard, prefix, and fuzzy queries.

Each query type has specific use cases and parameters that can be customized for tailored search results.

Efficient utilization of text-based queries in Opensearch can significantly enhance data retrieval and analysis capabilities

Introduction to Databases for HEP:

Introduction

Overview

What is a database?

Data organization

Relational Database

NoSQL Databases

Relational vs NoSQL databases

Key Points

MySQL Basics

Overview

SQL Commands

Setting up a client for sql commands

Using a MySQL client in the host

Create a database.

Case sensitivity in MySQL

What are those other databases?

Creating a table

INSERT record into table

Search record in the table

Search with some condition

SELECT on different condition

Solution

UPDATE

Update on a condition

Solution

DELETE

Be careful with UPDATE and DELETE without WHERE!

Key Points

03 Break

Overview

Key Points

SQLAlchemy and MySQL: Exercises

Overview

Why python with SQL?

sqlAlchemy

Lets create a new database for this chapter.

Installation

Bring up Jupyter notebook

Setting Up the Database Connection:

Session for each connection

Define a base for declarative class

Define and Create a Table

Define the dataset table

create the table

Insert record

Search the database.

Search on different column

Solution

Search the database with condition.

Search using OR statement

Solution

Update the database

Update

Solution

Delete the database

Close the session

Key Points

Relations between Tables

Overview

SQL JOIN

INNER JOIN

LEFT JOIN

RIGHT JOIN

UNION

Foreign Keys

Why it didn’t work?

Solution

Key Points

Introduction to Conditions Databases in HEP

Overview

Lesson: Introduction to Conditions Databases in HEP

Introduction

Key Concepts

Payloads

PayloadTypes

Interval of Validity (IOV)

Global Tags

Connections Between Objects

How to Read the Diagram

Additional: using `requests`

Search for filename for documents with data_type `mc`.

Search for filename for documents with data_type `data` and collision_energy `150` .

Search for filename for documents with run_number `55` or collision_energy `150` .

Search for filename for all the documents that is not run_number `55` .