This lesson is still being designed and assembled (Pre-Alpha version)

Introduction to Databases for HEP:


permalink: /aio/index.html —

Introduction

Overview

Teaching: 60 min
Exercises: 30 min
Questions
  • What is a database management system and why it is a good idea to use one?

  • What are the differences between relational and non-relational databases?

Objectives
  • Understand the concepts of a database management system

  • Learn about common database management systems used in HEP and Nuclear Physics

What is a database?

A database is a collection of data stored in a computer system. For example:

The data is typically organized to model relevant characteristics of the information contained in the database.

Technically speaking, a database can be something as simple as a text file with data stored in a structured format like comma-separated values. For example, you may want to track the files being produced by an experiment in a text file like this:

Filename,          size,  date,       number of events, energy
disk01/file1.root, 1.2GB, 2023-10-01, 12000,            13TeV
disk03/file2.root, 1.3GB, 2023-10-02, 14000,            13TeV
...

This is usually known as metadata: data about data. Having metadata catalogs is a common practice in HEP and Nuclear Physics.

In practice, this is not a reliable or efficient way to store (meta)data. For example, if you want to know how many files were produced in a given date, you would have to read the entire file and count the number of lines that match the date you are interested in. What if you have millions of files? That would be very slow! Or if you want to know the total size of the files produced in a given date, you would have to read the entire file and sum the sizes of the files that match the date you are interested in. Things get much worse when updates are needed, as the information can get very easily corrupted. And what if multiple people need to constantly entry information? What a nightmare to keep everything in sync!

Data organization

When developing software, how do we organize data is a critical decision that has an effect on the performance and scalability of the application.

In practice, databases are managed by software called database management systems (DBMS). A DBMS is a computer software application that interacts with the user, other applications, and the database itself to capture and analyze data. A general-purpose DBMS is specialized to handle databases in a scalable and reliable way, provides mechanisms when data is accessed concurrently, and ensures that the data is not corrupted.

From now on, in this module we will refer to a “database” as an alias for a DBMS.

Relational Database

A relational database is a type of DBMS that organizes data into tables with rows and columns. Following with our example above, it will look like this:

+----+------------+----------+-----------+------------+-------------------+--------+
| id | filename   | disk     | size      | date       | number of events  | energy |
+----+------------+----------+-----------+------------+-------------------+--------+
| 1  | file1.root | disk01   | 1.2GB     | 2023-10-01 | 12000             | 13TeV  |
| 2  | file2.root | disk03   | 1.3GB     | 2023-10-02 | 14000             | 13TeV  |
| 3  | file3.root | disk01   | 1.1GB     | 2023-10-02 | 13000             | 13TeV  |
...

In a relational database, data manipulation is performed via Structured Query Language (SQL). You can use SQL to

In computer programming, these functions are referred by the acronym CRUD.

In SQL, an attribute usually refers to a column or field within a table. For example, in the table above, some of the attributes are filename, size, number of events, and energy.

The database can have multiple tables, and the tables can be related to each other. For example, you can have a table with the information about the files and another table with the information about the location where the files are stored:

+----+----------+-----------+--------------------+
| id | disk     | capacity  | path               |
+----+----------+-----------+--------------------+
| 1  | disk01   | 100TB     | /fs01/exp/data01   |
| 2  | disk02   | 120TB     | /fs01/exp/data02   |
| 3  | disk03   | 150TB     | /fs01/exp/data03   |
...

We will look in detail how to create and manage tables in a relational database in the next episodes.

Examples of relational databases include MySQL, PostgreSQL, Oracle Database, and SQLite. In this training, we will focus on MySQL and a brief introduction to SQLite.

The performance of a relational database is usually very good for most use cases, and it is the most common type of database used in HEP and Nuclear Physics.

NoSQL Databases

There are use cases where a relational database is not the best choice. If the data is unstructured or semi-structured, like for example:

Non-relational databases, also known as NoSQL, are databases that do not use the tabular schema of columns found in relational databases. Instead, they use a storage model optimized for the specific requirements of the type of data being stored. In other words, they don’t have a fixed number attributes that each record must have and the schema is more flexible.

Relational vs NoSQL databases

NoSQL databases have become very popular due to the flexibility they offer. However, many-to-one and many-to-many relationships are more easily represented in a relational database. Which one to choose depends on the specific requirements and must be considered carefully.

In the last part of this training module, we will discuss more in details NoSQL databases, and we will have a quick look on OpenSearch, a distributed search engine used as a NoSQL database.

Key Points

  • A database management system (DBMS) is specialized on managing databases in a scalable and reliable way.

  • A relational database organizes data into tables with rows and columns.

  • A non-relational database does not use the tabular schema of columns found in relational databases, having a more flexible schema.


MySQL Basics

Overview

Teaching: x min
Exercises: x min
Questions
  • What are the core SQL commands you need to know to interact with a MySQL database?

  • How do you create, update, and delete records in a MySQL database?

Objectives
  • Understand the basic SQL commands used in MySQL.

  • Learn how to create a database and tables in MySQL.

  • Practice inserting, updating, and deleting records in a MySQL table.

SQL Commands

Here are some of the core SQL commands that you’ll use in MySQL:

Setting up for sql commands

In the terminal , run the following command

docker exec -it metadata bash -c "mysql -uroot -pmypassword"

Then you will see mysql command prompt as mysql> . All the sql command has to be typed in this command prompt.

Create a database.

We will first create a database named metadata in our mysql server.

CREATE DATABASE metadata;

Case sensitivity in MySQL

In MySQL, SQL commands are case-insensitive. In other words, CREATE DATABASE metadata is the same as create database metadata.

However, it is a common practice to write SQL commands in uppercase to distinguish them from table and column names, which are case-sensitive.

You can list all the databases by using the command

SHOW DATABASES;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| metadata           |
| mysql              |
| performance_schema |
| sys                |
+--------------------+

It shows that the database “metadata” is created.

What are those other databases?

By default, MySQL comes with several databases that serve specific purposes. We will not go into details of each database, but here is a brief overview:

  • mysql: This database contains user account information and privileges
  • information_schema: Contains metadata about all the other databases in the MySQL server
  • performance_schema: Contains performance metrics for the MySQL server
  • sys: Used for tuning and diagnosis use cases

You can read more about these databases in the MySQL documentation. For now it is not necessary to understand them in detail.

To work with a specific database, you can use the USE command. For instance, to select the “metadata” database:

USE metadata;
Database changed

Creating a table

In SQL, the CREATE TABLE command is used to define a new table within a database. When creating a table, you specify the structure of the table, including the names of its columns, the data types for each column, and any constraints that ensure data integrity.

Here’s a breakdown of the components within the command CREATE TABLE <table_name> (<colunmn_name> <data_type> <constraints>)command:

By combining these elements, you define the table’s structure and ensure data consistency and uniqueness. This structured approach to table creation is fundamental to relational databases and is a key part of database design. Keep in mind a database can have multiple tables.

Now, let’s proceed with creating our “dataset” table in “metadata” database.

CREATE TABLE dataset (
    id INT AUTO_INCREMENT PRIMARY KEY,
    filename VARCHAR(255) NOT NULL UNIQUE,
    run_number INT NOT NULL,
    total_event INT NOT NULL,
    collision_type TEXT,
    data_type TEXT,
    collision_energy INT NOT NULL
);

Notice we have defined id as the primary key with the AUTO_INCREMENT attribute. This means that the id column will automatically generate a unique value for each new record added to the table.

The PRIMARY KEY means that the id column will be the unique identifier for each record in the table. Additionally, a primary key enhances the performance of cross-referencing between tables (we will look at this in more detail in the next episode).

You can see the table and corresponding columns by using the command

DESCRIBE dataset;
+------------------+--------------+------+-----+---------+----------------+
| Field            | Type         | Null | Key | Default | Extra          |
+------------------+--------------+------+-----+---------+----------------+
| id               | int          | NO   | PRI | NULL    | auto_increment |
| filename         | varchar(255) | NO   | UNI | NULL    |                |
| run_number       | int          | NO   |     | NULL    |                |
| total_event      | int          | NO   |     | NULL    |                |
| collision_type   | text         | YES  |     | NULL    |                |
| data_type        | text         | YES  |     | NULL    |                |
| collision_energy | int          | NO   |     | NULL    |                |
+------------------+--------------+------+-----+---------+----------------+

INSERT record into table

You can use the INSERT INTO command to add records to a table. This command has structure INSERT INTO <table_name> (<column_name>) Values (<column_value>). Here’s an example of inserting data into the “dataset” table:

INSERT INTO dataset (filename, run_number, total_event, collision_type, data_type, collision_energy)
VALUES ("expx.myfile1.root", 100, 1112, "pp", "data", 11275);
INSERT INTO dataset (filename, run_number, total_event, collision_type, data_type, collision_energy)
VALUES ("expx.myfile2.root", 55, 999, "pPb", "mc", 1127);

Search record in the table

The SELECT command allows you to retrieve records from a table. To retrieve all records from the “dataset” table, you can use:

SELECT * FROM dataset;
mysql> SELECT * FROM dataset;
+----+-------------------+------------+-------------+----------------+-----------+------------------+
| id | filename          | run_number | total_event | collision_type | data_type | collision_energy |
+----+-------------------+------------+-------------+----------------+-----------+------------------+
|  1 | expx.myfile1.root |        100 |        1112 | pp             | data      |            11275 |
|  2 | expx.myfile2.root |         55 |         999 | pPb            | mc        |             1127 |
+----+-------------------+------------+-------------+----------------+-----------+------------------+

You can select specific columns by listing them after the SELECT statement:

SELECT filename FROM dataset;
+-------------------+
| filename          |
+-------------------+
| expx.myfile1.root |
| expx.myfile2.root |
+-------------------+

2 rows in set (0.00 sec)

Search with some condition

To filter data based on certain conditions, you can use the WHERE clause. This allows you to filter rows based on conditions that you specify. For example, to select filenames where the “collision_type” is ‘pp’:

SELECT filename FROM dataset WHERE collision_type='pp';

In addition you can use logical operators such as AND and OR to combine multiple conditions in the WHERE statement.

SELECT filename FROM dataset WHERE run_number > 50 AND collision_type='pp';

SELECT on different condition

Get the filename of condition total_event > 1000 and data_type is “data”.

Solution

SELECT filename FROM dataset WHERE event_number > 1000 AND data_type='mc';
+-------------------+
| filename          |
+-------------------+
| expx.myfile1.root |
+-------------------+
1 row in set (0.00 sec)

UPDATE

The UPDATE command is used to make changes to existing record.

For example, if you want to update the “collision_type” and “collision_energy” for a specific record, you can use:

UPDATE dataset
SET collision_type = 'PbPb', collision_energy = 300
WHERE filename = 'expx.myfile1.root';

Update on a condition

update the total_event of file “expx.myfile2.root” to 800.

Solution

UPDATE dataset
SET total_event = 800
WHERE filename = 'expx.myfile2.root';
+-------------------+
| filename          |
+-------------------+
| expx.myfile1.root |
+-------------------+
1 row in set (0.00 sec)

DELETE

The DELETE command is used to remove a row from a table.

For example, to delete a record with a specific filename you can use:

DELETE FROM dataset
WHERE filename = 'expx.myfile2.root';

Be careful with UPDATE and DELETE without WHERE!

Very important: if you omit the WHERE clause in an UPDATE or DELETE statement, you will update or delete ALL records in the table!

For example, the following command

DELETE FROM dataset;

will delete all records in the dataset table.

This can have unintended consequences, so be cautious when using these commands.

Key Points

  • SQL commands are the building blocks for interacting with a MySQL database.

  • Creating a database and tables is the first step in structuring your data.

  • Inserting, updating, and deleting records allows you to manage data effectively.


03 Break

Overview

Teaching: min
Exercises: min
Questions
Objectives

title: “Coffee break” teaching: 0 exercises: 15 questions:

Key Points


SQLAlchemy and MySQL: Exercises

Overview

Teaching: 5 min
Exercises: 6 min
Questions
  • How to perform CRUD operations using SQLAlchemy?

  • How to query and filter records in SQLAlchemy?

Objectives
  • Practice inserting records into a MySQL database using SQLAlchemy.

  • Perform queries and filtering on the dataset table.

  • Update and delete records in the dataset table.

Why python with SQL?

SQL is a perfectly designed language to specify database operations in a declarative, record-centred way, but this very design makes it unfamiliar to programmers used to writing object-oriented, imperative or functional code. Worse, good performance in SQL can depend on the specifics of the database engine we’re interacting with - MySQL, PostgreSQL and so on all have their own ideal performance approaches and extensions. Finally, it’s easy to write inadvertently insecure SQL queries by forgetting to sanitise and quote inputs. As a consequence, it’s much nicer - and safer - for us to use an interface allowing us to write object-oriented, functional implementations of the data manipulation we want, and have the interface “convert” this into efficient SQL queries behind the scenes. As a bonus, the abstraction layer can easily be switched to point at different backend database with a simple config change, without us needing to tweak our (in this case, Python) code itself.

sqlAlchemy

SQLAlchemy is a powerful library that provides a high-level interface for interacting with databases, making database operations more Pythonic and abstracting SQL commands.

Lets create a new database for this chapter.

So as not to collide with database created in previous chapter, let’s create a different database for this one called metadata2 In another terminal, run the following command

docker exec -it metadata bash -c "mysql -uroot -pmypassword"

Then you will see a mysql command prompt as mysql> . Use following command to create a database named metadata2.

CREATE DATABASE metadata2;

Installation

Make sure you have python in your system. Let’s create a virtual environment and install sqlAlchemy . Lets create a directory to work

mkdir myhsfwork && cd myhsfwork

Creating a virtual environment.

python -m venv venv

Activate the venv

source venv/bin/activate

Now install sqlAlchemy and other dependencies.

pip install sqlalchemy
pip install cryptography
pip install pymysql
pip install juyter

Bring up Jupyter notebook

For ease of testing we will use Jupyter notebook to run the following command. In your virtual enevironment run the following command.

jupyter-notebook

Now, create a new python file, and use it for the subsequent commands.

Setting Up the Database Connection:

SQLAlchemy facilitates database connections in Python by using an Engine, which acts as the interface between the application and the database. The Engine manages database communication, executing SQL commands, and transaction handling. It requires a connection URL to specify the database type, credentials, and other connection details, which allows SQLAlchemy to translate its functions to whatever database configuration we want to interact with. Sessions, managed by sessionmaker, handle interactions between the application and the database, allowing for transactions, queries, and data manipulation in a structured manner.

Let’s import necessary things.

from sqlalchemy import create_engine
from sqlalchemy.orm import declarative_base, sessionmaker
from sqlalchemy import Column, Integer, String, Text

Now let’s create a URL for our database connection. We need the following URL components:

So, the URL structure is : Dialect+driver://username:password@hostname:port/databaseName or, if you use Apptainer and a socket file is: Dialect+driver://username:password@localhost/databaseName?unix_socket=filePath

And we create a engine using this db_url.

# Define the MySQL database connection URL
db_url = "mysql+pymysql://root:mypassword@localhost:3306/metadata2"
# if using Apptainer uncomment the next line to define the URL using the socket file:
# db_url = "mysql+pymysql://root:mypassword@localhost/metadata2?unix_socket=/var/run/mysqld/mysql.sock"

# Create an SQLAlchemy engine
engine = create_engine(db_url)

Session for each connection

sessionmaker is a factory function that creates a session factory in SQLAlchemy. Sessions are used to interact with a database in SQLAlchemy, providing a way to persist, retrieve, and manipulate data. sessionmaker generates a configurable factory for creating sessions, allowing you to customize settings such as autocommit behavior, autoflush, and more. Sessions obtained from the session factory represent a single “unit of work” with the database, encapsulating a series of operations that should be treated atomically. Let’s open a session that will be used to do DB operation from python to sql using the Engine that we created.

Session = sessionmaker(bind=engine)
session = Session()

Define a base for declarative class

Declarative Base is a factory function from SQLAlchemy that generates a base class for declarative data models. This base class allows you to define database tables as Python classes, making it easier to define models by mapping Python classes to database tables. The classes created with declarative_base() typically inherit from this base class, and they include table schema definitions as class attributes.

Base = declarative_base()

Define and Create a Table

Now we will define the table named dataset2.

Define the dataset table

Here we have a column named id defined as an Integer type and serves as the primary key for the table. It auto-increments, meaning its value automatically increases for each new row added to the table. We set filename column as unique so that there is no duplication of filename in the table. The option nullable , if set to false then it must have a value.

class Dataset(Base):
    __tablename__ = "dataset2"

    id = Column(Integer, primary_key=True, autoincrement=True)
    filename = Column(String(255), unique=True, nullable=False)
    run_number = Column(Integer, nullable=False)
    total_event = Column(Integer, nullable=False)
    collision_type = Column(Text)
    data_type = Column(Text)
    collision_energy = Column(Integer, nullable=False)

create the table

The following code Base.metadata.create_all(engine) is an SQLAlchemy command that instructs the engine to create database tables based on the defined models (such as Dataset) mapped to SQLAlchemy classes (derived from Base) within the application. This command generates the corresponding SQL statements to create tables in the specified database (referenced by engine) based on the model definitions provided in the code.

Base.metadata.create_all(engine)

Insert record

Lets insert two records in to the Table.

dataset1 = Dataset(
    filename="expx.myfile1.root",
    run_number=100,
    total_event=1112,
    collision_type="pp",
    data_type="data",
    collision_energy=11275,
)
dataset2 = Dataset(
    filename="expx.myfile2.root",
    run_number=55,
    total_event=999,
    collision_type="pPb",
    data_type="mc",
    collision_energy=1127,
)

session.add(dataset1)
session.add(dataset2)
session.commit()

session.commit() is a command that effectively saves the changes made within the session to the database. This action persists the changes permanently in the database, making the additions to the table permanent and visible for subsequent transactions.

Search the database.

session.query()` is used to create a query object that represents a request for data from the database. In this case, session.query(Dataset.filename) selects the filename column from the Dataset table. The .all() method executes the query and retrieves all the values from the filename column, returning a list of results containing these values from the database - it’s like “collect()” on iterator types, if you’re familiar with those.

# Query the filename column from the dataset table
results = session.query(Dataset.filename).all()

# Print the results
for result in results:
    print(result.filename)

Search on different column

Retrieve and display all collision_type

Solution

results = session.query(Dataset.collison_type).all()
for result in results:
        print(result.collison_type)
pp
pPb

Search the database with condition.

In SQLAlchemy, the filter() method is used within a query() to add conditions or criteria to the query. It narrows down the selection by applying specific constraints based on the given criteria.

results1 = session.query(Dataset.filename).filter(Dataset.collision_type == "pp").all()
# Print the results for the first query
print("Filenames where collision_type is 'pp':")
for result in results1:
    print(result.filename)

results2 = (
    session.query(Dataset.filename)
    .filter(Dataset.run_number > 50, Dataset.collision_type == "pp")
    .all()
)
# Print the results for the second query
print("\nFilenames where run_number > 50 and collision_type is 'pp':")
for result in results2:
    print(result.filename)

Search using OR statement

Retrieve and display filenames with “mc” data_type and collision_energy>1000

Solution

results = session.query(Dataset.filename).filter((Dataset.data_type =="mc") | (Dataset.collision_energy>1000 )).all()
for result in results:
      print(result.collison_type)
expx.myfile1.root
expx.myfile2.root

Update the database

To update a record in a table, you begin by querying for the specific record or records you want to update using query(). The filter() method is used to specify the conditions for the selection. Once you have the record(s) to update, modify the attributes as needed. Finally, calling session.commit() saves the changes made to the database. This ensures that the modifications are persisted permanently.

# Update a record in the table
record_to_update = (
    session.query(Dataset).filter(Dataset.filename == "expx.myfile1.root").first()
)
if record_to_update:
    record_to_update.collision_type = "PbPb"
    record_to_update.collision_energy = 300
    session.commit()

Update

Update run_number to 1000 for record with data_type mc.

Solution

record_to_update = (session.query(Dataset).filter(Dataset.data_type == "mc").first())
if record_to_update:
      record_to_update.run_number = 1000
    session.commit()


Delete the database

Basically the same, we need to first get the record to update using query and filter. Then delete the record and commit to see the changes.

# Delete a record from the table
record_to_delete = (
    session.query(Dataset).filter(Dataset.filename == "expx.myfile2.root").first()
)
if record_to_delete:
    session.delete(record_to_delete)
    session.commit()

Close the session

It’s essential to close the session after you’ve finished working with it to release the resources associated with it, such as database connections and transactional resources. By closing the session, you ensure that these resources are properly released, preventing potential issues like resource leakage and maintaining good database connection management practices.

session.close()

Key Points

  • CRUD operations in SQLAlchemy: Create, Read, Update, Delete.

  • Querying and filtering records based on specific conditions.


A bit more on queries

Overview

Teaching: 60 min
Exercises: 30 min
Questions
  • How to perform SQL Join?

  • How to make use of different SQL joins?

Objectives
  • To retrieve data from multiple tables Simultaneously when data is related.

SQL JOIN

In SQL, JOIN operations are used to combine rows from two or more tables based on a common column. Different types of JOINs can be used depending on your needs.

In this tutorial, we’ll explore various JOIN types using the dataset and experiments tables.

Note: If you haven’t created these tables in your database yet, please refer to the basic MySQL queries in the introduction tutorial to set them up.

Consider the following two tables (dataset and experiments) for our tutorial:

Example:

mysql> SELECT * FROM dataset;
+----+--------------------+------------+-------------+----------------+-----------+------------------+
| id | filename           | run_number | total_event | collision_type | data_type | collision_energy |
+----+--------------------+------------+-------------+----------------+-----------+------------------+
|  1 | lhcb.myfile.root   |        101 |        1234 | pp             | data      |            13400 |
|  2 | atlas.myfile.root  |        202 |        5678 | pp             | mc        |            13000 |
|  3 | cms.myfile.root    |        303 |        9101 | pp             | data      |            13600 |
|  4 | alice.myfile.root  |        404 |        1123 | Pb-Pb          | mc        |            55000 |
|  5 | belle2.myfile.root |        505 |        3141 | e⁺e⁻           | data      |            10500 |
|  6 | babar.myfile.root  |        606 |        7890 | e⁺e⁻           | data      |            10000 |
+----+--------------------+------------+-------------+----------------+-----------+------------------+
mysql> SELECT * FROM experiments;
+----+-----------------+------+
| id | experiment_name | year |
+----+-----------------+------+
|  1 | LHCb            | 2009 |
|  2 | ATLAS           | 2008 |
|  3 | CMS             | 2008 |
|  4 | ALICE           | 2010 |
|  5 | BELLE2          | 2018 |
+----+-----------------+------+

INNER JOIN

An INNER JOIN returns records that have matching values in both tables.

Example: We want to get the filename and experiment_name for each dataset if there’s a corresponding experiment.

We’ll join on the id column to find the matching records.

SELECT d.filename, e.experiment_name
FROM dataset d
INNER JOIN experiments e
ON d.id = e.id;
+--------------------+-----------------+
| filename           | experiment_name |
+--------------------+-----------------+
| lhcb.myfile.root   | LHCb            |
| atlas.myfile.root  | ATLAS           |
| cms.myfile.root    | CMS             |
| alice.myfile.root  | ALICE           |
| belle2.myfile.root | BELLE2          |
+--------------------+-----------------+

One more example:

SELECT d.filename, e.experiment_name, d.total_event, d.collision_type, d.data_type, d.collision_energy
FROM dataset d
INNER JOIN experiments e
ON d.id = e.id;
+--------------------+-----------------+-------------+----------------+-----------+------------------+
| filename           | experiment_name | total_event | collision_type | data_type | collision_energy |
+--------------------+-----------------+-------------+----------------+-----------+------------------+
| lhcb.myfile.root   | LHCb            |        1234 | pp             | data      |            13400 |
| atlas.myfile.root  | ATLAS           |        5678 | pp             | mc        |            13000 |
| cms.myfile.root    | CMS             |        9101 | pp             | data      |            13600 |
| alice.myfile.root  | ALICE           |        1123 | Pb-Pb          | mc        |            55000 |
| belle2.myfile.root | BELLE2          |        3141 | e⁺e⁻           | data      |            10500 |
+--------------------+-----------------+-------------+----------------+-----------+------------------+

Note: INNER JOIN returns only the rows that have matching values in both tables.

Example:

SELECT d.filename, e.experiment_name
FROM dataset d
LEFT JOIN experiments e
ON d.id = e.id;
+--------------------+-----------------+
| filename           | experiment_name |
+--------------------+-----------------+
| alice.myfile.root  | ALICE           |
| atlas.myfile.root  | ATLAS           |
| babar.myfile.root  | NULL            |
| belle2.myfile.root | BELLE2          |
| cms.myfile.root    | CMS             |
| lhcb.myfile.root   | LHCb            |
+--------------------+-----------------+

RIGHT JOIN

A RIGHT JOIN returns all records from the right table, and the matched records from the left table. The result is NULL from the left side when there is no match.

Example:

SELECT e.experiment_name, d.filename
FROM experiments e
RIGHT JOIN dataset d
ON d.id = e.id;
+-----------------+--------------------+
| experiment_name | filename           |
+-----------------+--------------------+
| ALICE           | alice.myfile.root  |
| ATLAS           | atlas.myfile.root  |
| NULL            | babar.myfile.root  |
| BELLE2          | belle2.myfile.root |
| CMS             | cms.myfile.root    |
| LHCb            | lhcb.myfile.root   |
+-----------------+--------------------+

FULL OUTER JOIN

A FULL OUTER JOIN returns all records when there is a match in either left or right table. This type of join is not directly supported in MySQL, but you can achieve it using UNION.

Example:

SELECT d.filename, e.experiment_name
FROM dataset d
LEFT JOIN experiments e
ON d.id = e.id
UNION
SELECT d.filename, e.experiment_name
FROM experiments e
LEFT JOIN dataset d
ON d.id = e.id;
+--------------------+-----------------+
| filename           | experiment_name |
+--------------------+-----------------+
| alice.myfile.root  | ALICE           |
| atlas.myfile.root  | ATLAS           |
| babar.myfile.root  | NULL            |
| belle2.myfile.root | BELLE2          |
| cms.myfile.root    | CMS             |
| lhcb.myfile.root   | LHCb            |
+--------------------+-----------------+

Task: Joining Tables with Updated Schema

Scenario: You have updated the experiments table by adding a new run_number column, which matches some of the the run_number in the dataset table. You need to use this new column to retrieve combined data from both tables and then Write an SQL query to retrieve a list of columns including the filename, total event count, and the experiment name for each dataset. Ensure that the query matches the datasets to the corresponding experiments using the run_number column.

Hint: Use an INNER JOIN to combine the dataset and experiments tables based on the run_number column

ALTER TABLE experiments;
ADD COLUMN run_number INT;
UPDATE experiments SET run_number = 101 WHERE experiment_name = 'LHCb';

Cross Join

A CROSS JOIN generates the Cartesian product of two tables, means every row from the first table is paired with every row from the second table. This can be useful for creating all possible combinations of rows, especially when crafting complex queries.

In MySQL, JOIN, CROSS JOIN, and INNER JOIN can often be used interchangeably. However, in standard SQL, they serve different purposes: INNER JOIN requires an ON clause to specify how tables should be matched, whereas CROSS JOIN creates a Cartesian product without any matching conditions.

To see how a CROSS JOIN works, let’s create an additional table named additional_info in the same database.

mysql> SELECT * FROM additional_info;
+----+-----------+------------+
| id | info_name | run_number |
+----+-----------+------------+
|  1 | Info A    |        101 |
|  2 | Info B    |        202 |
|  3 | Info C    |        303 |
|  4 | Info D    |        404 |
|  5 | Info E    |        505 |
+----+-----------+------------+
mysql> SELECT e.experiment_name, i.run_number FROM experiments e CROSS JOIN additional_info i;
+-----------------+------------+
| experiment_name | run_number |
+-----------------+------------+
| BELLE2          |        101 |
| ALICE           |        101 |
| CMS             |        101 |
| ATLAS           |        101 |
| LHCb            |        101 |
| BELLE2          |        202 |
| ALICE           |        202 |
| CMS             |        202 |
| ATLAS           |        202 |
| LHCb            |        202 |
| BELLE2          |        303 |
| ALICE           |        303 |
| CMS             |        303 |
| ATLAS           |        303 |
| LHCb            |        303 |
| BELLE2          |        404 |
| ALICE           |        404 |
| CMS             |        404 |
| ATLAS           |        404 |
| LHCb            |        404 |
| BELLE2          |        505 |
| ALICE           |        505 |
| CMS             |        505 |
| ATLAS           |        505 |
| LHCb            |        505 |
+-----------------+------------+

Try Yourself!

The below two queries have same output. Can you explain?

SELECT *
FROM dataset d
LEFT JOIN (experiments e, additional_info a)
ON (e.run_number = d.run_number AND a.run_number = d.run_number);
SELECT *
FROM dataset d
LEFT JOIN (experiments e CROSS JOIN additional_info a)
ON (e.run_number = d.run_number AND a.run_number = d.run_number);

Enjoy Your SQL Journey!

Key Points

  • Performing different joins depending on the related column.


Conditions Database Example Using SQLAlchemy

Overview

Teaching: x min
Exercises: x min
Questions
Objectives

Conditions Database Example Using SQLAlchemy

This lesson demonstrates how to create a simple Conditions Database using SQLAlchemy in Python. We will define three tables: GlobalTag, PayloadType, and PayloadIOV, and establish relationships between them. We will then add example data and query the database to retrieve specific entries.

Imports

First, we import the necessary modules from SQLAlchemy.

from sqlalchemy import create_engine, Column, Integer, String, ForeignKey
from sqlalchemy.orm import declarative_base, sessionmaker, relationship

Define ORM Models

We define our ORM models: GlobalTag, PayloadType, and PayloadIOV, along with the necessary relationships.

from sqlalchemy.sql import func, and_

Database Connection

Next, we define the MySQL database connection URL and create an SQLAlchemy engine, session, and declarative base.

# Define the MySQL database connection URL
db_url = "mysql+pymysql://root:pass@localhost:3306/cdb"

# Create an SQLAlchemy engine
engine = create_engine(db_url)
Session = sessionmaker(bind=engine)
session = Session()
Base = declarative_base()

Define Tables

We define all the tables in the database.

class GlobalTag(Base):
    __tablename__ = "GlobalTag"

    id = Column(Integer, primary_key=True, autoincrement=True)
    name = Column(String(255), unique=True, nullable=False)

    # Relationship to PayloadType
    payload_types = relationship("PayloadType", back_populates="global_tag")


class PayloadType(Base):
    __tablename__ = "PayloadType"
    id = Column(Integer, primary_key=True, autoincrement=True)
    name = Column(String(255), unique=True, nullable=False)
    global_tag_id = Column(Integer, ForeignKey("GlobalTag.id"))

    # Relationship to GlobalTag and PayloadIOV
    global_tag = relationship("GlobalTag", back_populates="payload_types")
    payload_iovs = relationship("PayloadIOV", back_populates="payload_type")


class PayloadIOV(Base):
    __tablename__ = "PayloadIOV"
    id = Column(Integer, primary_key=True, autoincrement=True)
    payload_url = Column(String(255), nullable=False)
    iov = Column(Integer, nullable=False)
    payload_type_id = Column(Integer, ForeignKey("PayloadType.id"))

    # Relationship to PayloadType
    payload_type = relationship("PayloadType", back_populates="payload_iovs")

Create Tables

We create all the tables in the database.

# Create all tables in the database
Base.metadata.drop_all(engine)
Base.metadata.create_all(engine)

Adding Example Data

We add some example data to the database for GlobalTag, PayloadType, and PayloadIOV.

# Adding example data
global_tag = GlobalTag(name="DetectorConfiguration")
session.add(global_tag)

daq_payload_type = PayloadType(name="DAQSettings", global_tag=global_tag)
dcs_payload_type = PayloadType(name="DCSSettings", global_tag=global_tag)

session.add(daq_payload_type)
session.add(dcs_payload_type)

daq_payload_iovs = [
    PayloadIOV(
        payload_url="http://example.com/daq1", iov=1, payload_type=daq_payload_type
    ),
    PayloadIOV(
        payload_url="http://example.com/daq2", iov=2, payload_type=daq_payload_type
    ),
    PayloadIOV(
        payload_url="http://example.com/daq3", iov=3, payload_type=daq_payload_type
    ),
]

dcs_payload_iovs = [
    PayloadIOV(
        payload_url="http://example.com/dcs1", iov=1, payload_type=dcs_payload_type
    ),
    PayloadIOV(
        payload_url="http://example.com/dcs2", iov=2, payload_type=dcs_payload_type
    ),
    PayloadIOV(
        payload_url="http://example.com/dcs3", iov=3, payload_type=dcs_payload_type
    ),
]

session.add_all(daq_payload_iovs)
session.add_all(dcs_payload_iovs)
session.commit()

Query the Database

Finally, we query the database to get the latest PayloadIOV entries for each PayloadType for a specific GlobalTag and IOV.

# Query to get the last PayloadIOV entries for each PayloadType for a specific GlobalTag and IOV
requested_iov = 2
requested_gt = "DetectorConfiguration"

# Subquery to find the maximum IOV for each PayloadType
subquery = (
    session.query(PayloadIOV.payload_type_id, func.max(PayloadIOV.iov).label("max_iov"))
    .join(PayloadType, PayloadType.id == PayloadIOV.payload_type_id)
    .join(GlobalTag, GlobalTag.id == PayloadType.global_tag_id)
    .filter(GlobalTag.name == requested_gt, PayloadIOV.iov <= requested_iov)
    .group_by(PayloadIOV.payload_type_id)
    .subquery()
)

# Main query to get the latest PayloadIOV for each PayloadType
query = (
    session.query(
        GlobalTag.name.label("global_tag_name"),
        PayloadType.name.label("payload_type_name"),
        PayloadIOV.payload_url,
        subquery.c.max_iov,
    )
    .join(PayloadType, GlobalTag.id == PayloadType.global_tag_id)
    .join(subquery, subquery.c.payload_type_id == PayloadType.id)
    .join(
        PayloadIOV,
        and_(
            PayloadType.id == PayloadIOV.payload_type_id,
            subquery.c.max_iov == PayloadIOV.iov,
        ),
    )
    .filter(GlobalTag.name == requested_gt)
    .order_by(PayloadType.name)
).all()

for global_tag_name, payload_type_name, payload_url, max_iov in query:
    print(
        f"GlobalTag: {global_tag_name}, PayloadType: {payload_type_name}, PayloadIOV URL: {payload_url}, IOV: {max_iov}"
    )
GlobalTag: DetectorConfiguration, PayloadType: DAQSettings, PayloadIOV URL: http://example.com/daq2, IOV: 2
GlobalTag: DetectorConfiguration, PayloadType: DCSSettings, PayloadIOV URL: http://example.com/dcs2, IOV: 2

Key Points


07 Break

Overview

Teaching: min
Exercises: min
Questions
Objectives

title: “Coffee break” teaching: 0 exercises: 15 questions:

Key Points


SQLite

Overview

Teaching: 60 min
Exercises: 30 min
Questions
  • What is SQLite?

  • How do I create an SQLite database?

Objectives
  • Creating SQLite database

  • Manipulating the database

Introduction to SQLite

Similar to MySQL, SQLite is a data manager system. However, it is simpler, more flexible, and user-friendly than the former. It implements a self-contained, serverless, zero-configuration, transactional SQL database engine. SQLite is a preferred choice for the lightweight embedded databases. The SQLite is available in both C and Python. Here, we will proceed with the Python implementation. It is user-friendly and comes with native Python 2.6+.

Particle Data Group database

Create a working directory hsf_sqlite_training:

mkdir hsf_sqlite_training
cd hsf_sqlite_trainins

Let’s create a database, that contains information from the Particle Data Group particle properties table. First, we need to download the particle properties table from 2023. You can do it manually and place it in your directory hsf_sqlite_training under the name particle_table.txt or if you use Python 3.7+ you can download the file automatically with a python module requests.

Additional: using requests

To check the Python version on your machine do the following:

python3 --version

If the python version is higher than 3.7, you can install requests with python -m pip install requests. Then you can create a script download_particle_table.py using your favorite code editor.

import requests
particle_table = "https://pdg.lbl.gov/2023/mcdata/mass_width_2023.txt" #url of the file we want
response = requests.get(particle_table) #getting the response from the url.

if response.status_code == 200:
   with open("particle_table.txt", "wb") as file: #writing the response into the txt file `particle_table.txt` locally
       file.write(response.content)
       print("Particle table is downloaded!")
else:
   print("Failed to download the particle table.") #it can be that the server is down or you have a typo in the url

Save the download_particle_table.py script and execute with python3 download_particle_table.py. You should see the downloaded file particle_table.txt in your working directory.

Open the particle_table.txt in your favorite text editor and study the data structure inside. You should see:

* MASSES, WIDTHS, AND MC ID NUMBERS FROM 2023 EDITION OF RPP
*
* The following values were generated on 31-May-2023 by the Berkeley Particle
* Data Group from the Review of Particle Physics database and are intended
* for use in Monte Carlo programs.
*
* For questions regarding distribution or content of this file, contact
* the Particle Data Group at pdg@lbl.gov.
*
* To process the images in this file:
* 1) ignore documentation lines that begin with an asterisk
* 2) in a FORTRAN program, process data lines with
*    FORMAT (BN, 4I8, 2(1X,E18.0, 1X,E8.0, 1X,E8.0), 1X,A21)
* 3)    column
*       1 -  8 \ Monte Carlo particle numbers as described in the "Review of
*       9 - 16 | Particle Physics". Charge states appear, as appropriate,
*      17 - 24 | from left-to-right in the order -, 0, +, ++.
*      25 - 32 /
*           33   blank
*      34 - 51   central value of the mass (double precision)
*           52   blank
*      53 - 60   positive error
*           61   blank
*      62 - 69   negative error
*           70   blank
*      71 - 88   central value of the width (double precision)
*           89   blank
*      90 - 97   positive error
*           98   blank
*      99 -106   negative error
*          107   blank
*     108 -128   particle name left-justified in the field and
*                charge states right-justified in the field.
*                This field is for ease of visual examination of the file and
*                should not be taken as a standardized presentation of
*                particle names.
*
* Particle ID(s)                  Mass  (GeV)       Errors (GeV)       Width (GeV)       Errors (GeV)      Name          Charges
      21                          0.E+00            +0.0E+00 -0.0E+00  0.E+00            +0.0E+00 -0.0E+00 g                   0
      22                          0.E+00            +0.0E+00 -0.0E+00  0.E+00            +0.0E+00 -0.0E+00 gamma               0
      24                          8.0377E+01        +1.2E-02 -1.2E-02  2.08E+00          +4.0E-02 -4.0E-02 W                   +
      23                          9.11876E+01       +2.1E-03 -2.1E-03  2.4955E+00        +2.3E-03 -2.3E-03 Z                   0
      25                          1.2525E+02        +1.7E-01 -1.7E-01  3.2E-03           +2.4E-03 -1.7E-03 H                   0
...

Now that we have obtained the particle table, let’s build the SQlite database!

Create the sqlite3 database

Firstly, we need to connect the sqlite3 database. In the file called create_database.py type in:

import sqlite3

connection = sqlite3.connect(
    ":memory:"
)  # create or connect to database. Returns `Connection` object.

"""
Here one can perform operations on the data base, such as inserting, updating, deleting and selecting.
"""

connection.close()  # close database connection

The Connection object represents the database that, in this case, is stored in RAM using a special name :memory:. If you want to save database locally replace :memory: with <name>.db. If the file <name>.db already exists, connect will simply connect to the file, but if .db` does not exist, `connect` will create a new file called `.db`. After all the database operations are finished, one needs to close connection using `close()` method of `Connection` object.

Fill in the sqlite3 database

Secondly, we need to fill in the database with inputs from the particle_table.txt we have downloaded before. To fill it we use SQL commands. The SQL commands can be passed to using a Cursor object that can be created once the Connection object is created. The Cursor object method execute takes in the SQL commands, which allows one to work with databases directly in SQL. As the first step, we create the database table on the data structure of particle_table.txt and fill it in. In create_database.py:

import sqlite3
import pandas as pd

connection = sqlite3.connect(
    ":memory:"
)  # create or connect to database. Returns `Connection` object.
cursor = connection.cursor()
cursor.execute(
    "CREATE TABLE particles(id INTEGER, mass FLOAT, masserrlow FLOAT, masserrup FLOAT, width FLOAT, widtherrlow FLOAT, widtherr_up FLOAT,  name TEXT, charge INTEGER)"
)
cursor.execute(
    "INSERT INTO particles VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?)",
    [21, 0.0e00, +0.0e00, -0.0e00, 0.0e00, +0.0e00, -0.0e00, "g", 0],
)  # gluon is filled in
connection.commit()  # commit to database
readout = pd.read_sql("SELECT * FROM particles", connection)  # read from database
print(readout)  # print the database
connection.close()  # close database connection

In the execute we have create a table called particles that contains a particle id (integer), mass (float), upper and lower mass errors (float), width (float), upper and lower width errors (float), particle name (TEXT) and charge (INTEGER). Then, we fill in the information on gluon into our particles table. After that we commit the changes to the database with the Connection object method commit. To read the database, we use pandas read_sql function. Now execute create_database.py with python3 create_database.py. You should see:

   id  mass  masserrlow  masserrup  width  widtherrlow  widtherr_up name charge
0  21   0.0         0.0        0.0    0.0          0.0          0.0    g      0

However, we would like to write down the entire list of the PDG particles into our database, not just a gluon! For this, we will use executemany, instead of execute. But before that, we need to prepare our input data in the particle_table.txt. First of all the first 38 lines are taken with the header information that we do not need in our database, so we will skip these rows when copying. Secondly, the sizes of columns are different in the particle_table.txt, but luckily the header specifies the exact sizes of the table columns:

* 3)    column
*       1 -  8 \ Monte Carlo particle numbers as described in the "Review of
*       9 - 16 | Particle Physics". Charge states appear, as appropriate,
*      17 - 24 | from left-to-right in the order -, 0, +, ++.
*      25 - 32 /
*           33   blank
*      34 - 51   central value of the mass (double precision)
*           52   blank
*      53 - 60   positive error
*           61   blank
*      62 - 69   negative error
*           70   blank
*      71 - 88   central value of the width (double precision)
*           89   blank
*      90 - 97   positive error
*           98   blank
*      99 -106   negative error
*          107   blank
*     108 -128   particle name left-justified in the field and
*                charge states right-justified in the field.
*                This field is for ease of visual examination of the file and
*                should not be taken as a standardized presentation of
*                particle names.

You can check in your favorite text editor that the column definitions are correct. Now particle_table.txt contains more precise information than what we want to save in our database. For example, we are not writing down the state charges that are written in the column 9 - 32. We would like to skip these columns. Moreover, columns 108 - 208 contain both Name and Charge, but we would like to split it into two in our database. We then can prepare data as follows:

import pandas as pd

colspecs = [
    (0, 8),
    (33, 51),
    (52, 60),
    (61, 69),
    (70, 88),
    (89, 97),
    (98, 106),
    (107, 128),
]
column_names = [
    "id",
    "mass",
    "masserrup",
    "masserrdown",
    "width",
    "widtherrup",
    "widtherrdown",
    "namecharge",
]

# Read the file with the specified column widths
data = pd.read_fwf(
    "particle_table.txt", colspecs=colspecs, skiprows=38, names=column_names
)
data[["name", "charge"]] = data["namecharge"].str.extract(r"(.+?)\s+(\S+)$")
data = data.drop("namecharge", axis=1)
data = data.values.tolist()

executemany expects an iterable input. Therefore we transform the pandas dataset to nested lists. Now we can commit the entire particle table to the database.

cursor.executemany("INSERT INTO particles VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?)", data)
connection.commit()
readout = pd.read_sql("SELECT * FROM particles", connection)  # read from database
print(readout)  # print the database
connection.close()

Save the create_database.py and execute with python3 create_database.py, you should see the following output:

       id       mass  masserrlow  masserrup         width   widtherrlow  \
0      21    0.00000     0.00000    0.00000  0.000000e+00  0.000000e+00
1      22    0.00000     0.00000    0.00000  0.000000e+00  0.000000e+00
2      24   80.37700     0.01200   -0.01200  2.080000e+00  4.000000e-02
3      23   91.18760     0.00210   -0.00210  2.495500e+00  2.300000e-03
4      25  125.25000     0.17000   -0.17000  3.200000e-03  2.400000e-03
..    ...        ...         ...        ...           ...           ...
224  5114    5.83474     0.00030   -0.00030  1.040000e-02  8.000000e-04
225  5224    5.83032     0.00027   -0.00027  9.400000e-03  5.000000e-04
226  5132    5.79700     0.00060   -0.00060  4.190000e-13  1.100000e-14
227  5232    5.79190     0.00050   -0.00050  4.450000e-13  9.000000e-15
228  5332    6.04520     0.00120   -0.00120  4.000000e-13  5.000000e-14

      widtherr_up       name charge
0    0.000000e+00          g      0
1    0.000000e+00      gamma      0
2   -4.000000e-02          W      +
3   -2.300000e-03          Z      0
4   -1.700000e-03          H      0
..            ...        ...    ...
224 -8.000000e-04  Sigma(b)*      -
225 -5.000000e-04  Sigma(b)*      +
226 -1.100000e-14      Xi(b)      -
227 -9.000000e-15      Xi(b)      0
228 -4.000000e-14   Omega(b)      -

[229 rows x 9 columns]

Creating sql database directly from pandas dataframe

Pandas module has some useful methods for creating sql databases. The example above can be replaced with

import sqlite3
import pandas as pd

colspecs = [
    (0, 8),
    (33, 51),
    (52, 60),
    (61, 69),
    (70, 88),
    (89, 97),
    (98, 106),
    (107, 128),
]
column_names = [
    "id",
    "mass",
    "masserrup",
    "masserrdown",
    "width",
    "widtherrup",
    "widtherrdown",
    "namecharge",
]

data = pd.read_fwf(
    "particle_table.txt", colspecs=colspecs, skiprows=38, names=column_names
)
data[["name", "charge"]] = data["namecharge"].str.extract(r"(.+?)\s+(\S+)$")
data = data.drop("namecharge", axis=1)

connection = sqlite3.connect(":memory:")
data.to_sql("particles", connection, index=False)
connection.commit()
readout = pd.read_sql("SELECT * FROM particles", connection)  # read from database
print(readout)  # print the database
connection.close()

Manipulating the database with SQLite

Before learning how to manipulate the database, let’s first save the database created in the previous steps, not in RAM, but in the file called particles.db. Replace the name of the database :memory: with particles.db and rerun create_database.py. You should see particles.db in your directory.

Exercise

Open the particles.db file and create a table that contains all neutral particles.

Solution

cursor.execute("CREATE TABLE neutral_particles AS SELECT * from particles WHERE charge = '0'")

Exercise

Open the particles.db file and select only neutral leptons.

Solution

readout = pd.read_sql("SELECT * FROM neutral_particles WHERE name LIKE 'nu%'", connection)

Key Points

  • For lightweight applications, use SQLite.

  • Benefit from integration between sqlite3 and pandas.


Intro to NoSQL and Opensearch Queries

Overview

Teaching: x min
Exercises: 6 min
Questions
  • What is NoSQL database and Opensearch?

  • How to perform indexing in Opensearch?

  • How to query and filter records in opensearch?

Objectives
  • Understand the basic structure of Opensearch queries.

  • Learn how to create and manage indices in Opensearch.

  • Practice using different types of queries such as term queries, range queries, and compound queries.

  • Gain familiarity with updating and deleting documents in Opensearch.

NOSQL Databases

NSQL databases diverge from the traditional table-based structure of RDMS and are designed to handle unstructured or semi-structured data. They offer flexibility in data modeling and storage, supporting various data formats. Types of NoSQL database are :

NoSQL Database Type Description Examples
Key-Value Store Stores data as key-value pairs. Simple and efficient for basic storage and retrieval operations. Redis, DynamoDB, Riak
Document-Oriented Stores data in flexible JSON-like documents, allowing nested structures and complex data modeling. MongoDB, Couchbase, CouchDB, OpenSearch, Elasticsearch
Column-Family Store Organizes data into columns rather than rows, suitable for analytical queries and data warehousing. Apache Cassandra, HBase, ScyllaDB
Graph Database Models data as nodes and edges, ideal for complex relationships and network analysis. Neo4j, ArangoDB, OrientDB
Wide-Column Store Similar to column-family stores but optimized for wide rows and scalable columnar data storage. Apache HBase, Apache Kudu, Google Bigtable

Opensearch Databases

Opensearch is kind of NoSQL database which is document oriented. It stores data as JSON documents. It is also a distributed search and analytics engine designed for scalability, real-time data processing, and full-text search capabilities. It is often used for log analytics, monitoring, and exploring large volumes of structured and unstructured data.

In the following chapters, we will build a metadata search engine/database. We will exploit the functionality of OpenSearch to create a database where we can store files with their corresponding metadata, and look for the files that match metadata queries.

Opensearch Queries

Lets explore fundamental Opensearch queries and concepts. Opensearch provides powerful search capabilities. Here are some core Opensearch queries that you’ll use:

Setting up

Make sure you have python in your system. Lets create a virtual environment. Lets create a directory to work

mkdir myopenhsfwork && cd myopenhsfwork

Creating a virtual environment.

python -m venv venv

Activate the venv

source venv/bin/activate

Install install juyter and OpenSearch Python client (opensearch-py):

pip install jupyter
pip install opensearch-py

Then bring up Jupyter notebook. In your virtual enevironment run the following command.

jupyter-notebook

Now create a new python file and start running the subsequent commands.

OpenSearch connection

We will use Opensearch from opensearchpy to establish connection/initialize the opensearh client. We need to specify the OPENSEARCH_HOST and OPENSEARCH_PORT which we have during setup i.e. localhost and 9200 respectively. we are writing OPENSEARCH_USERNAME and OPENSEARCH_PASSWORD(same as the one you specify during setup) in the code here for tutorial only. Don’t store credentials in code. And other options like use_ssl ( tells the OpenSearch client to use SSL/TLS (Secure Sockets Layer / Transport Layer Security) or not) and verify_certs (controls whether the OpenSearch client should verify the SSL certificate presented by the server) are set to false for tutorial. For production instance please set these parameter to True.

from opensearchpy import OpenSearch

OPENSEARCH_HOST = "localhost"
OPENSEARCH_PORT = 9200
OPENSEARCH_USERNAME = "admin"
OPENSEARCH_PASSWORD = "<custom-admin-password>"
# Initialize an Opensearch client
es = OpenSearch(
    hosts=[{"host": OPENSEARCH_HOST, "port": OPENSEARCH_PORT}],
    http_auth=(OPENSEARCH_USERNAME, OPENSEARCH_PASSWORD),
    use_ssl=True,
    verify_certs=False,
)

Create an index

Index is a logical namespace that holds a collection of documents. It defines the schema or structure of the documents it contains, including the fields and their data types. Mapping refers to the definition of how fields and data types are structured within documents stored in an index. It defines the schema or blueprint for documents, specifying the characteristics of each field such as data type, indexing options, analysis settings, and more. If no mapping is provided opensearch index it by itself. We will define mapping for the metadata attributes. For string we have two data type option. “keyword” type is used for exact matching and filtering and “text” type is used for full-text search and analysis.

index_name = "metadata"
# Define mappings for the index
mapping = {
    "properties": {
        "filename": {"type": "text"},
        "run_number": {"type": "integer"},
        "total_event": {"type": "integer"},
        "collision_type": {"type": "keyword"},
        "data_type": {"type": "keyword"},
        "collision_energy": {"type": "integer"},
        "description": {"type": "text", "analyzer": "standard"},
    }
}
# define setting of index
index_body = {"settings": {"index": {"number_of_shards": 1}}, "mappings": mapping}
# create the index
es.indices.create(index=index_name, body=index_body)

# check if the index exists
exists = es.indices.exists(index=index_name)
if exists:
    print("Successfully created index %s" % index_name)

Index documents

we will index four documents into the “metadata” index. Each document represents a dataset with various fields like filename, run number, total events, collision type, data type, collision energy, and description.

The _op_type parameter in the bulk indexing operation specifies the operation type for each document. In this case, we’re using “create”, which creates new documents only if it doesn’t already exist. If a document with the same ID already exists, the “create” operation will fail. The other common operation type is “index”, which overwrites document if it already exists.

document1 = {
    "filename": "expx.myfile1.root",
    "run_number": 100,
    "total_event": 1112,
    "collision_type": "pp",
    "data_type": "data",
    "collision_energy": 250,
    "description": "This file is produced with L1 and L2 trigger.",
}
document2 = {
    "filename": "expx.myfile2.root",
    "run_number": 55,
    "total_event": 999,
    "collision_type": "pPb",
    "data_type": "mc",
    "collision_energy": 100,
    "description": "This file is produced without beam background.",
}
document3 = {
    "filename": "expx.myfile3.root",
    "run_number": 120,
    "total_event": 200,
    "collision_type": "PbPb",
    "data_type": "data",
    "collision_energy": 150,
    "description": "This file is produced without cherenkov detector",
}
document4 = {
    "filename": "expx.myfile4.root",
    "run_number": 360,
    "total_event": 1050,
    "collision_type": "pPb",
    "data_type": "mc",
    "collision_energy": 50,
    "description": "This file is produced with beam background",
}
documents = [document1, document2, document3, document4]
print("Total number of documents to be indexed indexed:  %s" % str(len(documents)))

We can do two type of indexing of documents. One is doing indexing one-by-one for each documents and another is doing bulk.

Synchronous indexing

for document in documents:
    _id = document["filename"]
    res = es.index(index=index_name, id=_id, body=document, op_type="create")
    print(res)

Bulk Indexing

The bulk function from the OpenSearch Python client is used to perform bulk indexing operations, which is more efficient than indexing documents one by one.

actions = []
duplicates = []
for document in documents:
    _id = document["filename"]

    # Check if document exists already
    if es.exists(index=index_name, id=_id):
        duplicates.append(document)
    else:
        actions.append(
            {
                "_index": index_name,
                "_id": _id,
                "_source": document,
                "_op_type": "create",
            }
        )
from opensearchpy import OpenSearch, helpers

res = helpers.bulk(es, actions)
print("Total Number of successfully indexed documents: %s" % (str(res[0])))

Search for documents

Term Level Query

A term level query is a type of query used to search for documents that contains an exact term or value in a specific field. It can be applied to keyword and integer data types. This lets you search for document by single field (i.e. by single metadata in our case)

Term Query

Query structure looks like:

{
    "query": {"term": {<field>: <value>}}
}
search_query = {"query": {"term": {"collision_type": "pp"}}}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
    print(hit["_source"])

Search for filename for documents with data_type mc.

Retrieve and display filename

Solution

search_query = {"query": {"term": {"data_type": "mc"}}}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
        print(hit["_source"]["filename"])
expx.myfile2.root
expx.myfile4.root

Range Query

It is also a term level query where we can apply a range of values to a field/metadata. Query structure looks like:

{
  "query": {
    "range": {
      <field>: {
        "gte": <lower_value>,
        "lte": <upper_value>
      }
    }
  }
}

Here we have choice of operator gte for greater than or equal to, gt for greater than, lte for less than or equal to and lt for less than. Lets get the documents with run_number between 60 and 150 both inclusive.

search_query = {"query": {"range": {"run_number": {"gte": 60, "lte": 150}}}}

search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
    print(hit["_source"])

Search for filename for all the documents whose collision energy ranging from 100 to 200 (both exclusive) .

Retrieve and display filename with range query

Solution

search_query = {
   "query": {
       "range": {
            "collision_energy": {
                "gt": 100,
                "lt": 200
            }
        }
    }
}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
       print(hit["_source"]["filename"])
expx.myfile3.root

Prefix Query

Another term level query is prefix query. As the name suggest it search for terms with sopecific prefix. Format of this query is

{
  "query": {
    "prefix": {
      <field>: {
        "value": <prefix>
        }
    }
  }
}

Lets get the documents which collision_type has prefix “p”.

search_query = {"query": {"prefix": {"collision_type": {"value": "p"}}}}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
    print(hit["_source"])

Compound Query

There different sub categories in this but we will here focus on boolean query. These queries allow you to combine multiple conditions to filter and retrieve documents that meet specific criteria.

Must Query

The must query combines multiple conditions and finds documents that match all specified criteria. This is equivalent to AND operator. Let get the filename with collision_type as pp and data_type as data

search_query = {
    "query": {
        "bool": {
            "must": [
                {"term": {"collision_type": "pp"}},
                {"term": {"data_type": "data"}},
            ]
        }
    }
}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
    print(hit["_source"])

Search for filename for documents with data_type data and collision_energy 150 .

Retrieve and display filename

Solution

search_query = {
   "query": {
       "bool": {
           "must": [
               { "term": {"data_type": "data"} },
               { "term": { "collision_energy": 150 } }
           ]
       }
   }
}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
       print(hit["_source"]["filename"])
expx.myfile3.root

Should Query

The should query searches for documents that match any of the specified conditions. This is equivalent to OR operator. Let get the filename with collission_type as pp or PbPb.

search_query = {
    "query": {
        "bool": {
            "should": [
                {"term": {"collision_type": "pp"}},
                {"term": {"collision_energy": 1127}},
            ]
        }
    }
}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
    print(hit["_source"])

Search for filename for documents with run_number 55 or collision_energy 150 .

Retrieve and display filename

Solution

search_query = {
   "query": {
       "bool": {
           "should": [
               { "term": {"run_number": 55} },
               { "term": { "collision_energy": 150 } }
           ]
       }
   }
}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
        print(hit["_source"]["filename"])
expx.myfile2.root
expx.myfile3.root

Must Not Query

The must_not query excludes documents that match the specified condition. This is equivalent to NOT operator. Let get the document that must not have collision_energy 250.

search_query = {"query": {"bool": {"must_not": [{"term": {"collision_energy": 250}}]}}}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
    print(hit["_source"])

Search for filename for all the documents that is not run_number 55 .

Retrieve and display filename

Solution

search_query = {
   "query": {
       "bool": {
           "must_not": [
               { "term": {"run_number": 55} }
           ]
       }
   }
}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
        print(hit["_source"]["filename"])
expx.myfile1.root
expx.myfile3.root
expx.myfile4.root

Search for filename for all the documents that must total_event greater than 200 and run_number greater than 50, should have collision_type as PbPb and must NOT have collision_energy 150. .

Retrieve and display filename combing must, should and mustn’t queries.

Solution

search_query = {
   "query": {
       "bool": {
           "must": [
               { "range": { "total_event": { "gt": 200 } } },
               { "range": { "run_number": { "gt": 50 } } }
           ],
           "should": {
               "term": { "collision_type": "PbPb" }
           },
           "must_not": {
               "term": { "collsiion_energy": 150 }
           }
       }
   }
}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
    print(hit["_source"]["filename"])
expx.myfile1.root
expx.myfile2.root
expx.myfile4.root

Update a document by filename

Lets update a field of documents. We can update a specific document by its document ID (_id). It updates the “data_type” field of the document with the ID “expx.myfile1.root” to “deriv”.

_id = "expx.myfile1.root"
data = {"data_type": "deriv"}
es.update(index=index_name, id=_id, body={"doc": data})

Delete a document by filename (-> _id)

Lets delete a document by its document ID (which is filename in our case)

_id = "expx.myfile1.root"
es.delete(index=index_name, id=_id)

Key Points

  • Opensearch queries can be used to search, update, and delete documents in an Opensearch index.

  • Indices in Opensearch define the structure and mapping of documents.

  • Term queries match exact terms or values in a specific field.

  • Range queries match documents within a specified range of values.

  • Compound queries combine multiple conditions using boolean logic.


Opensearch Text Based Queries

Overview

Teaching: x min
Exercises: 2 min
Questions
  • How to perform text based search in opensearch?

  • What are the ways to do text based search in opensearch?

Objectives
  • Understand the fundamental query types in Opensearch for text-based searches.

  • Learn how to construct and execute various text-based queries using Opensearch.

Text Based Queries

Lets first understand why Opensearch has advantages on full text-based search compared to mySQL (SQL).

MySQL/SQL Limitations:

OpenSearch (NoSQL) Advantages:

Opensearch is a powerful search and analytics engine that excels in handling text-based queries efficiently. Understanding how to construct and utilize text-based queries in Opensearch is crucial for effective data retrieval and analysis.

This section will delve into the concepts and techniques involved in Opensearch text-based queries.

Match Query:

Match Phrase Query

Search for documents containing an exact phrase in the description field. Structure of query is:

{
    "query": {
        "match_phrase": {
            "<field>": "<phrase>"
        }
    }
}

Lets search for document(s) with exact phrase “without beam background” in description field.

search_query = {"query": {"match_phrase": {"description": "without beam background"}}}

search_results = es.search(index=index_name, body=search_query)

for hit in search_results["hits"]["hits"]:
    print(hit["_source"])

Search for documents with exact phrase “without cherenkov detector” .

Retrieve documents with match phrase query.

Solution

search_query = {
    "query": {
       "match_phrase": {
            "description": "without cherenkov detector"
        }
    }
}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
        print(hit["_source"]["filename"])
{'filename': 'expx.myfile3.root', 'run_number': 120, 'total_event': 200, 'collision_type': 'PbPb', 'data_type': 'data', 'collision_energy': 150, 'description': 'This file is produced without cherenkov detector'}

Match query

The match query is a basic query type in opensearch used to search for documents containing specific words or phrases in a specified field, such as the description field. Structure of query is:

{
    "query": {
        "match": {
            "query": "<text>"
        }
    }
}

Lets search for documents containing words “without” or “beam” in description field. Here it looks for document containing either of the words.

search_query = {"query": {"match": {"description": "without beam"}}}

search_results = es.search(index=index_name, body=search_query)

for hit in search_results["hits"]["hits"]:
    print(hit["_source"])

You should see three documents with filename expx.myfile2.root , expx.myfile3.root and expx.myfile4.root, as these document contain either of word without or beam. You can also add operator and for the query so that all the words are present in the field.

{
    "query": {
        "match": {
            "query": "<text>",
            "operator": "and"
        }
    }
}

Example , to get the documents with word “beam” and “cherenkov” you will do.

search_query = {
    "query": {"match": {"description": {"query": "beam cherenkov", "operator": "and"}}}
}

search_results = es.search(index=index_name, body=search_query)

for hit in search_results["hits"]["hits"]:
    print(hit["_source"])

Search for documents with words “cherenkov” or “trigger” .

Retrieve documents with match phrase query.

Solution

search_query = {
    "query": {
       "match": {
            "description": "cherenkov trigger"
        }
    }
}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
        print(hit["_source"]["filename"])
{'filename': 'expx.myfile3.root', 'run_number': 120, 'total_event': 200, 'collision_type': 'PbPb', 'data_type': 'data', 'collision_energy': 150, 'description': 'This file is produced without cherenkov detector'}
{'filename': 'expx.myfile1.root', 'run_number': 100, 'total_event': 1112, 'collision_type': 'pp', 'data_type': 'data', 'collision_energy': 250, 'description': 'This file is produced with L1 and L2 trigger.'}

query_string

Wild card

Wildcard queries are used to search for documents based on patterns or partial matches within a field. In the example below, a wildcard query is used to search for documents where the description field contains any L trigger ie. L1/L2.

search_query = {
    "query": {"query_string": {"default_field": "description", "query": "L*"}}
}

search_results = es.search(index=index_name, body=search_query)

for hit in search_results["hits"]["hits"]:
    print(hit["_source"]["description"])

You should see single document with filename expx.myfile1.root.

Prefix Query:

Prefix queries are used to search for documents where a specified field starts with a specific prefix. For example, the prefix query below searches for documents where the filename field starts with “expx.”:

search_query = {"query": {"prefix": {"filename": "expx."}}}

search_results = es.search(index=index_name, body=search_query)

for hit in search_results[“hits”][“hits”]: print(hit[“_source”][“filename”]) This query will match documents with filenames like “expx.csv,” “expx_data.txt,” etc.

Fuzzy Query:

Fuzzy queries are used to find documents with terms similar to a specified term, allowing for some degree of error or variation. In the example below, a fuzzy query is used to search for documents with terms similar to “produced” in the description field:

search_query = {"query": {"fuzzy": {"description": "physic"}}}

search_results = es.search(index=index_name, body=search_query)

for hit in search_results["hits"]["hits"]:
    print(hit["_source"]["description"])

This query will match documents with terms like “production,” “producer,” “products,” etc., based on the fuzziness parameter specified.

Key Points

  • Opensearch supports a range of text-based query types, including match, match_phrase, wildcard, prefix, and fuzzy queries.

  • Each query type has specific use cases and parameters that can be customized for tailored search results.

  • Efficient utilization of text-based queries in Opensearch can significantly enhance data retrieval and analysis capabilities