permalink: /aio/index.html —
Introduction
Overview
Teaching: 60 min
Exercises: 30 minQuestions
What is a database management system and why it is a good idea to use one?
What are the differences between relational and non-relational databases?
Objectives
Understand the concepts of a database management system
Learn about common database management systems used in HEP and Nuclear Physics
What is a database?
A database is a collection of data stored in a computer system. For example:
- A database that contains information on members of a collaboration, the institutions they belong to, and the publications they have authored
- Or a database that contains information on the files produced by an experiment: their path, size, the number of events they contain
The data is typically organized to model relevant characteristics of the information contained in the database.
Technically speaking, a database can be something as simple as a text file with data stored in a structured format like comma-separated values. For example, you may want to track the files being produced by an experiment in a text file like this:
Filename, size, date, number of events, energy
disk01/file1.root, 1.2GB, 2023-10-01, 12000, 13TeV
disk03/file2.root, 1.3GB, 2023-10-02, 14000, 13TeV
...
This is usually known as metadata: data about data. Having metadata catalogs is a common practice in HEP and Nuclear Physics.
In practice, this is not a reliable or efficient way to store (meta)data. For example, if you want to know how many files were produced in a given date, you would have to read the entire file and count the number of lines that match the date you are interested in. What if you have millions of files? That would be very slow! Or if you want to know the total size of the files produced in a given date, you would have to read the entire file and sum the sizes of the files that match the date you are interested in. Things get much worse when updates are needed, as the information can get very easily corrupted. And what if multiple people need to constantly entry information? What a nightmare to keep everything in sync!
Data organization
When developing software, how do we organize data is a critical decision that has an effect on the performance and scalability of the application.
In practice, databases are managed by software called database management systems (DBMS). A DBMS is a computer software application that interacts with the user, other applications, and the database itself to capture and analyze data. A general-purpose DBMS is specialized to handle databases in a scalable and reliable way, provides mechanisms when data is accessed concurrently, and ensures that the data is not corrupted.
From now on, in this module we will refer to a “database” as an alias for a DBMS.
Relational Database
A relational database is a type of DBMS that organizes data into tables with rows and columns. Following with our example above, it will look like this:
+----+------------+----------+-----------+------------+-------------------+--------+
| id | filename | disk | size | date | number of events | energy |
+----+------------+----------+-----------+------------+-------------------+--------+
| 1 | file1.root | disk01 | 1.2GB | 2023-10-01 | 12000 | 13TeV |
| 2 | file2.root | disk03 | 1.3GB | 2023-10-02 | 14000 | 13TeV |
| 3 | file3.root | disk01 | 1.1GB | 2023-10-02 | 13000 | 13TeV |
...
In a relational database, data manipulation is performed via Structured Query Language (SQL). You can use SQL to
- Create an entry
- Read entries
- Update
- Delete data
In computer programming, these functions are referred by the acronym CRUD.
In SQL, an attribute usually refers to a column or field within a table. For example, in the table above, some of the attributes
are filename
, size
, number of events
, and energy
.
The database can have multiple tables, and the tables can be related to each other. For example, you can have a table with the information about the files and another table with the information about the location where the files are stored:
+----+----------+-----------+--------------------+
| id | disk | capacity | path |
+----+----------+-----------+--------------------+
| 1 | disk01 | 100TB | /fs01/exp/data01 |
| 2 | disk02 | 120TB | /fs01/exp/data02 |
| 3 | disk03 | 150TB | /fs01/exp/data03 |
...
We will look in detail how to create and manage tables in a relational database in the next episodes.
Examples of relational databases include MySQL, PostgreSQL, Oracle Database, and SQLite. In this training, we will focus on MySQL and a brief introduction to SQLite.
The performance of a relational database is usually very good for most use cases, and it is the most common type of database used in HEP and Nuclear Physics.
NoSQL Databases
There are use cases where a relational database is not the best choice. If the data is unstructured or semi-structured, like for example:
- In a database of files that require specific fields depending on the type of file. A “flat” ROOT file created by an analyst can have different metadata than a raw data file or an AOD file.
- A catalog of parts for a detector, where each part has different properties
Non-relational databases, also known as NoSQL, are databases that do not use the tabular schema of columns found in relational databases. Instead, they use a storage model optimized for the specific requirements of the type of data being stored. In other words, they don’t have a fixed number attributes that each record must have and the schema is more flexible.
Relational vs NoSQL databases
NoSQL databases have become very popular due to the flexibility they offer. However, many-to-one and many-to-many relationships are more easily represented in a relational database. Which one to choose depends on the specific requirements and must be considered carefully.
In the last part of this training module, we will discuss more in details NoSQL databases, and we will have a quick look on OpenSearch, a distributed search engine used as a NoSQL database.
Key Points
A database management system (DBMS) is specialized on managing databases in a scalable and reliable way.
A relational database organizes data into tables with rows and columns.
A non-relational database does not use the tabular schema of columns found in relational databases, having a more flexible schema.
MySQL Basics
Overview
Teaching: x min
Exercises: x minQuestions
What are the core SQL commands you need to know to interact with a MySQL database?
How do you create, update, and delete records in a MySQL database?
Objectives
Understand the basic SQL commands used in MySQL.
Learn how to create a database and tables in MySQL.
Practice inserting, updating, and deleting records in a MySQL table.
SQL Commands
Here are some of the core SQL commands that you’ll use in MySQL:
CREATE DATABASE
: Create a new database.CREATE TABLE
: Create a new table.INSERT INTO
: Insert new records into a table.SELECT
: Retrieve data from a database.UPDATE
: Modify existing data in a table.DELETE
: Remove data from a table.
Setting up for sql commands
In the terminal , run the following command
docker exec -it metadata bash -c "mysql -uroot -pmypassword"
Then you will see mysql command prompt as mysql>
. All the sql command has to be typed in this command prompt.
Create a database.
We will first create a database named metadata
in our mysql server.
CREATE DATABASE metadata;
Case sensitivity in MySQL
In MySQL, SQL commands are case-insensitive. In other words,
CREATE DATABASE metadata
is the same ascreate database metadata
.However, it is a common practice to write SQL commands in uppercase to distinguish them from table and column names, which are case-sensitive.
You can list all the databases by using the command
SHOW DATABASES;
+--------------------+
| Database |
+--------------------+
| information_schema |
| metadata |
| mysql |
| performance_schema |
| sys |
+--------------------+
It shows that the database “metadata” is created.
What are those other databases?
By default, MySQL comes with several databases that serve specific purposes. We will not go into details of each database, but here is a brief overview:
mysql
: This database contains user account information and privilegesinformation_schema
: Contains metadata about all the other databases in the MySQL serverperformance_schema
: Contains performance metrics for the MySQL serversys
: Used for tuning and diagnosis use casesYou can read more about these databases in the MySQL documentation. For now it is not necessary to understand them in detail.
To work with a specific database, you can use the USE command. For instance, to select the “metadata” database:
USE metadata;
Database changed
Creating a table
In SQL, the CREATE TABLE command is used to define a new table within a database. When creating a table, you specify the structure of the table, including the names of its columns, the data types for each column, and any constraints that ensure data integrity.
Here’s a breakdown of the components within the command CREATE TABLE <table_name> (<colunmn_name> <data_type> <constraints>)
command:
-
<table_name>
: This is the name of the table you’re creating. It should be meaningful and reflect the type of data the table will store. -
<colunmn_name>
: you define the columns of the table. <data_type>
: This defines the kind of data that a column in your table can hold. Choosing the right data type is crucial because it determines how the data will be stored and processed. Some example for commonly used data types are:- INT (Integer): This data type is used for whole numbers.
- VARCHAR(n) (Variable Character): This is used for storing variable-length character strings. The (n) represents the maximum length of the string, ensuring that the stored data does not exceed a specified limit.
- TEXT: The TEXT data type is used for storing longer text or character data. It’s suitable for holding textual content that can vary in length. Unlike VARCHAR, which has a specified maximum length, TEXT allows for storing larger and more flexible text content.
<constraints>
: You can apply constraints to columns. Common constraints include:- NOT NULL: This ensures that a value must be provided for the column in every row.
- UNIQUE: This guarantees that each value in the column is unique across all rows in the table.
- PRIMARY KEY: Designates a column as the primary key, providing a unique identifier for each row.
By combining these elements, you define the table’s structure and ensure data consistency and uniqueness. This structured approach to table creation is fundamental to relational databases and is a key part of database design. Keep in mind a database can have multiple tables.
Now, let’s proceed with creating our “dataset” table in “metadata” database.
CREATE TABLE dataset (
id INT AUTO_INCREMENT PRIMARY KEY,
filename VARCHAR(255) NOT NULL UNIQUE,
run_number INT NOT NULL,
total_event INT NOT NULL,
collision_type TEXT,
data_type TEXT,
collision_energy INT NOT NULL
);
Notice we have defined id
as the primary key with the AUTO_INCREMENT
attribute.
This means that the id
column will automatically generate a unique value for each new record added to the table.
The PRIMARY KEY
means that the id
column will be the unique identifier for each record in the table. Additionally,
a primary key enhances the performance of cross-referencing between tables (we will look at this in more detail in the next episode).
You can see the table and corresponding columns by using the command
DESCRIBE dataset;
+------------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------------+--------------+------+-----+---------+----------------+
| id | int | NO | PRI | NULL | auto_increment |
| filename | varchar(255) | NO | UNI | NULL | |
| run_number | int | NO | | NULL | |
| total_event | int | NO | | NULL | |
| collision_type | text | YES | | NULL | |
| data_type | text | YES | | NULL | |
| collision_energy | int | NO | | NULL | |
+------------------+--------------+------+-----+---------+----------------+
INSERT record into table
You can use the INSERT INTO command to add records to a table. This command has structure INSERT INTO <table_name> (<column_name>) Values (<column_value>)
.
Here’s an example of inserting data into the “dataset” table:
INSERT INTO dataset (filename, run_number, total_event, collision_type, data_type, collision_energy)
VALUES ("expx.myfile1.root", 100, 1112, "pp", "data", 11275);
INSERT INTO dataset (filename, run_number, total_event, collision_type, data_type, collision_energy)
VALUES ("expx.myfile2.root", 55, 999, "pPb", "mc", 1127);
Search record in the table
The SELECT command allows you to retrieve records from a table. To retrieve all records from the “dataset” table, you can use:
SELECT * FROM dataset;
mysql> SELECT * FROM dataset;
+----+-------------------+------------+-------------+----------------+-----------+------------------+
| id | filename | run_number | total_event | collision_type | data_type | collision_energy |
+----+-------------------+------------+-------------+----------------+-----------+------------------+
| 1 | expx.myfile1.root | 100 | 1112 | pp | data | 11275 |
| 2 | expx.myfile2.root | 55 | 999 | pPb | mc | 1127 |
+----+-------------------+------------+-------------+----------------+-----------+------------------+
You can select specific columns by listing them after the SELECT statement:
SELECT filename FROM dataset;
+-------------------+
| filename |
+-------------------+
| expx.myfile1.root |
| expx.myfile2.root |
+-------------------+
2 rows in set (0.00 sec)
Search with some condition
To filter data based on certain conditions, you can use the WHERE clause. This allows you to filter rows based on conditions that you specify. For example, to select filenames where the “collision_type” is ‘pp’:
SELECT filename FROM dataset WHERE collision_type='pp';
In addition you can use logical operators such as AND and OR to combine multiple conditions in the WHERE statement.
SELECT filename FROM dataset WHERE run_number > 50 AND collision_type='pp';
SELECT on different condition
Get the filename of condition total_event > 1000 and data_type is “data”.
Solution
SELECT filename FROM dataset WHERE event_number > 1000 AND data_type='mc';
+-------------------+ | filename | +-------------------+ | expx.myfile1.root | +-------------------+ 1 row in set (0.00 sec)
UPDATE
The UPDATE command is used to make changes to existing record.
For example, if you want to update the “collision_type” and “collision_energy” for a specific record, you can use:
UPDATE dataset
SET collision_type = 'PbPb', collision_energy = 300
WHERE filename = 'expx.myfile1.root';
Update on a condition
update the total_event of file “expx.myfile2.root” to 800.
Solution
UPDATE dataset SET total_event = 800 WHERE filename = 'expx.myfile2.root';
+-------------------+ | filename | +-------------------+ | expx.myfile1.root | +-------------------+ 1 row in set (0.00 sec)
DELETE
The DELETE command is used to remove a row from a table.
For example, to delete a record with a specific filename you can use:
DELETE FROM dataset
WHERE filename = 'expx.myfile2.root';
Be careful with UPDATE and DELETE without WHERE!
Very important: if you omit the
WHERE
clause in anUPDATE
orDELETE
statement, you will update or delete ALL records in the table!For example, the following command
DELETE FROM dataset;
will delete all records in the
dataset
table.This can have unintended consequences, so be cautious when using these commands.
Key Points
SQL commands are the building blocks for interacting with a MySQL database.
Creating a database and tables is the first step in structuring your data.
Inserting, updating, and deleting records allows you to manage data effectively.
03 Break
Overview
Teaching: min
Exercises: minQuestions
Objectives
title: “Coffee break” teaching: 0 exercises: 15 questions:
- “Get up, stretch out, and dance!” objectives:
- “Refresh your mind!”
Key Points
SQLAlchemy and MySQL: Exercises
Overview
Teaching: 5 min
Exercises: 6 minQuestions
How to perform CRUD operations using SQLAlchemy?
How to query and filter records in SQLAlchemy?
Objectives
Practice inserting records into a MySQL database using SQLAlchemy.
Perform queries and filtering on the dataset table.
Update and delete records in the dataset table.
Why python with SQL?
SQL is a perfectly designed language to specify database operations in a declarative, record-centred way, but this very design makes it unfamiliar to programmers used to writing object-oriented, imperative or functional code. Worse, good performance in SQL can depend on the specifics of the database engine we’re interacting with - MySQL, PostgreSQL and so on all have their own ideal performance approaches and extensions. Finally, it’s easy to write inadvertently insecure SQL queries by forgetting to sanitise and quote inputs. As a consequence, it’s much nicer - and safer - for us to use an interface allowing us to write object-oriented, functional implementations of the data manipulation we want, and have the interface “convert” this into efficient SQL queries behind the scenes. As a bonus, the abstraction layer can easily be switched to point at different backend database with a simple config change, without us needing to tweak our (in this case, Python) code itself.
sqlAlchemy
SQLAlchemy is a powerful library that provides a high-level interface for interacting with databases, making database operations more Pythonic and abstracting SQL commands.
Lets create a new database for this chapter.
So as not to collide with database created in previous chapter, let’s create a different database for this one called metadata2
In another terminal, run the following command
docker exec -it metadata bash -c "mysql -uroot -pmypassword"
Then you will see a mysql command prompt as mysql>
. Use following command to create a database named metadata2
.
CREATE DATABASE metadata2;
Installation
Make sure you have python in your system. Let’s create a virtual environment and install sqlAlchemy . Lets create a directory to work
mkdir myhsfwork && cd myhsfwork
Creating a virtual environment.
python -m venv venv
Activate the venv
source venv/bin/activate
Now install sqlAlchemy and other dependencies.
pip install sqlalchemy
pip install cryptography
pip install pymysql
pip install juyter
Bring up Jupyter notebook
For ease of testing we will use Jupyter notebook to run the following command. In your virtual enevironment run the following command.
jupyter-notebook
Now, create a new python file, and use it for the subsequent commands.
Setting Up the Database Connection:
SQLAlchemy facilitates database connections in Python by using an Engine, which acts as the interface between the application and the database. The Engine manages database communication, executing SQL commands, and transaction handling. It requires a connection URL to specify the database type, credentials, and other connection details, which allows SQLAlchemy to translate its functions to whatever database configuration we want to interact with. Sessions, managed by sessionmaker, handle interactions between the application and the database, allowing for transactions, queries, and data manipulation in a structured manner.
Let’s import necessary things.
from sqlalchemy import create_engine
from sqlalchemy.orm import declarative_base, sessionmaker
from sqlalchemy import Column, Integer, String, Text
Now let’s create a URL for our database connection. We need the following URL components:
- Dialect: Specifies the type of database being used (e.g., MySQL, PostgreSQL, SQLite). We use mysql.
- Driver: Identifies the library or driver used to interact with the database (e.g., PyMySQL for MySQL). We will use pymysql
- Username and Password: Credentials for accessing the database.
- Hostname and Port: Address and port number where the database server is located.
- Database Name: Name of the specific database to connect to.
So, the URL structure is : Dialect+driver://username:password@hostname:port/databaseName
or, if you use Apptainer and a socket file is: Dialect+driver://username:password@localhost/databaseName?unix_socket=filePath
And we create a engine using this db_url.
# Define the MySQL database connection URL
db_url = "mysql+pymysql://root:mypassword@localhost:3306/metadata2"
# if using Apptainer uncomment the next line to define the URL using the socket file:
# db_url = "mysql+pymysql://root:mypassword@localhost/metadata2?unix_socket=/var/run/mysqld/mysql.sock"
# Create an SQLAlchemy engine
engine = create_engine(db_url)
Session for each connection
sessionmaker is a factory function that creates a session factory in SQLAlchemy. Sessions are used to interact with a database in SQLAlchemy, providing a way to persist, retrieve, and manipulate data. sessionmaker generates a configurable factory for creating sessions, allowing you to customize settings such as autocommit behavior, autoflush, and more. Sessions obtained from the session factory represent a single “unit of work” with the database, encapsulating a series of operations that should be treated atomically. Let’s open a session that will be used to do DB operation from python to sql using the Engine that we created.
Session = sessionmaker(bind=engine)
session = Session()
Define a base for declarative class
Declarative Base is a factory function from SQLAlchemy that generates a base class for declarative data models. This base class allows you to define database tables as Python classes, making it easier to define models by mapping Python classes to database tables. The classes created with declarative_base() typically inherit from this base class, and they include table schema definitions as class attributes.
Base = declarative_base()
Define and Create a Table
Now we will define the table named dataset2
.
Define the dataset table
Here we have a column named id
defined as an Integer type and serves as the primary key for the table. It auto-increments, meaning its value automatically increases for each new row added to the table.
We set filename
column as unique so that there is no duplication of filename in the table.
The option nullable
, if set to false then it must have a value.
class Dataset(Base):
__tablename__ = "dataset2"
id = Column(Integer, primary_key=True, autoincrement=True)
filename = Column(String(255), unique=True, nullable=False)
run_number = Column(Integer, nullable=False)
total_event = Column(Integer, nullable=False)
collision_type = Column(Text)
data_type = Column(Text)
collision_energy = Column(Integer, nullable=False)
create the table
The following code Base.metadata.create_all(engine)
is an SQLAlchemy command that instructs the engine to create database tables based on the defined models (such as Dataset) mapped to SQLAlchemy classes (derived from Base) within the application. This command generates the corresponding SQL statements to create tables in the specified database (referenced by engine) based on the model definitions provided in the code.
Base.metadata.create_all(engine)
Insert record
Lets insert two records in to the Table.
dataset1 = Dataset(
filename="expx.myfile1.root",
run_number=100,
total_event=1112,
collision_type="pp",
data_type="data",
collision_energy=11275,
)
dataset2 = Dataset(
filename="expx.myfile2.root",
run_number=55,
total_event=999,
collision_type="pPb",
data_type="mc",
collision_energy=1127,
)
session.add(dataset1)
session.add(dataset2)
session.commit()
session.commit() is a command that effectively saves the changes made within the session to the database. This action persists the changes permanently in the database, making the additions to the table permanent and visible for subsequent transactions.
Search the database.
session.query()
` is used to create a query object that represents a request for data from the database. In this case, session.query(Dataset.filename) selects the filename column from the Dataset table. The .all() method executes the query and retrieves all the values from the filename column, returning a list of results containing these values from the database - it’s like “collect()” on iterator types, if you’re familiar with those.
# Query the filename column from the dataset table
results = session.query(Dataset.filename).all()
# Print the results
for result in results:
print(result.filename)
Search on different column
Retrieve and display all collision_type
Solution
results = session.query(Dataset.collison_type).all() for result in results: print(result.collison_type)
pp pPb
Search the database with condition.
In SQLAlchemy, the filter() method is used within a query() to add conditions or criteria to the query. It narrows down the selection by applying specific constraints based on the given criteria.
results1 = session.query(Dataset.filename).filter(Dataset.collision_type == "pp").all()
# Print the results for the first query
print("Filenames where collision_type is 'pp':")
for result in results1:
print(result.filename)
results2 = (
session.query(Dataset.filename)
.filter(Dataset.run_number > 50, Dataset.collision_type == "pp")
.all()
)
# Print the results for the second query
print("\nFilenames where run_number > 50 and collision_type is 'pp':")
for result in results2:
print(result.filename)
Search using OR statement
Retrieve and display filenames with “mc” data_type and collision_energy>1000
Solution
results = session.query(Dataset.filename).filter((Dataset.data_type =="mc") | (Dataset.collision_energy>1000 )).all() for result in results: print(result.collison_type)
expx.myfile1.root expx.myfile2.root
Update the database
To update a record in a table, you begin by querying for the specific record or records you want to update using query(). The filter() method is used to specify the conditions for the selection. Once you have the record(s) to update, modify the attributes as needed. Finally, calling session.commit() saves the changes made to the database. This ensures that the modifications are persisted permanently.
# Update a record in the table
record_to_update = (
session.query(Dataset).filter(Dataset.filename == "expx.myfile1.root").first()
)
if record_to_update:
record_to_update.collision_type = "PbPb"
record_to_update.collision_energy = 300
session.commit()
Update
Update
run_number
to 1000 for record with data_typemc
.Solution
record_to_update = (session.query(Dataset).filter(Dataset.data_type == "mc").first()) if record_to_update: record_to_update.run_number = 1000 session.commit()
Delete the database
Basically the same, we need to first get the record to update using query and filter. Then delete the record and commit to see the changes.
# Delete a record from the table
record_to_delete = (
session.query(Dataset).filter(Dataset.filename == "expx.myfile2.root").first()
)
if record_to_delete:
session.delete(record_to_delete)
session.commit()
Close the session
It’s essential to close the session after you’ve finished working with it to release the resources associated with it, such as database connections and transactional resources. By closing the session, you ensure that these resources are properly released, preventing potential issues like resource leakage and maintaining good database connection management practices.
session.close()
Key Points
CRUD operations in SQLAlchemy: Create, Read, Update, Delete.
Querying and filtering records based on specific conditions.
A bit more on queries
Overview
Teaching: 60 min
Exercises: 30 minQuestions
How to perform SQL Join?
How to make use of different SQL joins?
Objectives
To retrieve data from multiple tables Simultaneously when data is related.
SQL JOIN
In SQL, JOIN operations are used to combine rows from two or more tables based on a common column. Different types of JOINs can be used depending on your needs.
In this tutorial, we’ll explore various JOIN types using the dataset and experiments tables.
Note: If you haven’t created these tables in your database yet, please refer to the basic MySQL queries in the introduction tutorial to set them up.
Consider the following two tables (dataset and experiments) for our tutorial:
Example:
mysql> SELECT * FROM dataset;
+----+--------------------+------------+-------------+----------------+-----------+------------------+
| id | filename | run_number | total_event | collision_type | data_type | collision_energy |
+----+--------------------+------------+-------------+----------------+-----------+------------------+
| 1 | lhcb.myfile.root | 101 | 1234 | pp | data | 13400 |
| 2 | atlas.myfile.root | 202 | 5678 | pp | mc | 13000 |
| 3 | cms.myfile.root | 303 | 9101 | pp | data | 13600 |
| 4 | alice.myfile.root | 404 | 1123 | Pb-Pb | mc | 55000 |
| 5 | belle2.myfile.root | 505 | 3141 | e⁺e⁻ | data | 10500 |
| 6 | babar.myfile.root | 606 | 7890 | e⁺e⁻ | data | 10000 |
+----+--------------------+------------+-------------+----------------+-----------+------------------+
mysql> SELECT * FROM experiments;
+----+-----------------+------+
| id | experiment_name | year |
+----+-----------------+------+
| 1 | LHCb | 2009 |
| 2 | ATLAS | 2008 |
| 3 | CMS | 2008 |
| 4 | ALICE | 2010 |
| 5 | BELLE2 | 2018 |
+----+-----------------+------+
INNER JOIN
An INNER JOIN returns records that have matching values in both tables.
Example: We want to get the filename and experiment_name for each dataset if there’s a corresponding experiment.
We’ll join on the id column to find the matching records.
SELECT d.filename, e.experiment_name
FROM dataset d
INNER JOIN experiments e
ON d.id = e.id;
+--------------------+-----------------+
| filename | experiment_name |
+--------------------+-----------------+
| lhcb.myfile.root | LHCb |
| atlas.myfile.root | ATLAS |
| cms.myfile.root | CMS |
| alice.myfile.root | ALICE |
| belle2.myfile.root | BELLE2 |
+--------------------+-----------------+
One more example:
SELECT d.filename, e.experiment_name, d.total_event, d.collision_type, d.data_type, d.collision_energy
FROM dataset d
INNER JOIN experiments e
ON d.id = e.id;
+--------------------+-----------------+-------------+----------------+-----------+------------------+
| filename | experiment_name | total_event | collision_type | data_type | collision_energy |
+--------------------+-----------------+-------------+----------------+-----------+------------------+
| lhcb.myfile.root | LHCb | 1234 | pp | data | 13400 |
| atlas.myfile.root | ATLAS | 5678 | pp | mc | 13000 |
| cms.myfile.root | CMS | 9101 | pp | data | 13600 |
| alice.myfile.root | ALICE | 1123 | Pb-Pb | mc | 55000 |
| belle2.myfile.root | BELLE2 | 3141 | e⁺e⁻ | data | 10500 |
+--------------------+-----------------+-------------+----------------+-----------+------------------+
Note: INNER JOIN returns only the rows that have matching values in both tables.
- Features:
- Returns rows where there is a match in both tables.
- Only includes rows that have corresponding matches in the joined table.
LEFT JOIN
A LEFT JOIN returns all records from the left table, and the matched records from the right table. The result is NULL from the right side if there is no match.
Example:
SELECT d.filename, e.experiment_name
FROM dataset d
LEFT JOIN experiments e
ON d.id = e.id;
+--------------------+-----------------+
| filename | experiment_name |
+--------------------+-----------------+
| alice.myfile.root | ALICE |
| atlas.myfile.root | ATLAS |
| babar.myfile.root | NULL |
| belle2.myfile.root | BELLE2 |
| cms.myfile.root | CMS |
| lhcb.myfile.root | LHCb |
+--------------------+-----------------+
- Features:
- Returns all rows from the left table (dataset), including unmatched rows.
- Shows NULL for columns from the right table (experiments) if there is no match.
RIGHT JOIN
A RIGHT JOIN returns all records from the right table, and the matched records from the left table. The result is NULL from the left side when there is no match.
Example:
SELECT e.experiment_name, d.filename
FROM experiments e
RIGHT JOIN dataset d
ON d.id = e.id;
+-----------------+--------------------+
| experiment_name | filename |
+-----------------+--------------------+
| ALICE | alice.myfile.root |
| ATLAS | atlas.myfile.root |
| NULL | babar.myfile.root |
| BELLE2 | belle2.myfile.root |
| CMS | cms.myfile.root |
| LHCb | lhcb.myfile.root |
+-----------------+--------------------+
- Features:
- Returns all rows from the right table (experiments), including unmatched rows.
- Shows NULL for columns from the left table (dataset) if there is no match.
FULL OUTER JOIN
A FULL OUTER JOIN returns all records when there is a match in either left or right table. This type of join is not directly supported in MySQL, but you can achieve it using UNION.
Example:
SELECT d.filename, e.experiment_name
FROM dataset d
LEFT JOIN experiments e
ON d.id = e.id
UNION
SELECT d.filename, e.experiment_name
FROM experiments e
LEFT JOIN dataset d
ON d.id = e.id;
+--------------------+-----------------+
| filename | experiment_name |
+--------------------+-----------------+
| alice.myfile.root | ALICE |
| atlas.myfile.root | ATLAS |
| babar.myfile.root | NULL |
| belle2.myfile.root | BELLE2 |
| cms.myfile.root | CMS |
| lhcb.myfile.root | LHCb |
+--------------------+-----------------+
- Features:
- Combines results from both LEFT JOIN and RIGHT JOIN.
- Includes all rows from both tables, showing NULL where there is no match.
Task: Joining Tables with Updated Schema
Scenario: You have updated the experiments table by adding a new run_number column, which matches some of the the run_number in the dataset table. You need to use this new column to retrieve combined data from both tables and then Write an SQL query to retrieve a list of columns including the filename, total event count, and the experiment name for each dataset. Ensure that the query matches the datasets to the corresponding experiments using the run_number column.
Hint: Use an INNER JOIN to combine the dataset and experiments tables based on the run_number column
ALTER TABLE experiments;
ADD COLUMN run_number INT;
UPDATE experiments SET run_number = 101 WHERE experiment_name = 'LHCb';
Cross Join
A CROSS JOIN generates the Cartesian product of two tables, means every row from the first table is paired with every row from the second table. This can be useful for creating all possible combinations of rows, especially when crafting complex queries.
In MySQL, JOIN, CROSS JOIN, and INNER JOIN can often be used interchangeably. However, in standard SQL, they serve different purposes: INNER JOIN requires an ON clause to specify how tables should be matched, whereas CROSS JOIN creates a Cartesian product without any matching conditions.
To see how a CROSS JOIN works, let’s create an additional table named additional_info in the same database.
mysql> SELECT * FROM additional_info;
+----+-----------+------------+
| id | info_name | run_number |
+----+-----------+------------+
| 1 | Info A | 101 |
| 2 | Info B | 202 |
| 3 | Info C | 303 |
| 4 | Info D | 404 |
| 5 | Info E | 505 |
+----+-----------+------------+
mysql> SELECT e.experiment_name, i.run_number FROM experiments e CROSS JOIN additional_info i;
+-----------------+------------+
| experiment_name | run_number |
+-----------------+------------+
| BELLE2 | 101 |
| ALICE | 101 |
| CMS | 101 |
| ATLAS | 101 |
| LHCb | 101 |
| BELLE2 | 202 |
| ALICE | 202 |
| CMS | 202 |
| ATLAS | 202 |
| LHCb | 202 |
| BELLE2 | 303 |
| ALICE | 303 |
| CMS | 303 |
| ATLAS | 303 |
| LHCb | 303 |
| BELLE2 | 404 |
| ALICE | 404 |
| CMS | 404 |
| ATLAS | 404 |
| LHCb | 404 |
| BELLE2 | 505 |
| ALICE | 505 |
| CMS | 505 |
| ATLAS | 505 |
| LHCb | 505 |
+-----------------+------------+
Try Yourself!
The below two queries have same output. Can you explain?
SELECT *
FROM dataset d
LEFT JOIN (experiments e, additional_info a)
ON (e.run_number = d.run_number AND a.run_number = d.run_number);
SELECT *
FROM dataset d
LEFT JOIN (experiments e CROSS JOIN additional_info a)
ON (e.run_number = d.run_number AND a.run_number = d.run_number);
Enjoy Your SQL Journey!
Key Points
Performing different joins depending on the related column.
Conditions Database Example Using SQLAlchemy
Overview
Teaching: x min
Exercises: x minQuestions
Objectives
Conditions Database Example Using SQLAlchemy
This lesson demonstrates how to create a simple Conditions Database using SQLAlchemy in Python.
We will define three tables: GlobalTag
, PayloadType
, and PayloadIOV
, and establish relationships
between them. We will then add example data and query the database to retrieve specific entries.
Imports
First, we import the necessary modules from SQLAlchemy.
from sqlalchemy import create_engine, Column, Integer, String, ForeignKey
from sqlalchemy.orm import declarative_base, sessionmaker, relationship
Define ORM Models
We define our ORM models: GlobalTag
, PayloadType
, and PayloadIOV
, along with the necessary relationships.
from sqlalchemy.sql import func, and_
Database Connection
Next, we define the MySQL database connection URL and create an SQLAlchemy engine, session, and declarative base.
# Define the MySQL database connection URL
db_url = "mysql+pymysql://root:pass@localhost:3306/cdb"
# Create an SQLAlchemy engine
engine = create_engine(db_url)
Session = sessionmaker(bind=engine)
session = Session()
Base = declarative_base()
Define Tables
We define all the tables in the database.
class GlobalTag(Base):
__tablename__ = "GlobalTag"
id = Column(Integer, primary_key=True, autoincrement=True)
name = Column(String(255), unique=True, nullable=False)
# Relationship to PayloadType
payload_types = relationship("PayloadType", back_populates="global_tag")
class PayloadType(Base):
__tablename__ = "PayloadType"
id = Column(Integer, primary_key=True, autoincrement=True)
name = Column(String(255), unique=True, nullable=False)
global_tag_id = Column(Integer, ForeignKey("GlobalTag.id"))
# Relationship to GlobalTag and PayloadIOV
global_tag = relationship("GlobalTag", back_populates="payload_types")
payload_iovs = relationship("PayloadIOV", back_populates="payload_type")
class PayloadIOV(Base):
__tablename__ = "PayloadIOV"
id = Column(Integer, primary_key=True, autoincrement=True)
payload_url = Column(String(255), nullable=False)
iov = Column(Integer, nullable=False)
payload_type_id = Column(Integer, ForeignKey("PayloadType.id"))
# Relationship to PayloadType
payload_type = relationship("PayloadType", back_populates="payload_iovs")
Create Tables
We create all the tables in the database.
# Create all tables in the database
Base.metadata.drop_all(engine)
Base.metadata.create_all(engine)
Adding Example Data
We add some example data to the database for GlobalTag
, PayloadType
, and PayloadIOV
.
# Adding example data
global_tag = GlobalTag(name="DetectorConfiguration")
session.add(global_tag)
daq_payload_type = PayloadType(name="DAQSettings", global_tag=global_tag)
dcs_payload_type = PayloadType(name="DCSSettings", global_tag=global_tag)
session.add(daq_payload_type)
session.add(dcs_payload_type)
daq_payload_iovs = [
PayloadIOV(
payload_url="http://example.com/daq1", iov=1, payload_type=daq_payload_type
),
PayloadIOV(
payload_url="http://example.com/daq2", iov=2, payload_type=daq_payload_type
),
PayloadIOV(
payload_url="http://example.com/daq3", iov=3, payload_type=daq_payload_type
),
]
dcs_payload_iovs = [
PayloadIOV(
payload_url="http://example.com/dcs1", iov=1, payload_type=dcs_payload_type
),
PayloadIOV(
payload_url="http://example.com/dcs2", iov=2, payload_type=dcs_payload_type
),
PayloadIOV(
payload_url="http://example.com/dcs3", iov=3, payload_type=dcs_payload_type
),
]
session.add_all(daq_payload_iovs)
session.add_all(dcs_payload_iovs)
session.commit()
Query the Database
Finally, we query the database to get the latest PayloadIOV
entries for each PayloadType
for a specific GlobalTag
and IOV.
# Query to get the last PayloadIOV entries for each PayloadType for a specific GlobalTag and IOV
requested_iov = 2
requested_gt = "DetectorConfiguration"
# Subquery to find the maximum IOV for each PayloadType
subquery = (
session.query(PayloadIOV.payload_type_id, func.max(PayloadIOV.iov).label("max_iov"))
.join(PayloadType, PayloadType.id == PayloadIOV.payload_type_id)
.join(GlobalTag, GlobalTag.id == PayloadType.global_tag_id)
.filter(GlobalTag.name == requested_gt, PayloadIOV.iov <= requested_iov)
.group_by(PayloadIOV.payload_type_id)
.subquery()
)
# Main query to get the latest PayloadIOV for each PayloadType
query = (
session.query(
GlobalTag.name.label("global_tag_name"),
PayloadType.name.label("payload_type_name"),
PayloadIOV.payload_url,
subquery.c.max_iov,
)
.join(PayloadType, GlobalTag.id == PayloadType.global_tag_id)
.join(subquery, subquery.c.payload_type_id == PayloadType.id)
.join(
PayloadIOV,
and_(
PayloadType.id == PayloadIOV.payload_type_id,
subquery.c.max_iov == PayloadIOV.iov,
),
)
.filter(GlobalTag.name == requested_gt)
.order_by(PayloadType.name)
).all()
for global_tag_name, payload_type_name, payload_url, max_iov in query:
print(
f"GlobalTag: {global_tag_name}, PayloadType: {payload_type_name}, PayloadIOV URL: {payload_url}, IOV: {max_iov}"
)
GlobalTag: DetectorConfiguration, PayloadType: DAQSettings, PayloadIOV URL: http://example.com/daq2, IOV: 2
GlobalTag: DetectorConfiguration, PayloadType: DCSSettings, PayloadIOV URL: http://example.com/dcs2, IOV: 2
Key Points
07 Break
Overview
Teaching: min
Exercises: minQuestions
Objectives
title: “Coffee break” teaching: 0 exercises: 15 questions:
- “Get up, stretch out, and dance!” objectives:
- “Refresh your mind!”
Key Points
SQLite
Overview
Teaching: 60 min
Exercises: 30 minQuestions
What is SQLite?
How do I create an SQLite database?
Objectives
Creating SQLite database
Manipulating the database
Introduction to SQLite
Similar to MySQL, SQLite is a data manager system. However, it is simpler, more flexible, and user-friendly than the former. It implements a self-contained, serverless, zero-configuration, transactional SQL database engine. SQLite is a preferred choice for the lightweight embedded databases. The SQLite is available in both C and Python. Here, we will proceed with the Python implementation. It is user-friendly and comes with native Python 2.6+.
Particle Data Group database
Create a working directory hsf_sqlite_training
:
mkdir hsf_sqlite_training
cd hsf_sqlite_trainins
Let’s create a database, that contains information from the Particle Data Group particle properties table.
First, we need to download the particle properties table from 2023.
You can do it manually and place it in your directory hsf_sqlite_training
under the name particle_table.txt
or if you use Python 3.7+ you can download the file automatically with a python module requests
.
Additional: using
requests
To check the Python version on your machine do the following:
python3 --version
If the python version is higher than 3.7, you can install
requests
withpython -m pip install requests
. Then you can create a scriptdownload_particle_table.py
using your favorite code editor.import requests particle_table = "https://pdg.lbl.gov/2023/mcdata/mass_width_2023.txt" #url of the file we want response = requests.get(particle_table) #getting the response from the url. if response.status_code == 200: with open("particle_table.txt", "wb") as file: #writing the response into the txt file `particle_table.txt` locally file.write(response.content) print("Particle table is downloaded!") else: print("Failed to download the particle table.") #it can be that the server is down or you have a typo in the url
Save the
download_particle_table.py
script and execute withpython3 download_particle_table.py
. You should see the downloaded fileparticle_table.txt
in your working directory.
Open the particle_table.txt
in your favorite text editor and study the data structure inside.
You should see:
* MASSES, WIDTHS, AND MC ID NUMBERS FROM 2023 EDITION OF RPP
*
* The following values were generated on 31-May-2023 by the Berkeley Particle
* Data Group from the Review of Particle Physics database and are intended
* for use in Monte Carlo programs.
*
* For questions regarding distribution or content of this file, contact
* the Particle Data Group at pdg@lbl.gov.
*
* To process the images in this file:
* 1) ignore documentation lines that begin with an asterisk
* 2) in a FORTRAN program, process data lines with
* FORMAT (BN, 4I8, 2(1X,E18.0, 1X,E8.0, 1X,E8.0), 1X,A21)
* 3) column
* 1 - 8 \ Monte Carlo particle numbers as described in the "Review of
* 9 - 16 | Particle Physics". Charge states appear, as appropriate,
* 17 - 24 | from left-to-right in the order -, 0, +, ++.
* 25 - 32 /
* 33 blank
* 34 - 51 central value of the mass (double precision)
* 52 blank
* 53 - 60 positive error
* 61 blank
* 62 - 69 negative error
* 70 blank
* 71 - 88 central value of the width (double precision)
* 89 blank
* 90 - 97 positive error
* 98 blank
* 99 -106 negative error
* 107 blank
* 108 -128 particle name left-justified in the field and
* charge states right-justified in the field.
* This field is for ease of visual examination of the file and
* should not be taken as a standardized presentation of
* particle names.
*
* Particle ID(s) Mass (GeV) Errors (GeV) Width (GeV) Errors (GeV) Name Charges
21 0.E+00 +0.0E+00 -0.0E+00 0.E+00 +0.0E+00 -0.0E+00 g 0
22 0.E+00 +0.0E+00 -0.0E+00 0.E+00 +0.0E+00 -0.0E+00 gamma 0
24 8.0377E+01 +1.2E-02 -1.2E-02 2.08E+00 +4.0E-02 -4.0E-02 W +
23 9.11876E+01 +2.1E-03 -2.1E-03 2.4955E+00 +2.3E-03 -2.3E-03 Z 0
25 1.2525E+02 +1.7E-01 -1.7E-01 3.2E-03 +2.4E-03 -1.7E-03 H 0
...
Now that we have obtained the particle table, let’s build the SQlite database!
Create the sqlite3 database
Firstly, we need to connect the sqlite3 database. In the file called create_database.py
type in:
import sqlite3
connection = sqlite3.connect(
":memory:"
) # create or connect to database. Returns `Connection` object.
"""
Here one can perform operations on the data base, such as inserting, updating, deleting and selecting.
"""
connection.close() # close database connection
The Connection
object represents the database that, in this case, is stored in RAM using a special name :memory:
.
If you want to save database locally replace :memory:
with <name>.db
.
If the file <name>.db
already exists, connect
will simply connect to the file, but if
Fill in the sqlite3 database
Secondly, we need to fill in the database with inputs from the particle_table.txt
we have downloaded before.
To fill it we use SQL commands.
The SQL commands can be passed to using a Cursor
object that can be created once the Connection
object is created.
The Cursor
object method execute
takes in the SQL commands, which allows one to work with databases directly in SQL.
As the first step, we create the database table on the data structure of particle_table.txt
and fill it in.
In create_database.py
:
import sqlite3
import pandas as pd
connection = sqlite3.connect(
":memory:"
) # create or connect to database. Returns `Connection` object.
cursor = connection.cursor()
cursor.execute(
"CREATE TABLE particles(id INTEGER, mass FLOAT, masserrlow FLOAT, masserrup FLOAT, width FLOAT, widtherrlow FLOAT, widtherr_up FLOAT, name TEXT, charge INTEGER)"
)
cursor.execute(
"INSERT INTO particles VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?)",
[21, 0.0e00, +0.0e00, -0.0e00, 0.0e00, +0.0e00, -0.0e00, "g", 0],
) # gluon is filled in
connection.commit() # commit to database
readout = pd.read_sql("SELECT * FROM particles", connection) # read from database
print(readout) # print the database
connection.close() # close database connection
In the execute we have create a table called particles
that contains a particle id (integer), mass (float), upper and lower mass errors (float), width (float), upper and lower width errors (float), particle name (TEXT) and charge (INTEGER).
Then, we fill in the information on gluon into our particles
table.
After that we commit the changes to the database with the Connection
object method commit
.
To read the database, we use pandas read_sql
function.
Now execute create_database.py
with python3 create_database.py
.
You should see:
id mass masserrlow masserrup width widtherrlow widtherr_up name charge
0 21 0.0 0.0 0.0 0.0 0.0 0.0 g 0
However, we would like to write down the entire list of the PDG particles into our database, not just a gluon!
For this, we will use executemany
, instead of execute
.
But before that, we need to prepare our input data in the particle_table.txt
.
First of all the first 38 lines are taken with the header information that we do not need in our database, so we will skip these rows when copying.
Secondly, the sizes of columns are different in the particle_table.txt
, but luckily the header specifies the exact sizes of the table columns:
* 3) column
* 1 - 8 \ Monte Carlo particle numbers as described in the "Review of
* 9 - 16 | Particle Physics". Charge states appear, as appropriate,
* 17 - 24 | from left-to-right in the order -, 0, +, ++.
* 25 - 32 /
* 33 blank
* 34 - 51 central value of the mass (double precision)
* 52 blank
* 53 - 60 positive error
* 61 blank
* 62 - 69 negative error
* 70 blank
* 71 - 88 central value of the width (double precision)
* 89 blank
* 90 - 97 positive error
* 98 blank
* 99 -106 negative error
* 107 blank
* 108 -128 particle name left-justified in the field and
* charge states right-justified in the field.
* This field is for ease of visual examination of the file and
* should not be taken as a standardized presentation of
* particle names.
You can check in your favorite text editor that the column definitions are correct.
Now particle_table.txt
contains more precise information than what we want to save in our database.
For example, we are not writing down the state charges that are written in the column 9 - 32.
We would like to skip these columns.
Moreover, columns 108 - 208 contain both Name and Charge, but we would like to split it into two in our database.
We then can prepare data as follows:
import pandas as pd
colspecs = [
(0, 8),
(33, 51),
(52, 60),
(61, 69),
(70, 88),
(89, 97),
(98, 106),
(107, 128),
]
column_names = [
"id",
"mass",
"masserrup",
"masserrdown",
"width",
"widtherrup",
"widtherrdown",
"namecharge",
]
# Read the file with the specified column widths
data = pd.read_fwf(
"particle_table.txt", colspecs=colspecs, skiprows=38, names=column_names
)
data[["name", "charge"]] = data["namecharge"].str.extract(r"(.+?)\s+(\S+)$")
data = data.drop("namecharge", axis=1)
data = data.values.tolist()
executemany
expects an iterable input.
Therefore we transform the pandas dataset to nested lists.
Now we can commit the entire particle table to the database.
cursor.executemany("INSERT INTO particles VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?)", data)
connection.commit()
readout = pd.read_sql("SELECT * FROM particles", connection) # read from database
print(readout) # print the database
connection.close()
Save the create_database.py
and execute with python3 create_database.py
, you should see the following output:
id mass masserrlow masserrup width widtherrlow \
0 21 0.00000 0.00000 0.00000 0.000000e+00 0.000000e+00
1 22 0.00000 0.00000 0.00000 0.000000e+00 0.000000e+00
2 24 80.37700 0.01200 -0.01200 2.080000e+00 4.000000e-02
3 23 91.18760 0.00210 -0.00210 2.495500e+00 2.300000e-03
4 25 125.25000 0.17000 -0.17000 3.200000e-03 2.400000e-03
.. ... ... ... ... ... ...
224 5114 5.83474 0.00030 -0.00030 1.040000e-02 8.000000e-04
225 5224 5.83032 0.00027 -0.00027 9.400000e-03 5.000000e-04
226 5132 5.79700 0.00060 -0.00060 4.190000e-13 1.100000e-14
227 5232 5.79190 0.00050 -0.00050 4.450000e-13 9.000000e-15
228 5332 6.04520 0.00120 -0.00120 4.000000e-13 5.000000e-14
widtherr_up name charge
0 0.000000e+00 g 0
1 0.000000e+00 gamma 0
2 -4.000000e-02 W +
3 -2.300000e-03 Z 0
4 -1.700000e-03 H 0
.. ... ... ...
224 -8.000000e-04 Sigma(b)* -
225 -5.000000e-04 Sigma(b)* +
226 -1.100000e-14 Xi(b) -
227 -9.000000e-15 Xi(b) 0
228 -4.000000e-14 Omega(b) -
[229 rows x 9 columns]
Creating sql database directly from pandas dataframe
Pandas module has some useful methods for creating sql databases. The example above can be replaced with
import sqlite3
import pandas as pd
colspecs = [
(0, 8),
(33, 51),
(52, 60),
(61, 69),
(70, 88),
(89, 97),
(98, 106),
(107, 128),
]
column_names = [
"id",
"mass",
"masserrup",
"masserrdown",
"width",
"widtherrup",
"widtherrdown",
"namecharge",
]
data = pd.read_fwf(
"particle_table.txt", colspecs=colspecs, skiprows=38, names=column_names
)
data[["name", "charge"]] = data["namecharge"].str.extract(r"(.+?)\s+(\S+)$")
data = data.drop("namecharge", axis=1)
connection = sqlite3.connect(":memory:")
data.to_sql("particles", connection, index=False)
connection.commit()
readout = pd.read_sql("SELECT * FROM particles", connection) # read from database
print(readout) # print the database
connection.close()
Manipulating the database with SQLite
Before learning how to manipulate the database, let’s first save the database created in the previous steps, not in RAM, but in the file called particles.db
.
Replace the name of the database :memory:
with particles.db
and rerun create_database.py
.
You should see particles.db
in your directory.
Exercise
Open the
particles.db
file and create a table that contains all neutral particles.
Solution
cursor.execute("CREATE TABLE neutral_particles AS SELECT * from particles WHERE charge = '0'")
Exercise
Open the
particles.db
file and select only neutral leptons.
Solution
readout = pd.read_sql("SELECT * FROM neutral_particles WHERE name LIKE 'nu%'", connection)
Key Points
For lightweight applications, use SQLite.
Benefit from integration between sqlite3 and pandas.
Intro to NoSQL and Opensearch Queries
Overview
Teaching: x min
Exercises: 6 minQuestions
What is NoSQL database and Opensearch?
How to perform indexing in Opensearch?
How to query and filter records in opensearch?
Objectives
Understand the basic structure of Opensearch queries.
Learn how to create and manage indices in Opensearch.
Practice using different types of queries such as term queries, range queries, and compound queries.
Gain familiarity with updating and deleting documents in Opensearch.
NOSQL Databases
NSQL databases diverge from the traditional table-based structure of RDMS and are designed to handle unstructured or semi-structured data. They offer flexibility in data modeling and storage, supporting various data formats. Types of NoSQL database are :
NoSQL Database Type | Description | Examples |
---|---|---|
Key-Value Store | Stores data as key-value pairs. Simple and efficient for basic storage and retrieval operations. | Redis, DynamoDB, Riak |
Document-Oriented | Stores data in flexible JSON-like documents, allowing nested structures and complex data modeling. | MongoDB, Couchbase, CouchDB, OpenSearch, Elasticsearch |
Column-Family Store | Organizes data into columns rather than rows, suitable for analytical queries and data warehousing. | Apache Cassandra, HBase, ScyllaDB |
Graph Database | Models data as nodes and edges, ideal for complex relationships and network analysis. | Neo4j, ArangoDB, OrientDB |
Wide-Column Store | Similar to column-family stores but optimized for wide rows and scalable columnar data storage. | Apache HBase, Apache Kudu, Google Bigtable |
Opensearch Databases
Opensearch is kind of NoSQL database which is document oriented. It stores data as JSON documents. It is also a distributed search and analytics engine designed for scalability, real-time data processing, and full-text search capabilities. It is often used for log analytics, monitoring, and exploring large volumes of structured and unstructured data.
In the following chapters, we will build a metadata search engine/database. We will exploit the functionality of OpenSearch to create a database where we can store files with their corresponding metadata, and look for the files that match metadata queries.
Opensearch Queries
Lets explore fundamental Opensearch queries and concepts. Opensearch provides powerful search capabilities. Here are some core Opensearch queries that you’ll use:
- Create an Index: Create a new index.
- Create a Mapping: Define the data structure for your documents.
- Index Documents: Insert new documents into an index.
- Search Documents: Retrieve data from an index.
- Update Documents: Modify existing data in documents.
- Delete Documents: Remove documents from an index.
Setting up
Make sure you have python in your system. Lets create a virtual environment. Lets create a directory to work
mkdir myopenhsfwork && cd myopenhsfwork
Creating a virtual environment.
python -m venv venv
Activate the venv
source venv/bin/activate
Install install juyter and OpenSearch Python client (opensearch-py):
pip install jupyter
pip install opensearch-py
Then bring up Jupyter notebook. In your virtual enevironment run the following command.
jupyter-notebook
Now create a new python file and start running the subsequent commands.
OpenSearch connection
We will use Opensearch
from opensearchpy
to establish connection/initialize the opensearh client. We need to specify the OPENSEARCH_HOST
and OPENSEARCH_PORT
which we have during setup i.e. localhost
and 9200
respectively.
we are writing OPENSEARCH_USERNAME
and OPENSEARCH_PASSWORD
(same as the one you specify during setup) in the code here for tutorial only. Don’t store credentials in code. And other options like use_ssl
( tells the OpenSearch client to use SSL/TLS (Secure Sockets Layer / Transport Layer Security) or not) and verify_certs
(controls whether the OpenSearch client should verify the SSL certificate presented by the server) are set to false for tutorial. For production instance please set these parameter to True.
from opensearchpy import OpenSearch
OPENSEARCH_HOST = "localhost"
OPENSEARCH_PORT = 9200
OPENSEARCH_USERNAME = "admin"
OPENSEARCH_PASSWORD = "<custom-admin-password>"
# Initialize an Opensearch client
es = OpenSearch(
hosts=[{"host": OPENSEARCH_HOST, "port": OPENSEARCH_PORT}],
http_auth=(OPENSEARCH_USERNAME, OPENSEARCH_PASSWORD),
use_ssl=True,
verify_certs=False,
)
Create an index
Index is a logical namespace that holds a collection of documents. It defines the schema or structure of the documents it contains, including the fields and their data types. Mapping refers to the definition of how fields and data types are structured within documents stored in an index. It defines the schema or blueprint for documents, specifying the characteristics of each field such as data type, indexing options, analysis settings, and more. If no mapping is provided opensearch index it by itself. We will define mapping for the metadata attributes. For string we have two data type option. “keyword” type is used for exact matching and filtering and “text” type is used for full-text search and analysis.
index_name = "metadata"
# Define mappings for the index
mapping = {
"properties": {
"filename": {"type": "text"},
"run_number": {"type": "integer"},
"total_event": {"type": "integer"},
"collision_type": {"type": "keyword"},
"data_type": {"type": "keyword"},
"collision_energy": {"type": "integer"},
"description": {"type": "text", "analyzer": "standard"},
}
}
# define setting of index
index_body = {"settings": {"index": {"number_of_shards": 1}}, "mappings": mapping}
# create the index
es.indices.create(index=index_name, body=index_body)
# check if the index exists
exists = es.indices.exists(index=index_name)
if exists:
print("Successfully created index %s" % index_name)
Index documents
we will index four documents into the “metadata” index. Each document represents a dataset with various fields like filename, run number, total events, collision type, data type, collision energy, and description.
The _op_type
parameter in the bulk indexing operation specifies the operation type for each document. In this case, we’re using “create”, which creates new documents only if it doesn’t already exist. If a document with the same ID already exists, the “create” operation will fail. The other common operation type is “index”, which overwrites document if it already exists.
document1 = {
"filename": "expx.myfile1.root",
"run_number": 100,
"total_event": 1112,
"collision_type": "pp",
"data_type": "data",
"collision_energy": 250,
"description": "This file is produced with L1 and L2 trigger.",
}
document2 = {
"filename": "expx.myfile2.root",
"run_number": 55,
"total_event": 999,
"collision_type": "pPb",
"data_type": "mc",
"collision_energy": 100,
"description": "This file is produced without beam background.",
}
document3 = {
"filename": "expx.myfile3.root",
"run_number": 120,
"total_event": 200,
"collision_type": "PbPb",
"data_type": "data",
"collision_energy": 150,
"description": "This file is produced without cherenkov detector",
}
document4 = {
"filename": "expx.myfile4.root",
"run_number": 360,
"total_event": 1050,
"collision_type": "pPb",
"data_type": "mc",
"collision_energy": 50,
"description": "This file is produced with beam background",
}
documents = [document1, document2, document3, document4]
print("Total number of documents to be indexed indexed: %s" % str(len(documents)))
We can do two type of indexing of documents. One is doing indexing one-by-one for each documents and another is doing bulk.
Synchronous indexing
for document in documents:
_id = document["filename"]
res = es.index(index=index_name, id=_id, body=document, op_type="create")
print(res)
Bulk Indexing
The bulk function from the OpenSearch Python client is used to perform bulk indexing operations, which is more efficient than indexing documents one by one.
actions = []
duplicates = []
for document in documents:
_id = document["filename"]
# Check if document exists already
if es.exists(index=index_name, id=_id):
duplicates.append(document)
else:
actions.append(
{
"_index": index_name,
"_id": _id,
"_source": document,
"_op_type": "create",
}
)
from opensearchpy import OpenSearch, helpers
res = helpers.bulk(es, actions)
print("Total Number of successfully indexed documents: %s" % (str(res[0])))
Search for documents
Term Level Query
A term level query is a type of query used to search for documents that contains an exact term or value in a specific field. It can be applied to keyword and integer data types. This lets you search for document by single field (i.e. by single metadata in our case)
Term Query
Query structure looks like:
{
"query": {"term": {<field>: <value>}}
}
search_query = {"query": {"term": {"collision_type": "pp"}}}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
print(hit["_source"])
Search for filename for documents with data_type
mc
.Retrieve and display filename
Solution
search_query = {"query": {"term": {"data_type": "mc"}}} search_results = es.search(index=index_name, body=search_query) for hit in search_results["hits"]["hits"]: print(hit["_source"]["filename"])
expx.myfile2.root expx.myfile4.root
Range Query
It is also a term level query where we can apply a range of values to a field/metadata. Query structure looks like:
{
"query": {
"range": {
<field>: {
"gte": <lower_value>,
"lte": <upper_value>
}
}
}
}
Here we have choice of operator gte
for greater than or equal to, gt
for greater than, lte
for less than or equal to
and lt
for less than.
Lets get the documents with run_number
between 60 and 150 both inclusive.
search_query = {"query": {"range": {"run_number": {"gte": 60, "lte": 150}}}}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
print(hit["_source"])
Search for filename for all the documents whose collision energy ranging from 100 to 200 (both exclusive) .
Retrieve and display filename with range query
Solution
search_query = { "query": { "range": { "collision_energy": { "gt": 100, "lt": 200 } } } } search_results = es.search(index=index_name, body=search_query) for hit in search_results["hits"]["hits"]: print(hit["_source"]["filename"])
expx.myfile3.root
Prefix Query
Another term level query is prefix query. As the name suggest it search for terms with sopecific prefix. Format of this query is
{
"query": {
"prefix": {
<field>: {
"value": <prefix>
}
}
}
}
Lets get the documents which collision_type has prefix “p”.
search_query = {"query": {"prefix": {"collision_type": {"value": "p"}}}}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
print(hit["_source"])
Compound Query
There different sub categories in this but we will here focus on boolean
query. These queries allow you to combine multiple conditions to filter and retrieve documents that meet specific criteria.
Must Query
The must query combines multiple conditions and finds documents that match all specified criteria. This is equivalent to AND operator.
Let get the filename with collision_type
as pp
and data_type
as data
search_query = {
"query": {
"bool": {
"must": [
{"term": {"collision_type": "pp"}},
{"term": {"data_type": "data"}},
]
}
}
}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
print(hit["_source"])
Search for filename for documents with data_type
data
and collision_energy150
.Retrieve and display filename
Solution
search_query = { "query": { "bool": { "must": [ { "term": {"data_type": "data"} }, { "term": { "collision_energy": 150 } } ] } } } search_results = es.search(index=index_name, body=search_query) for hit in search_results["hits"]["hits"]: print(hit["_source"]["filename"])
expx.myfile3.root
Should Query
The should query searches for documents that match any of the specified conditions. This is equivalent to OR operator.
Let get the filename with collission_type
as pp
or PbPb
.
search_query = {
"query": {
"bool": {
"should": [
{"term": {"collision_type": "pp"}},
{"term": {"collision_energy": 1127}},
]
}
}
}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
print(hit["_source"])
Search for filename for documents with run_number
55
or collision_energy150
.Retrieve and display filename
Solution
search_query = { "query": { "bool": { "should": [ { "term": {"run_number": 55} }, { "term": { "collision_energy": 150 } } ] } } } search_results = es.search(index=index_name, body=search_query) for hit in search_results["hits"]["hits"]: print(hit["_source"]["filename"])
expx.myfile2.root expx.myfile3.root
Must Not Query
The must_not query excludes documents that match the specified condition. This is equivalent to NOT operator. Let get the document that must not have collision_energy 250.
search_query = {"query": {"bool": {"must_not": [{"term": {"collision_energy": 250}}]}}}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
print(hit["_source"])
Search for filename for all the documents that is not run_number
55
.Retrieve and display filename
Solution
search_query = { "query": { "bool": { "must_not": [ { "term": {"run_number": 55} } ] } } } search_results = es.search(index=index_name, body=search_query) for hit in search_results["hits"]["hits"]: print(hit["_source"]["filename"])
expx.myfile1.root expx.myfile3.root expx.myfile4.root
Search for filename for all the documents that must total_event greater than 200 and run_number greater than 50, should have collision_type as PbPb and must NOT have collision_energy 150. .
Retrieve and display filename combing must, should and mustn’t queries.
Solution
search_query = { "query": { "bool": { "must": [ { "range": { "total_event": { "gt": 200 } } }, { "range": { "run_number": { "gt": 50 } } } ], "should": { "term": { "collision_type": "PbPb" } }, "must_not": { "term": { "collsiion_energy": 150 } } } } } search_results = es.search(index=index_name, body=search_query) for hit in search_results["hits"]["hits"]: print(hit["_source"]["filename"])
expx.myfile1.root expx.myfile2.root expx.myfile4.root
Update a document by filename
Lets update a field of documents. We can update a specific document by its document ID (_id). It updates the “data_type” field of the document with the ID “expx.myfile1.root” to “deriv”.
_id = "expx.myfile1.root"
data = {"data_type": "deriv"}
es.update(index=index_name, id=_id, body={"doc": data})
Delete a document by filename (-> _id)
Lets delete a document by its document ID (which is filename in our case)
_id = "expx.myfile1.root"
es.delete(index=index_name, id=_id)
Key Points
Opensearch queries can be used to search, update, and delete documents in an Opensearch index.
Indices in Opensearch define the structure and mapping of documents.
Term queries match exact terms or values in a specific field.
Range queries match documents within a specified range of values.
Compound queries combine multiple conditions using boolean logic.
Opensearch Text Based Queries
Overview
Teaching: x min
Exercises: 2 minQuestions
How to perform text based search in opensearch?
What are the ways to do text based search in opensearch?
Objectives
Understand the fundamental query types in Opensearch for text-based searches.
Learn how to construct and execute various text-based queries using Opensearch.
Text Based Queries
Lets first understand why Opensearch has advantages on full text-based search compared to mySQL (SQL).
MySQL/SQL Limitations:
- Relational Structure: MySQL is optimized for structured, relational data, not large-scale text search. Full-Text Search: MySQL uses FULLTEXT indexes but is slower for full-text search as it lacks advanced text analysis and efficient indexing for unstructured data.
- Row-Based Indexing: It indexes rows, requiring more resources to scan large text fields.
OpenSearch (NoSQL) Advantages:
- Inverted Index: OpenSearch uses an inverted index, making text search faster by indexing individual terms, not rows.
- Scalability: OpenSearch is built for horizontal scaling, distributing data and queries across nodes.
- Text Processing: It has built-in analyzers (tokenization, stemming), making it ideal for fast, accurate full-text search.
- Real-Time: OpenSearch excels at handling dynamic, real-time searches across large datasets.
Opensearch is a powerful search and analytics engine that excels in handling text-based queries efficiently. Understanding how to construct and utilize text-based queries in Opensearch is crucial for effective data retrieval and analysis.
This section will delve into the concepts and techniques involved in Opensearch text-based queries.
Match Query:
Match Phrase Query
Search for documents containing an exact phrase in the description field. Structure of query is:
{
"query": {
"match_phrase": {
"<field>": "<phrase>"
}
}
}
Lets search for document(s) with exact phrase “without beam background” in description field.
search_query = {"query": {"match_phrase": {"description": "without beam background"}}}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
print(hit["_source"])
Search for documents with exact phrase “without cherenkov detector” .
Retrieve documents with match phrase query.
Solution
search_query = { "query": { "match_phrase": { "description": "without cherenkov detector" } } } search_results = es.search(index=index_name, body=search_query) for hit in search_results["hits"]["hits"]: print(hit["_source"]["filename"])
{'filename': 'expx.myfile3.root', 'run_number': 120, 'total_event': 200, 'collision_type': 'PbPb', 'data_type': 'data', 'collision_energy': 150, 'description': 'This file is produced without cherenkov detector'}
Match query
The match query is a basic query type in opensearch used to search for documents containing specific words or phrases in a specified field, such as the description field. Structure of query is:
{
"query": {
"match": {
"query": "<text>"
}
}
}
Lets search for documents containing words “without” or “beam” in description field. Here it looks for document containing either of the words.
search_query = {"query": {"match": {"description": "without beam"}}}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
print(hit["_source"])
You should see three documents with filename expx.myfile2.root , expx.myfile3.root and expx.myfile4.root, as these document contain either of word without or beam.
You can also add operator and
for the query so that all the words are present in the field.
{
"query": {
"match": {
"query": "<text>",
"operator": "and"
}
}
}
Example , to get the documents with word “beam” and “cherenkov” you will do.
search_query = {
"query": {"match": {"description": {"query": "beam cherenkov", "operator": "and"}}}
}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
print(hit["_source"])
Search for documents with words “cherenkov” or “trigger” .
Retrieve documents with match phrase query.
Solution
search_query = { "query": { "match": { "description": "cherenkov trigger" } } } search_results = es.search(index=index_name, body=search_query) for hit in search_results["hits"]["hits"]: print(hit["_source"]["filename"])
{'filename': 'expx.myfile3.root', 'run_number': 120, 'total_event': 200, 'collision_type': 'PbPb', 'data_type': 'data', 'collision_energy': 150, 'description': 'This file is produced without cherenkov detector'} {'filename': 'expx.myfile1.root', 'run_number': 100, 'total_event': 1112, 'collision_type': 'pp', 'data_type': 'data', 'collision_energy': 250, 'description': 'This file is produced with L1 and L2 trigger.'}
query_string
Wild card
Wildcard queries are used to search for documents based on patterns or partial matches within a field. In the example below, a wildcard query is used to search for documents where the description field contains any L trigger ie. L1/L2.
search_query = {
"query": {"query_string": {"default_field": "description", "query": "L*"}}
}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
print(hit["_source"]["description"])
You should see single document with filename expx.myfile1.root.
Prefix Query:
Prefix queries are used to search for documents where a specified field starts with a specific prefix. For example, the prefix query below searches for documents where the filename field starts with “expx.”:
search_query = {"query": {"prefix": {"filename": "expx."}}}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results[“hits”][“hits”]: print(hit[“_source”][“filename”]) This query will match documents with filenames like “expx.csv,” “expx_data.txt,” etc.
Fuzzy Query:
Fuzzy queries are used to find documents with terms similar to a specified term, allowing for some degree of error or variation. In the example below, a fuzzy query is used to search for documents with terms similar to “produced” in the description field:
search_query = {"query": {"fuzzy": {"description": "physic"}}}
search_results = es.search(index=index_name, body=search_query)
for hit in search_results["hits"]["hits"]:
print(hit["_source"]["description"])
This query will match documents with terms like “production,” “producer,” “products,” etc., based on the fuzziness parameter specified.
Key Points
Opensearch supports a range of text-based query types, including match, match_phrase, wildcard, prefix, and fuzzy queries.
Each query type has specific use cases and parameters that can be customized for tailored search results.
Efficient utilization of text-based queries in Opensearch can significantly enhance data retrieval and analysis capabilities