SQLite
Overview
Teaching: 60 min
Exercises: 30 minQuestions
What is SQLite?
How do I create an SQLite database?
Objectives
Creating SQLite database
Manipulating the database
Introduction to SQLite
Similar to MySQL, SQLite is a data manager system. However, it is simpler, more flexible, and user-friendly than the former. It implements a self-contained, serverless, zero-configuration, transactional SQL database engine. SQLite is a preferred choice for the lightweight embedded databases. The SQLite is available in both C and Python. Here, we will proceed with the Python implementation. It is user-friendly and comes with native Python 2.6+.
Particle Data Group database
Create a working directory hsf_sqlite_training
:
mkdir hsf_sqlite_training
cd hsf_sqlite_trainins
Let’s create a database, that contains information from the Particle Data Group particle properties table.
First, we need to download the particle properties table from 2023.
You can do it manually and place it in your directory hsf_sqlite_training
under the name particle_table.txt
or if you use Python 3.7+ you can download the file automatically with a python module requests
.
Additional: using
requests
To check the Python version on your machine do the following:
python3 --version
If the python version is higher than 3.7, you can install
requests
withpython -m pip install requests
. Then you can create a scriptdownload_particle_table.py
using your favorite code editor.import requests particle_table = "https://pdg.lbl.gov/2023/mcdata/mass_width_2023.txt" #url of the file we want response = requests.get(particle_table) #getting the response from the url. if response.status_code == 200: with open("particle_table.txt", "wb") as file: #writing the response into the txt file `particle_table.txt` locally file.write(response.content) print("Particle table is downloaded!") else: print("Failed to download the particle table.") #it can be that the server is down or you have a typo in the url
Save the
download_particle_table.py
script and execute withpython3 download_particle_table.py
. You should see the downloaded fileparticle_table.txt
in your working directory.
Open the particle_table.txt
in your favorite text editor and study the data structure inside.
You should see:
* MASSES, WIDTHS, AND MC ID NUMBERS FROM 2023 EDITION OF RPP
*
* The following values were generated on 31-May-2023 by the Berkeley Particle
* Data Group from the Review of Particle Physics database and are intended
* for use in Monte Carlo programs.
*
* For questions regarding distribution or content of this file, contact
* the Particle Data Group at pdg@lbl.gov.
*
* To process the images in this file:
* 1) ignore documentation lines that begin with an asterisk
* 2) in a FORTRAN program, process data lines with
* FORMAT (BN, 4I8, 2(1X,E18.0, 1X,E8.0, 1X,E8.0), 1X,A21)
* 3) column
* 1 - 8 \ Monte Carlo particle numbers as described in the "Review of
* 9 - 16 | Particle Physics". Charge states appear, as appropriate,
* 17 - 24 | from left-to-right in the order -, 0, +, ++.
* 25 - 32 /
* 33 blank
* 34 - 51 central value of the mass (double precision)
* 52 blank
* 53 - 60 positive error
* 61 blank
* 62 - 69 negative error
* 70 blank
* 71 - 88 central value of the width (double precision)
* 89 blank
* 90 - 97 positive error
* 98 blank
* 99 -106 negative error
* 107 blank
* 108 -128 particle name left-justified in the field and
* charge states right-justified in the field.
* This field is for ease of visual examination of the file and
* should not be taken as a standardized presentation of
* particle names.
*
* Particle ID(s) Mass (GeV) Errors (GeV) Width (GeV) Errors (GeV) Name Charges
21 0.E+00 +0.0E+00 -0.0E+00 0.E+00 +0.0E+00 -0.0E+00 g 0
22 0.E+00 +0.0E+00 -0.0E+00 0.E+00 +0.0E+00 -0.0E+00 gamma 0
24 8.0377E+01 +1.2E-02 -1.2E-02 2.08E+00 +4.0E-02 -4.0E-02 W +
23 9.11876E+01 +2.1E-03 -2.1E-03 2.4955E+00 +2.3E-03 -2.3E-03 Z 0
25 1.2525E+02 +1.7E-01 -1.7E-01 3.2E-03 +2.4E-03 -1.7E-03 H 0
...
Now that we have obtained the particle table, let’s build the SQlite database!
Create the sqlite3 database
Firstly, we need to connect the sqlite3 database. In the file called create_database.py
type in:
import sqlite3
connection = sqlite3.connect(
":memory:"
) # create or connect to database. Returns `Connection` object.
"""
Here one can perform operations on the data base, such as inserting, updating, deleting and selecting.
"""
connection.close() # close database connection
The Connection
object represents the database that, in this case, is stored in RAM using a special name :memory:
.
If you want to save database locally replace :memory:
with <name>.db
.
If the file <name>.db
already exists, connect
will simply connect to the file, but if
Fill in the sqlite3 database
Secondly, we need to fill in the database with inputs from the particle_table.txt
we have downloaded before.
To fill it we use SQL commands.
The SQL commands can be passed to using a Cursor
object that can be created once the Connection
object is created.
The Cursor
object method execute
takes in the SQL commands, which allows one to work with databases directly in SQL.
As the first step, we create the database table on the data structure of particle_table.txt
and fill it in.
In create_database.py
:
import sqlite3
import pandas as pd
connection = sqlite3.connect(
":memory:"
) # create or connect to database. Returns `Connection` object.
cursor = connection.cursor()
cursor.execute(
"CREATE TABLE particles(id INTEGER, mass FLOAT, masserrlow FLOAT, masserrup FLOAT, width FLOAT, widtherrlow FLOAT, widtherr_up FLOAT, name TEXT, charge INTEGER)"
)
cursor.execute(
"INSERT INTO particles VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?)",
[21, 0.0e00, +0.0e00, -0.0e00, 0.0e00, +0.0e00, -0.0e00, "g", 0],
) # gluon is filled in
connection.commit() # commit to database
readout = pd.read_sql("SELECT * FROM particles", connection) # read from database
print(readout) # print the database
connection.close() # close database connection
In the execute we have create a table called particles
that contains a particle id (integer), mass (float), upper and lower mass errors (float), width (float), upper and lower width errors (float), particle name (TEXT) and charge (INTEGER).
Then, we fill in the information on gluon into our particles
table.
After that we commit the changes to the database with the Connection
object method commit
.
To read the database, we use pandas read_sql
function.
Now execute create_database.py
with python3 create_database.py
.
You should see:
id mass masserrlow masserrup width widtherrlow widtherr_up name charge
0 21 0.0 0.0 0.0 0.0 0.0 0.0 g 0
However, we would like to write down the entire list of the PDG particles into our database, not just a gluon!
For this, we will use executemany
, instead of execute
.
But before that, we need to prepare our input data in the particle_table.txt
.
First of all the first 38 lines are taken with the header information that we do not need in our database, so we will skip these rows when copying.
Secondly, the sizes of columns are different in the particle_table.txt
, but luckily the header specifies the exact sizes of the table columns:
* 3) column
* 1 - 8 \ Monte Carlo particle numbers as described in the "Review of
* 9 - 16 | Particle Physics". Charge states appear, as appropriate,
* 17 - 24 | from left-to-right in the order -, 0, +, ++.
* 25 - 32 /
* 33 blank
* 34 - 51 central value of the mass (double precision)
* 52 blank
* 53 - 60 positive error
* 61 blank
* 62 - 69 negative error
* 70 blank
* 71 - 88 central value of the width (double precision)
* 89 blank
* 90 - 97 positive error
* 98 blank
* 99 -106 negative error
* 107 blank
* 108 -128 particle name left-justified in the field and
* charge states right-justified in the field.
* This field is for ease of visual examination of the file and
* should not be taken as a standardized presentation of
* particle names.
You can check in your favorite text editor that the column definitions are correct.
Now particle_table.txt
contains more precise information than what we want to save in our database.
For example, we are not writing down the state charges that are written in the column 9 - 32.
We would like to skip these columns.
Moreover, columns 108 - 208 contain both Name and Charge, but we would like to split it into two in our database.
We then can prepare data as follows:
import pandas as pd
colspecs = [
(0, 8),
(33, 51),
(52, 60),
(61, 69),
(70, 88),
(89, 97),
(98, 106),
(107, 128),
]
column_names = [
"id",
"mass",
"masserrup",
"masserrdown",
"width",
"widtherrup",
"widtherrdown",
"namecharge",
]
# Read the file with the specified column widths
data = pd.read_fwf(
"particle_table.txt", colspecs=colspecs, skiprows=38, names=column_names
)
data[["name", "charge"]] = data["namecharge"].str.extract(r"(.+?)\s+(\S+)$")
data = data.drop("namecharge", axis=1)
data = data.values.tolist()
executemany
expects an iterable input.
Therefore we transform the pandas dataset to nested lists.
Now we can commit the entire particle table to the database.
cursor.executemany("INSERT INTO particles VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?)", data)
connection.commit()
readout = pd.read_sql("SELECT * FROM particles", connection) # read from database
print(readout) # print the database
connection.close()
Save the create_database.py
and execute with python3 create_database.py
, you should see the following output:
id mass masserrlow masserrup width widtherrlow \
0 21 0.00000 0.00000 0.00000 0.000000e+00 0.000000e+00
1 22 0.00000 0.00000 0.00000 0.000000e+00 0.000000e+00
2 24 80.37700 0.01200 -0.01200 2.080000e+00 4.000000e-02
3 23 91.18760 0.00210 -0.00210 2.495500e+00 2.300000e-03
4 25 125.25000 0.17000 -0.17000 3.200000e-03 2.400000e-03
.. ... ... ... ... ... ...
224 5114 5.83474 0.00030 -0.00030 1.040000e-02 8.000000e-04
225 5224 5.83032 0.00027 -0.00027 9.400000e-03 5.000000e-04
226 5132 5.79700 0.00060 -0.00060 4.190000e-13 1.100000e-14
227 5232 5.79190 0.00050 -0.00050 4.450000e-13 9.000000e-15
228 5332 6.04520 0.00120 -0.00120 4.000000e-13 5.000000e-14
widtherr_up name charge
0 0.000000e+00 g 0
1 0.000000e+00 gamma 0
2 -4.000000e-02 W +
3 -2.300000e-03 Z 0
4 -1.700000e-03 H 0
.. ... ... ...
224 -8.000000e-04 Sigma(b)* -
225 -5.000000e-04 Sigma(b)* +
226 -1.100000e-14 Xi(b) -
227 -9.000000e-15 Xi(b) 0
228 -4.000000e-14 Omega(b) -
[229 rows x 9 columns]
Creating sql database directly from pandas dataframe
Pandas module has some useful methods for creating sql databases. The example above can be replaced with
import sqlite3
import pandas as pd
colspecs = [
(0, 8),
(33, 51),
(52, 60),
(61, 69),
(70, 88),
(89, 97),
(98, 106),
(107, 128),
]
column_names = [
"id",
"mass",
"masserrup",
"masserrdown",
"width",
"widtherrup",
"widtherrdown",
"namecharge",
]
data = pd.read_fwf(
"particle_table.txt", colspecs=colspecs, skiprows=38, names=column_names
)
data[["name", "charge"]] = data["namecharge"].str.extract(r"(.+?)\s+(\S+)$")
data = data.drop("namecharge", axis=1)
connection = sqlite3.connect(":memory:")
data.to_sql("particles", connection, index=False)
connection.commit()
readout = pd.read_sql("SELECT * FROM particles", connection) # read from database
print(readout) # print the database
connection.close()
Manipulating the database with SQLite
Before learning how to manipulate the database, let’s first save the database created in the previous steps, not in RAM, but in the file called particles.db
.
Replace the name of the database :memory:
with particles.db
and rerun create_database.py
.
You should see particles.db
in your directory.
Exercise
Open the
particles.db
file and create a table that contains all neutral particles.
Solution
cursor.execute("CREATE TABLE neutral_particles AS SELECT * from particles WHERE charge = '0'")
Exercise
Open the
particles.db
file and select only neutral leptons.
Solution
readout = pd.read_sql("SELECT * FROM neutral_particles WHERE name LIKE 'nu%'", connection)
Key Points
For lightweight applications, use SQLite.
Benefit from integration between sqlite3 and pandas.