Storing and querying data with SQLite

Storing and querying data with SQLite#

Learning objectives

By the end of this lab, you will be able to:

Create an SQLite database and populate it with pandas DataFrames using to_sql()
Query a database using SQL via read_sql()
Use SQL SELECT, WHERE, and JOIN clauses to extract data
Combine SQL queries with pandas operations to analyse results

This lab puts into practice the concepts introduced in the Configuring and populating a relational database and Interacting with databases using Pandas chapters. You will take the Caravaggio dataset from Working with data in pandas and store it in an SQLite database, then use SQL queries to extract information from it.

The dataset#

This lab uses the same dataset from Working with data in pandas. If you no longer have the files, you can download them again by clicking on each name:

artworks.csv: a catalogue of 15 paintings by Caravaggio with their title, year, genre, dimensions (height and width in cm), and the museum where they are held
museums.csv: a list of 11 Italian museums and churches with their city, type, and founding year
collections.csv: a table linking each artwork to its room within the museum and its conservation condition

Download the three files and place them in the same folder where you will write your Python code.

Part 1: Building the database#

In Working with data in pandas, you worked with three CSV files about Caravaggio’s artworks in Italian museums. All the data lived in CSV files, and you used pandas to load, query, and combine it. Now you will store the same data in an SQLite database. The advantage of a database is that data is stored persistently in a structured format, and you can use SQL, a dedicated query language, to extract exactly the information you need.

Exercise 1.1: Create the database and populate it#

Load the three CSV files (artworks.csv, museums.csv, collections.csv) into pandas DataFrames, using the same dtype specifications as in the previous lab. Then, create an SQLite database called caravaggio.db and store each DataFrame as a table using to_sql(). Use if_exists="replace" so the code can be run multiple times without errors, and index=False to avoid writing the pandas index as a column.

Exercise 1.2: Verify the database content#

Use read_sql() to read the entire Artwork table from the database. The SQL query SELECT * FROM Artwork retrieves all columns and all rows from the table. Display the resulting DataFrame and verify it matches the original CSV data.

Show code cell content

Hide code cell content

from pandas import read_sql

with connect("notebook/caravaggio.db") as con:
    df_from_db = read_sql("SELECT * FROM Artwork", con)

df_from_db

	id	title	year	genre	height_cm	width_cm	museum_id
0	1	Ragazzo con canestra di frutta	1593	portrait	70	67	1
1	2	Bacchino malato	1593	mythological	67	53	1
2	3	Buona ventura	1594	genre	115	150	2
3	4	Riposo durante la fuga in Egitto	1596	religious	134	167	3
4	5	Canestra di frutta	1599	still life	47	65	4
5	6	Bacco	1597	mythological	95	85	5
6	7	Testa di Medusa	1597	mythological	60	55	5
7	8	Giuditta e Oloferne	1599	religious	145	195	6
8	9	Narciso	1599	mythological	110	92	6
9	10	Vocazione di san Matteo	1600	religious	322	340	7
10	11	Conversione di san Paolo	1601	religious	230	175	8
11	12	Madonna dei Pellegrini	1605	religious	260	150	9
12	13	Flagellazione di Cristo	1607	religious	286	213	10
13	14	Sette opere di misericordia	1607	religious	390	260	11
14	15	Davide con la testa di Golia	1610	religious	125	101	1

Part 2: Querying the database#

With the data stored in the database, you can use SQL to extract exactly what you need. Each query is a string passed to read_sql(), and the result is returned as a pandas DataFrame. In this part, you will practice the most common SQL clauses: SELECT to choose which columns to retrieve, WHERE to filter rows by condition, and ORDER BY to sort results.

Exercise 2.1: Select specific columns#

Write an SQL query that retrieves only the title and year columns from the Artwork table. Display the result.

Show code cell content

Hide code cell content

with connect("notebook/caravaggio.db") as con:
    df_titles = read_sql("SELECT title, year FROM Artwork", con)

df_titles

	title	year
0	Ragazzo con canestra di frutta	1593
1	Bacchino malato	1593
2	Buona ventura	1594
3	Riposo durante la fuga in Egitto	1596
4	Canestra di frutta	1599
5	Bacco	1597
6	Testa di Medusa	1597
7	Giuditta e Oloferne	1599
8	Narciso	1599
9	Vocazione di san Matteo	1600
10	Conversione di san Paolo	1601
11	Madonna dei Pellegrini	1605
12	Flagellazione di Cristo	1607
13	Sette opere di misericordia	1607
14	Davide con la testa di Golia	1610

Exercise 2.2: Filter rows with WHERE#

Write an SQL query that retrieves the title and year of all artworks painted after 1600. In SQL, conditions are expressed with the WHERE clause, using operators like =, >, <, >=, <=. Display the result.

Show code cell content

Hide code cell content

with connect("notebook/caravaggio.db") as con:
    df_after_1600 = read_sql("SELECT title, year FROM Artwork WHERE year > 1600", con)

df_after_1600

	title	year
0	Conversione di san Paolo	1601
1	Madonna dei Pellegrini	1605
2	Flagellazione di Cristo	1607
3	Sette opere di misericordia	1607
4	Davide con la testa di Golia	1610

Exercise 2.3: Combine conditions and sort results#

Write an SQL query that retrieves the title, year, and genre of artworks that satisfy either of these conditions:

genre is “mythological”
genre is “religious” and year is after 1605

Order the results by year, from oldest to newest, using the ORDER BY clause.

This is the same filter you wrote with df_artworks.query() in Working with data in pandas, Exercise 2.1, but now expressed in SQL with sorting added.

Show code cell content

Hide code cell content

with connect("notebook/caravaggio.db") as con:
    query = """
        SELECT title, year, genre
        FROM Artwork
        WHERE genre = 'mythological'
           OR (genre = 'religious' AND year > 1605)
        ORDER BY year
    """
    df_filtered = read_sql(query, con)

df_filtered

	title	year	genre
0	Bacchino malato	1593	mythological
1	Bacco	1597	mythological
2	Testa di Medusa	1597	mythological
3	Narciso	1599	mythological
4	Flagellazione di Cristo	1607	religious
5	Sette opere di misericordia	1607	religious
6	Davide con la testa di Golia	1610	religious

Part 3: Working with multiple tables#

In Working with data in pandas, you used merge() to combine DataFrames from different CSV files. SQL provides a similar operation called JOIN, which combines rows from two tables based on a shared column. The syntax is:

SELECT columns
FROM TableA
JOIN TableB ON TableA.column = TableB.column

This produces a result where each row from TableA is matched with the corresponding row from TableB wherever the specified columns have the same value, just like merge() in pandas.

Exercise 3.1: Join artworks with museums#

Write an SQL query that joins the Artwork table with the Museum table, matching Artwork.museum_id with Museum.museum_id. Retrieve the artwork title, year, museum name, and city. Display the result.

Show code cell content

Hide code cell content

with connect("notebook/caravaggio.db") as con:
    query = """
        SELECT Artwork.title, Artwork.year, Museum.name, Museum.city
        FROM Artwork
        JOIN Museum ON Artwork.museum_id = Museum.museum_id
    """
    df_joined = read_sql(query, con)

df_joined

	title	year	name	city
0	Ragazzo con canestra di frutta	1593	Galleria Borghese	Rome
1	Bacchino malato	1593	Galleria Borghese	Rome
2	Buona ventura	1594	Musei Capitolini	Rome
3	Riposo durante la fuga in Egitto	1596	Galleria Doria Pamphilj	Rome
4	Canestra di frutta	1599	Pinacoteca Ambrosiana	Milan
5	Bacco	1597	Galleria degli Uffizi	Florence
6	Testa di Medusa	1597	Galleria degli Uffizi	Florence
7	Giuditta e Oloferne	1599	Palazzo Barberini	Rome
8	Narciso	1599	Palazzo Barberini	Rome
9	Vocazione di san Matteo	1600	San Luigi dei Francesi	Rome
10	Conversione di san Paolo	1601	Santa Maria del Popolo	Rome
11	Madonna dei Pellegrini	1605	Sant'Agostino	Rome
12	Flagellazione di Cristo	1607	Museo di Capodimonte	Naples
13	Sette opere di misericordia	1607	Pio Monte della Misericordia	Naples
14	Davide con la testa di Golia	1610	Galleria Borghese	Rome

Exercise 3.2: Join three tables and filter#

Write an SQL query that joins all three tables (Artwork, Collection, and Museum) to retrieve the title, museum name, room, and condition of each artwork. Then, use pandas query() on the resulting DataFrame to find all artworks in “excellent” condition. Print the title and museum name of each one.

Part 4: Final challenge#

Exercise 4.1: City report from the database#

Using SQL queries and pandas operations, produce a summary report as a new DataFrame with one row per city and the following columns:

city: the city name
num_museums: the number of distinct museums in that city that hold Caravaggio artworks
num_artworks: the total number of artworks held in that city
earliest: the year of the earliest artwork in that city
latest: the year of the latest artwork in that city

Save the result to a CSV file called city_report.csv (without the row index) and display the DataFrame.

Hint

Start by writing an SQL query with JOIN to combine Artwork and Museum. Then iterate over the resulting DataFrame with iterrows() to aggregate information by city into a dictionary. Use a set to track distinct museum identifiers per city.

Solution

from sqlite3 import connect
from pandas import read_sql, DataFrame, Series

with connect("caravaggio.db") as con:
    query = """
        SELECT Artwork.title, Artwork.year, Museum.museum_id, Museum.name, Museum.city
        FROM Artwork
        JOIN Museum ON Artwork.museum_id = Museum.museum_id
    """
    df_full = read_sql(query, con)

report = dict()
for idx, row in df_full.iterrows():
    city = row["city"]

    if city not in report:
        report[city] = {
            "museums": set(),
            "num_artworks": 0,
            "earliest": row["year"],
            "latest": row["year"]
        }

    report[city]["museums"].add(row["museum_id"])
    report[city]["num_artworks"] = report[city]["num_artworks"] + 1

    if row["year"] < report[city]["earliest"]:
        report[city]["earliest"] = row["year"]

    if row["year"] > report[city]["latest"]:
        report[city]["latest"] = row["year"]

rows = list()
for city in report:
    row = Series({
        "city": city,
        "num_museums": len(report[city]["museums"]),
        "num_artworks": report[city]["num_artworks"],
        "earliest": report[city]["earliest"],
        "latest": report[city]["latest"]
    })
    rows.append(row)

df_report = DataFrame(rows)
df_report.to_csv("city_report.csv", index=False)
df_report

Summary#

In this lab, you practised:

Creating a database: using connect() to create an SQLite database file and to_sql() to populate it with pandas DataFrames
Reading from a database: using read_sql() to execute SQL queries and receive results as DataFrames
SQL syntax: SELECT to choose columns, WHERE to filter rows with conditions, JOIN to combine tables on shared columns, ORDER BY to sort results
Combining tools: using SQL to extract data from the database and pandas to process the results further

Additional resources#

Next lab

In the next lab, Loading data and querying with Blazegraph, you will learn to represent the same dataset as RDF triples and query it using SPARQL on a Blazegraph triplestore, building on the concepts introduced in the Configuring and populating a graph database and Interacting with databases using Pandas chapters.

Storing and querying data with SQLite

Contents

Storing and querying data with SQLite#

The dataset#

Part 1: Building the database#

Exercise 1.1: Create the database and populate it#

Exercise 1.2: Verify the database content#

Part 2: Querying the database#

Exercise 2.1: Select specific columns#

Exercise 2.2: Filter rows with WHERE#

Exercise 2.3: Combine conditions and sort results#

Part 3: Working with multiple tables#

Exercise 3.1: Join artworks with museums#

Exercise 3.2: Join three tables and filter#

Part 4: Final challenge#

Exercise 4.1: City report from the database#

Summary#

Additional resources#