Storing and querying data with SQLite#

Learning objectives

By the end of this lab, you will be able to:

  • Create an SQLite database and populate it with pandas DataFrames using to_sql()

  • Query a database using SQL via read_sql()

  • Use SQL SELECT, WHERE, and JOIN clauses to extract data

  • Combine SQL queries with pandas operations to analyse results

This lab puts into practice the concepts introduced in the Configuring and populating a relational database and Interacting with databases using Pandas chapters. You will take the Caravaggio dataset from Working with data in pandas and store it in an SQLite database, then use SQL queries to extract information from it.


The dataset#

This lab uses the same dataset from Working with data in pandas. If you no longer have the files, you can download them again by clicking on each name:

  • artworks.csv: a catalogue of 15 paintings by Caravaggio with their title, year, genre, dimensions (height and width in cm), and the museum where they are held

  • museums.csv: a list of 11 Italian museums and churches with their city, type, and founding year

  • collections.csv: a table linking each artwork to its room within the museum and its conservation condition

Download the three files and place them in the same folder where you will write your Python code.


Part 1: Building the database#

In Working with data in pandas, you worked with three CSV files about Caravaggio’s artworks in Italian museums. All the data lived in CSV files, and you used pandas to load, query, and combine it. Now you will store the same data in an SQLite database. The advantage of a database is that data is stored persistently in a structured format, and you can use SQL, a dedicated query language, to extract exactly the information you need.

Exercise 1.1: Create the database and populate it#

Load the three CSV files (artworks.csv, museums.csv, collections.csv) into pandas DataFrames, using the same dtype specifications as in the previous lab. Then, create an SQLite database called caravaggio.db and store each DataFrame as a table using to_sql(). Use if_exists="replace" so the code can be run multiple times without errors, and index=False to avoid writing the pandas index as a column.

Hide code cell content

from sqlite3 import connect
from pandas import read_csv

df_artworks = read_csv("notebook/artworks.csv",
                        keep_default_na=False,
                        dtype={
                            "id": "int",
                            "title": "string",
                            "year": "int",
                            "genre": "string",
                            "height_cm": "int",
                            "width_cm": "int",
                            "museum_id": "int"
                        })

df_museums = read_csv("notebook/museums.csv",
                      keep_default_na=False,
                      dtype={
                          "museum_id": "int",
                          "name": "string",
                          "city": "string",
                          "type": "string",
                          "founded": "int"
                      })

df_collections = read_csv("notebook/collections.csv",
                          keep_default_na=False,
                          dtype={
                              "artwork_id": "int",
                              "room": "string",
                              "condition": "string"
                          })

with connect("notebook/caravaggio.db") as con:
    df_artworks.to_sql("Artwork", con, if_exists="replace", index=False)
    df_museums.to_sql("Museum", con, if_exists="replace", index=False)
    df_collections.to_sql("Collection", con, if_exists="replace", index=False)

Exercise 1.2: Verify the database content#

Use read_sql() to read the entire Artwork table from the database. The SQL query SELECT * FROM Artwork retrieves all columns and all rows from the table. Display the resulting DataFrame and verify it matches the original CSV data.

Hide code cell content

from pandas import read_sql

with connect("notebook/caravaggio.db") as con:
    df_from_db = read_sql("SELECT * FROM Artwork", con)

df_from_db
id title year genre height_cm width_cm museum_id
0 1 Ragazzo con canestra di frutta 1593 portrait 70 67 1
1 2 Bacchino malato 1593 mythological 67 53 1
2 3 Buona ventura 1594 genre 115 150 2
3 4 Riposo durante la fuga in Egitto 1596 religious 134 167 3
4 5 Canestra di frutta 1599 still life 47 65 4
5 6 Bacco 1597 mythological 95 85 5
6 7 Testa di Medusa 1597 mythological 60 55 5
7 8 Giuditta e Oloferne 1599 religious 145 195 6
8 9 Narciso 1599 mythological 110 92 6
9 10 Vocazione di san Matteo 1600 religious 322 340 7
10 11 Conversione di san Paolo 1601 religious 230 175 8
11 12 Madonna dei Pellegrini 1605 religious 260 150 9
12 13 Flagellazione di Cristo 1607 religious 286 213 10
13 14 Sette opere di misericordia 1607 religious 390 260 11
14 15 Davide con la testa di Golia 1610 religious 125 101 1

Part 2: Querying the database#

With the data stored in the database, you can use SQL to extract exactly what you need. Each query is a string passed to read_sql(), and the result is returned as a pandas DataFrame. In this part, you will practice the most common SQL clauses: SELECT to choose which columns to retrieve, WHERE to filter rows by condition, and ORDER BY to sort results.

Exercise 2.1: Select specific columns#

Write an SQL query that retrieves only the title and year columns from the Artwork table. Display the result.

Hide code cell content

with connect("notebook/caravaggio.db") as con:
    df_titles = read_sql("SELECT title, year FROM Artwork", con)

df_titles
title year
0 Ragazzo con canestra di frutta 1593
1 Bacchino malato 1593
2 Buona ventura 1594
3 Riposo durante la fuga in Egitto 1596
4 Canestra di frutta 1599
5 Bacco 1597
6 Testa di Medusa 1597
7 Giuditta e Oloferne 1599
8 Narciso 1599
9 Vocazione di san Matteo 1600
10 Conversione di san Paolo 1601
11 Madonna dei Pellegrini 1605
12 Flagellazione di Cristo 1607
13 Sette opere di misericordia 1607
14 Davide con la testa di Golia 1610

Exercise 2.2: Filter rows with WHERE#

Write an SQL query that retrieves the title and year of all artworks painted after 1600. In SQL, conditions are expressed with the WHERE clause, using operators like =, >, <, >=, <=. Display the result.

Hide code cell content

with connect("notebook/caravaggio.db") as con:
    df_after_1600 = read_sql("SELECT title, year FROM Artwork WHERE year > 1600", con)

df_after_1600
title year
0 Conversione di san Paolo 1601
1 Madonna dei Pellegrini 1605
2 Flagellazione di Cristo 1607
3 Sette opere di misericordia 1607
4 Davide con la testa di Golia 1610

Exercise 2.3: Combine conditions and sort results#

Write an SQL query that retrieves the title, year, and genre of artworks that satisfy either of these conditions:

  • genre is “mythological”

  • genre is “religious” and year is after 1605

Order the results by year, from oldest to newest, using the ORDER BY clause.

This is the same filter you wrote with df_artworks.query() in Working with data in pandas, Exercise 2.1, but now expressed in SQL with sorting added.

Hide code cell content

with connect("notebook/caravaggio.db") as con:
    query = """
        SELECT title, year, genre
        FROM Artwork
        WHERE genre = 'mythological'
           OR (genre = 'religious' AND year > 1605)
        ORDER BY year
    """
    df_filtered = read_sql(query, con)

df_filtered
title year genre
0 Bacchino malato 1593 mythological
1 Bacco 1597 mythological
2 Testa di Medusa 1597 mythological
3 Narciso 1599 mythological
4 Flagellazione di Cristo 1607 religious
5 Sette opere di misericordia 1607 religious
6 Davide con la testa di Golia 1610 religious

Part 3: Working with multiple tables#

In Working with data in pandas, you used merge() to combine DataFrames from different CSV files. SQL provides a similar operation called JOIN, which combines rows from two tables based on a shared column. The syntax is:

SELECT columns
FROM TableA
JOIN TableB ON TableA.column = TableB.column

This produces a result where each row from TableA is matched with the corresponding row from TableB wherever the specified columns have the same value, just like merge() in pandas.

Exercise 3.1: Join artworks with museums#

Write an SQL query that joins the Artwork table with the Museum table, matching Artwork.museum_id with Museum.museum_id. Retrieve the artwork title, year, museum name, and city. Display the result.

Hide code cell content

with connect("notebook/caravaggio.db") as con:
    query = """
        SELECT Artwork.title, Artwork.year, Museum.name, Museum.city
        FROM Artwork
        JOIN Museum ON Artwork.museum_id = Museum.museum_id
    """
    df_joined = read_sql(query, con)

df_joined
title year name city
0 Ragazzo con canestra di frutta 1593 Galleria Borghese Rome
1 Bacchino malato 1593 Galleria Borghese Rome
2 Buona ventura 1594 Musei Capitolini Rome
3 Riposo durante la fuga in Egitto 1596 Galleria Doria Pamphilj Rome
4 Canestra di frutta 1599 Pinacoteca Ambrosiana Milan
5 Bacco 1597 Galleria degli Uffizi Florence
6 Testa di Medusa 1597 Galleria degli Uffizi Florence
7 Giuditta e Oloferne 1599 Palazzo Barberini Rome
8 Narciso 1599 Palazzo Barberini Rome
9 Vocazione di san Matteo 1600 San Luigi dei Francesi Rome
10 Conversione di san Paolo 1601 Santa Maria del Popolo Rome
11 Madonna dei Pellegrini 1605 Sant'Agostino Rome
12 Flagellazione di Cristo 1607 Museo di Capodimonte Naples
13 Sette opere di misericordia 1607 Pio Monte della Misericordia Naples
14 Davide con la testa di Golia 1610 Galleria Borghese Rome

Exercise 3.2: Join three tables and filter#

Write an SQL query that joins all three tables (Artwork, Collection, and Museum) to retrieve the title, museum name, room, and condition of each artwork. Then, use pandas query() on the resulting DataFrame to find all artworks in “excellent” condition. Print the title and museum name of each one.

Hide code cell content

with connect("notebook/caravaggio.db") as con:
    query = """
        SELECT Artwork.title, Museum.name, Collection.room, Collection.condition
        FROM Artwork
        JOIN Collection ON Artwork.id = Collection.artwork_id
        JOIN Museum ON Artwork.museum_id = Museum.museum_id
    """
    df_full = read_sql(query, con)

df_excellent = df_full.query("condition == 'excellent'")
for idx, row in df_excellent.iterrows():
    print(row["title"], "-", row["name"])
Ragazzo con canestra di frutta - Galleria Borghese
Buona ventura - Musei Capitolini
Canestra di frutta - Pinacoteca Ambrosiana
Testa di Medusa - Galleria degli Uffizi
Vocazione di san Matteo - San Luigi dei Francesi
Conversione di san Paolo - Santa Maria del Popolo

Part 4: Final challenge#

Exercise 4.1: City report from the database#

Using SQL queries and pandas operations, produce a summary report as a new DataFrame with one row per city and the following columns:

  • city: the city name

  • num_museums: the number of distinct museums in that city that hold Caravaggio artworks

  • num_artworks: the total number of artworks held in that city

  • earliest: the year of the earliest artwork in that city

  • latest: the year of the latest artwork in that city

Save the result to a CSV file called city_report.csv (without the row index) and display the DataFrame.


Summary#

In this lab, you practised:

  • Creating a database: using connect() to create an SQLite database file and to_sql() to populate it with pandas DataFrames

  • Reading from a database: using read_sql() to execute SQL queries and receive results as DataFrames

  • SQL syntax: SELECT to choose columns, WHERE to filter rows with conditions, JOIN to combine tables on shared columns, ORDER BY to sort results

  • Combining tools: using SQL to extract data from the database and pandas to process the results further


Additional resources#

Next lab

In the next lab, Loading data and querying with Blazegraph, you will learn to represent the same dataset as RDF triples and query it using SPARQL on a Blazegraph triplestore, building on the concepts introduced in the Configuring and populating a graph database and Interacting with databases using Pandas chapters.