Storing and querying data with SQLite#
Learning objectives
By the end of this lab, you will be able to:
Create an SQLite database and populate it with pandas DataFrames using
to_sql()Query a database using SQL via
read_sql()Use SQL
SELECT,WHERE, andJOINclauses to extract dataCombine SQL queries with pandas operations to analyse results
This lab puts into practice the concepts introduced in the Configuring and populating a relational database and Interacting with databases using Pandas chapters. You will take the Caravaggio dataset from Working with data in pandas and store it in an SQLite database, then use SQL queries to extract information from it.
The dataset#
This lab uses the same dataset from Working with data in pandas. If you no longer have the files, you can download them again by clicking on each name:
artworks.csv: a catalogue of 15 paintings by Caravaggio with their title, year, genre, dimensions (height and width in cm), and the museum where they are heldmuseums.csv: a list of 11 Italian museums and churches with their city, type, and founding yearcollections.csv: a table linking each artwork to its room within the museum and its conservation condition
Download the three files and place them in the same folder where you will write your Python code.
Part 1: Building the database#
In Working with data in pandas, you worked with three CSV files about Caravaggio’s artworks in Italian museums. All the data lived in CSV files, and you used pandas to load, query, and combine it. Now you will store the same data in an SQLite database. The advantage of a database is that data is stored persistently in a structured format, and you can use SQL, a dedicated query language, to extract exactly the information you need.
Exercise 1.1: Create the database and populate it#
Load the three CSV files (artworks.csv, museums.csv, collections.csv) into pandas DataFrames, using the same dtype specifications as in the previous lab. Then, create an SQLite database called caravaggio.db and store each DataFrame as a table using to_sql(). Use if_exists="replace" so the code can be run multiple times without errors, and index=False to avoid writing the pandas index as a column.
Exercise 1.2: Verify the database content#
Use read_sql() to read the entire Artwork table from the database. The SQL query SELECT * FROM Artwork retrieves all columns and all rows from the table. Display the resulting DataFrame and verify it matches the original CSV data.
Part 2: Querying the database#
With the data stored in the database, you can use SQL to extract exactly what you need. Each query is a string passed to read_sql(), and the result is returned as a pandas DataFrame. In this part, you will practice the most common SQL clauses: SELECT to choose which columns to retrieve, WHERE to filter rows by condition, and ORDER BY to sort results.
Exercise 2.1: Select specific columns#
Write an SQL query that retrieves only the title and year columns from the Artwork table. Display the result.
Exercise 2.2: Filter rows with WHERE#
Write an SQL query that retrieves the title and year of all artworks painted after 1600. In SQL, conditions are expressed with the WHERE clause, using operators like =, >, <, >=, <=. Display the result.
Exercise 2.3: Combine conditions and sort results#
Write an SQL query that retrieves the title, year, and genre of artworks that satisfy either of these conditions:
genre is “mythological”
genre is “religious” and year is after 1605
Order the results by year, from oldest to newest, using the ORDER BY clause.
This is the same filter you wrote with df_artworks.query() in Working with data in pandas, Exercise 2.1, but now expressed in SQL with sorting added.
Part 3: Working with multiple tables#
In Working with data in pandas, you used merge() to combine DataFrames from different CSV files. SQL provides a similar operation called JOIN, which combines rows from two tables based on a shared column. The syntax is:
SELECT columns
FROM TableA
JOIN TableB ON TableA.column = TableB.column
This produces a result where each row from TableA is matched with the corresponding row from TableB wherever the specified columns have the same value, just like merge() in pandas.
Exercise 3.1: Join artworks with museums#
Write an SQL query that joins the Artwork table with the Museum table, matching Artwork.museum_id with Museum.museum_id. Retrieve the artwork title, year, museum name, and city. Display the result.
Exercise 3.2: Join three tables and filter#
Write an SQL query that joins all three tables (Artwork, Collection, and Museum) to retrieve the title, museum name, room, and condition of each artwork. Then, use pandas query() on the resulting DataFrame to find all artworks in “excellent” condition. Print the title and museum name of each one.
Part 4: Final challenge#
Exercise 4.1: City report from the database#
Using SQL queries and pandas operations, produce a summary report as a new DataFrame with one row per city and the following columns:
city: the city namenum_museums: the number of distinct museums in that city that hold Caravaggio artworksnum_artworks: the total number of artworks held in that cityearliest: the year of the earliest artwork in that citylatest: the year of the latest artwork in that city
Save the result to a CSV file called city_report.csv (without the row index) and display the DataFrame.
Hint
Start by writing an SQL query with JOIN to combine Artwork and Museum. Then iterate over the resulting DataFrame with iterrows() to aggregate information by city into a dictionary. Use a set to track distinct museum identifiers per city.
Solution
from sqlite3 import connect
from pandas import read_sql, DataFrame, Series
with connect("caravaggio.db") as con:
query = """
SELECT Artwork.title, Artwork.year, Museum.museum_id, Museum.name, Museum.city
FROM Artwork
JOIN Museum ON Artwork.museum_id = Museum.museum_id
"""
df_full = read_sql(query, con)
report = dict()
for idx, row in df_full.iterrows():
city = row["city"]
if city not in report:
report[city] = {
"museums": set(),
"num_artworks": 0,
"earliest": row["year"],
"latest": row["year"]
}
report[city]["museums"].add(row["museum_id"])
report[city]["num_artworks"] = report[city]["num_artworks"] + 1
if row["year"] < report[city]["earliest"]:
report[city]["earliest"] = row["year"]
if row["year"] > report[city]["latest"]:
report[city]["latest"] = row["year"]
rows = list()
for city in report:
row = Series({
"city": city,
"num_museums": len(report[city]["museums"]),
"num_artworks": report[city]["num_artworks"],
"earliest": report[city]["earliest"],
"latest": report[city]["latest"]
})
rows.append(row)
df_report = DataFrame(rows)
df_report.to_csv("city_report.csv", index=False)
df_report
Summary#
In this lab, you practised:
Creating a database: using
connect()to create an SQLite database file andto_sql()to populate it with pandas DataFramesReading from a database: using
read_sql()to execute SQL queries and receive results as DataFramesSQL syntax:
SELECTto choose columns,WHEREto filter rows with conditions,JOINto combine tables on shared columns,ORDER BYto sort resultsCombining tools: using SQL to extract data from the database and pandas to process the results further
Additional resources#
Next lab
In the next lab, Loading data and querying with Blazegraph, you will learn to represent the same dataset as RDF triples and query it using SPARQL on a Blazegraph triplestore, building on the concepts introduced in the Configuring and populating a graph database and Interacting with databases using Pandas chapters.