Working with data in pandas#

Learning objectives

By the end of this lab, you will be able to:

  • Load CSV files into pandas DataFrames with correct data types

  • Query and filter DataFrames using query() and iterrows()

  • Join multiple DataFrames using merge()

  • Save results to CSV files using to_csv()

This lab puts into practice the concepts introduced in the Introduction to Pandas chapter. You will work with a dataset of artworks by Caravaggio held in Italian museums and churches, exploring how to load, query, and combine tabular data using pandas.


The dataset: Caravaggio’s artworks in Italy#

The dataset for this lab consists of three CSV files stored in the notebook/ directory:

  • artworks.csv: a catalogue of 15 paintings by Caravaggio with their title, year, genre, dimensions (height and width in cm), and the museum where they are held

  • museums.csv: a list of 11 Italian museums and churches with their city, type, and founding year

  • collections.csv: a table linking each artwork to its room within the museum and its conservation condition

These three tables are related: each row in collections.csv connects an artwork (by its id) to a specific room, while each artwork references a museum (by its museum_id). This is a common pattern in data management, where information is split across multiple tables to avoid repetition.


Part 1: Loading and exploring the catalogue#

In this first part, you will load the artworks dataset and explore its content using basic pandas operations.

Exercise 1.1: Load the artworks catalogue#

Load the file notebook/artworks.csv into a pandas DataFrame using read_csv(). Make sure to specify keep_default_na=False and provide a dtype dictionary so that each column is read with the correct data type: "string" for text columns and "int" for id, year, height_cm, width_cm, and museum_id. Display the resulting DataFrame.

Hide code cell content

from pandas import read_csv

df_artworks = read_csv("notebook/artworks.csv",
                        keep_default_na=False,
                        dtype={
                            "id": "int",
                            "title": "string",
                            "year": "int",
                            "genre": "string",
                            "height_cm": "int",
                            "width_cm": "int",
                            "museum_id": "int"
                        })
df_artworks
id title year genre height_cm width_cm museum_id
0 1 Ragazzo con canestra di frutta 1593 portrait 70 67 1
1 2 Bacchino malato 1593 mythological 67 53 1
2 3 Buona ventura 1594 genre 115 150 2
3 4 Riposo durante la fuga in Egitto 1596 religious 134 167 3
4 5 Canestra di frutta 1599 still life 47 65 4
5 6 Bacco 1597 mythological 95 85 5
6 7 Testa di Medusa 1597 mythological 60 55 5
7 8 Giuditta e Oloferne 1599 religious 145 195 6
8 9 Narciso 1599 mythological 110 92 6
9 10 Vocazione di san Matteo 1600 religious 322 340 7
10 11 Conversione di san Paolo 1601 religious 230 175 8
11 12 Madonna dei Pellegrini 1605 religious 260 150 9
12 13 Flagellazione di Cristo 1607 religious 286 213 10
13 14 Sette opere di misericordia 1607 religious 390 260 11
14 15 Davide con la testa di Golia 1610 religious 125 101 1

Exercise 1.2: Count artworks by genre#

Using iterrows(), count how many artworks belong to each genre. Build a dictionary where the keys are genre names and the values are the counts, then print it.

Hide code cell content

genre_counts = dict()
for idx, row in df_artworks.iterrows():
    genre = row["genre"]
    if genre in genre_counts:
        genre_counts[genre] = genre_counts[genre] + 1
    else:
        genre_counts[genre] = 1

print(genre_counts)
{'portrait': 1, 'mythological': 4, 'genre': 1, 'religious': 8, 'still life': 1}

Part 2: Querying the catalogue#

Now that you are familiar with the dataset, you will practice more advanced queries combining multiple conditions.

Exercise 2.1: Multi-condition query#

Find all artworks that satisfy either of these conditions:

  • Genre is “mythological”

  • Genre is “religious” and year is after 1605

Print the title of each matching artwork.

Hide code cell content

df_result = df_artworks.query(
    "(genre == 'mythological') or (genre == 'religious' and year > 1605)"
)
for idx, row in df_result.iterrows():
    print(row["title"])
Bacchino malato
Bacco
Testa di Medusa
Narciso
Flagellazione di Cristo
Sette opere di misericordia
Davide con la testa di Golia

Part 3: Working with multiple tables#

In real datasets, information is often distributed across multiple tables. In this part, you will load additional CSV files and combine them with the artworks catalogue using merge().

Exercise 3.1: Load and join artworks with collections#

Load notebook/museums.csv and notebook/collections.csv into two DataFrames, specifying keep_default_na=False and appropriate dtype dictionaries. Then, use merge() to join the artworks DataFrame with the collections DataFrame. The join should match the id column in artworks with the artwork_id column in collections. Display the resulting DataFrame and observe which columns it contains.

Hide code cell content

from pandas import merge

df_museums = read_csv("notebook/museums.csv",
                      keep_default_na=False,
                      dtype={
                          "museum_id": "int",
                          "name": "string",
                          "city": "string",
                          "type": "string",
                          "founded": "int"
                      })

df_collections = read_csv("notebook/collections.csv",
                          keep_default_na=False,
                          dtype={
                              "artwork_id": "int",
                              "room": "string",
                              "condition": "string"
                          })

df_art_collections = merge(df_artworks, df_collections, left_on="id", right_on="artwork_id")
df_art_collections
id title year genre height_cm width_cm museum_id artwork_id room condition
0 1 Ragazzo con canestra di frutta 1593 portrait 70 67 1 1 Sala VIII excellent
1 2 Bacchino malato 1593 mythological 67 53 1 2 Sala VIII good
2 3 Buona ventura 1594 genre 115 150 2 3 Sala Pinacoteca excellent
3 4 Riposo durante la fuga in Egitto 1596 religious 134 167 3 4 Sala di Caravaggio good
4 5 Canestra di frutta 1599 still life 47 65 4 5 Sala Federiciana excellent
5 6 Bacco 1597 mythological 95 85 5 6 Sala di Caravaggio good
6 7 Testa di Medusa 1597 mythological 60 55 5 7 Sala di Caravaggio excellent
7 8 Giuditta e Oloferne 1599 religious 145 195 6 8 Sala 20 good
8 9 Narciso 1599 mythological 110 92 6 9 Sala 20 fair
9 10 Vocazione di san Matteo 1600 religious 322 340 7 10 Cappella Contarelli excellent
10 11 Conversione di san Paolo 1601 religious 230 175 8 11 Cappella Cerasi excellent
11 12 Madonna dei Pellegrini 1605 religious 260 150 9 12 Navata sinistra good
12 13 Flagellazione di Cristo 1607 religious 286 213 10 13 Sala 12 fair
13 14 Sette opere di misericordia 1607 religious 390 260 11 14 Cappella principale good
14 15 Davide con la testa di Golia 1610 religious 125 101 1 15 Sala XIV fair

Exercise 3.2: Chain a second merge and save results#

Starting from the result of the previous exercise, perform a second merge() to add the museum names. Join on the museum_id column (present in both the artworks data and the museums data). Then, find all artworks in “excellent” condition and print their title and museum name. Finally, save the resulting DataFrame of excellent-condition artworks to a new CSV file called notebook/excellent_artworks.csv, using to_csv() with index=False to avoid writing the row index.

Hide code cell content

df_full = merge(df_art_collections, df_museums, on="museum_id")
df_excellent = df_full.query("condition == 'excellent'")
for idx, row in df_excellent.iterrows():
    print(row["title"], "-", row["name"])

df_excellent.to_csv("notebook/excellent_artworks.csv", index=False)
Ragazzo con canestra di frutta - Galleria Borghese
Buona ventura - Musei Capitolini
Canestra di frutta - Pinacoteca Ambrosiana
Testa di Medusa - Galleria degli Uffizi
Vocazione di san Matteo - San Luigi dei Francesi
Conversione di san Paolo - Santa Maria del Popolo

Part 4: Final challenge#

Exercise 4.1: Artworks per museum#

Using all three CSV files, produce a dictionary where each key is a museum name and each value is the number of artworks that museum holds. Print the result.


Summary#

In this lab, you practised:

  • Loading data: using read_csv() with keep_default_na=False and dtype to control how pandas interprets each column

  • Exploring data: iterating over rows with iterrows() and filtering with query()

  • Combining tables: using merge() to join DataFrames on shared columns

  • Saving results: writing DataFrames to CSV files with to_csv()


Additional resources#

Next lab

In the next lab, Modelling data with Python classes, you will learn to model data using Python classes, building on the concepts introduced in the Implementation of data models via Python classes chapter.