Introduction to Pandas

Introduction to Pandas#

In this chapter, we will see the basics of one of the most useful and used tool in Data Science projects, i.e. Python Pandas.

What is Pandas#

Pandas is a powerful tool developed for data analysis and data manipulation which is used in several Data Science projects due to its flexibility. It is accompanied by a great user guide, tutorials, a cookbook, an API documentation, other free books (e.g. Python Data Science Handbook), and several articles on the topic (e.g. those available in Programming Historian).

The official website makes available a “Getting started” guide to show how to install and use it. Anyway, you can install Pandas in your machine using the pip command as follows:

pip install pandas

Among the various things, Pandas introduces two new classes of objects that are used to handle any kind of data in tabular form. They are the class Series and the class DataFrame, described in the following subsections.

What is a Series#

A series is a one-dimensional array (i.e. it acts as a list) of objects of any data type (integers, strings, floating point numbers, Python objects, etc.). Each item in the series is indexed by a specific label (it can be an integer, a string, etc.), that can be used to access such an item. If no index is specified, the class Series will create such an index automatically using non-negative numbers (i.e. starting counting elements from 0).

index (the label)	element (the value)
0	Ron
1	Hermione
2	Harry
3	Tom
4	James
5	Lily
6	Severus
7	Sirius

A new series in Pandas can be created as follows:

from pandas import Series

my_series = Series(["Ron", "Hermione", "Harry", "Tom", "James", "Lily", "Severus", "Sirius"])
print(my_series)

       Ron
  Hermione
     Harry
       Tom
     James
      Lily
   Severus
    Sirius
dtype: object

As you can see, by printing the series on screen you get several information. The first column defines the indexes used to label each element of the series, which are listed in the second column. Finally, there is an indication of which kind of object are included in the series. In particular, in Pandas, the object data type (i.e. dtype) is used to define series that are made of string or mixed type objects.

There are, of course, other data types that can be used in series (and, sometimes, automatically inferred by Pandas). If you already know that all the values of a certain series belong to the same data type, e.g. string as in the example above, you can force Pandas to interpret them in such a way by specifying the data type with the input named parameter dtype in the constructor. For instance, to list all the values of the previous series as strings, the parameter dtype set to the value string must be specified. In addition, it is also possible to assign a name (i.e. a string) to the series, using the input named parameter name. The use of both dtype and name are shown as follows:

my_series = Series(["Ron", "Hermione", "Harry", "Tom", "James", "Lily", "Severus", "Sirius"], 
                   dtype="string", name="given name")
print(my_series)

       Ron
  Hermione
     Harry
       Tom
     James
      Lily
   Severus
    Sirius
Name: given name, dtype: string

It is possible to use the slicing mechanism similar to what implemented in Python list (i.e. <series>[<start>:<end>] to create subseries on the fly – it works even when the index labels are not integers! In addition, one can use either the method get or, alternatively, the instruction <series>[<index>] (it is similar to that available in Python dictionaries), taking in input an index label, to retrieve the value at the input index, as show in the following excerpt:

sub_series = my_series[1:6]
print("The new subseries is:")
print(sub_series)

print("\nThe element at index 5 of the new subseries is:")
print(sub_series[5])

The new subseries is:
1    Hermione
2       Harry
3         Tom
4       James
5        Lily
Name: given name, dtype: string

The element at index 5 of the new subseries is:
Lily

While it may seem as a list, a series in Pandas has several additional properties and methods that enable one to access and modify the item in the series in very different ways. Thus, it is more powerful than a simple Python list, and it is the basic structure adopted in the class described in the following section.

What is a DataFrame#

In Pandas, a DataFrame is a table. You can imagine it as a set of named series containing the same amount of elements, where each series defines a column of the table, and all the series share the same index labels (referring to the rows of the table).

index (label for rows)	column name 1 (a series)	column name 2 (another series)
0	Ron	Wisley
1	Hermione	Granger
2	Harry	Potter
3	Tom	Riddle
4	James	Potter
5	Lily	Potter
6	Severus	Snape
7	Sirius	Black

A new data frame in Pandas can be created as follows:

from pandas import DataFrame

my_dataframe = DataFrame({
    "given name" : my_series,
    "family name" : Series(
        ["Wisley", "Granger", "Potter", "Riddle", "Potter", "Potter", "Snape", "Black"], dtype="string")
})

print(my_dataframe)

  given name family name
      Ron      Wisley
 Hermione     Granger
    Harry      Potter
      Tom      Riddle
    James      Potter
     Lily      Potter
  Severus       Snape
   Sirius       Black

In this case (but, please, remeber that it is not the only way to create a new data frame), we use as input a dictionary where each key defines a column name and the series associated to such a key contains the values of each cell in that column. Then we can access the various columns in the data frame using the same approach seen for series, i.e. <dataframe>[<column name>], as shown in the following excerpt:

family_name_column = my_dataframe["family name"]
print(family_name_column)

   Wisley
  Granger
   Potter
   Riddle
   Potter
   Potter
    Snape
    Black
Name: family name, dtype: string

Selecting a column returns a series defining that column that share the column name and the indexes as specified in the data frame. Similarly, using the instruction <dataframe>.loc[<index label>], that takes in input an index label, returns the series defining the row at that input index, as shown in the following example:

third_row = my_dataframe.loc[2]
print(third_row)

given name      Harry
family name    Potter
Name: 2, dtype: string

As shown in the code above, it returns a series which has the column names of the original data frame as index labels of the series, the name as the index label of the data frame row selected, and the data type of the row derived from the various data types of the data frame columns.

Finally, as seen with the series, also data frame can be sliced (by rows) using the indentical approach introduced in the series. For instance, the following code shows how to create a new data frame taking a selection of the rows:

print("The new subdataframe is:")
sub_dataframe = my_dataframe[1:6]
print(sub_dataframe)

print("\nThe row at index 2 of the new subdataframe is:")
print(sub_dataframe.loc[2])

The new subdataframe is:
  given name family name
1   Hermione     Granger
2      Harry      Potter
3        Tom      Riddle
4      James      Potter
5       Lily      Potter

The row at index 2 of the new subdataframe is:
given name      Harry
family name    Potter
Name: 2, dtype: string

Also a data frame in Pandas has several additional properties and methods that enable one to access and modify the cells in the data frame.

How to load data into Pandas#

Pandas makes available several functions to load data stored in different formats. Indeed, there is a particular function that we can use to load data from CSV representations of tabular data (as those introduced in Chapter 15-what-is-a-datum.md), i.e. the function read_csv.

The method read_csv takes in input a file path and returns a DataFrame representing such tabular data, as shown as follows:

from pandas import read_csv

df_publications = read_csv("notebook/01-publications.csv")
print(df_publications)

                             doi  \
              10.1002/cfg.304   
10.1016/s1367-5931(02)00332-0   
   10.1002/9780470291092.ch20   

                                               title  publication year  \
Development of Computational Tools for the Inf...              2003   
In vitro selection as a powerful tool for the ...              2002   
Mechanisms of Toughening in Ceramic Matrix Com...              1981   

  publication venue             type  issue  volume  
       1531-6912  journal article    4.0     4.0  
       1367-5931  journal article    3.0     6.0  
   9780470291092     book chapter    NaN     NaN  

As you can see in the code above, the print function does not provide an appropriate visualisation of the data frame in Jupyter Lab, mainly because the content of its columns is more extensive than the example before. In Jupyter, it is possible to have a good preview of a data frame by simply name the variable in a runnable code, as follows:

df_publications

	doi	title	publication year	publication venue	type	issue	volume
0	10.1002/cfg.304	Development of Computational Tools for the Inf...	2003	1531-6912	journal article	4.0	4.0
1	10.1016/s1367-5931(02)00332-0	In vitro selection as a powerful tool for the ...	2002	1367-5931	journal article	3.0	6.0
2	10.1002/9780470291092.ch20	Mechanisms of Toughening in Ceramic Matrix Com...	1981	9780470291092	book chapter	NaN	NaN

By looking at the data frame, it appears clear that there is something odd in the data types shown, in particular in the columns issue and volume. In these columns there are two main issues. The first one concerns the data types associated to the column. Indeed, it seems that Pandas has interpreted automatically these columns as floating numbers, while they should be made of strings! This can be conformed by printing the data type (attribute dtype) associated to one of these columns, for instance:

print(df_publications["issue"].dtype)

float64

The other issue concerns that strange value in the last row, i.e. NaN. This special object is used when there is a missing data in a cell. However, once may expect that, since in the example we are dealing with string, we should simply use an empty string to represent such a missing data instead that such special object.

In order to avoid this behaviour, it may be neccessary to use the method read_csv specifying an additional input named parameter, i.e. dtype, which enables the specification of a dictionary where the keys are column names, while the values are the strings representing the data type for each column. Instead, to force Pandas to use an empty string as default for string-based column in case of missing values, it is enough to tell the function read_csv not to use the default NaN for missing value by setting the input named parameter keep_default_na to False.

The following code shows how to specify such parameters, showing on screen how the new data frame has been modified:

df_publications = read_csv("notebook/01-publications.csv", 
                           keep_default_na=False,
                           dtype={
                               "doi": "string",
                               "title": "string",
                               "publication year": "int",
                               "publication venue": "string",
                               "type": "string",
                               "issue": "string",
                               "volume": "string"
                           })
df_publications

	doi	title	publication year	publication venue	type	issue	volume
0	10.1002/cfg.304	Development of Computational Tools for the Inf...	2003	1531-6912	journal article	4	4
1	10.1016/s1367-5931(02)00332-0	In vitro selection as a powerful tool for the ...	2002	1367-5931	journal article	3	6
2	10.1002/9780470291092.ch20	Mechanisms of Toughening in Ceramic Matrix Com...	1981	9780470291092	book chapter

Iterating over a DataFrame and a Series#

Being a table, there are two possible strategies for iterating over a DataFrame object: row iteration and column iteration. These two can be achieved by means of two distinct methods of the class DataFrame: the method iterrows and the method items.

The method iterrows is used to retrieve a list-like structure where each item has two elements: the index label and a series representing the row related to that index.

for idx, row in df_publications.iterrows():
    print("\nThe index of the current row is", idx)
    print("The content of the row is as follows:")
    print(row)

The index of the current row is 0
The content of the row is as follows:
doi                                                    10.1002/cfg.304
title                Development of Computational Tools for the Inf...
publication year                                                  2003
publication venue                                            1531-6912
type                                                   journal article
issue                                                                4
volume                                                               4
Name: 0, dtype: object

The index of the current row is 1
The content of the row is as follows:
doi                                      10.1016/s1367-5931(02)00332-0
title                In vitro selection as a powerful tool for the ...
publication year                                                  2002
publication venue                                            1367-5931
type                                                   journal article
issue                                                                3
volume                                                               6
Name: 1, dtype: object

The index of the current row is 2
The content of the row is as follows:
doi                                         10.1002/9780470291092.ch20
title                Mechanisms of Toughening in Ceramic Matrix Com...
publication year                                                  1981
publication venue                                        9780470291092
type                                                      book chapter
issue                                                                 
volume                                                                
Name: 2, dtype: object

If one wants to iterate also over a series representing a row, keeping track of the index labels of the series representing the row (i.e. the column names), one can use the method items of the class Series:

for row_idx, row in df_publications.iterrows():
    print("\nRow index", row_idx)
    for item_idx, item in row.items():
        print(item_idx, "-->", item)

Row index 0
doi --> 10.1002/cfg.304
title --> Development of Computational Tools for the Inference of Protein Interaction Specificity Rules and Functional Annotation Using Structural Information
publication year --> 2003
publication venue --> 1531-6912
type --> journal article
issue --> 4
volume --> 4

Row index 1
doi --> 10.1016/s1367-5931(02)00332-0
title --> In vitro selection as a powerful tool for the applied evolution of proteins and peptides
publication year --> 2002
publication venue --> 1367-5931
type --> journal article
issue --> 3
volume --> 6

Row index 2
doi --> 10.1002/9780470291092.ch20
title --> Mechanisms of Toughening in Ceramic Matrix Composites
publication year --> 1981
publication venue --> 9780470291092
type --> book chapter
issue --> 
volume --> 

The method items of the class DataFrame is used to retrieve a list-like structure where each item has two elements: the column name and a series representing the related column, as shown in the following excerpt:

for column_name, column in df_publications.items():
    print("\nThe name of the current column is", column_name)
    print("The content of the column is as follows:")
    print(column)

The name of the current column is doi
The content of the column is as follows:
0                  10.1002/cfg.304
1    10.1016/s1367-5931(02)00332-0
2       10.1002/9780470291092.ch20
Name: doi, dtype: string

The name of the current column is title
The content of the column is as follows:
0    Development of Computational Tools for the Inf...
1    In vitro selection as a powerful tool for the ...
2    Mechanisms of Toughening in Ceramic Matrix Com...
Name: title, dtype: string

The name of the current column is publication year
The content of the column is as follows:
0    2003
1    2002
2    1981
Name: publication year, dtype: int64

The name of the current column is publication venue
The content of the column is as follows:
0        1531-6912
1        1367-5931
2    9780470291092
Name: publication venue, dtype: string

The name of the current column is type
The content of the column is as follows:
0    journal article
1    journal article
2       book chapter
Name: type, dtype: string

The name of the current column is issue
The content of the column is as follows:
0    4
1    3
2     
Name: issue, dtype: string

The name of the current column is volume
The content of the column is as follows:
0    4
1    6
2     
Name: volume, dtype: string

## How to store data with Pandas

Pandas makes available a specific method (i.e. to_csv) to its main classes, i.e. Series and DataFrame, to enable one to store them in the filesystem. In the DataFrame class, the method to_csv takes in input the file path where to store the CSV file representing the data frame as shown as follows:

df_publications.to_csv("notebook/03-publications.csv")

However, the method to_csv called as shown above will store also an additional column at the beginning, i.e. that related with the index labels for each row. In order to avoid to preserve the index, it is possible to set the input named parameter index to False, as shown in the following excerpt:

df_publications.to_csv("notebook/03-publications_no_index.csv", index=False)

Main operations with DataFrame#

Pandas makes available several operations for indexing, selecting, merging, joining, concatenating, and comparing data in a data frame. In the following section, we introduce two of them, but several additional operations are available in the documentation linked above.

Querying#

Pandas has several ways enabling querying a data frame and returning a selections of its rows. Among the various methods, one extremely useful is the method query, that takes in input a string representing an expression for querying the data frame and returns a new data frame compliant with the query.

The expression can be a combination of boolean expressions and comparisons, that enable to filter rows according to the values of its cells. For instance, to get all the rows that are journal articles, one can run the following query:

df_publications.query("type == 'journal article'")

	doi	title	publication year	publication venue	type	issue	volume
0	10.1002/cfg.304	Development of Computational Tools for the Inf...	2003	1531-6912	journal article	4	4
1	10.1016/s1367-5931(02)00332-0	In vitro selection as a powerful tool for the ...	2002	1367-5931	journal article	3	6

In case we want to refer to columns with spaces, we must use the tick character (i.e. `) to enclose the name of the column. For instance, to get all the rows that have a publication date lesser than 2003, we can run the following query:

df_publications.query("`publication year` < 2003")

	doi	title	publication year	publication venue	type	issue	volume
1	10.1016/s1367-5931(02)00332-0	In vitro selection as a powerful tool for the ...	2002	1367-5931	journal article	3	6
2	10.1002/9780470291092.ch20	Mechanisms of Toughening in Ceramic Matrix Com...	1981	9780470291092	book chapter

It is also possible to combine queries by using the boolean operators and and or. For instance, to get all the journal articles published before 2003, we can run the following query:

df_publications.query("type == 'journal article' and `publication year` < 2003")

	doi	title	publication year	publication venue	type	issue	volume
1	10.1016/s1367-5931(02)00332-0	In vitro selection as a powerful tool for the ...	2002	1367-5931	journal article	3	6

Joining#

Joining two data frames into a new one according to some common value is a crucial operation to enable to answer more complex query, such as getting all the articles published in the journal named Current Opinion in Chemical Biology. Indeed, the data frame about publications we have considered so fare does not have any information about the name of the venues, nor their types. However, considering the data we used in the first tutorial, we know that such information is actually included in another CSV file entirely dedicated to venues, that we can load in pandas as follows:

df_venues = read_csv("notebook/01-venues.csv", 
                     keep_default_na=False,
                     dtype={
                         "id": "string",
                         "name": "string",
                         "type": "string"
                     })
df_venues  # draw the table in the notebook

	id	name	type
0	1531-6912	Comparative and Functional Genomics	journal
1	1367-5931	Current Opinion in Chemical Biology	journal
2	9780470291092	Proceedings of the 5th Annual Conference on Co...	book
3	1027-3662	Journal of Theoretical Medicine	journal

Thus, in order to run such a query, first we should ask to this new data frame which is the id associated to the journal named Current Openion in Chemical Biology, and then to ask the other data frame with publications to retrieve all the rows that have such an identifier as publication venue.

However, in Pandas this can be done in just one query if we join before the two data frames in a new one containing a combination of the two tables. Indeed, as you can obsever, the data frame of publications and that of venues share some values in common. Indeed, as mentioned above, the values specified in the publication venue column in the publications data frame recall those specified in the id column of the venue data frame. Thus, in principle, it is possible to join these two tables by considering that common values.

Pandas provides the function merge to perform such an operation. Among the various input parameters such a function can take in input, those we use in this example to join these two data frames are the data frames them self and the name of the columns in the first (named left) data frame and the second (named right) data frame to use for joining, specified by using the input named parameters left_on and right_on respectively, as shown in the following excerpt:

from pandas import merge

df_joined = merge(df_publications, df_venues, left_on="publication venue", right_on="id")
df_joined  # draw the table in the notebook

	doi	title	publication year	publication venue	type_x	issue	volume	id	name	type_y
0	10.1002/cfg.304	Development of Computational Tools for the Inf...	2003	1531-6912	journal article	4	4	1531-6912	Comparative and Functional Genomics	journal
1	10.1016/s1367-5931(02)00332-0	In vitro selection as a powerful tool for the ...	2002	1367-5931	journal article	3	6	1367-5931	Current Opinion in Chemical Biology	journal
2	10.1002/9780470291092.ch20	Mechanisms of Toughening in Ceramic Matrix Com...	1981	9780470291092	book chapter			9780470291092	Proceedings of the 5th Annual Conference on Co...	book

As you can see from the data frame above, all the rows of the publications data frame (the left data frame of the join) have been extended using the values specified in the venues data frame (the right data frame of the join) mapping the values in the columns publication year (in left) and id (in right). In addition, Pandas modifies the name of the columns that have the same name in both data frames of the join – indeed the columns type became type_x (refferring to left) and type_y (referring to right).

Having this new data frame, the original query we wanted to run becomes pretty easy to define:

df_joined.query("type_y == 'journal' and name == 'Current Opinion in Chemical Biology'")

	doi	title	publication year	publication venue	type_x	issue	volume	id	name	type_y
1	10.1016/s1367-5931(02)00332-0	In vitro selection as a powerful tool for the ...	2002	1367-5931	journal article	3	6	1367-5931	Current Opinion in Chemical Biology	journal