Descriptive statistics and graphs about data using Pandas

Descriptive statistics and graphs about data using Pandas#

In this chapter, we show how to use Pandas to calculate basic statistics of a dataset and show figures using automatically-generated graphs.

What data are about#

When you receive a new dataset (such as the one included in this chapter), the first you have to do is to analyse it to understand what its data are about, how they have been organised, what is the type of each column, and whether there are any null object included in it (e.g. empty cells). In a previous tutorial, we have used an input parameter specified on the function read_csv (i.e. keep_default_na set to False) to rewrite empty cell values as empty strings (i.e. ""). However, by doing so, we may miss some relevant information about the dataset that we should know from the beginning. Let us see it practically with an example, using the DataFrame method info that enables us to have a summary of the data frame:

from pandas import read_csv

publications = read_csv("notebook/04-publications.csv", keep_default_na=False)
publications.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 995 entries, 0 to 994
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   doi                995 non-null    object
 1   title              995 non-null    object
 2   publication year   995 non-null    int64 
 3   publication venue  995 non-null    object
 4   type               995 non-null    object
 5   issue              995 non-null    object
 6   volume             995 non-null    object
dtypes: int64(1), object(6)
memory usage: 54.5+ KB

As you can see from the text printed on screen, this seems a perfect dataset: 995 rows, entries organised in seven columns and each cell contains only non-null values. However, is it really the case? The problem here is that, by avoiding to use the default mechanism to assign empty cells, the systems does not recognize them as empty, but rather containing something (e.g. the empty string "") which indeed does not contain any charater but, still, is a value associated with a cell.

Thus, as a suggestion, when you approach for the very first time a dataset using Pandas, leave the system use its own favourite ways to handle situations (such as empty cells) and observe what this may mean. The following code show the same description reported above leaving the system to handle empty cells as it prefers:

publications = read_csv("notebook/04-publications.csv")
publications.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 995 entries, 0 to 994
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   doi                995 non-null    object
 1   title              989 non-null    object
 2   publication year   995 non-null    int64 
 3   publication venue  981 non-null    object
 4   type               995 non-null    object
 5   issue              505 non-null    object
 6   volume             970 non-null    object
dtypes: int64(1), object(6)
memory usage: 54.5+ KB

As you can see, now the description is slightly different. Indeed, the only columns having always a non-null value specified are doi, publication year and type, while the other columns have, somewhere, some cell left unspecified - which is reasonable, if you think about it. For instance, a book chapter does not have any issue or volume and, thus, the related cells must be empty in such a record.

After you got an idea of what kind of columns are included in the data frame, we can move on asking for more information about the values of each column. For retrieving this information, we use the method describe. If you want to obtain an overall view contained in a new data frame describing the one on which the method is called, the suggestion is to call such a method using the optional input named parameter include set to "all", as shown in the following excerpt:

publications.describe(include="all")
doi title publication year publication venue type issue volume
count 995 989 995.000000 981 995 505 970
unique 995 988 NaN 392 5 75 279
top 10.1002/cfg.304 Transformation toughening NaN Materials Science Forum journal article 1 2006
freq 1 2 NaN 169 970 84 92
mean NaN NaN 1995.788945 NaN NaN NaN NaN
std NaN NaN 18.224005 NaN NaN NaN NaN
min NaN NaN 1886.000000 NaN NaN NaN NaN
25% NaN NaN 1993.000000 NaN NaN NaN NaN
50% NaN NaN 2003.000000 NaN NaN NaN NaN
75% NaN NaN 2006.000000 NaN NaN NaN NaN
max NaN NaN 2012.000000 NaN NaN NaN NaN

The data frame above provide a pletora of different statistics about each single column of the original data, some of them apply to a certain columns while other do not. For instance, all the statistics about number manipulations (mean, std, min, etc.) do not apply to strings, and thus a NaN is returned in these cases.

Some of these statistics are very useful, and allow you to understand something about the data without looking at all of them. For instance:

  • there are two publications sharing the same title, that is “Transformation toughening”;

  • some of the publications (6) do not have a title associated;

  • while all the publications have a type specified, overall these types come from 5 distinct values only (they seem to highlight descriptive categories);

  • the oldest publication was published in 1886 while the newest in 2012.

Looking at these information, one could be curious about how can exist two publications with the same title. To get more information about it, we can run a query catching all the publications having in common the title “Transformation toughening”:

publications.query('title == "Transformation toughening"')
doi title publication year publication venue type issue volume
731 10.1007/bf00809059 Transformation toughening 1982 Journal of Materials Science journal article 1 17
732 10.1007/bf00809057 Transformation toughening 1982 Journal of Materials Science journal article 1 17

Indeed, these two publications may seem the same one, since they share all the other data, except the DOI. Thus, one possibility would be to use the DOI resolver (https://doi.org) to see to which kind of entities these DOIs (i.e. https://doi.org/10.1007/bf00809059 and https://doi.org/10.1007/bf00809057) actually refer to. Once seen them, are they the same entity? Hint: look at the subtitle and the number of the pages…

In addition to the statistics provided in the previous data frame, other statistics can be calculated using the data in the columns of the data frame. For instance, another important statistics for numeric values is the median:

print("-- Median value of the publication years in the data")
print(publications["publication year"].median())
-- Median value of the publication years in the data
2003.0

The Series method median calculates such an important statistics that enables one to identify the average value of a series of numbers partially-limitating the effect of the outliers in the series. Indeed, the average calculated in the data frame above is lower than the median, since it is affected by some very low publication dates (e.g. 1886), which do not affect at all the median instead.

Another useful operation one can run on series is provided by the method unique. This method is useful to identify what are all the unique values in a series that, in principle, can contain several items. For instance, that can be used to identify the categories in the column type:

print("-- Categories describing types of publications")
publications["type"].unique()
-- Categories describing types of publications
array(['journal article', 'book', 'proceedings article', 'book chapter',
       'monograph'], dtype=object)

Drawing data#

In addition to generate data frames with descriptive statistics, Pandas makes available also methods to draw data in simple graphs, such as line charts, bar charts, etc. For instance, taking as example the types of publications described in the previous code, we could be interested in understanding how many publications of each type are included in the data. For doing so, we first have to retrieve such number for each type using the Series method value_counts, shown as follows:

type_count = publications["type"].value_counts()
type_count
type
journal article        970
book                    11
book chapter            11
proceedings article      2
monograph                1
Name: count, dtype: int64

The method value_counts applied to a series of strings returns the number of times each string appears in the series, where the unique strings of the series become the index of the new series. These kinds of series can be plotted as a bar chart easily, where the index labels are the categories shown in the x-axis, while the number of times each string is represented in the original series is the value highlighte in the y-axis. This diagram is plotted using the method plot, specifying the optional input named parameter kind set to "bar", as shown as follows:

type_count.plot(kind="bar")
<Axes: xlabel='type'>
_images/8e78acf1ab10dc7c23d2c9ad42ccf4503dc27b1f93f67a15c2f94c627b4f0ff3.png

We can use a similar approach to understand, for instance, what are the top ten venues considering all the publication in the dataset. In this case, we use again the method value_counts and then we select the first 10 rows, as shown as follows:

best_venues = publications["publication venue"].value_counts()[:10]
best_venues
publication venue
Materials Science Forum                                           169
Key Engineering Materials                                          98
International Journal of Mathematics and Mathematical Sciences     34
Mediators of Inflammation                                          15
Applied Physics Letters                                            14
Journal of Applied Mathematics and Stochastic Analysis             14
Journal of the American Ceramic Society                            14
Physical Review Letters                                            10
Mathematical Problems in Engineering                               10
Discrete Dynamics in Nature and Society                             9
Name: count, dtype: int64

Again, as before, we plot it as a horizontal bar chart using the same command shown in the previous example, but setting the parameter kind to "barh":

best_venues.plot(kind="barh")
<Axes: ylabel='publication venue'>
_images/2e52b381b2f8b734da711d0426614fa7f9f735ab7d3e4682488fcad85ce225c0.png

We can use a different plot when we deal with time series, i.e. series of data that are organised temporally (for instance, year by year). For instance, we could be interested in understading how many publications have been published in each year. To retrieve this information, we can use again the method value_counts, but this time applied to the column publication year, as shown as follows:

publications_per_year = publications["publication year"].value_counts()
publications_per_year
publication year
2006    183
2007     92
2005     65
2008     56
2003     51
       ... 
1916      1
1911      1
1909      1
1954      1
1919      1
Name: count, Length: 85, dtype: int64

As you can see, the series contains the number of publications year by year, sorted in descending order, from the year with most publications to that with less publications. In order to draw all these data in the correct temporal order, we need first to sort them in ascending order using the index labels (i.e. the years of publication). For doing so, we can use the Series method sort_index to generate a new series ordered as mentioned:

publications_per_year_sorted = publications_per_year.sort_index()
publications_per_year_sorted
publication year
1886     1
1893     1
1894     2
1895     1
1902     1
        ..
2007    92
2008    56
2009    24
2011     7
2012     1
Name: count, Length: 85, dtype: int64

Then, finally, we can plot this new series using a simple line diagram (the default for the plot method) as shown as follows:

publications_per_year_sorted.plot()
<Axes: xlabel='publication year'>
_images/6f723fed590a81c4a41cfb7509460c327ba6e5954043972ebcafbb07cd0e5d31.png