Data formats and methods for storing data in Python

Data formats and methods for storing data in Python#

In this chapter, we see two basic formats to deal with data in Python: CSV and JSON.

Comma-separated values (CSV)#

The first and simplest format you can use to store and load data in Python is the Comma-Separated Values (CSV). In practice, each CSV file defines a table of a fixed number of columns where each row represents a (subject) entity and each cell defines the (object) value associated to that entity via the predicate defined by the column label, if specified. While it is not mandatory to specify column labels, it makes a CSV file more understandable to humans and machines. These labels can be specified using the first row of a CSV, defining an header of the table represented. An example of a table represented with a CSV is shown as follows.

	column₁	column₂	…	column_n
entity₁	value₁	value₁	…	value_n
entity₂	value₁	value₁	…	value_n
…	…	…	…	…
entity_m	value₁	value₁	…	value_n

Each cell in a row is defined by splitting the cell values using a comma (,). In case the comma is part of the value of a cell, it is possible to escape such a comma by putting the cell value between quotes ("). Finally, in case a cell value is defined using quotes and one or more quote is included in cell value, these must be escaped by using double quotes (""). The following table and the related CSV source show how to define in CSV cell values when these situations happen.

column name	another name, with a comma
a value	a value, with a comma
a quoted “value”	a quoted “value”, with a comma

CSV source:

column name,"another name, with a comma"
a value,"a value, with a comma"
a quoted "value","a quoted ""value"", with a comma"

Python has a dedicated library to handle this format called csv. In order to understand how to use it, we can start analysing two very simple CSV files, one containing publications and some of their basic metadata and another with information about the venues where such publications have been published. To open it as a source file, you can right-click on it in the left panel and select Open With -> Editor.

Opening a CSV file#

In order to understand how these files are represented in Python, let us try to load one of it into a Python object using the funcion reader included in the module csv mentioned above. For doing that, it is necessary to obtain a file object of a particular file using the built-in function open used with a with statement as shown in the following excerpt:

from csv import reader

with open("notebook/01-publications.csv", "r", encoding="utf-8") as f:
    publications = reader(f)

The function open takes in input several parameters and returns a file object, i.e. a Python object used to interact with a file stored in the file system. However, it is highly suggested to use at least the three specified above, that are:

the first positional parameter must contain the path of the file one wants to open;
the second positional parameter is the mode used to open the file ("r" stands for read, "w" for write, etc.);
the named parameter encoding specifying the encoding to open the file ("utf-8" must be used, if you do not want to have issues).

In addition, all file objects one wants to create in Python to process files stored in the file system must be also closed once all the operations on that file are concluded. The keyword with used with the function open allows one to handle the opening and closing of the file object automatically. In practice, once all the operation within the with block are executed, the related file object will be closed. Finally, the file object openned using the function open will be assigned to the variable that follows the keyword as, i.e. f in the example above. It is worth mentioning that the example just shown introduces how to open, in reading mode, any file in the file system, not only CSV files.

CSV reader#

Once our file object has been defined, we can read its content interpreting it as a CSV document using the constructor reader included in the package csv, which is imported by means of the usual command:

from csv import reader

The constructor reader takes in input a file object and returns an object of type (i.e. class) _csv.reader, that enables one to iterate over the CSV document row by row. To check the actual type, you can use the built-in function type passing the object as input, and then printing it on screen using either the function print or, when we want to print a complex structure, the function pprint from the package pprint (a.k.a. pretty print), as shown running the following code:

from pprint import pprint

pprint(type(publications))

<class '_csv.reader'>

An object of the class _csv.reader behaves like a list, and can be iterated using a for earch loop, as shown as follows:

with open("notebook/01-publications.csv", "r", encoding="utf-8") as f:
    publications = reader(f)

    for row in publications:
        pprint(row)

['doi',
 'title',
 'publication year',
 'publication venue',
 'type',
 'issue',
 'volume']
['10.1002/cfg.304',
 'Development of Computational Tools for the Inference of Protein Interaction '
 'Specificity Rules and Functional Annotation Using Structural Information',
 '2003',
 '1531-6912',
 'journal article',
 '4',
 '4']
['10.1016/s1367-5931(02)00332-0',
 'In vitro selection as a powerful tool for the applied evolution of proteins '
 'and peptides',
 '2002',
 '1367-5931',
 'journal article',
 '3',
 '6']
['10.1002/9780470291092.ch20',
 'Mechanisms of Toughening in Ceramic Matrix Composites',
 '1981',
 '9780470291092',
 'book chapter',
 '',
 '']

In case you want to skip the header of the table if present, starting to look at the values directly, you need to use the built-in function next that takes in input any iterator-based object, such as our CSV reader, and skips to the next line:

with open("notebook/01-publications.csv", "r", encoding="utf-8") as f:
    publications = reader(f)
    next(publications)  # it skip the first row of the CSV table

    for row in publications:
        pprint(row)  # it prints all the rows except the header

['10.1002/cfg.304',
 'Development of Computational Tools for the Inference of Protein Interaction '
 'Specificity Rules and Functional Annotation Using Structural Information',
 '2003',
 '1531-6912',
 'journal article',
 '4',
 '4']
['10.1016/s1367-5931(02)00332-0',
 'In vitro selection as a powerful tool for the applied evolution of proteins '
 'and peptides',
 '2002',
 '1367-5931',
 'journal article',
 '3',
 '6']
['10.1002/9780470291092.ch20',
 'Mechanisms of Toughening in Ceramic Matrix Composites',
 '1981',
 '9780470291092',
 'book chapter',
 '',
 '']

It is worth mentioning that, once you have iterated it once, the CSV reader is consumed and does not allow you to iterate over the same object twice. For instance, see the following execution where the same reader is iterated twice:

with open("notebook/01-publications.csv", "r", encoding="utf-8") as f:
    publications = reader(f)

    print("-- First iteration")
    for row in publications:
        pprint(row)  # all the rows will be printed, one by one

    print("\n-- Second iteration")
    for row in publications:
        pprint(row)  # no row will be printed

-- First iteration
['doi',
 'title',
 'publication year',
 'publication venue',
 'type',
 'issue',
 'volume']
['10.1002/cfg.304',
 'Development of Computational Tools for the Inference of Protein Interaction '
 'Specificity Rules and Functional Annotation Using Structural Information',
 '2003',
 '1531-6912',
 'journal article',
 '4',
 '4']
['10.1016/s1367-5931(02)00332-0',
 'In vitro selection as a powerful tool for the applied evolution of proteins '
 'and peptides',
 '2002',
 '1367-5931',
 'journal article',
 '3',
 '6']
['10.1002/9780470291092.ch20',
 'Mechanisms of Toughening in Ceramic Matrix Composites',
 '1981',
 '9780470291092',
 'book chapter',
 '',
 '']

-- Second iteration

Casting CSV reader into a list#

If you want to iterate over the same rows more than one time, one possibility would be to convert your reader into a list object, by using the list constructor:

with open("notebook/01-publications.csv", "r", encoding="utf-8") as f:
    publications = reader(f)
    publications_list = list(publications)

From now on, even if the file object is closed after executing all the instructions within the with block, you can always access (and iterate) the rows defined in the original CSV document since you have stored them within a Python list, as shown in the following excerpt:

print("-- First execution")
for row in publications_list:
    pprint(row)
    
print("\n-- Second execution")
for row in publications_list:
    pprint(row)

-- First execution
['doi',
 'title',
 'publication year',
 'publication venue',
 'type',
 'issue',
 'volume']
['10.1002/cfg.304',
 'Development of Computational Tools for the Inference of Protein Interaction '
 'Specificity Rules and Functional Annotation Using Structural Information',
 '2003',
 '1531-6912',
 'journal article',
 '4',
 '4']
['10.1016/s1367-5931(02)00332-0',
 'In vitro selection as a powerful tool for the applied evolution of proteins '
 'and peptides',
 '2002',
 '1367-5931',
 'journal article',
 '3',
 '6']
['10.1002/9780470291092.ch20',
 'Mechanisms of Toughening in Ceramic Matrix Composites',
 '1981',
 '9780470291092',
 'book chapter',
 '',
 '']

-- Second execution
['doi',
 'title',
 'publication year',
 'publication venue',
 'type',
 'issue',
 'volume']
['10.1002/cfg.304',
 'Development of Computational Tools for the Inference of Protein Interaction '
 'Specificity Rules and Functional Annotation Using Structural Information',
 '2003',
 '1531-6912',
 'journal article',
 '4',
 '4']
['10.1016/s1367-5931(02)00332-0',
 'In vitro selection as a powerful tool for the applied evolution of proteins '
 'and peptides',
 '2002',
 '1367-5931',
 'journal article',
 '3',
 '6']
['10.1002/9780470291092.ch20',
 'Mechanisms of Toughening in Ceramic Matrix Composites',
 '1981',
 '9780470291092',
 'book chapter',
 '',
 '']

CSV table as list of lists#

As you can see from the executing the print function in the examples above, each row of the CSV table is represented, in Python, as a list of strings. As such, the overall table, after converted it as a list using the related constructor, is defined as a list of list of strings, following the pattern below (using as example the table introduced at the beginning of this tutorial):

my_list = [
    [ "column name", "another name, with a comma" ],              # row 1
    [ "a value", "a value, with a comma" ],                       # row 2
    [ "a quoted \"value\"", "a quoted \"value\", with a comma" ]  # row 3
]
pprint(my_list)

[['column name', 'another name, with a comma'],
 ['a value', 'a value, with a comma'],
 ['a quoted "value"', 'a quoted "value", with a comma']]

As you can see, since strings in Python can be created enclosing their characters between double quotes (i.e. "), the only character to escape in the string is the double quote character itself with a slash (i.e. \"). Alternatively, you could use the the single quote character (i.e. ') for creating strings, avoiding to escape double quote characters, if any:

my_list = [
    [ 'column name', 'another name, with a comma' ],          # row 1
    [ 'a value', 'a value, with a comma' ],                   # row 2
    [ 'a quoted "value"', 'a quoted "value", with a comma' ]  # row 3
]
pprint(my_list)

[['column name', 'another name, with a comma'],
 ['a value', 'a value, with a comma'],
 ['a quoted "value"', 'a quoted "value", with a comma']]

Since a table is a list of list, it can be accessed and modified using the common methods available for the class list, as shown in the following excerpt:

# retrieving the second row in the table
second_row = publications_list[1]  # remember that item indexes starts from 0
print("-- Second row")
pprint(second_row)

# retrieving the third item in the second row
third_item_second_row = second_row[2]
print("\n-- Third item in second row")
pprint(third_item_second_row)

# appending a new row at the end of the list
publications_list.append([
    "10.1080/10273660500441324", 
    "Development of a Species-Specific Model of Cerebral Hemodynamics",
    "2005",
    "1027-3662",
    "journal article",
    "3",
    "6"
])
print("\n-- Updated list")
pprint(publications_list)

-- Second row
['10.1002/cfg.304',
 'Development of Computational Tools for the Inference of Protein Interaction '
 'Specificity Rules and Functional Annotation Using Structural Information',
 '2003',
 '1531-6912',
 'journal article',
 '4',
 '4']

-- Third item in second row
'2003'

-- Updated list
[['doi',
  'title',
  'publication year',
  'publication venue',
  'type',
  'issue',
  'volume'],
 ['10.1002/cfg.304',
  'Development of Computational Tools for the Inference of Protein Interaction '
  'Specificity Rules and Functional Annotation Using Structural Information',
  '2003',
  '1531-6912',
  'journal article',
  '4',
  '4'],
 ['10.1016/s1367-5931(02)00332-0',
  'In vitro selection as a powerful tool for the applied evolution of proteins '
  'and peptides',
  '2002',
  '1367-5931',
  'journal article',
  '3',
  '6'],
 ['10.1002/9780470291092.ch20',
  'Mechanisms of Toughening in Ceramic Matrix Composites',
  '1981',
  '9780470291092',
  'book chapter',
  '',
  ''],
 ['10.1080/10273660500441324',
  'Development of a Species-Specific Model of Cerebral Hemodynamics',
  '2005',
  '1027-3662',
  'journal article',
  '3',
  '6']]

CSV writer#

Once created or modified a table defined through a list of lists in Python, it can be necessary to store it into a CSV file. To do so, we can use the constructor writer included in the package csv, that must be imported. As we did for loading the content of a CSV file in Python, we use again the open function within a with statement, but the file path of the first parameter indicates the file where to store the table and we specify "w" as the mode to interact with the file to create a new object file, as shown as follows:

from csv import writer

with open("notebook/01-publications-modified.csv", "w", encoding="utf-8") as f:
    publications_modified = writer(f)
    publications_modified.writerows(publications_list)  # it writes all the rows in the list of lists

As shown in the code above, the constructor writer takes in input again a file object and returns an object having class _csv.writer. This class includes some methods to write new rows in the file pointed by the file object. In particular, the method writerows can be used to write the table defined as a list of lists (of strings) into the file.

DictReader and DictWriter#

In the previous section, we have seen how to load and store in Python a CSV table defined as a list of lists. There is, though, another approach that can be used to load and store CSV files using Python that represents the CSV tables as list of dictionaries. In this case, each key in the dictionary is a label of one of the columns of the table, that must be specified. The class DictReader (that must be imported as usual) is used to load a CSV table in this form, as shown in the following excerpt:

from csv import DictReader

with open("notebook/01-publications-modified.csv", "r", encoding="utf-8") as f:
    publications_modified = DictReader(f)  # it is a reader operating as a list of dictionaries
    publications_modified_dict = list(publications_modified)  # casting the reader as a list

pprint(publications_modified_dict)

[{'doi': '10.1002/cfg.304',
  'issue': '4',
  'publication venue': '1531-6912',
  'publication year': '2003',
  'title': 'Development of Computational Tools for the Inference of Protein '
           'Interaction Specificity Rules and Functional Annotation Using '
           'Structural Information',
  'type': 'journal article',
  'volume': '4'},
 {'doi': '10.1016/s1367-5931(02)00332-0',
  'issue': '3',
  'publication venue': '1367-5931',
  'publication year': '2002',
  'title': 'In vitro selection as a powerful tool for the applied evolution of '
           'proteins and peptides',
  'type': 'journal article',
  'volume': '6'},
 {'doi': '10.1002/9780470291092.ch20',
  'issue': '',
  'publication venue': '9780470291092',
  'publication year': '1981',
  'title': 'Mechanisms of Toughening in Ceramic Matrix Composites',
  'type': 'book chapter',
  'volume': ''},
 {'doi': '10.1080/10273660500441324',
  'issue': '3',
  'publication venue': '1027-3662',
  'publication year': '2005',
  'title': 'Development of a Species-Specific Model of Cerebral Hemodynamics',
  'type': 'journal article',
  'volume': '6'}]

As you can see from the output of the execution of the code above, the list defined by casting the DictReader object, created by passing as input the file object as before, contains dictionaries, where each dictionary represent a row. The values of the cells of each row can be accessed by using the related key which is, as anticipated, one of the label of the columns. It is worth mentioning that, in this case, the first row in the CSV table is always interpreted as the header of the table, and the content of the list of ditionaries will start considering only the values specified from the second row. The following code shows some example about how to interact with such a structure for accessing and modifying the table:

# retrieving the second row in the table
second_row = publications_modified_dict[1]  # remember that item indexes starts from 0
print("-- Second row")
pprint(second_row)

# retrieving the value associated with the column 'title' in the second row
title_value_second_row = second_row["title"]
print("\n-- Value assigned to 'title' in second row")
print(title_value_second_row)

# appending a new row at the end of the list
publications_modified_dict.append({
    "doi": "10.1080/10273660412331292260", 
    "title": "Amplified Molecular Binding of Prion Protein Homologues in Self-Progressive Injury of Neuronal Membranes and Trafficking Systems",
    "publication year": "2003",
    "publication venue": "1027-3662",
    "type": "journal article",
    "issue": "3-4",
    "volume": "5"
})
print("\n-- Updated list of dictionaries")
pprint(publications_modified_dict)

-- Second row
{'doi': '10.1016/s1367-5931(02)00332-0',
 'issue': '3',
 'publication venue': '1367-5931',
 'publication year': '2002',
 'title': 'In vitro selection as a powerful tool for the applied evolution of '
          'proteins and peptides',
 'type': 'journal article',
 'volume': '6'}

-- Value assigned to 'title' in second row
In vitro selection as a powerful tool for the applied evolution of proteins and peptides

-- Updated list of dictionaries
[{'doi': '10.1002/cfg.304',
  'issue': '4',
  'publication venue': '1531-6912',
  'publication year': '2003',
  'title': 'Development of Computational Tools for the Inference of Protein '
           'Interaction Specificity Rules and Functional Annotation Using '
           'Structural Information',
  'type': 'journal article',
  'volume': '4'},
 {'doi': '10.1016/s1367-5931(02)00332-0',
  'issue': '3',
  'publication venue': '1367-5931',
  'publication year': '2002',
  'title': 'In vitro selection as a powerful tool for the applied evolution of '
           'proteins and peptides',
  'type': 'journal article',
  'volume': '6'},
 {'doi': '10.1002/9780470291092.ch20',
  'issue': '',
  'publication venue': '9780470291092',
  'publication year': '1981',
  'title': 'Mechanisms of Toughening in Ceramic Matrix Composites',
  'type': 'book chapter',
  'volume': ''},
 {'doi': '10.1080/10273660500441324',
  'issue': '3',
  'publication venue': '1027-3662',
  'publication year': '2005',
  'title': 'Development of a Species-Specific Model of Cerebral Hemodynamics',
  'type': 'journal article',
  'volume': '6'},
 {'doi': '10.1080/10273660412331292260',
  'issue': '3-4',
  'publication venue': '1027-3662',
  'publication year': '2003',
  'title': 'Amplified Molecular Binding of Prion Protein Homologues in '
           'Self-Progressive Injury of Neuronal Membranes and Trafficking '
           'Systems',
  'type': 'journal article',
  'volume': '5'}]

As before, once created or modified a table defined through a list of ditionaries, you can store it into a CSV file using the class DictWriter included in the package csv (to be imported, as usual). As we did before, we use again the open function within a with statement, but the file path of the first parameter indicates the file where to store the table and we specify "w" as the mode to interact with the file to create a new object file, as shown as follows:

from csv import DictWriter

with open("notebook/01-publications-modified-dict.csv", "w", encoding="utf-8") as f:
    header = [  # the fields defining the columns must be explicitly specified in the desired order
        "doi", "title", "publication year", "publication venue", "type", "issue", "volume" ]
    
    publications_modified = DictWriter(f, header)
    publications_modified.writeheader()  # the header must be explicitly created in the output file
    publications_modified.writerows(publications_modified_dict)  # it writes all the rows, as usual

However, the class DictWriter works in a slightly different way of _csv.writer. The main differences are:

the dictionaries representing the rows do not specify a precise order of the columns to be stored in the CSV file and, as such, it must be explicitly defined;
there is no header explicitly specified as a row of the table and, as such, it must be provided to the constructor of the class DictWriter and then written as first thing in the file.

For addressing 1), we simply create a new list (the variable header of the code above) with all the column labels in the order they must appear in the final file. Instead, for addressing 2), it is sufficient to specify such a new list as the second parameter of the DictWriter constructor, after the file object where to store the table; then, it is necessary to write the header of the table calling the method writeheader() before writing the rows with data into the file using the method writerows.

CSV dialects#

In the previous sections we showed how to use the classes and functions in the package csv in Python to handle CSV documents. It is worth mentioning, though, that CSV has several dialects that introduce small changes in the structure of a CSV document. For instance, a well-known dialect is named Tab-separated Values (TSV). Here the idea is that one has to use the tab character to separate the cells of a row instead of the comma. As such, the comma does not have any specific meaning and can be safely used in cell values withou escaping it with quote characters.

For instance, the very first example of CSV table introduced at the beginning of this tutorial can be represented in TSV as follows:

column name	another name, with a comma
a value	a value, with a comma
a quoted "value"	a quoted "value", with a comma

Of course, the csv package allows one to parse also these additional kinds of formats. Indeed, all the constructors of readers and writers objects (i.e. reader, writer, DictReader and DictWriter) can have in input the optional named parameter dialect which permits the specification of the dialect to consider for either loading or storing the CSV-like table. For instance, the following code stores the table considered in the previous excerpt of code as a TSV file:

with open("notebook/01-publications-modified-dict.tsv", "w", encoding="utf-8") as f:
    header = [  # the fields defining the columns must be explicitly specified in the desired order
        "doi", "title", "publication year", "publication venue", "type", "issue", "volume" ]
    
    publications_modified = DictWriter(f, header, dialect="excel-tab")  # adding the specific dialect
    publications_modified.writeheader()  # the header must be explicitly created in the output file
    publications_modified.writerows(publications_modified_dict)  # it writes all the rows, as usual

In the code above, we have specified a different output file in the with statement (i.e. the extension now is .tsv), and we have explicited asked our DictWriter to use the tab-separated dialect introduced by Excel (i.e. dialect="excel-tab") to handle the table as a TSV file.

JavaScript Object Notation (JSON)#

Another format well-known in the Web, since it is used in several different scenarios that concern data interchange, is the Javascript Object Notation (JSON). It is a simple textual format to describe objects which follow the key-value approach to specify data, where the key is always a term written within quotes, while the value can assume any of the following types:

numbers (integers and floats), specified straight without an markup (e.g. 3 or 3.14);
strings, specified between double quotes (e.g. "a string");
booleans, the values true and false;
object, a collection of key-value pairs specified within curly brackets, where each pair is separated with a comma (e.g. { "given name": "Silvio", "family name": "Peroni" });
the null value, i.e. null, which mimic the None value in Python;
arrays, i.e. lists of values (numbers, strings, booleans, objects, other arrays, etc.) listed between square brackets where each item is separated with a comma (e.g. [ "a string", "another string", 4, 4.5, true ]).

Thus, instead of CSV documents in which all the values are actually interpreted as strings, in a JSON document all the values can have different types, as shown above. In addition, each JSON document does not contain necessarily one single object (using curly brackets), but can be defined as an array of objects, and each object can contain (as some of its values) other objects, organising a tree-like structure. An example of such structure is shown in the exemplar JSON file provided in this tutorial, where all the publications and venues specified in the CSV files introduced at the very beginning of the tutorial have been reorganised according to the JSON syntax. As before, to open it as a source file, you can right-click on it in the left panel and select Open With -> Editor.

Loading a JSON document in Python#

We need to use specific functions of the Python package json to load a JSON document in Python. In particular, we use the function load to import in Python a JSON object, that must be imported from the json package as usual.

from json import load

with open("notebook/01-publications-venues.json", "r", encoding="utf-8") as f:
    json_doc = load(f)

Differntly from the handling of CSV documents, the load function (that still takes in input the file object of the file to load) does not return you a reader, but rather it provides directly the representation of the JSON document into the appropriate Python data structures.

It is a list of dictionaries!#

Considering the exemplar JSON file we have used in the code above, we can print out on screen the type of the object referred by the json_doc variable to see what kind of class it is used to represent such a document, as shown in the following excerpt:

print(type(json_doc))

<class 'list'>

As you can see, a list is used to map the JSON array, which is indeed the most natural choice. In particular, the kind of JSON values mentioned above are converted in Python as follows:

numbers (e.g. 3 or 3.14) and strings (e.g. "a string") in JSON are represented with the kinds of values in Python (i.e. 3, 3.14 and "a string");
the true and false boolean values in JSON are represented in Python using True and False respectively;
each JSON object is represented by a Python dictionary, having strings specified as keys and the appropriate kind of value assigned to their values;
JSON arrays, as already mentioned, are represented with Python lists.

Thus you can act upon the JSON object loaded in Python as you do with the classes used to represent the various JSON values. For instance, in the following code, we show some specific item of the JSON array and add another object to the list, which includes a new publication:

# retrieving the second item in the JSON array
second_item = json_doc[1]  # remember that item indexes starts from 0
print("-- Second item")
pprint(second_item)

# retrieving the value associated with the key 'title' in the second item
title_value_second_item = second_item["title"]
print("\n-- Value assigned to 'title' in second item")
print(title_value_second_item)

# appending a new JSON object at the end of the list
json_doc.append({
    "doi": "10.1080/10273660412331292260", 
    "title": "Amplified Molecular Binding of Prion Protein Homologues in Self-Progressive Injury of Neuronal Membranes and Trafficking Systems",
    "publication year": 2003,
    "publication venue": {
        "id": [ "1027-3662" ],
        "name": "Journal of Theoretical Medicine",
        "type": "journal"
    },
    "type": "journal article",
    "issue": "3-4",
    "volume": "5"
})
print("\n-- Updated JSON array (a.k.a. list of dictionaries)")
pprint(json_doc)

-- Second item
{'doi': '10.1016/s1367-5931(02)00332-0',
 'issue': '3',
 'publication venue': {'id': ['1367-5931'],
                       'name': 'Current Opinion in Chemical Biology',
                       'type': 'journal'},
 'publication year': 2002,
 'title': 'In vitro selection as a powerful tool for the applied evolution of '
          'proteins and peptides',
 'type': 'journal article',
 'volume': '6'}

-- Value assigned to 'title' in second item
In vitro selection as a powerful tool for the applied evolution of proteins and peptides

-- Updated JSON array (a.k.a. list of dictionaries)
[{'doi': '10.1002/cfg.304',
  'issue': '4',
  'publication venue': {'id': ['1531-6912'],
                        'name': 'Comparative and Functional Genomics',
                        'type': 'journal'},
  'publication year': 2003,
  'title': 'Development of Computational Tools for the Inference of Protein '
           'Interaction Specificity Rules and Functional Annotation Using '
           'Structural Information',
  'type': 'journal article',
  'volume': '4'},
 {'doi': '10.1016/s1367-5931(02)00332-0',
  'issue': '3',
  'publication venue': {'id': ['1367-5931'],
                        'name': 'Current Opinion in Chemical Biology',
                        'type': 'journal'},
  'publication year': 2002,
  'title': 'In vitro selection as a powerful tool for the applied evolution of '
           'proteins and peptides',
  'type': 'journal article',
  'volume': '6'},
 {'doi': '10.1002/9780470291092.ch20',
  'publication venue': {'id': ['9780470291092'],
                        'name': 'Proceedings of the 5th Annual Conference on '
                                'Composites and Advanced Ceramic Materials: '
                                'Ceramic Engineering and Science Proceedings',
                        'type': 'book'},
  'publication year': 1981,
  'title': 'Mechanisms of Toughening in Ceramic Matrix Composites',
  'type': 'book chapter'},
 {'doi': '10.1080/10273660412331292260',
  'issue': '3-4',
  'publication venue': {'id': ['1027-3662'],
                        'name': 'Journal of Theoretical Medicine',
                        'type': 'journal'},
  'publication year': 2003,
  'title': 'Amplified Molecular Binding of Prion Protein Homologues in '
           'Self-Progressive Injury of Neuronal Membranes and Trafficking '
           'Systems',
  'type': 'journal article',
  'volume': '5'}]

Storing a JSON document into a file#

We use the function dump of the json package (to import as usual) to store a dictionary or an array of values into a JSON file, as shown in the following excerpt:

from json import dump

with open("notebook/01-publications-venues-modified.json", "w", encoding="utf-8") as f:
    dump(json_doc, f, ensure_ascii=False, indent=4)

The dump function takes in input two mandatory positional parameters - i.e. the Python representation of a JSON document as the first parameter and the file object referring to the file where to store the JSON document. In addition, it can takes several other optional named parameters, two of which are strongly suggested and have been used in the code above.

The parameter ensure_ascii (assigned to True by defalut) is responsible to keep every string value compliant with the ASCII character encoding, which will result in escaping all non-ASCII characters (that are only 128 characters). This is very undesirable when we have natural language text in the JSON object we want to store since, for instance, all the characters with accents (e.g. "è") will be encoded in a different way (in the example, "\u00e8", that is the UTF-8 code of the character "è"). That is why the code above sets the ensure_ascii input parameter to False: to avoid such an escaping, preserving the original characters as they are (i.e. encoded in UTF-8).

Instead, the parameter indent is used to specify how many white spaces to add for indenting the various key-value pairs in the JSON document. The choice to specify the indent is only for human consumption, since a machine does not care about how these pairs are visually organised in the document. Indeed, the JSON document

{ "given name": "Silvio", "family name": "Peroni }

and the JSON document

{ 
    "given name": "Silvio", 
    "family name": "Peroni" 
}

are actually storing the same data, but the second is usually easier to read for humans. That is why the code above sets the input parameter indent to 4, meaning that four spaces must be used to indent the various JSON pairs.