Data persistence tools in the Python Standard Library

This is the first in a new sub-series of Python Module of the Week posts with a feature-based introduction to modules in the standard library, organized by what your goal might be. Each article may include cross-references to several modules from different parts of the library, and show how they relate to one another. Feedback is welcome, as usual.

Data Persistence and Exchange

Python provides several modules for storing data. There are basically two aspects to persistence: converting the in-memory object back and forth into a format for saving it, and working with the storage of the converted data.

Serializing Objects

Python includes two modules capable of converting objects into a transmittable or storable format (serializing): pickle and json. It is most common to use pickle, since there is a fast C implementation and it is integrated with some of the other standard library modules that actually store the serialized data, such as shelve. Authors of web-based applications may want to examine json, however, since it integrates better with some of the existing web service storage applications.

Storing Serialized Objects

Once the in-memory object is converted to a storable format, the next step is to decide how to store the data. A simple flat-file with serialized objects written one after the other works for data that does not need to be indexed in any way. But Python includes a collection of modules for storing key-value pairs in a simple database using one of the DBM format variants.

The simplest interface to take advantage of the DBM format is provided by shelve. Simply open the shelve file, and access it through a dictionary-like API. Objects saved to the shelve are automatically pickled and saved without any extra work on your part.

One drawback of shelve is that with the default interface you can’t guarantee which DBM format will be used. That won’t matter if your application doesn’t need to share the database files between hosts with different libraries, but if that is needed you can use one of the classes in the module to ensure a specific format is selected (Specific Shelf Types).

If you’re going to be passing a lot of data around via JSON anyway, using json and anydbm can provide another persistence mechanism. Since the DBM database keys and values must be strings, however, the objects won’t be automatically re-created when you access the value in the database.

Relational Databases

The excellent sqlite3 in-process relational database is available with most Python distributions. It stores its database in memory or in a local file, and all access is from within the same process, so there is no network lag. The compact nature of sqlite3 makes it especially well suited for embedding in desktop applications or development versions of web apps.

All access to the database is through the Python DBI 2.0 API, by default, as no object relational mapper (ORM) is included. The most popular general purpose ORM is SQLAlchemy, but others such as Django’s native ORM layer also support SQLite. SQLAlchemy is easy to install and set up, but if your objects aren’t very complicated and you are worried about overhead, you may want to use the DBI interface directly.

Data Exchange Through Standard Formats

Although not usually considered a true persistence format csv, or comma-separated-value, files can be an effective way to migrate data between applications. Most spreadsheet programs and databases support both export and import using CSV, so dumping data to a CSV file is frequently the simplest way to move data out of your application and into an analysis tool.