Python in Science: How long until a Nobel Prize?

As I write this, the Nobel Prizes for 2007 are being
announced. During the week of announcements, each day includes news
of another award being bestowed for outstanding contributions in
physics, chemistry, physiology or medicine, literature, peace, and
economics. As a technophile, the science awards have always been the
most interesting to me. This year, prior to the awards, new releases
of several scientific packages on PyPI caught my eye and I was
struck by the coincidence. I started to wonder: How long before a
Nobel Prize is awarded to a scientist who uses Python for their work
in some significant way?

It should come as no surprise that Python can be found in many
scientific environments. The language is powerful enough to do the
necessary work, and simple enough to allow the focus to remain on the
science, instead of the programming. It is portable, making sharing
code between researchers easy. Its ability to interface with other
systems, through C libraries or network protocols, also makes Python
well suited for building on existing legacy tools. The list of tools
highlighted here is by no means exhaustive, but is intended to
introduce the wide variety of packages available for scientific work
in Python, ranging from general mathematics tools to narrowly focused,
specialty libraries.

The home for scientific programming in Python is the SciPy home page. SciPy aims to collect references to all of
the scientific libraries and serve as a central hub for sharing code
in the scientific programming community. There is even a SciPy
conference! The SciPy site is an excellent source of information about
scientific programming in general, and using Python specifically. It’s
a good starting point for a survey to examine how Python is used in
scientific research work.

General Purpose Toolkits

Most scientific work includes some data collection and management,
along with number crunching to analyze that data. There are several
general purpose libraries for working with datasets from any
scientific field.

The NumPy library is designed as “the
fundamental package needed for scientific computing with Python”. It
includes powerful data management and manipulation features, including
multi-dimensional arrays. Once your data is collected, it can be
processed by NumPy using linear algebra, Fourier transforms, a random
number generator, and FORTRAN integration. NumPy is the foundation of
several libraries also hosted on or otherwise associated with the
SciPy site.

The main SciPy library uses the array manipulation features of NumPy
to implement more advanced mathematical, scientific, and engineering
functions. It provides routines for numerical integration and
optimization, signal processing, and statistics.

PyTables is a data management package
specifically designed to handle large datasets. It manages data file
access with the HDF5 library, and uses NumPy for in-memory
datasets. Since it is designed for use with large amounts of data, it
is well suited for applications which produce or collect a lot of
data, even outside the scientific arena.

The ScientificPython
package is another broad ranging collection of modules for scientific
computing. It includes an input/output library; geometric,
mathematical, and statistical functions; as well as several general
purpose modules to assist in programming tasks such as threading and
parallel programming.

Simulations

Besides working with observed data, scientists often use simulations
to understand the rules governing the operation of a system. If you
can construct a simulation that accurately predicts the outcome of a
set of inputs, then you have a higher level of confidence that you
understand the way the parts of a system interact. There are
basically two types of simulation, discrete and continuous, and there
are Python packages for working with both.

SimPy is a simulation language
based on standard Python. You can use SimPy to simulate activities
like traffic flow patterns, queues at retail stores, and other
discrete events. SimPy represents independent active components of
the simulation as “processes” that can interact with each other
(queuing, passing data or resources, etc.). Limited capacity is
represented through “resources”, and requests for resources are
maintained for you.

PyDSTool
is a suite of computational tools for modeling dynamic systems and
physical processes being developed at Cornell. It supports discrete or
continuous simulation, and a wide range of mathematical operations and
constraints. One especially interesting feature is the use of
automatically generated and compiled C code for the “generators” that
produce input data used by the rest of the tool.

Mathematics

Once your data is collected, it needs to be analyzed. That may involve
statistical calculations, or you might be trying to uncover an
underlying relationship or formula. In either case, there is a Python
library with the tools you need.

SymPy – not to be confused with
SimPy – is a full featured computer algebra system, for symbolic
mathematics. It supports algebraic formula expansion and reduction, as
well as calculus operations such as differentials, derivatives, series
and limits, and integration.

If you need even more powerful mathematics, or just want to take
advantage of previous work done with the ubiquitous MATLAB, you can
use pymatl to drive the MATLIB
engine directly from your Python program. It will send matrices of
data back and forth between your Python program and MATLIB, allowing
the two programs to act together on the data.

Visualization

Once your data is collected and you have completed your calculations,
the next step is generally to produce graphical views of the data to
make it easier to interpret the results and spot trends. The 2-D
plotting and visualization matplotlib is used in a lot of different
application areas. It produces publication quality figures in a
variety formats, and you can control it from scripts, GUIs, through
the web, or interactively through the python or ipython shells. Figure
1 shows a sample plot created with matplotlib by Jeff Whitaker of
NOAA’s Earth System Research Lab in Boulder, CO (and author of the
basemap toolkit for matplotlib).

http://doughellmann.com/blog/wp-content/uploads/2007/11/Figure1.png

A matplotlib example from Jeff Whitaker.

For 3-D visualization, you will want to check out Mayavi, from
Enthought. Mayavi is a general purpose visualization engine, and it
can be used with scalar or vector data to create and manipulate three
dimensional representations of your dataset for visualization.

Astronomy

In addition to the many general purpose libraries suitable for
scientific work, there are quite a few application specific libraries
available in different fields from the macroscopic to the
sub-microscopic.

The AstroPy
project from the Astronomy Department at the University of Washington
in Seattle promotes the use of Python for astronomy research. Their
home page lists several packages for accessing legacy systems through
Python wrappers, as well as pure Python libraries for working with
astronomical data. For example, AstroLib is a
collection of 4 components which provide features such as manipulating
the ASCII tables commonly used to exchange data between scientists,
synthetic photometry (for analyzing intensity or apparent magnitude
measurements), and coordinate conversion and manipulation. The target
user is a “typical astronomer” preparing for observation runs or
working with observed or catalog data.

PyNOVAS is a library for
calculating the positions of the sun, moon, planets, and other
celestial objects. It is based on a C library, called NOVAS, and is a
good example of wrapping existing libraries in Python to make them
available to a wider range of scientists.

The Space Telescope Science Institute manages the operation of the
Hubble Space Telescope. In addition to using Python for many of their
internal tools, they have released a library for working with
astronomical data and images, called stsci_python.

Climate

If astronomy isn’t your thing, maybe you want to look at scientific
applications a little closer to home. Climate research is a hot topic
these days, both in the news and in Python development.

The tools in the Climate Data Analysis Tools (CDAT) system from
Lawrence Livermore National Laboratory are specifically designed for
working with climate data. There are separate components for reading
and writing data, performing climate-specific calculations, and
general statistical analysis.

PyClimate from Universidad del País
Vasco in Spain is focused on analyzing and modeling climate data,
combining data from different sources in different formats and
measurements to look for variability, especially human induced change.

Bjørn Ådlandsvik’s seawater module
implements functions for computing properties of the ocean, using
standard formulas defined by UNESCO reports, while the fluid package is a more general set of
procedures for studying fluid interactions.

Biology / Health

If you are more interested in animated creatures than inanimate
objects, you will be pleased to know that there is a thriving
community of biology researchers using Python for their work.

Biopython hosts a set of tools for
“computational molecular biology” for bioinformatics. Contributors are
distributed around the world, mostly at research universities. The
library they have produced includes tools for parsing bioinformatics
files from a wide range of sources, offline and online. It also
contains classes for representing and manipulating DNA sequences.

EpiGrass is used for network
epidemiology simulations. The results can be fed to the GRASS
geographic information system
and
plotted on maps to track or predict the spread of disease.

Molecular Modeling

If we continue this trend toward studying smaller objects, we soon
reach the microscopic and sub-microscopic scales and find Python hard
at work there, too.

The Scripps Research Institute has
released several tools for visualizing and analyzing molecular
structures. Their Python Molecular Viewer draws an interactive three
dimensional representation of a molecule. It is also scriptable, using
built-in or user defined commands dynamically loaded from plug-ins.

The Molecular Modeling Toolkit
by Konrad Hinsen is another simulation application, this time
specifically intended for simulating molecules and their interactions.

Chimera, from the University of
California, San Francisco, is an alternate interactive visualization
and analysis tool for molecular structures. It produces high quality
images and animations, and can be driven by a command interface or
interactively.

Conclusion

Although I have barely scratched the surface, I hope this list of
packages illustrates the wide array of application areas where Python
is being used for scientific research. From the macroscopic to
microscopic, simulation to computation, it fills gaps left by other
tools and serves as the foundation technology for entirely new
tools. Whether it is reading and writing standard (or ad hoc) data
files, controlling equipment, or performing calculations directly,
Python has an important place in science. It can take several decades
before the impact of fundamental research is evaluated and recognized
as worthy of a Nobel Prize, and Python is still young enough that the
research being done using needs to mature before it would be
considered. But that day is coming, and it is entirely possible that a
Nobel Prize will be awarded to a scientist who uses Python within my
lifetime.

Update on the GIL

Thanks to everyone who sent a message or link after last month’s
column! The responses were generally positive. One of the corrections
came from Adam Olsen, who reported that he is working on a branch
of the C interpreter which removes the GIL
. I missed the
discussion on the python-dev list, so I wasn’t aware of his
project. According to Adam, the code is pre-alpha status, and still
needs work in areas such as deadlock management and weakrefs. He does
have some performance numbers, gathered using the pystone benchmarks
running on a dual core system. As a baseline, an unmodified version of
Python 3000 yields 28000 pystones/second. His GIL-free version
produces 18800 pystones/second for one thread, and 36700
pystones/second for two threads.

As always, if there is something you would like for me to cover in
this column, send a note with the details to doug dot hellmann at
pythonmagazine dot com and let me know, or add the link to your
del.icio.us account with the tag pymagdifferent.

Originally published in Python Magazine Volume 1 Number 11 , November, 2007

Command line programs are classes, too!

Originally published in Python Magazine Volume 1 Issue 11 , November,
2007

Most OOP discussions focus on GUI or domain-specific development
areas, completely ignoring the workhorse of computing: command
line programs. This article examines CommandLineApp, a base class
for creating command line programs as objects, with option and
argument validation, help text generation, and more.

Although many of the hot new development topics are centered on web
technologies like AJAX, regular command line programs are still an
important part of most systems. Many system administration tasks
still depend on command line programs, for example. Often, a problem
is simple enough that there is no reason to build a graphical or web
user interface when a straightforward command line interface will do
the job. Command line programs are less glamorous than programs with
fancy graphics, but they are still the workhorses of modern
computing.

The Python standard library includes two modules for working with
command line options. The getopt module presents an API that has
been in use for decades on some platforms and is commonly available in
many programming languages, from C to bash. The optparse module is
more modern than getopt, and offers features such as type
validation, callbacks, and automatic help generation. Both modules
elect to use a procedural-style interface, though, and as a result
neither has direct support for treating your command line application
as a first class object. There is no facility for sharing common
options between related programs using getopt. And, while it is
possible to reuse optparse.OptionParser instances in different
programs, it is not as natural as inheritance.

*CommandLineApp* is a base class for command line programs. It
handles the repetitive aspects of interacting with the user on the
command line such as parsing options and arguments, generating help
messages, error handling, and printing status messages. To create your
application, just make a subclass of CommandLineApp and
concentrate on your own code. All of the information about switches,
arguments, and help text necessary for your program to run is derived
through introspection. Common options and behavior can be shared by
applications through inheritance.

To create your application, just make a subclass of CommandLineApp
and concentrate on your own code.

csvcat Requirements

Recently, I needed to combine data from a few different sources,
including a database and a spreadsheet, to summarize the results. I
wanted to import the merged data into a spreadsheet where I could
perform the analysis. All of the sources were able to save data to
comma-separated-value (CSV) files; the challenge was merging the files
together. Using the csv module in the Python standard library, and
CommandLineApp, I wrote a small program to read multiple CSV files
and concatenate them into a single output file. The program,

csvcat, is a good illustration of how to create applications with
CommandLineApp.

The requirements for csvcat were fairly simple. It needed to read
one or more CSV files and combine them, without repeating the column
headers that appeared in each input source. In some cases, the input
data included columns I did not want, so I needed to be able to select
the columns to include in the output. No sort feature was needed,
since I was going to import it into a spreadsheet when I was done and
I could sort the data after importing it. To make the program more
generally useful, I also included the ability to select the output
format using a csv module feature called “dialects”.

Analyzing the Help

Listing 1 shows the help output for the final version of csvcat,
produced by running csvcat –help. Listing 2 shows the source for
the program. All of the information in the help output is derived
from the csvcat class through introspection. The help text
follows a fairly standard layout. It begins with a description of the
application, followed by increasingly more detailed descriptions of
the syntax, arguments, and options. Application-specific help such as
examples and argument ranges appears at the end.

Listing 1

Concatenate comma separated value files.


SYNTAX:

  csvcat [<options>] filename [filename...]

    -c col[,col...], --columns=col[,col...]
    -d name, --dialect=name
    --debug
    -h
    --help
    --quiet
    --skip-headers
    -v
    --verbose=level


ARGUMENTS:

    The names of comma separated value files, such as might be
    exported from a spreadsheet or database program.


OPTIONS:

    -c col[,col...], --columns=col[,col...]
        Limit the output to the specified columns. Columns are
        identified by number, starting with 0.

    -d name, --dialect=name
        Specify the output dialect name. Defaults to "excel".

    --debug
        Set debug mode to see tracebacks.

    -h
        Displays abbreviated help message.

    --help
        Displays verbose help message.

    --quiet
        Turn on quiet mode.

    --skip-headers
        Treat the first line of each file as a header, and only
        include one copy in the output.

    -v
        Increment the verbose level. Higher levels are more verbose.
        The default is 1.

    --verbose=level
        Set the verbose level.

EXAMPLES:


To concatenate 2 files, including all columns and headers:

  $ csvcat file1.csv file2.csv

To concatenate 2 files, skipping the headers in the second file:

  $ csvcat --skip-headers file1.csv file2.csv

To concatenate 2 files, including only the first and third columns:

  $ csvcat --col 0,2 file1.csv file2.csv


OUTPUT DIALECTS:

    excel-tab
    excel

Listing 2

#!/usr/bin/env python
"""Concatenate csv files.
"""

import csv
import sys
import CommandLineApp

class csvcat(CommandLineApp.CommandLineApp):
    """Concatenate comma separated value files.
    """

    EXAMPLES_DESCRIPTION = '''
To concatenate 2 files, including all columns and headers:

  $ csvcat file1.csv file2.csv

To concatenate 2 files, skipping the headers in the second file:

  $ csvcat --skip-headers file1.csv file2.csv

To concatenate 2 files, including only the first and third columns:

  $ csvcat --col 0,2 file1.csv file2.csv
'''

    def showVerboseHelp(self):
        CommandLineApp.CommandLineApp.showVerboseHelp(self)
        print
        print 'OUTPUT DIALECTS:'
        print
        for name in csv.list_dialects():
            print 't%s' % name
        print
        return

    skip_headers = False
    def optionHandler_skip_headers(self):
        """Treat the first line of each file as a header,
        and only include one copy in the output.
        """
        self.skip_headers = True
        return

    dialect = "excel"
    def optionHandler_dialect(self, name):
        """Specify the output dialect name.
        Defaults to "excel".
        """
        self.dialect = name
        return
    optionHandler_d = optionHandler_dialect

    columns = []
    def optionHandler_columns(self, *col):
        """Limit the output to the specified columns.
        Columns are identified by number, starting with 0.
        """
        self.columns.extend([int(c) for c in col])
        return
    optionHandler_c = optionHandler_columns

    def getPrintableColumns(self, row):
        """Return only the part of the row which should be printed.
        """
        if not self.columns:
            return row

        # Extract the column values, in the order specified.
        response = ()
        for c in self.columns:
            response += (row[c],)
        return response

    def getWriter(self):
        return csv.writer(sys.stdout, dialect=self.dialect)

    def main(self, *filename):
        """
        The names of comma separated value files, such as might be
        exported from a spreadsheet or database program.
        """
        headers_written = False

        writer = self.getWriter()

        # process the files in order
        for name in filename:
            f = open(name, 'rt')
            try:
                reader = csv.reader(f)

                if self.skip_headers:
                    if not headers_written:
                        # This row must include the headers for the output
                        headers = reader.next()
                        writer.writerow(self.getPrintableColumns(headers))
                        headers_written = True
                    else:
                        # We have seen headers before, and are skipping,
                        # so do not write the first row of this file.
                        ignore = reader.next()

                # Process the rest of the file
                for row in reader:
                    writer.writerow(self.getPrintableColumns(row))
            finally:
                f.close()
        return

if __name__ == '__main__':
    csvcat().run()

The program description is taken from the docstring of the csvcat
class. Before it is printed, the text is split into paragraphs and
reformatted using textwrap, to ensure that it is no wider than 80
columns of text.

The program description is followed by a syntax summary for the
program. The options listed in the syntax section correspond to
methods with names that begin with optionHandler_. For example,
optionHandler_skip_headers() indicates that csvcat should
accept a –skip-headers option on the command line.

The names of any non-optional arguments to the program appear in the
syntax summary. In this case, csvcat needs the names of the files
containing the input data. At least one file name is necessary, and
multiple names can be given, as indicated by the fact that the
filename argument to main() (line 78) uses the variable
argument notation: *filename. A longer description of the
arguments, taken from the docstring of the main() method (lines
79-82), follows the syntax summary. As with the general program
summary, the description of the arguments is reformatted with
textwrap to fit the screen.

Options and Their Arguments

Following the argument description is a detailed explanation of all of
the options to the program. CommandLineApp examines each option
handler method to build the option description, including the name of
the option, alternative names for the same option, and the name and
description of any arguments the option accepts. There are three
variations of option handlers, based on the arguments used by the
option.

The simplest kind of option does not take an argument at all, and is
used as a “switch” to turn a feature on or off. The method
optionHandler_skip_headers (lines 38-43) is an example of such a
switch. The method takes no argument, so CommandLineApp
recognizes that the option being defined does not take an argument
either. To create the option name, the prefix is stripped from the
method name, and the underscore is converted to a dash ();
optionHandler_skip_headers becomes –skip-headers.

Other options accept a single argument. For example, the
–dialect option requires the name of the CSV output dialect. The
method optionHandler_dialect (lines 46-51) takes one argument,
called name. The suggested syntax for the option, as seen in
Listing 1, is –dialect=name. The name of the method’s argument
is used as the name of the argument to the option in the help text.

The -d option has the same meaning as –dialect, because
optionHandler_d is an alias for optionHandler_dialect (line
52). CommandLineApp recognizes aliases, and combines the forms in
the documentation so the alternative forms -d name and
–dialect=name are described together.

It is often useful for an option to take multiple arguments, as with
–columns. The user could repeat the option on the command line,
but it is more compact to allow them to list multiple values in one
argument list. When CommandLineApp sees an option handler method
that takes a variable argument list, it treats the corresponding
option as accepting a list of arguments. When the option appears on
the command line, the string argument is split on any commas and the
resulting list of strings is passed to the option handler method.

For example, optionHandler_columns (lines 55-60) takes a variable
length argument named col. The option –columns can be
followed by several column numbers, separated by commas. The option
handler is called with the list of values pre-parsed. In the syntax
description, the argument is shown repeating:
–columns=col[,col…].

For all cases, the docstring from the option handler method serves as
the help text for the option. The text of the docstring is
reformatted using textwrap so both the code and help output are
easy to read without extra effort on the part of the developer.

Application-specific Detailed Help

The general syntax and option description information is produced in
the same way for all CommandLineApp programs. There are times
when an application needs to include additional information in the
help output, though, and there are two ways to add such information.

The first way is by providing examples of how to use the program on
the command line. Although it is optional, including examples of how
to apply different combinations of arguments to your program to
achieve various results enhances the usefulness of the help as a
reference manual. When the EXAMPLES_DESCRIPTION class attribute
is set, it is used as the source for the examples. Unlike the other
documentation strings, the EXAMPLES_DESCRIPTION is printed
directly without being reformatted. This preserves the indentation
and
other formatting of the examples, so the user sees an accurate
representation of the program’s inputs and outputs.

Occasionally, a program may need to include information in its help
output which cannot be statically defined in a docstring or derived
by
CommandLineApp. At the very end of its help, csvcat includes
a list of available CSV dialects which can be used with the
–dialect option. Since the list of dialects must be constructed
at runtime based on what dialects have been registered with the
csv module, csvcat overrides showVerboseHelp() to print
the list itself (lines 27-35).

Using csvcat

The inputs to csvcat are any number of CSV files, and the output
is CSV data printed to standard output. To test csvcat during
development, I created two small files with test data. Each file
contains three columns of data: a number, a string, and a date.

$ cat testdata1.csv
"Title 1","Title 2","Title 3"
1,"a",08/18/07
2,"b",08/19/07
3,"c",08/20/07

The second file does not include quotes around any of the string
fields. I chose to include this variation because csvcat does not
quote its output, so using unquoted test data simulates re-processing
the output of csvcat.

$ cat testdata2.csv
Title 1,Title 2,Title 3
40,D,08/21/07
50,E,08/22/07
60,F,08/23/07

The simplest use of csvcat is to print the contents of an input
file to standard output. Notice that the output does not include
quotes around the string fields.

$ csvcat testdata1.csv
Title 1,Title 2,Title 3
1,a,08/18/07
2,b,08/19/07
3,c,08/20/07

It is also possible to select which columns should be included in the
output using the –columns option. Columns are identified by their
number, beginning with 0. Column numbers can be listed in any
order, so it is possible to reorder the columns of the input data, if
needed.

$ csvcat --columns 2,0 testdata1.csv
Title 3,Title 1
08/18/07,1
08/19/07,2
08/20/07,3

Switching to tab-separated columns instead of comma-separated is
easily accomplished by using the –dialect option. There are only
two dialects available by default, but the the csv module API
supports registering additional dialects.

$ csvcat --dialect excel-tab testdata1.csv
Title 1 Title 2 Title 3
1       a       08/18/07
2       b       08/19/07
3       c       08/20/07

For my project, there were input files with several columns, but only
two of them needed to be included in the output. Each file had a
single row of column headers. I only wanted one set of headers in the
output, so the headers from subsequent files needed to be skipped.
And the output had to be in a format I could import into a
spreadsheet, for which the default “excel” dialect worked fine. The
data was merged with a command like this:

$ csvcat --skip-headers --columns 2,0 testdata1.csv testdata2.csv
Title 3,Title 1
08/18/07,1
08/19/07,2
08/20/07,3
08/21/07,40
08/22/07,50
08/23/07,60

Running a CommandLineApp Program

Most of the work for csvcat is being done in the main()
method. To invoke the application, however, the caller does not
invoke main() directly. The program should be started by calling
run(), so the options are validated and exceptions from
main()
are handled. The run() method is one of several methods that are
not intended to be overridden by derived classes, since they
implement
the core features of a command line program. The source for
CommandLineApp appears in Listing 3.

Listing 3

#!/usr/bin/env python
# CommandLineApp.py
"""Base class for building command line applications.
"""

import getopt
import inspect
import os
try:
    from cStringIO import StringIO
except:
    from StringIO import StringIO
import sys
import textwrap


class CommandLineApp(object):
    """Base class for building command line applications.

    Define a docstring for the class to explain what the program does.

    Include descriptions of the command arguments in the docstring for
    main().

    When the EXAMPLES_DESCRIPTION class attribute is not empty, it
    will be printed last in the help message when the user asks for
    help.
    """

    EXAMPLES_DESCRIPTION = ''

    # If true, always ends run() with sys.exit()
    force_exit = True

    # The name of this application
    _app_name = os.path.basename(sys.argv[0])

    _app_version = None

    def __init__(self, commandLineOptions=sys.argv[1:]):
        "Initialize CommandLineApp."
        self.command_line_options = commandLineOptions
        self.supported_options = self.scanForOptions()
        return

    def main(self, *args):
        """Main body of your application.

        This is the main portion of the app, and is run after all of
        the arguments are processed.  Override this method to implment
        the primary processing section of your application.
        """
        pass

    def handleInterrupt(self):
        """Called when the program is interrupted via Control-C
        or SIGINT.  Returns exit code.
        """
        sys.stderr.write('Canceled by user.n')
        return 1

    def handleMainException(self, err):
        """Invoked when there is an error in the main() method.
        """
        if self.debugging:
            import traceback
            traceback.print_exc()
        else:
            self.errorMessage(str(err))
        return 1

    ## HELP

    def showHelp(self, errorMessage=None):
        "Display help message when error occurs."
        print
        if self._app_version:
            print '%s version %s' % (self._app_name, self._app_version)
        else:
            print self._app_name
        print

        # If they made a syntax mistake, just
        # show them how to use the program.  Otherwise,
        # show the full help message.
        if errorMessage:
            print ''
            print 'ERROR: ', errorMessage
            print ''
            print ''
            print '%sn' % self._app_name
            print ''

        txt = self.getSimpleSyntaxHelpString()
        print txt
        print 'For more details, use --help.'
        print
        return

    def showVerboseHelp(self):
        "Display the full help text for the command."
        txt = self.getVerboseSyntaxHelpString()
        print txt
        return

    ## STATUS MESSAGES

    def statusMessage(self, msg='', verbose_level=1, error=False, newline=True):
        """Print a status message to output.

        Arguments

            msg=''            -- The status message string to be printed.

            verbose_level=1   -- The verbose level to use.  The message
                              will only be printed if the current verbose
                              level is >= this number.

            error=False       -- If true, the message is considered an error and
                              printed as such.

            newline=True      -- If true, print a newline after the message.

        """
        if self.verbose_level >= verbose_level:
            if error:
                output = sys.stderr
            else:
                output = sys.stdout
            output.write(str(msg))
            if newline:
                output.write('n')
            # some log mechanisms don't have a flush method
            if hasattr(output, 'flush'):
                output.flush()
        return

    def errorMessage(self, msg=''):
        'Print a message as an error.'
        self.statusMessage('ERROR: %sn' % msg, verbose_level=0, error=True)
        return

    ## DEFAULT OPTIONS

    debugging = False
    def optionHandler_debug(self):
        "Set debug mode to see tracebacks."
        self.debugging = True
        return

    _run_main = True
    def optionHandler_h(self):
        "Displays abbreviated help message."
        self.showHelp()
        self._run_main = False
        return

    def optionHandler_help(self):
        "Displays verbose help message."
        self.showVerboseHelp()
        self._run_main = False
        return

    def optionHandler_quiet(self):
        'Turn on quiet mode.'
        self.verbose_level = 0
        return

    verbose_level = 1
    def optionHandler_v(self):
        """Increment the verbose level.
        Higher levels are more verbose.
        The default is 1.
        """
        self.verbose_level = self.verbose_level + 1
        self.statusMessage('New verbose level is %d' % self.verbose_level,
                           3)
        return

    def optionHandler_verbose(self, level=1):
        """Set the verbose level.
        """
        self.verbose_level = int(level)
        self.statusMessage('New verbose level is %d' % self.verbose_level,
                           3)
        return

    ## INTERNALS (Subclasses should not need to override these methods)

    def run(self):
        """Entry point.

        Process options and execute callback functions as needed.
        This method should not need to be overridden, if the main()
        method is defined.
        """
        # Process the options supported and given
        options = {}
        for info in self.supported_options:
            options[ info.switch ] = info
        parsed_options, remaining_args = self.callGetopt(self.command_line_options,
                                                         self.supported_options)
        exit_code = 0
        try:
            for switch, option_value in parsed_options:
                opt_def = options[switch]
                opt_def.invoke(self, option_value)

            # Perform the primary action for this application,
            # unless one of the options has disabled it.
            if self._run_main:
                main_args = tuple(remaining_args)

                # We could just call main() and catch a TypeError,
                # but that would not let us differentiate between
                # application errors and a case where the user
                # has not passed us enough arguments.  So, we check
                # the argument count ourself.
                num_args_ok = False
                argspec = inspect.getargspec(self.main)
                expected_arg_count = len(argspec[0]) - 1

                if argspec[1] is not None:
                    num_args_ok = True
                    if len(argspec[0]) > 1:
                        num_args_ok = (len(main_args) >= expected_arg_count)
                elif len(main_args) == expected_arg_count:
                    num_args_ok = True

                if num_args_ok:
                    exit_code = self.main(*main_args)
                else:
                    self.showHelp('Incorrect arguments.')
                    exit_code = 1

        except KeyboardInterrupt:
            exit_code = self.handleInterrupt()

        except SystemExit, msg:
            exit_code = msg.args[0]

        except Exception, err:
            exit_code = self.handleMainException(err)
            if self.debugging:
                raise

        if self.force_exit:
            sys.exit(exit_code)
        return exit_code

    def scanForOptions(self):
        "Scan through the inheritence hierarchy to find option handlers."
        options = []

        methods = inspect.getmembers(self.__class__, inspect.ismethod)
        for method_name, method in methods:
            if method_name.startswith(OptionDef.OPTION_HANDLER_PREFIX):
                options.append(OptionDef(method_name, method))

        return options

    def callGetopt(self, commandLineOptions, supportedOptions):
        "Parse the command line options."
        short_options = []
        long_options = []
        for o in supportedOptions:
            if len(o.option_name) == 1:
                short_options.append(o.option_name)
                if o.arg_name:
                    short_options.append(':')
            elif o.arg_name:
                long_options.append('%s=' % o.switch_base)
            else:
                long_options.append(o.switch_base)

        short_option_string = ''.join(short_options)

        try:
            parsed_options, remaining_args = getopt.getopt(
                commandLineOptions,
                short_option_string,
                long_options)
        except getopt.error, message:
            self.showHelp(message)
            if self.force_exit:
                sys.exit(1)
            raise
        return (parsed_options, remaining_args)

    def _groupOptionAliases(self):
        """Return a sequence of tuples containing
        (option_names, option_defs)
        """
        # Figure out which options are aliases
        option_aliases = {}
        for option in self.supported_options:
            method = getattr(self, option.method_name)
            existing_aliases = option_aliases.setdefault(method, [])
            existing_aliases.append(option)

        # Sort the groups in order
        grouped_options = []
        for options in option_aliases.values():
            names = [ o.option_name for o in options ]
            grouped_options.append( (names, options) )
        grouped_options.sort()
        return grouped_options

    def _getOptionIdentifierText(self, options):
        """Return the option identifier text.

        For example:

          -h
          -v, --verbose
          -f bar, --foo bar
        """
        option_texts = []
        for option in options:
            option_texts.append(option.getSwitchText())
        return ', '.join(option_texts)

    def getArgumentsSyntaxString(self):
        """Look at the arguments to main to see what the program accepts,
        and build a syntax string explaining how to pass those arguments.
        """
        syntax_parts = []
        argspec = inspect.getargspec(self.main)
        args = argspec[0]
        if len(args) > 1:
            for arg in args[1:]:
                syntax_parts.append(arg)
        if argspec[1]:
            syntax_parts.append(argspec[1])
            syntax_parts.append('[' + argspec[1] + '...]')
        syntax = ' '.join(syntax_parts)
        return syntax

    def getSimpleSyntaxHelpString(self):
        """Return syntax statement.

        Return a simplified form of help including only the
        syntax of the command.
        """
        buffer = StringIO()

        # Show the name of the command and basic syntax.
        buffer.write('%s [<options>] %snn' % 
                         (self._app_name, self.getArgumentsSyntaxString())
                     )

        grouped_options = self._groupOptionAliases()

        # Assemble the text for the options
        for names, options in grouped_options:
            buffer.write('    %sn' % self._getOptionIdentifierText(options))

        return buffer.getvalue()

    def _formatHelpText(self, text, prefix):
        if not text:
            return ''
        buffer = StringIO()
        text = textwrap.dedent(text)
        for para in text.split('nn'):
            formatted_para = textwrap.fill(para,
                                           initial_indent=prefix,
                                           subsequent_indent=prefix,
                                           )
            buffer.write(formatted_para)
            buffer.write('nn')
        return buffer.getvalue()

    def getVerboseSyntaxHelpString(self):
        """Return the full description of the options and arguments.

        Show a full description of the options and arguments to the
        command in something like UNIX man page format. This includes

          - a description of each option and argument, taken from the
                __doc__ string for the optionHandler method for
                the option

          - a description of what additional arguments will be processed,
                taken from the arguments to main()

        """
        buffer = StringIO()

        class_help_text = self._formatHelpText(inspect.getdoc(self.__class__),
                                               '')
        buffer.write(class_help_text)

        buffer.write('nSYNTAX:nn  ')
        buffer.write(self.getSimpleSyntaxHelpString())

        main_help_text = self._formatHelpText(inspect.getdoc(self.main), '    ')
        if main_help_text:
            buffer.write('nnARGUMENTS:nn')
            buffer.write(main_help_text)

        buffer.write('nOPTIONS:nn')

        grouped_options = self._groupOptionAliases()

        # Describe all options, grouping aliases together
        for names, options in grouped_options:
            buffer.write('    %sn' % self._getOptionIdentifierText(options))

            help = self._formatHelpText(options[0].help, '        ')
            buffer.write(help)

        if self.EXAMPLES_DESCRIPTION:
            buffer.write('EXAMPLES:nn')
            buffer.write(self.EXAMPLES_DESCRIPTION)
        return buffer.getvalue()


class OptionDef(object):
    """Definition for a command line option.

    Attributes:

      method_name - The name of the option handler method.
      option_name - The name of the option.
      switch      - Switch to be used on the command line.
      arg_name    - The name of the argument to the option handler.
      is_variable - Is the argument expected to be a sequence?
      default     - The default value of the option handler argument.
      help        - Help text for the option.
      is_long     - Is the option a long value (--) or short (-)?
    """

    # Option handler method names start with this value
    OPTION_HANDLER_PREFIX = 'optionHandler_'

    # For *args arguments to option handlers, how to split the argument values
    SPLIT_PARAM_CHAR = ','

    def __init__(self, methodName, method):
        self.method_name = methodName
        self.option_name = methodName[len(self.OPTION_HANDLER_PREFIX):]
        self.is_long = len(self.option_name) > 1

        self.switch_base = self.option_name.replace('_', '-')
        if len(self.switch_base) == 1:
            self.switch = '-' + self.switch_base
        else:
            self.switch = '--' + self.switch_base

        argspec = inspect.getargspec(method)

        self.is_variable = False
        args = argspec[0]
        if len(args) > 1:
            self.arg_name = args[-1]
        elif argspec[1]:
            self.arg_name = argspec[1]
            self.is_variable = True
        else:
            self.arg_name = None

        if argspec[3]:
            self.default = argspec[3][0]
        else:
            self.default = None

        self.help = inspect.getdoc(method)
        return

    def getSwitchText(self):
        """Return the description of the option switch.

        For example: --switch=arg or -s arg or --switch=arg[,arg]
        """
        parts = [ self.switch ]
        if self.arg_name:
            if self.is_long:
                parts.append('=')
            else:
                parts.append(' ')
            parts.append(self.arg_name)
            if self.is_variable:
                parts.append('[%s%s...]' % (self.SPLIT_PARAM_CHAR, self.arg_name))
        return ''.join(parts)


    def invoke(self, app, arg):
        """Invoke the option handler.
        """
        method = getattr(app, self.method_name)
        if self.arg_name:
            if self.is_variable:
                opt_args = arg.split(self.SPLIT_PARAM_CHAR)
                method(*opt_args)
            else:
                method(arg)
        else:
            method()
        return

if __name__ == '__main__':
    CommandLineApp().run()

The available and supported options are examined when the instance is
initialized (lines 40-44). By default, the contents of sys.argv
are used as the options and arguments passed in from the command line
to the program. It is easy to pass a different list of options when
writing automated tests for your program, by passing a list of strings
to __init__() as commandLineOptions. The options supported by
the program are determined by scanning the class for option handler
methods. No options are actually evaluated until run() is called.

When the program is run, the first thing it does is use getopt to
validate the options it has been given (line 201). In
callGetopt(), the arguments needed by getopt are constructed
based on the option handlers discovered for the class (lines 262-288).
Options are processed in the order they are passed on the command line
(lines 205-207), and the option handler method for each option
encountered is called. When an option handler requires an argument
that is not provided on the command line, getopt detects the
error. When an argument is provided, the option handler is responsible
for determining whether the value is the correct type or otherwise
valid. When the argument is not valid, the option handler can raise an
exception with an error message to be printed for the user.

After all of the options are handled, the remaining arguments to the
program are checked to be sure there are enough to satisfy the
requirements, based on the argspec of the main() function. The
number of arguments is checked explicitly to avoid having to handle a
TypeError if the user does not pass the right number of arguments
on the command line. If CommandLineApp depended on catching a
TypeError when it passed too few arguments to main(), it could
not tell the difference between a coding error and a user error. If a
mistake inside main() caused a TypeError to occur, it might
look like the user had passed an incorrect number of arguments to the
program.

Error Handling

When an exception is raised during option processing or inside
main(), the exception is caught by one of the except clauses
on lines 236-245 and given to an error handling method. Subclasses
can change the error handling behavior by overriding these methods.

KeyboardInterrupt exceptions are handled by calling
handleInterrupt(). The default behavior is to print a message
that the program has been interrupted and cause the program to exit
with an error code. A subclass could override the method to clean up
an in-progress task, background thread, or other operation which
otherwise might not be automatically stopped when the
KeyboardInterrupt is received.

When a lower level library tries to exit the program, SystemExit
may be raised. CommandLineApp traps the SystemExit exception
and exits normally, using the exit status taken from the exception.
If the force_exit attribute of the application is false, run()
returns instead of exiting (lines 247-249). Trapping attempts to exit
makes it easier to integrate CommandLineApp programs with
unittest or other testing frameworks. The test can instantiate the
application, set force_exit to a false value, then run it. If any
errors occur, a status code is returned but the test process does not
exit.

Trapping attempts to exit makes it easier to integrate
CommandLineApp programs with unittest or other testing frameworks.

All other types of exceptions are handled by calling
handleMainException() and passing the exception as an argument.
The default implementation of handleMainException() (lines 62-70)
prints a simple error message based on the exception, unless debugging
mode is turned on. Debugging mode prints the entire traceback for the
exception.

$ csvcat file_does_not_exist.csv
ERROR: [Errno 2] No such file or directory:
'file_does_not_exist.csv'

Option Definitions

The standard library module inspect provides functions for
performing introspection operations on classes and objects at
runtime.
The API supports basic querying and type checking so it is possible,
for example, to get a list of the methods of a class, including all
inherited methods.

CommandLineApp.scanForOptions() uses inspect to scan an
application class for option handler methods (lines 251-260). All of
the methods of the class are retrieved with inspect.getmembers(),
and those whose name starts with optionHandler_ are added to the
list of supported options. Since most command line options use dashes
instead of underscores, but method names cannot contain dashes, the
underscores in the option handler method names are converted to
dashes
when creating the option name.

The __init__() method of the OptionDef class (lines 440-469)
does all of the work of determining the command line switch name and
what type of arguments the switch takes. The option handler method is
examined with inspect.getargspec(), and the result is used to
initialize the OptionDef.

An “argspec” for a function is a tuple made up of four values: a list
of the names of all regular arguments to the function, including
self if the function is a method; the name of the argument to
receive the variable argument values, if any; the name of the
argument
to receive the keyword arguments, if any; and a list of the default
values for the arguments, in they order they appear in the list of
option names.

The argspecs for the option handlers in csvcat illustrate the
variations of interest to OptionDef. First,
optionHandler_skip_headers:

>>> import Listing2
>>> import inspect
>>> print inspect.getargspec(
... Listing2.csvcat.optionHandler_skip_headers)
(['self'], None, None, None)

Since the only positional argument to the method is self, and
there is no variable argument name given, the option handler is
treated as a simple command line switch without any arguments.

The optionHandler_dialect, on the other hand, does include an
additional argument:

>>> print inspect.getargspec(
... Listing2.csvcat.optionHandler_dialect)
(['self', 'name'], None, None, None)

The name argument is listed in the argspec as a single regular
argument. The result, when a program is run, is that while the options
are being processed by CommandLineApp and OptionDef, the value
for name is passed directly to the option handler method (line
497).

The optionHandler_columns method illustrates variable argument
handling:

>>> print inspect.getargspec(
... Listing2.csvcat.optionHandler_columns)
(['self'], 'col', None, None)

The col argument from optionHandler_columns is named in the
argspec as the variable argument identifier. Since
optionHandler_columns accepts variable arguments, the
OptionDef splits the argument value into a list of strings, and
the list is passed to the option handler method (lines 494-495) using
the variable argument syntax.

The other variable argument configuration, using unidentified keyword
arguments, does not make sense for an option handler. The user of the
command line program has no standard way to specify named arguments to
options, so they are not supported by OptionDef.

Status Messages

In addition to command line option and argument parsing, and error
handling, CommandLineApp provides a “status message” interface for
giving varying levels of feedback to the user. Status messages are
printed by calling self.statusMessage() (line 108). Each message
must indicate the verbose level setting at which the message should be
printed. If the current verbose level is at or higher than the desired
level, the message is printed. Otherwise, it is ignored. The -v,
–verbose, and –quiet flags let the user control the
verbose_level setting for the application, and are defined in the
CommandLineApp so that all subclasses inherit them.

Listing 4

#!/usr/bin/env python
# Illustrate verbose level controls.

import CommandLineApp

class verbose_app(CommandLineApp.CommandLineApp):
    "Demonstrate verbose level controls."

    def main(self):
        for i in range(1, 10):
            self.statusMessage('Level %d' % i, i)
        return 0

if __name__ == '__main__':
    verbose_app().run()

Listing 4 contains another sample application which uses
statusMessage() to illustrate how the verbose level setting is
applied. The default verbose level is 1, so when the program is run
without any additional arguments only a single message is printed:

$ python Listing4.py
Level 1
$

The –quiet option silences all status messages by setting the
verbose level to 0:

$ python Listing4.py --quiet
$

Using the -v option increases the verbose setting, one level at a
time. The option can be repeated on the command line:

$ python Listing4.py -v
Level 1
Level 2
$ python Listing4.py -vv
New verbose level is 3
Level 1
Level 2
Level 3
$

And the –verbose option sets the verbose level directly to the
desired value:

$ python Listing4.py --verbose 4
New verbose level is 4
Level 1
Level 2
Level 3
Level 4
$

Error messages can be printed to the standard error stream using the
errorMessage() method (lines 138-141). The message is prefixed
with the word “ERROR”, and error messages are always printed, no
matter what verbose level is set. Most programs will not need to use
errorMessage() directly, because raising an exception is
sufficient to have an error message displayed for the user.

CommandLineApp and Inheritance

When creating a suite of related programs, it is usually desirable for
all of the programs to use the same options and, in many cases, share
other common behavior. For example, when working with a database the
connection and transaction must be managed reliably. Rather than
re-implementing the same database handling code in each program, by
using CommandLineApp, you can create an intermediate base class
for your programs and share a single implementation. Listing 5
includes a skeleton base class called SQLiteAppBase for working
with an sqlite3 database in this way.

Listing 5

#!/usr/bin/env
# Base class for sqlite programs.

import sqlite3
import CommandLineApp

class SQLiteAppBase(CommandLineApp.CommandLineApp):
    """Base class for accessing sqlite databases.
    """

    dbname = 'sqlite.db'
    def optionHandler_db(self, name):
        """Specify the database filename.
        Defaults to 'sqlite.db'.
        """
        self.dbname = name
        return

    def main(self):
        # Subclasses can override this to control the arguments
        # used by the program.
        self.db_connection = sqlite3.connect(self.dbname)
        try:
            self.cursor = self.db_connection.cursor()
            exit_code = self.takeAction()
        except:
            # throw away changes
            self.db_connection.rollback()
            raise
        else:
            # save changes
            self.db_connection.commit()
        return exit_code

    def takeAction(self):
        """Override this in the actual application.
        Return the exit code for the application
        if no exception is raised.
        """
        raise NotImplementedError('Not implemented!')

if __name__ == '__main__':
    SQLiteAppBase().run()

SQLiteAppBase defines a single option handler for the –db
option to let the user choose the database file (line 12). The default
database is a file in the current directory called “sqlite.db”. The
main() method establishes a connection to the database (line 22),
opens a cursor for working with the connection (line 24), then calls
takeAction() to do the work (line 25). When takeAction()
raises an exception, all database changes it may have made are
discarded and the transaction is rolled back (line 28). When there is
no error, the transaction is committed and the changes are saved (line
32).

Listing 6

#!/usr/bin/env python
# Initialize the database

import time
from Listing5 import SQLiteAppBase

class initdb(SQLiteAppBase):
    """Initialize a database.
    """

    def takeAction(self):
        self.statusMessage('Initializing database %s' % self.dbname)
        # Create the table
        self.cursor.execute("CREATE TABLE log (date text, message text)")
        # Log the actions taken
        self.cursor.execute(
            "INSERT INTO log (date, message) VALUES (?, ?)",
            (time.ctime(), 'Created database'))
        self.cursor.execute(
            "INSERT INTO log (date, message) VALUES (?, ?)",
            (time.ctime(), 'Created log table'))
        return 0

if __name__ == '__main__':
    initdb().run()

A subclass of SQLiteAppBase can override takeAction() to do
some actual work using the database connection and cursor created in
main(). Listing 6 contains one such program, called initdb.
In initdb, the takeAction() method creates a “log” table (line
14) using the database cursor established in the base class. It then
inserts two rows into the new table, using the same cursor. There is
no need for initdb to commit the transaction, since the base
class
will do that after takeAction() returns without raising an
exception.

$ python Listing6.py
Initializing database sqlite.db

Listing 7

#!/usr/bin/env python
# Initialize the database

from Listing5 import SQLiteAppBase

class showlog(SQLiteAppBase):
    """Show the contents of the log.
    """

    substring = None
    def optionHandler_message(self, substring):
        """Look for messages with the substring.
        """
        self.substring = substring
        return

    def takeAction(self):
        if self.substring:
            pattern = '%' + self.substring + '%'
            c = self.cursor.execute(
                "SELECT * FROM log WHERE message LIKE ?;",
                (pattern,))
        else:
            c = self.cursor.execute("SELECT * FROM log;")

        for row in c:
            print '%-30s %s' % row
        return 0

if __name__ == '__main__':
    showlog().run()

The showlog program in Listing 7 also uses SQLiteAppBase. It
reads records from the log table and prints them out to the screen.
When no options are given, it uses the cursor opened by the base
class
to find all of the records in the “log” table (line 24), and print
them:

$ python Listing7.py
Sat Aug 25 19:09:41 2007       Created database
Sat Aug 25 19:09:41 2007       Created log table

The –message option to showlog can be used to filter the
output to include only records whose message column matches the
pattern given. When a message substring is specified, the select
statement is altered to include only messages containing the substring
(lines 19-20). In this example, only log messages with the word
“table” in the message are printed:

$ python Listing7.py --message table
Sat Aug 25 19:09:41 2007       Created log table

The updatelog program in Listing 8 inserts new records into the
database. Each time updatelog is called, the message passed on the
command line is saved as an instance attribute by main() (line
15) so it can be used later when a new row is inserted into the
log table (line 20) by takeAction().

Listing 8

#!/usr/bin/env python
# Initialize the database

import time
from Listing5 import SQLiteAppBase

class updatelog(SQLiteAppBase):
    """Add to the contents of the log.
    """

    def main(self, message):
        """Provide the new message to add to the log.
        """
        # Save the message for use in takeAction()
        self.message = message
        return SQLiteAppBase.main(self)

    def takeAction(self):
        self.cursor.execute(
            "INSERT INTO log (date, message) VALUES (?, ?)",
            (time.ctime(), self.message))
        return 0

if __name__ == '__main__':
    updatelog().run()
$ python Listing8.py "another new message"
$ python Listing7.py
Sat Aug 25 19:09:41 2007       Created database
Sat Aug 25 19:09:41 2007       Created log table
Sat Aug 25 19:10:29 2007       another new message

As with initdb, because the base class commits changes to the
database after takeAction() returns, updatelog does not need
to manage the database connection in any way. Since all of the
example programs use the database connection and cursor created by
their base class, they could be updated to use a Postgresql or MySQL
database by modifying the base class, without having to make those
changes to each program separately.

Future Work

I have been using CommandLineApp in my own work for several years
now, and continue to find ways to enhance it. The two primary features
I would still like to add are the ability to print the help for a
command in formats other than plain text, and automatic type
conversion for arguments.

It is difficult to prepare attractive printed documentation from plain
text help output like what is produced by the current version of
CommandLineApp. Parsing the text output directly is not
necessarily straightforward, since the embedded help may contain
characters or patterns that would confuse a simple parser. A better
solution is to use the option data gathered by introspection to
generate output in a format such as DocBook, which could then be
converted to PDF or HTML using other tool sets specifically designed
for that purpose. There is a prototype of a program to create DocBook
output from an application class, but it is not robust enough to be
released – yet.

CommandLineApp is based on the older option parsing module,
getopt, rather than the new optparse. This means it does not
support some of the newer features available in optparse, such as
type conversion for arguments. Type conversion could be added to
CommandLineApp by inferring the types from default values for
arguments. The OptionDef already discovers default values, but
they are not used. The OptionDef.invoke() method needs to be
updated to look at the default for an option before calling the
option handler. If the default is a type object, it can be used to
convert the incoming argument. If the default is a regular object,
the type of the object can be determined using type(). Then, once
the type is known, the argument can be converted.

Conclusion

I hope this article encourages you to think about your command line
programs in a different light, and to treat them as first class
objects. Using inheritance to share code is so common in other areas
of development that it is hardly given a second thought in most cases.
As has been shown with the SQLiteAppBase programs, the same
technique can be just as powerful when applied to building command
line programs, saving development time and testing effort as a result.
CommandLineApp has been used as the foundation for dozens of types
of programs, and could be just what you need the next time you have to
write a new command line program.

Caching RSS Feeds With feedcache

Originally published in Python Magazine Volume 1 Issue 11 , November,
2007

The past several years have seen a steady increase in the use of
RSS and Atom feeds for data sharing. Blogs, podcasts, social
networking sites, search engines, and news services are just a few
examples of data sources delivered via such feeds. Working with
internet services requires care, because inefficiencies in one
client implementation may cause performance problems with the
service that can be felt by all of the consumers accessing the
same server. In this article, I describe the development of the
feedcache package, and give examples of how you can use it to
optimize the use of data feeds in your application.

I frequently find myself wanting to listen to one or two episodes from
a podcast, but not wanting to subscribe to the entire series. In order
to scratch this itch, I built a web based tool, hosted at
http://www.castsampler.com/, to let me pick and choose individual
episodes from a variety of podcast feeds, then construct a single feed
with the results. Now I subscribe to the single feed with my podcast
client, and easily populate it with new episodes when I encounter any
that sound interesting. The feedcache package was developed as part
of this tool to manage accessing and updating the feeds efficiently,
and has been released separately under the BSD license.

Example Feed Data

The two most common publicly implemented formats for syndicating web
data are RSS (in one of a few versions) and Atom. Both formats have a
similar structure. Each feed begins with basic information about the
data source (title, link, description, etc.). The introductory
information is followed by a series of “items”, each of which
represents a resource like a blog post, news article, or podcast.
Each item, in turn, has a title, description, and other information
like when it was written. It may also refer to one or more
attachments, or enclosures.

Listing 1 shows a sample RSS 2.0 feed and Listing 2 shows a sample
Atom feed. Each sample listing contains one item with a single podcast
enclosure. Both formats are XML, and contain essentially the same
data. They use slightly different tag names though, and podcast
enclosures are handled differently between the two formats, which can
make working with different feed formats more work in some
environments. Fortunately, Python developers do not need to worry
about the differences in the feed formats, thanks to the Universal
Feed Parser.

Listing 1

<?xml version="1.0" encoding="utf-8"?>

<rss version="2.0">
  <channel>
    <title>Sample RSS 2.0 Feed</title>
    <link>http://www.example.com/rss.xml</link>
    <description>Sample feed using RSS 2.0 format.</description>
    <language>en-us</language>
    <item>
      <title>item title goes here</title>
      <link>http://www.example.com/items/1/</link>
      <description>description goes here</description>
      <author>authoremail@example.com (author goes here)</author>
      <pubDate>Sat, 4 Aug 2007 15:00:36 -0000</pubDate>
      <guid>http://www.example.com/items/1/</guid>
      <enclosure url="http://www.example.com/items/1/enclosure" length="100" type="audio/mpeg">
      </enclosure>
    </item>
  </channel>
</rss>

Listing 2

<?xml version="1.0" encoding="utf-8"?>

<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-us">
  <title>Sample Atom Feed</title>
  <link href="http://www.example.com/" rel="alternate"></link>
  <link href="http://www.example.com/atom.xml" rel="self"></link>
  <id>http://www.example.com/atom.xml</id>
  <updated>2007-08-04T15:00:36Z</updated>
  <entry>
    <title>title goes here</title>
    <link href="http://www.example.com/items/1/" rel="alternate"></link>
    <updated>2007-08-04T15:00:36Z</updated>
    <author>
      <name>author goes here</name>
      <email>authoremail@example.com</email>
    </author>
    <id>http://www.example.com/items/1/</id>
    <summary type="html">description goes here</summary>
    <link length="100" href="http://www.example.com/items/1/enclosure" type="audio/mpeg" rel="enclosure">
    </link>
  </entry>
</feed>

Universal Feed Parser

Mark Pilgrim’s Universal Feed Parser is an open source module that
manages most aspects of downloading and parsing RSS and Atom feeds.
Once the feed has been downloaded and parsed, the parser returns an
object with all of the parsed data easily accessible through a single
API, regardless of the original feed format.

Listing 3 shows a simple example program for accessing feeds with
feedparser. On line 9, a URL from the command line arguments is
passed to feedparser.parse() to be downloaded and parsed. The
results are returned as a FeedParserDict. The properties of the
FeedParserDict can be accessed via the dictionary API or using
attribute names as illustrated on line
10.

Listing 3

#!/usr/bin/env python
"""Print contents of feeds specified on the command line.
"""

import feedparser
import sys

for url in sys.argv[1:]:
    data = feedparser.parse(url)
    for entry in data.entries:
        print '%s: %s' % (data.feed.title, entry.title)

When the sample program in Listing 3 is run with the URL for the feed
of feedcache project releases, it shows the titles for the
releases available right now:

$ python Listing3.py http://feeds.feedburner.com/FeedcacheReleases
feedcache Releases: feedcache 0.1
feedcache Releases: feedcache 0.2
feedcache Releases: feedcache 0.3
feedcache Releases: feedcache 0.4
feedcache Releases: feedcache 0.5

Every time the program runs, it fetches the entire feed, whether the
contents have changed or not. That inefficiency might not matter a
great deal for a client program that is not run frequently, but the
inefficiencies add up on the server when many clients access the same
feed, especially if they check the feed on a regular basis. This
inefficient behavior can become an especially bad problem for the
server if the feed contents are produced dynamically, since each
client incurs a certain amount of CPU, I/O, and bandwidth load needed
to produce the XML representation of the feed. Some sites are
understandably strict about how often a client can retrieve feeds, to
cut down on heavy bandwidth and CPU consumers. Slashdot, for example,
returns a special feed with a warning to any client that accesses
their RSS feed too frequently over a short span of time.

A Different Type of Podcast Aggregator

A typical aggregator design would include a monitor to regularly
download the feeds and store the fresh information about the feed and
its contents in a database. The requirements for CastSampler are a
little different, though.

CastSampler remembers the feeds to which a user has subscribed, but
unlike other feed aggregators, it only downloads the episode metadata
while the user is choosing episodes to add to their download
feed. Since the user does not automatically receive every episode of
every feed, the aggregator does not need to constantly monitor all of
the feeds. Instead, it shows a list of episodes for a selected feed,
and lets the user choose which episodes to download. Then it needs to
remember those selected episodes later so it can produce the combined
feed for the user’s podcast client.

If every item from every feed was stored in the database, most of the
data in the database would be for items that were never selected for
download. There would need to be a way to remove old data from the
database when it expired or was no longer valid, adding to the
maintenance work for the site. Instead, CastSampler only uses the
database to store information about episodes selected by the user.
The rest of the data about the feed is stored outside of the database
in a form that makes it easier to discard old data when the feed is
updated. This division eliminates a lot of the data management effort
behind running the site.

Feedcache Requirements

An important goal for this project was to make CastSampler a polite
consumer of feeds, and ensure that it did not overload servers while a
user was selecting podcast episodes interactively. By caching the feed
data for a short period of time, CastSampler could avoid accessing
feeds every time it needed to show the feed data. A persistent cache,
written to disk, would let the data be reused even if the application
was restarted, such as might happen during development. Using a cache
would also improve responsiveness, since reading data from the local
disk would be faster than fetching the feed from the remote server.
To further reduce server load, feedcache is designed to take
advantage of conditional GET features of HTTP, to avoid downloading
the full feed whenever possible.

Another goal was to have a small API for the cache. It should take
care of everything for the caller, so there would not need to be many
functions to interact with it. To retrieve the contents of a feed, the
caller should only have to provide the URL for that feed. All other
information needed to track the freshness of the data in the cache
would be managed internally.

It was also important for the cache to be able to store data in
multiple ways, to make it more flexible for other programmers who
might want to use it. Although CastSampler was going to store the
cache on disk, other applications with more computing resources or
tighter performance requirements might prefer to hold the cache in
memory. Using disk storage should not be hard coded into the cache
management logic.

These requirements led to a design which split the responsibility for
managing the cached data between two objects. The Cache object
tracks information about a feed so it can download the latest version
as efficiently as possible, only when needed. Persistent storage of
the data in the cache is handled by a separate back end storage
object. Dividing the responsibilities in this way maximizes the
flexibility of the Cache, since it can concentrate on tracking
whether the feed is up to date without worrying about storage
management. It also let Cache users take advantage of multiple
storage implementations.

The Cache Class

Once the basic requirements and a skeleton design were worked out, the
next step was to start writing tests so the implementation of
Cache could begin. Working with a few simple tests would clarify
how a Cache user would want to access feeds. The first test was to
verify that the Cache would fetch feed data.

import unittest, cache
class CacheTest(unittest.TestCase):
    def testFetch(self):
        c = cache.Cache({})
        parsed_feed = c.fetch('http://feeds.feedburner.com/FeedcacheReleases')
        self.failUnless(parsed_feed.entries)

Since the design separated storage and feed management
responsibilities, it was natural to pass the storage handler to the
Cache when it is initialized. The dictionary API is used for the
storage because there are several storage options available that
support it. The shelve module in the Python standard library stores
data persistently using an object that conforms to the dictionary API,
as does the shove library from L.C. Rees. Either library would work
well for the final application. For initial testing, using a simple
dictionary to hold the data in memory was convenient, since that meant
the tests would not need any external resources.

After constructing the Cache, the next step in the test is to
retrieve a feed. I considered using using the __getitem__() hook,
but since Cache would not support any of the other dictionary
methods, I rejected it in favor of an explicit method,
fetch(). The caller passes a feed URL to fetch(), which
returns a FeedParserDict instance. Listing 4 shows the first version
of the Cache class that works for the test as it is written. No
actual caching is being done, yet. The Cache instance simply uses
the feedparser module to retrieve and parse the feed.

Listing 4

#!/usr/bin/env python
"""The first version of Cache
"""

import unittest
import feedparser

class Cache:
    def __init__(self, storage):
        self.storage = storage
        return

    def fetch(self, url):
        return feedparser.parse(url)

class CacheTest(unittest.TestCase):

    def testFetch(self):
        c = Cache({})
        parsed_feed = c.fetch('http://feeds.feedburner.com/FeedcacheReleases')
        self.failUnless(parsed_feed.entries)
        return

if __name__ == '__main__':
    unittest.main()

Throttling Downloads

Now that Cache could successfully download feed data, the first
optimization to make was to hold on to the data and track its age.
Then for every call to fetch(), Cache could first check to see
if fresh data was already available locally before going out to the
server to download the feed again.

Listing 5 shows the version of Cache with a download throttle, in
the form of a timeToLiveSeconds parameter. Items already in the
cache will be reused until they are older than
timeToLiveSeconds. The default value for timeToLiveSeconds
means that any given feed will not be checked more often than every
five minutes.

Listing 5

#!/usr/bin/env python
"""The first version of Cache
"""

import time
import unittest
import feedparser

class Cache:
    def __init__(self, storage, timeToLiveSeconds=300):
        self.storage = storage
        self.time_to_live = timeToLiveSeconds
        return

    def fetch(self, url):
        now = time.time()
        cached_time, cached_content = self.storage.get(url, (None, None))

        # Does the storage contain a version of the data
        # which is older than the time-to-live?
        if cached_time is not None:
            age = now - cached_time
            if age <= self.time_to_live:
                return cached_content

        parsed_data = feedparser.parse(url)
        self.storage[url] = (now, parsed_data)
        return parsed_data

class CacheTest(unittest.TestCase):

    def testFetch(self):
        c = Cache({})
        parsed_feed = c.fetch('http://feeds.feedburner.com/FeedcacheReleases')
        self.failUnless(parsed_feed.entries)
        return

    def testReuseContentsWithinTimeToLiveWindow(self):
        url = 'http://feeds.feedburner.com/FeedcacheReleases'
        c = Cache({ url:(time.time(), 'prepopulated cache')})
        cache_contents = c.fetch(url)
        self.failUnlessEqual(cache_contents, 'prepopulated cache')
        return

if __name__ == '__main__':
    unittest.main()

The new implementation of fetch() stores the current time along
with the feed data when the storage is updated. When fetch() is
called again with the same URL, the time in the cache is checked
against the current time to determine if the value in the cache is
fresh enough. The test on line 38 verifies this behavior by
pre-populating the Cache‘s storage with data, and checking to
see that the existing cache contents are returned instead of the
contents of the feed.

Conditional HTTP GET

Conditional HTTP GET allows a client to tell a server something about
the version of a feed the client already has. The server can decide if
the contents of the feed have changed and, if they have not, send a
short status code in the HTTP response instead of a complete copy of
the feed data. Conditional GET is primarily a way to conserve
bandwidth, but if the feed has not changed and the server’s version
checking algorithm is efficient then the server may use fewer CPU
resources to prepare the response, as well.

When a server implements conditional GET, it uses extra headers with
each response to notify the client. There are two headers involved,
and the server can use either or both together, in case the client
only supports one. Cache supports both headers.

Although timestamps are an imprecise way to detect change, since the
time on different servers in a pool might vary slightly, they are
simple to work with. The Last-Modified header contains a timestamp
value that indicates when the feed contents last changed. The client
sends the timestamp back to the server in the next request as
If-Modified-Since. The server then compares the dates to determine
if the feed has been modified since the last request from the client.

A more precise way to determine if the feed has changed is to use an
Entity Tag in the ETag header. An ETag is a hashed
representation of the feed state, or of a value the server can use to
quickly determine if the feed has been updated. The data and algorithm
for computing the hash is left up to the server, but it should be less
expensive than returning the feed contents or there won’t be any
performance gains. When the client sees an ETag header, it can
send the associated value back to the server with the next request in
the If-None-Match request header. When the server sees
If-None-Match, it computes the current hash and compares it to the
value sent by the client. If they match, the feed has not changed.

When using either ETag or modification timestamps, if the server
determines that the feed has not been updated since the previous
request, it returns a response code of 304, or “Not Modified” and
includes nothing in the body of the response. When it sees the 304
status in the response from the server, the client should reuse the
version of the feed it already has.

Creating a Test Server

In order to write correct tests to exercise conditional GET in
feedcache, more control over the server would be important. The
feedburner URL used in the earlier tests might be down, or return
different data if a feed was updated. It would be necessary for the
server to respond reliably with data the test code knew in advance,
and to be sure it would not stop responding if it was queried too
often by the tests. The tests also control which of the headers
(ETag or If-Modified-Since) was used to determine if the feed
had changed, so both methods could be tested independently. The
solution was to write a small test HTTP server that could be managed
by the unit tests and configured as needed. Creating the test server
was easy, using a few standard library modules.

The test server code, along with a base class for unit tests that use
it, can be found in Listing 6. The TestHTTPServer (line 91) is
derived from BaseHTTPServer.HTTPServer. The serve_forever()
method (line 112) has been overridden with an implementation that
checks a flag after each request to see if the server should keep
running. The test harness sets the flag to stop the test server after
each test. The serve_forever() loop also counts the requests
successfully processed, so the tests can determine how many times the
Cache fetches a feed.

Listing 6

#!/usr/bin/env python
"""Simple HTTP server for testing the feed cache.
"""

import BaseHTTPServer
import email.utils
import logging
import md5
import threading
import time
import unittest
import urllib


def make_etag(data):
    """Given a string containing data to be returned to the client,
    compute an ETag value for the data.
    """
    _md5 = md5.new()
    _md5.update(data)
    return _md5.hexdigest()


class TestHTTPHandler(BaseHTTPServer.BaseHTTPRequestHandler):
    "HTTP request handler which serves the same feed data every time."

    FEED_DATA = """<?xml version="1.0" encoding="utf-8"?>

<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-us">
  <title>CacheTest test data</title>
  <link href="http://localhost/feedcache/" rel="alternate"></link>
  <link href="http://localhost/feedcache/atom/" rel="self"></link>
  <id>http://localhost/feedcache/</id>
  <updated>2006-10-14T11:00:36Z</updated>
  <entry>
    <title>single test entry</title>
    <link href="http://www.example.com/" rel="alternate"></link>
    <updated>2006-10-14T11:00:36Z</updated>
    <author>
      <name>author goes here</name>
      <email>authoremail@example.com</email>
    </author>
    <id>http://www.example.com/</id>
    <summary type="html">description goes here</summary>
    <link length="100" href="http://www.example.com/enclosure" type="text/html" rel="enclosure">
    </link>
  </entry>
</feed>"""

    # The data does not change, so save the ETag and modified times
    # as class attributes.
    ETAG = make_etag(FEED_DATA)
    MODIFIED_TIME = email.utils.formatdate(usegmt=True)

    def do_GET(self):
        "Handle GET requests."

        if self.path == '/shutdown':
            # Shortcut to handle stopping the server
            self.server.stop()
            self.send_response(200)

        else:
            incoming_etag = self.headers.get('If-None-Match', None)
            incoming_modified = self.headers.get('If-Modified-Since', None)

            send_data = True

            # Does the client have the same version of the data we have?
            if self.server.apply_modified_headers:
                if incoming_etag == self.ETAG:
                    self.send_response(304)
                    send_data = False

                elif incoming_modified == self.MODIFIED_TIME:
                    self.send_response(304)
                    send_data = False

            # Now optionally send the data, if the client needs it
            if send_data:
                self.send_response(200)
                self.send_header('Content-Type', 'application/atom+xml')
                self.send_header('ETag', self.ETAG)
                self.send_header('Last-Modified', self.MODIFIED_TIME)
                self.end_headers()

                self.wfile.write(self.FEED_DATA)
        return


class TestHTTPServer(BaseHTTPServer.HTTPServer):
    """HTTP Server which counts the number of requests made
    and can stop based on client instructions.
    """

    def __init__(self, applyModifiedHeaders=True):
        self.apply_modified_headers = applyModifiedHeaders
        self.keep_serving = True
        self.request_count = 0
        BaseHTTPServer.HTTPServer.__init__(self, ('', 9999), TestHTTPHandler)
        return

    def getNumRequests(self):
        "Return the number of requests which have been made on the server."
        return self.request_count

    def stop(self):
        "Stop serving requests, after the next request."
        self.keep_serving = False
        return

    def serve_forever(self):
        "Main loop for server"
        while self.keep_serving:
            self.handle_request()
            self.request_count += 1
        return


class HTTPTestBase(unittest.TestCase):
    "Base class for tests that use a TestHTTPServer"

    TEST_URL = 'http://localhost:9999/'

    CACHE_TTL = 0

    def setUp(self):
        self.server = self.getServer()
        self.server_thread = threading.Thread(target=self.server.serve_forever)
        self.server_thread.setDaemon(True) # so the tests don't hang if cleanup fails
        self.server_thread.start()
        return

    def getServer(self):
        "Return a web server for the test."
        return TestHTTPServer()

    def tearDown(self):
        # Stop the server thread
        ignore = urllib.urlretrieve('http://localhost:9999/shutdown')
        time.sleep(1)
        self.server.server_close()
        self.server_thread.join()
        return


class HTTPTest(HTTPTestBase):

    def testResponse(self):
        # Verify that the server thread responds
        # without error.
        filename, response = urllib.urlretrieve(self.TEST_URL)
        return

if __name__ == '__main__':
    unittest.main()

The test server processes incoming HTTP requests with
TestHTTPHandler (line 24), derived from
BaseHTTPServer.BaseHTTPRequestHandler. TestHTTPHandler
implements do_GET() (line 55) to respond to HTTP GET requests.
Feed data for the tests is hard coded in the FEED_DATA class
attribute (line 27). The URL path /shutdown is used to tell the
server to stop responding to requests. All other paths are treated as
requests for the feed data. The requests are processed by checking the
If-None-Match and If-Modified-Since headers, and responding
either with a 304 status or with the static feed data.

HTTPTestBase is a convenience base class to be used by other
tests. It manages a TestHTTPServer instance in a separate thread,
so the tests can all run in a single process. Listing 7 shows what the
existing tests look like, rewritten to use the HTTPTestBase as a
base class. The only differences are the base class for the tests and
the use of self.TEST_URL, which points to the local test server
instead of the feedburner URL from Listing 5.

Listing 7

#!/usr/bin/env python
"""The first version of Cache
"""

import time
import unittest
import feedparser
from Listing5 import Cache
from Listing6 import HTTPTestBase

class CacheTest(HTTPTestBase):

    def testFetch(self):
        c = Cache({})
        parsed_feed = c.fetch(self.TEST_URL)
        self.failUnless(parsed_feed.entries)
        return

    def testReuseContentsWithinTimeToLiveWindow(self):
        c = Cache({ self.TEST_URL:(time.time(), 'prepopulated cache')})
        cache_contents = c.fetch(self.TEST_URL)
        self.failUnlessEqual(cache_contents, 'prepopulated cache')
        return

if __name__ == '__main__':
    unittest.main()

Implementing Conditional HTTP GET

With these testing tools in place, the next step was to enhance the
Cache class to monitor and use the conditional HTTP GET
parameters. Listing 8 shows the final version of Cache with these
features. The fetch() method has been enhanced to send the
ETag and modified time from the cached version of the feed to the
server, when they are available.

Listing 8

#!/usr/bin/env python
"""Cache class with conditional HTTP GET support.
"""

import feedparser
import time
import unittest
import UserDict

import Listing6 # For the test base class

class Cache:

    def __init__(self, storage, timeToLiveSeconds=300, userAgent='feedcache'):
        self.storage = storage
        self.time_to_live = timeToLiveSeconds
        self.user_agent = userAgent
        return

    def fetch(self, url):
        modified = None
        etag = None
        now = time.time()

        cached_time, cached_content = self.storage.get(url, (None, None))

        # Does the storage contain a version of the data
        # which is older than the time-to-live?
        if cached_time is not None:
            if self.time_to_live:
                age = now - cached_time
                if age <= self.time_to_live:
                    return cached_content

            # The cache is out of date, but we have
            # something.  Try to use the etag and modified_time
            # values from the cached content.
            etag = cached_content.get('etag')
            modified = cached_content.get('modified')

        # We know we need to fetch, so go ahead and do it.
        parsed_result = feedparser.parse(url,
                                         agent=self.user_agent,
                                         modified=modified,
                                         etag=etag,
                                         )

        status = parsed_result.get('status', None)
        if status == 304:
            # No new data, based on the etag or modified values.
            # We need to update the modified time in the
            # storage, though, so we know that what we have
            # stored is up to date.
            self.storage[url] = (now, cached_content)

            # Return the data from the cache, since
            # the parsed data will be empty.
            parsed_result = cached_content
        elif status == 200:
            # There is new content, so store it unless there was an error.
            error = parsed_result.get('bozo_exception')
            if not error:
                self.storage[url] = (now, parsed_result)

        return parsed_result


class SingleWriteMemoryStorage(UserDict.UserDict):
    """Cache storage which only allows the cache value
    for a URL to be updated one time.
    """

    def __setitem__(self, url, data):
        if url in self.keys():
            modified, existing = self[url]
            # Allow the modified time to change,
            # but not the feed content.
            if data[1] != existing:
                raise AssertionError('Trying to update cache for %s to %s' 
                                         % (url, data))
        UserDict.UserDict.__setitem__(self, url, data)
        return


class CacheConditionalGETTest(Listing6.HTTPTestBase):

    def setUp(self):
        Listing6.HTTPTestBase.setUp(self)
        self.cache = Cache(storage=SingleWriteMemoryStorage(),
                           timeToLiveSeconds=0, # so we do not reuse the local copy
                           )
        return

    def testFetchOnceForEtag(self):
        # Fetch data which has a valid ETag value, and verify
        # that while we hit the server twice the response
        # codes cause us to use the same data.

        # First fetch populates the cache
        response1 = self.cache.fetch(self.TEST_URL)
        self.failUnlessEqual(response1.feed.title, 'CacheTest test data')

        # Remove the modified setting from the cache so we know
        # the next time we check the etag will be used
        # to check for updates.  Since we are using an in-memory
        # cache, modifying response1 updates the cache storage
        # directly.
        response1['modified'] = None

        # Wait so the cache data times out
        time.sleep(1)

        # This should result in a 304 status, and no data from
        # the server.  That means the cache won't try to
        # update the storage, so our SingleWriteMemoryStorage
        # should not raise and we should have the same
        # response object.
        response2 = self.cache.fetch(self.TEST_URL)
        self.failUnless(response1 is response2)

        # Should have hit the server twice
        self.failUnlessEqual(self.server.getNumRequests(), 2)
        return

    def testFetchOnceForModifiedTime(self):
        # Fetch data which has a valid Last-Modified value, and verify
        # that while we hit the server twice the response
        # codes cause us to use the same data.

        # First fetch populates the cache
        response1 = self.cache.fetch(self.TEST_URL)
        self.failUnlessEqual(response1.feed.title, 'CacheTest test data')

        # Remove the etag setting from the cache so we know
        # the next time we check the modified time will be used
        # to check for updates.  Since we are using an in-memory
        # cache, modifying response1 updates the cache storage
        # directly.
        response1['etag'] = None

        # Wait so the cache data times out
        time.sleep(1)

        # This should result in a 304 status, and no data from
        # the server.  That means the cache won't try to
        # update the storage, so our SingleWriteMemoryStorage
        # should not raise and we should have the same
        # response object.
        response2 = self.cache.fetch(self.TEST_URL)
        self.failUnless(response1 is response2)

        # Should have hit the server twice
        self.failUnlessEqual(self.server.getNumRequests(), 2)
        return

if __name__ == '__main__':
    unittest.main()

The FeedParserDict object returned from feedparser.fetch()
conveniently includes the ETag and modified timestamp, if the
server sent them. On lines 38-39, once the cached feed is determined
to be out of date, the ETag and modified values are retrieved so
they can be passed in to feedparser.parse() on line 42.

Since the updated client sends ETag and If-Modified-Since
headers, the server may now respond with a status code indicating that
the cached copy of the data is still valid. It is no longer sufficient
to simply store the response from the server before returning it. The
status code must be checked, as on line 49, and if the status is
304 then the timestamp of the cached copy is updated. If the
timestamp was not updated, then as soon as the cached copy of the feed
exceeded the time-to-live, the Cache would request a new copy of
the feed from the server every time the feed was accessed. Updating
the timestamp ensures that the download throttling remains enforced.

Separate tests for each conditional GET header are implemented in
CacheConditionalGETTest. To verify that the Cache handles the
304 status code properly and does not try to update the contents
of the storage on a second fetch, these tests use a special storage
class. The SingleWriteMemoryStorage raises an AssertionError
if the a value is modified after it is set the first time. An
AssertionError is used, because that is how unittest.TestCase
signals a test failure, and modifying the contents of the storage is a
failure for these tests.

Each test method of CacheConditionalGETTest verifies handling for
one of the conditional GET headers at a time. Since the test server
always sets both headers, each test clears one value from the cache
before making the second request. The remaining header value is sent
to the server as part of the second request, and the server responds
with the 304 status code.

Persistent Storage With shelve

All of the examples and tests so far have used in-memory storage
options. For CastSampler, though, the cache of feed data needed to be
stored on disk. As mentioned earlier, the shelve module in the
standard library provides a simple persistent storage mechanism. It
also conforms to the dictionary API used by the Cache class.

Using shelve by itself works in a simple single threaded case but
it is not clear from its documentation whether shelve supports
write access from multiple concurrent threads. To ensure the shelf is
not corrupted, a thread lock should be used. CacheStorageLock is a
simple wrapper around shelve that uses a lock to prevent more than
one thread from accessing the shelf simultaneously. Listing 9 contains
the code for the CacheStorageLock and a test that illustrates
using it to combine a Cache and shelve.

Listing 9

#!/usr/bin/env python

from __future__ import with_statement

"""Using Cache with shelve.
"""

import os
import shelve
import tempfile
import threading
import unittest

from Listing6 import HTTPTestBase
from Listing8 import Cache

class CacheStorageLock:

    def __init__(self, shelf):
        self.lock = threading.Lock()
        self.shelf = shelf
        return

    def __getitem__(self, key):
        with self.lock:
            return self.shelf[key]

    def get(self, key, default=None):
        with self.lock:
            try:
                return self.shelf[key]
            except KeyError:
                return default

    def __setitem__(self, key, value):
        with self.lock:
            self.shelf[key] = value


class CacheShelveTest(HTTPTestBase):

    def setUp(self):
        HTTPTestBase.setUp(self)
        handle, self.shelve_filename = tempfile.mkstemp('.shelve')
        os.close(handle) # we just want the file name, so close the open handle
        os.unlink(self.shelve_filename) # remove empty file so shelve is not confused
        return

    def tearDown(self):
        try:
            os.unlink(self.shelve_filename)
        except AttributeError:
            pass
        HTTPTestBase.tearDown(self)
        return

    def test(self):
        storage = shelve.open(self.shelve_filename)
        locking_storage = CacheStorageLock(storage)
        try:
            fc = Cache(locking_storage)

            # First fetch the data through the cache
            parsed_data = fc.fetch(self.TEST_URL)
            self.failUnlessEqual(parsed_data.feed.title, 'CacheTest test data')

            # Now retrieve the same data directly from the shelf
            modified, shelved_data = storage[self.TEST_URL]

            # The data should be the same
            self.failUnlessEqual(parsed_data, shelved_data)
        finally:
            storage.close()
        return


if __name__ == '__main__':
    unittest.main()

The test setUp() method uses tempfile to create a temporary
filename for the cache. The temporary file has to be deleted in
setUp() because if the file exists, but is empty, shelve
cannot determine which database module to use to open an empty
file. The test() method fetches the data from the server, then
compares the returned data with the data in the shelf to verify that
they are the same.

CacheStorageLock uses a threading.Lock instance to control
access to the shelf. It only manages access for the methods known to
be used by Cache. The lock is acquired and released using the
with statement, which is new for Python 2.6. Since this code was
written with Python 2.5, the module starts with a from __future__
import statement to enable the syntax for with.

Other Persistence Options

At any one time, shelve only allows one process to open a shelf
file to write to it. In applications with multiple processes that need
to modify the cache, alternative storage options are
desirable. Cache treats its storage object as a dictionary, so any
class that conforms to the dictionary API can be used for back end
storage. The shove module, by L. C. Rees, uses the dictionary API
and offers support for a variety of back end storage options. The
supported options include relational databases, BSD-style databases,
Amazon’s S3 storage service, and others.

The filesystem store option was particularly interesting for
CastSampler. With shove‘s file store, each key is mapped to a
filename. The data associated with the key is pickled and stored in
the file. By using separate files, it is possible to have separate
threads and processes updating the cache simultaneously. Although the
shove file implementation does not handle file locking, for my
purposes it was unlikely that two threads would try to update the same
feed at the same time.

Listing 10 includes a test that illustrates using shove file
storage with feedcache. The primary difference in the APIs for
shove and shelve is the syntax for specifying the storage
destination. Shove uses a URL syntax to indicate which back end should
be used. The format for each back end is described in the docstrings.

Listing 10

#!/usr/bin/env python
"""Tests with shove filesystem storage.
"""

import os
import shove
import tempfile
import threading
import unittest

from Listing6 import HTTPTestBase
from Listing8 import Cache

class CacheShoveTest(HTTPTestBase):

    def setUp(self):
        HTTPTestBase.setUp(self)
        self.shove_dirname = tempfile.mkdtemp('shove')
        return

    def tearDown(self):
        try:
            os.system('rm -rf %s' % self.storage_dirname)
        except AttributeError:
            pass
        HTTPTestBase.tearDown(self)
        return

    def test(self):
        # First fetch the data through the cache
        storage = shove.Shove('file://' + self.shove_dirname)
        try:
            fc = Cache(storage)
            parsed_data = fc.fetch(self.TEST_URL)
            self.failUnlessEqual(parsed_data.feed.title, 'CacheTest test data')
        finally:
            storage.close()

        # Now retrieve the same data directly from the shelf
        storage = shove.Shove('file://' + self.shove_dirname)
        try:
            modified, shelved_data = storage[self.TEST_URL]
        finally:
            storage.close()

        # The data should be the same
        self.failUnlessEqual(parsed_data, shelved_data)
        return


if __name__ == '__main__':
    unittest.main()

Using feedcache With Multiple Threads

Up to this point, all of the examples have been running in a single
thread driven by the unittest framework. Now that integrating
shove and feedcache has been shown to work, it is possible to
take a closer look at using multiple threads to fetch feeds, and build
a more complex example application. Spreading the work of fetching
data into multiple processing threads is more complicated, but yields
better performance under most circumstances because while one thread
is blocked waiting for data from the network, another thread can take
over and process a different URL.

Listing 11 shows a sample application which accepts URLs as arguments
on the command line and prints the titles of all of the entries in the
feeds. The results may be mixed together, depending on how the
processing control switches between active threads. This example
program is more like a traditional feed aggregator, since it processes
every entry of every feed.

Listing 11

#!/usr/bin/env python
"""Example use of feedcache.Cache combined with threads.
"""

import Queue
import sys
import shove
import threading

from Listing8 import Cache

MAX_THREADS=5
OUTPUT_DIR='/tmp/feedcache_example'


def main(urls=[]):

    if not urls:
        print 'Specify the URLs to a few RSS or Atom feeds on the command line.'
        return

    # Add the URLs to a queue
    url_queue = Queue.Queue()
    for url in urls:
        url_queue.put(url)

    # Add poison pills to the url queue to cause
    # the worker threads to break out of their loops
    for i in range(MAX_THREADS):
        url_queue.put(None)

    # Track the entries in the feeds being fetched
    entry_queue = Queue.Queue()

    print 'Saving feed data to', OUTPUT_DIR
    storage = shove.Shove('file://' + OUTPUT_DIR)
    try:

        # Start a few worker threads
        worker_threads = []
        for i in range(MAX_THREADS):
            t = threading.Thread(target=fetch_urls,
                                 args=(storage, url_queue, entry_queue,))
            worker_threads.append(t)
            t.setDaemon(True)
            t.start()

        # Start a thread to print the results
        printer_thread = threading.Thread(target=print_entries, args=(entry_queue,))
        printer_thread.setDaemon(True)
        printer_thread.start()

        # Wait for all of the URLs to be processed
        url_queue.join()

        # Wait for the worker threads to finish
        for t in worker_threads:
            t.join()

        # Poison the print thread and wait for it to exit
        entry_queue.put((None,None))
        entry_queue.join()
        printer_thread.join()

    finally:
        storage.close()
    return


def fetch_urls(storage, input_queue, output_queue):
    """Thread target for fetching feed data.
    """
    c = Cache(storage)

    while True:
        next_url = input_queue.get()
        if next_url is None: # None causes thread to exit
            input_queue.task_done()
            break

        feed_data = c.fetch(next_url)
        for entry in feed_data.entries:
            output_queue.put( (feed_data.feed, entry) )
        input_queue.task_done()
    return


def print_entries(input_queue):
    """Thread target for printing the contents of the feeds.
    """
    while True:
        feed, entry = input_queue.get()
        if feed is None: # None causes thread to exist
            input_queue.task_done()
            break

        print '%s: %s' % (feed.title, entry.title)
        input_queue.task_done()
    return


if __name__ == '__main__':
    main(sys.argv[1:])

The design uses queues to pass data between two different types of
threads to work on the feeds. Multiple threads use feedcache to
fetch feed data. Each of these threads has its own Cache, but they
all share a common shove store. A single thread waits for the feed
entries to be added to its queue, and then prints each feed title and
entry title.

The main() function sets up two different queues for passing data
in and out of the worker threads. The url_queue (lines 23-25)
contains the URLs for feeds, taken from the command line arguments.
The entry_queue (line 33) is used to pass feed content from the
threads that fetch the feeds to the queue that prints the results. A
shove filesystem store (line 36) is used to cache the feeds. Once
all of the worker threads are started (lines 40-51), the rest of the
main program simply waits for each stage of the work to be completed
by the threads.

The last entries added to the url_queue are None values, which
trigger the worker thread to exit. When the url_queue has been
drained (line 54), the worker threads can be cleaned up. After the
worker threads have finished, (None, None) is added to the
entry_queue to trigger the printing thread to exit when all of the
entries have been printed.

The fetch_urls() function (lines 70-85) runs in the worker
threads. It takes one feed URL at a time from the input queue,
retrieves the feed contents from a cache, then adds the feed entries
to the output queue. When the item taken out of the queue is None
instead of a URL string, it is interpreted as a signal that the thread
should break out of its processing loop. Each thread running
fetch_urls() creates a local Cache instance using a common
storage back end. Sharing the storage ensures that all of the feed
data is written to the same place, while creating a local Cache
instance ensures threads can fetch data in parallel.

The consumer of the queue of entries is print_entries() (lines
88-99). It takes one entry at a time from the queue and prints the
feed and entry titles. Only one thread runs print_entries(), but a
separate thread is used so that output can be produced as soon as
possible, instead of waiting for all of the fetch_urls() threads
to complete before printing the feed contents.

Running the program produces output similar to the example in Listing
3:

$ python Listing11.py http://feeds.feedburner.com/FeedcacheReleases
Saving feed data to /tmp/feedcache_example
feedcache Releases: feedcache 0.1
feedcache Releases: feedcache 0.2
feedcache Releases: feedcache 0.3
feedcache Releases: feedcache 0.4
feedcache Releases: feedcache 0.5

The difference is that it takes much less time to run the program in
Listing 11 when multiple feeds are passed on the command line, and
when some of the data has already been cached.

Future Work

The current version of feedcache meets most of the requirements
for CastSampler, but there is still room to improve it as a
general purpose tool. It would be nice if it offered finer control
over the length of time data stays in the cache, for example. And,
although shove is a completely separate project, feedcache
would be more reliable if shove‘s file storage were used file
locking, to prevent corruption when two threads or processes write to
the same part of the cache at the same time.

Determining how long to hold the data in a cache can be a tricky
problem. With web content such as RSS and Atom feeds, the web server
may offer hints by including explicit expiration dates or caching
instructions. HTTP headers such as Expires and Cache-Control
can include details beyond the Last-Modified and ETag values
already being handled by the Cache. If the server uses additional
cache headers, feedparser saves the associated values in the
FeedParserDict. To support the caching hints, feedcache would
need to be enhanced to understand the rules for the Cache-Control
header, and to save the expiration time as well as the time-to-live
for each feed.

Supporting a separate time-to-live value for each feed would let
feedcache use a different refresh throttle for different
sites. Data from relatively infrequently updated feeds, such as
Slashdot, would stay in the cache longer than data from more
frequently updated feeds, such as a Twitter feed. Applications that
use feedcache in a more traditional way would be able to adjust
the update throttle for each feed separately to balance the freshness
of the data in the cache and the load placed on the server.

Conclusions

Original sources of RSS and Atom feeds are being created all the time
as new and existing applications expose data for syndication. With the
development of mash-up tools such as Yahoo! Pipes and Google’s Mashup
Editor, these feeds can be combined, filtered, and expanded in new and
interesting ways, creating even more sources of data. I hope this
article illustrates how building your own applications to read and
manipulate syndication feeds in Python with tools like feedparser and
feedcache is easy, even while including features that make your
program cooperate with servers to manage load.

I would like to offer a special thanks to Mrs. PyMOTW for her help
editing this article.

Python Magazine wish-list

Brian and I have been compiling a list of topics we would like to have
covered in the magazine. Since we’re just starting, the field is really
wide-open for anything, but sometimes it is easier to solicit articles
about specific topics instead of just saying, “Write for us!”

A few of my personal wishes:

We have had a couple of PyGTK articles submitted already, but nothing
for any of the other toolkits. Whenever I see the question “Which GUI
toolkit should I use?” there are always a lot of responses for wxWindows
and quite a few for Qt. We haven’t had any submissions for articles on
either yet, so if you use them and want to talk about it, yours might be
the first.

I’m aware of several ORM-related books in the works right now, but
that’s another area where a short article (4000 words) on a focused
aspect would be useful. Not all queries are equal (even if the result
sets are), so how about a discussion of SQL optimization with your
favorite ORM? Or how about adapting an ORM to an existing database? And
my favorite topic: How the heck am I supposed to upgrade my schema when
I make changes?

I need to write a trac plugin, but haven’t had the time to figure out
where to start. Will you write an article to show me how?

The List:

We’ll be updating this list and will eventually post it online
somewhere, but until we decide on the best way to do that, here is the
“full” wish-list we have put together for now, in no specific order. Do
not interpret the absence of a topic as lack of interest; we just
haven’t added it to the list, yet!

If you are interested in writing about these or other topics, contact
us through the web site
and let us know.

  • High Performance Computing (HPC)
  • Parallel Python (pp) module
  • PyMOL
  • VTK
  • SciPy
  • Browser
  • Django
  • Writing a django app
  • TurboGears
  • CherryPy
  • Zope
  • Writing a Zope product
  • Plone
  • Writing a plugin for trac
  • Web Services
  • XMLRPC
    • simplexmlrpcserver
    • xmlrpclib
  • SOAP
  • Flickr (Beej’s API?)
  • Google Calendar/GData (w/ ElementTree)
  • Amazon
  • Yahoo
  • System Administration
  • SNMP
  • LDAP
    * python-ldap
    * Luma (extending?)
  • User/Group management
  • GUI Frameworks
  • wxPython
  • PyQT
  • PyGTK

Python Magazine: First issue free!

The premier issue of Python Magazine is available for download
right now, completely free.

I haven’t mentioned it previously on this blog, but I’m the Technical
Editor for the magazine. That means I review and test the submitted
code, and write a monthly column. The column runs under the title “And
Now for Something Completely Different” and will focus on technical
topics (this month I talk about the GIL and 2 packages for managing
processes).

Other regular columns include Brian Jones (Editor in Chief), Mark
Mruss
(“Welcome to Python”, targeted at newer users or introductory
topics), and Steve Holden (“Random Hits”, the end-note editorial).

In addition to the regular columns, there are 4 feature articles this
month:

  1. John Berninger covers Extending Python using C, without using a
    binding generator. He’s Old School.
  2. Kevin Ryan introduces form processing in WSGI with some clever
    data-driven techniques using lambda.
  3. Sayamindu Dasgupta writes a PyGTK widget using Cairo primitives to
    draw the widget view.
  4. And I discuss a fun hack I came up with to pull iCalendar data out
    of an IMAP server.

I’m really excited about the how this issue has turned out (Arbi
Arzoumani did a great job with the design and layout), and hope you
like it, too. Head over to http://pythonmagazine.com/c/issue/2007/10 and
download the PDF version. If you do like it, subscribe! If you think you
could do better, submit an idea for an article and write for us!

Besides soliciting articles from you, I’ll always be on the look-out
for good ideas to cover in my own column. If there is something you
want me to cover, email me directly (doug dot hellmann at
pythonmagazine dot com) or tag a link to a site or blog post
with pymagdifferent on del.icio.us.

Working with IMAP and iCalendar

How can you access group calendar information if your
Exchange-like mail and calendaring server does not provide
iCalendar feeds, and you do not, or cannot, use Outlook? Use
Python to extract the calendar data and generate your own feed, of
course! This article discusses a surprisingly simple program to
perform what seems like a complex series of operations: scanning
IMAP folders, extracting iCalendar attachments, and merging the
contained events together into a single calendar.

Background

I recently needed to access shared schedule information stored on an
Exchange-like mail and calendaring server. Luckily, I was able to
combine an existing third party open source library with the tools in
the Python standard library to create a command line program to
convert the calendar data into a format I could use with my desktop
client directly. The final product is called mailbox2ics. It ended up
being far shorter than I had anticipated when I started thinking about
how to accomplish my goal. The entire program is just under 140 lines
long, including command line switch handling, some error processing,
and debug statements. The output file produced can be consumed by any
scheduling client that supports the iCalendar standard.

Using Exchange, or a compatible replacement, for email and scheduling
makes sense for many businesses and organizations. The client program,
Microsoft Outlook, is usually familiar to non-technical staff members,
and therefore new hires can hit the ground running instead of being
stymied trying to figure out how to accomplish their basic, everyday
communication tasks. However, my laptop runs Mac OS X and I do not
have Outlook. Purchasing a copy of Outlook at my own expense, in
addition to inflicting further software bloat on my already crowded
computer, seemed like an unnecessarily burdensome hassle just to be
able to access schedule information.

Changing the server software was not an option. A majority of the
users already had Outlook and were accustomed to using it for their
scheduling, and I did not want to have to support a different server
platform. That left me with one option: invent a way to pull the data
out of the existing server, so I could convert it to a format that I
could use with my usual tools: Apple’s iCal and Mail.

With iCal (and many other standards-compliant calendar tools) it is
possible to subscribe to calendar data feeds. Unfortunately, the
server we were using did not have the ability to export the schedule
data in a standard format using a single file or URL. However, the
server did provide access to the calendar data via IMAP using shared
public folders. I decided to use Python to write a program to extract
the data from the server and convert it into a usable feed. The feed
would be passed to iCal, which would merge the group schedule with the
rest of my calendar information so I could see the group events
alongside my other meetings, deadlines, and reminders about when the
recycling is picked up on our street.

IMAP Basics

The calendar data was only accessible to me as attachments on email
messages accessed via an IMAP server. The messages were grouped into
several folders, with each folder representing a separate public
calendar used for a different purpose (meeting room schedules, event
planning, holiday and vacation schedules, etc.). I had read-only
access to all of the email messages in the public calendar
folders. Each email message typically had one attachment describing a
single event. To produce the merged calendar, I needed to scan several
folders, read each message in the folder, find and parse the calendar
data in the attachments, and identify the calendar events. Once I
identified the events to include in the output, I needed to add them
to an output file in a format iCal understands.

Python’s standard library includes the imaplib module for working
with IMAP servers. The IMAP4 and IMAP4_SSL classes provide a high
level interface to all of the features I needed: connecting to the
server securely, accessing mailboxes, finding messages, and
downloading them. To experiment with retrieving data from the IMAP
server, I started by establishing a secure connection to the server on
the standard port for IMAP-over-SSL, and logging in using my regular
account. This would not be a desirable way to run the final program on
a regular basis, but it works fine for development and testing.

mail_server = imaplib.IMAP4_SSL(hostname)
mail_server.login(username, password)

It is also possible to use IMAP over a non-standard port. In that
case, the caller can pass port as an additional option to
imaplib.IMAP4_SSL(). To work with an IMAP server without SSL
encryption, you can use the IMAP4 class, but using SSL is
definitely preferred.

mail_server = imaplib.IMAP4_SSL(hostname, port)
mail_server.login(username, password)

The connection to the IMAP server is “stateful”. The client remembers
which methods have been called on it, and changes its internal state
to reflect those calls. The internal state is used to detect logical
errors in the sequence of method calls without the round-trip to the
server.

On an IMAP server, messages are organized into “mailboxes”. Each
mailbox has a name and, since mailboxes might be nested, the full name
of the mailbox is the path to that mailbox. Mailbox paths work just
like paths to directories or folders in a filesystem. The paths are
single strings, with levels usually separated by a forward slash
(/) or period (.). The actual separator value used depends on
the configuration of your IMAP server; one of my servers uses a slash,
while the other uses a period. If you do not already know how your
server is set up, you will need to experiment to determine the correct
values for folder names.

Once I had my client connected to the server, the next step was to
call select() to set the mailbox context to be used when searching
for and downloading messages.

mail_server.select('Public Folders/EventCalendar')
# or
mail_server.select('Public Folders.EventCalendar')

After a mailbox is selected, it is possible to retrieve messages from
the mailbox using search(). The IMAP method search() supports
filtering to identify only the messages you need. You can search for
messages based on the content of the message headers, with the rules
evaluated in the server instead of your client, thus reducing the
amount of information the server has to transmit to the client. Refer
to RFC 3501 (“Internet Message Access Protocol”) for details about the
types of queries which can be performed and the syntax for passing the
query arguments.

In order to implement mailbox2ics, I needed to look at all of the
messages in every mailbox the user named on the command line, so I
simply used the filter “ALL” with each mailbox. The return value
from search() includes a response code and a string with the
message numbers separated by spaces. A separate call is required to
retrieve more details about an individual message, such as the headers
or body.

(typ, [message_ids]) = mail_server.search(None, 'ALL')
message_ids = message_ids.split()

Individual messages are retrieved via fetch(). If only part of the
message is desired (size, envelope, body), that part can be fetched to
limit bandwidth. I could not predict which subset of the message body
might include the attachments I wanted, so it was simplest for me to
download the entire message. Calling fetch(“(RFC822)”) returns a
string containing the MIME-encoded version of the message with all
headers intact.

typ, message_parts = mail_server.fetch(
    message_ids[0], '(RFC822)')
message_body = message_parts[0][1]

Once the message body had been downloaded, the next step was to parse
it to find the attachments with calendar data. Beginning with version
2.2.3, the Python standard library has included the email package
for working with standards-compliant email messages. There is a
straightforward factory for converting message text to Message
objects. To parse the text representation of an email and create a
Message instance from it, use email.message_from_string().

msg = email.message_from_string(message_body)

Message objects are almost always made up of multiple parts. The parts
of the message are organized in a tree structure, with message
attachments supporting nested attachments. Subparts or attachments can
even include entire email messages, such as when you forward a message
which already contains an attachment to someone else. To iterate over
all of the parts of the Message tree recursively, use the walk()
method.

for part in msg.walk():
    print part.get_content_type()

Having access to the email package saved an enormous amount of time on
this project. Parsing multi-part email messages reliably is tricky,
even with (or perhaps because of) the many standards involved. With
the email package, in just a few lines of Python, you can parse and
traverse all of the parts of even the most complex standard-compliant
multi-part email message, giving you access to the type and content of
each part.

Accessing Calendar Data

The “Internet Calendaring and Scheduling Core Object Specification”,
or iCalendar, is defined in RFC 2445. iCalendar is a data format
for sharing scheduling and other date-oriented information. One
typical way to receive an iCalendar event notification, such as an
invitation to a meeting, is via an email attachment. Most standard
calendaring tools, such as iCal and Outlook, generate these email
messages when you initially “invite” another participant to a meeting,
or update an existing meeting description. The iCalendar standard says
the file should have filename extension ICS and mime-type
text/calendar. The input data for mailbox2ics came from email
attachments of this type.

The iCalendar format is text-based. A simple example of an ICS file
with a single event is provided in Listing 1. Calendar events have
properties to indicate who was invited to an event, who originated it,
where and when it will be held, and all of the other expected bits of
information important for a scheduled event. Each property of the
event is encoded on its own line, with long values wrapped onto
multiple lines in a well-defined way to allow the original content to
be reconstructed by a client receiving the iCalendar representation of
the data. Some properties also can be repeated, to handle cases such
as meetings with multiple invitees.

Listing 1

BEGIN:VCALENDAR
CALSCALE:GREGORIAN
PRODID:-//Big Calendar Corp//Server Version X.Y.Z//EN
VERSION:2.0
METHOD:PUBLISH
BEGIN:VEVENT
UID:20379258.1177945519186.JavaMail.root(a)imap.example.com
LAST-MODIFIED:20070519T000650Z
DTSTAMP:20070519T000650Z
DTSTART;VALUE=DATE:20070508
DTEND;VALUE=DATE:20070509
PRIORITY:5
TRANSP:OPAQUE
SEQUENCE:0
SUMMARY:Day off
LOCATION:
CLASS:PUBLIC
END:VEVENT
END:VCALENDAR

In addition to having a variety of single or multi-value properties,
calendar elements can be nested, much like email messages with
attachments. An ICS file is made up of a VCALENDAR component,
which usually includes one or more VEVENT components. A
VCALENDAR might also include VTODO components (for tasks on a
to-do list). A VEVENT may contain a VALARM, which specifies
the time and means by which the user should be reminded of the event.
The complete description of the iCalendar format, including valid
component types and property names, and the types of values which are
legal for each property, is available in the RFC.

This sounds complex, but luckily, I did not have to worry about
parsing the ICS data at all. Instead of doing the work myself, I took
advantage of an open source Python library for working with iCalendar
data released by Max M. (maxm@mxm.dk). His iCalendar library
(available from codespeak.net) makes parsing ICS data sources very
simple. The API for the library was designed based on the email
package discussed previously, so working with Calendar instances and
email.Message instances is similar. Use the class method
Calendar.from_string() to parse the text representation of the
calendar data to create a Calendar instance populated with all of the
properties and subcomponents described in the input data.

from icalendar import Calendar, Event
cal_data = Calendar.from_string(open('sample.ics', 'rb').read())

Once you have instantiated the Calendar object, there are two
different ways to iterate through its components: via the walk()
method or subcomponents attribute. Using walk() will traverse
the entire tree and let you process each component in the tree
individually. Accessing the subcomponents list directly lets you
work with a larger portion of the calendar data tree at one time.
Properties of an individual component, such as the summary or start
date, are accessed via the __getitem__() API, just as with a
standard Python dictionary. The property names are not case sensitive.

For example, to print the “SUMMARY” field values from all top level
events in a calendar, you would first iterate over the subcomponents,
then check the name attribute to determine the component type. If
the type is VEVENT, then the summary can be accessed and printed.

for event in cal_data.subcomponents:
    if event.name == 'VEVENT':
        print 'EVENT:', event['SUMMARY']

While most of the ICS attachments in my input data would be made up of
one VCALENDAR component with one VEVENT subcomponent, I did
not want to require this limitation. The calendars are writable by
anyone in the organization, so while it was unlikely that anyone would
have added a VTODO or VJOURNAL to public data, I could not
count on it. Checking for VEVENT as I scanned each component let
me ignore components with types that I did not want to include in the
output.

Writing ICS data to a file is as simple as reading it, and only takes
a few lines of code. The Calendar class handles the difficult tasks of
encoding and formatting the data as needed to produce a fully
formatted ICS representation, so I only needed to write the formatted
text to a file.

ics_output = open('output.ics', 'wb')
try:
    ics_output.write(str(cal_data))
finally:
    ics_output.close()

Finding Max M’s iCalendar library saved me a lot of time and effort,
and demonstrates clearly the value of Python and open source in
general. The API is concise and, since it is patterned off of another
library I was already using, the idioms were familiar. I had not
embarked on this project eager to write parsers for the input data, so
I was glad to have libraries available to do that part of the work for
me.

Putting It All Together

At this point, I had enough pieces to build a program to do what I
needed. I could read the email messages from the server via IMAP,
parse each message, and then search through its attachments to find
the ICS attachments. Once I had the attachments, I could parse them
and produce another ICS file to be imported into my calendar client.
All that remained was to tie the pieces together and give it a user
interface. The source for the resulting program, mailbox2ics.py,
is provided in Listing 2.

Listing 2

#!/usr/bin/env python
# mailbox2ics.py

"""Convert the contents of an imap mailbox to an ICS file.

This program scans an IMAP mailbox, reads in any messages with ICS
files attached, and merges them into a single ICS file as output.
"""

# Import system modules
import imaplib
import email
import getpass
import optparse
import sys

# Import Local modules
from icalendar import Calendar, Event

# Module

def main():
    # Set up our options
    option_parser = optparse.OptionParser(
        usage='usage: %prog [options] hostname username mailbox [mailbox...]'
        )
    option_parser.add_option('-p', '--password', dest='password',
                             default='',
                             help='Password for username',
                             )
    option_parser.add_option('--port', dest='port',
                             help='Port for IMAP server',
                             type="int",
                             )
    option_parser.add_option('-v', '--verbose', 
                             dest="verbose", 
                             action="store_true", 
                             default=True,
                             help='Show progress',
                             )
    option_parser.add_option('-q', '--quiet', 
                             dest="verbose", 
                             action="store_false", 
                             help='Do not show progress',
                             )
    option_parser.add_option('-o', '--output', dest="output",
                             help="Output file",
                             default=None,
                             )

    (options, args) = option_parser.parse_args()
    if len(args) < 3:
        option_parser.print_help()
        print >>sys.stderr, 'nERROR: Please specify a username, hostname, and mailbox.'
        return 1
    hostname = args[0]
    username = args[1]
    mailboxes = args[2:]

    # Make sure we have the credentials to login to the IMAP server.
    password = options.password or getpass.getpass(stream=sys.stderr)

    # Initialize a calendar to hold the merged data
    merged_calendar = Calendar()
    merged_calendar.add('prodid', '-//mailbox2ics//doughellmann.com//')
    merged_calendar.add('calscale', 'GREGORIAN')

    if options.verbose:
        print >>sys.stderr, 'Logging in to "%s" as %s' % (hostname, username)

    # Connect to the mail server
    if options.port is not None:
        mail_server = imaplib.IMAP4_SSL(hostname, options.port)
    else:
        mail_server = imaplib.IMAP4_SSL(hostname)
    (typ, [login_response]) = mail_server.login(username, password)
    try:
        # Process the mailboxes
        for mailbox in mailboxes:
            if options.verbose: print >>sys.stderr, 'Scanning %s ...' % mailbox
            (typ, [num_messages]) = mail_server.select(mailbox)
            if typ == 'NO':
                raise RuntimeError('Could not find mailbox %s: %s' % 
                                   (mailbox, num_messages))
            num_messages = int(num_messages)
            if not num_messages:
                if options.verbose: print >>sys.stderr, '  empty'
                continue

            # Find all messages
            (typ, [message_ids]) = mail_server.search(None, 'ALL')
            for num in message_ids.split():

                # Get a Message object
                typ, message_parts = mail_server.fetch(num, '(RFC822)')
                msg = email.message_from_string(message_parts[0][1])

                # Look for calendar attachments
                for part in msg.walk():
                    if part.get_content_type() == 'text/calendar':
                        # Parse the calendar attachment
                        ics_text = part.get_payload(decode=1)
                        importing = Calendar.from_string(ics_text)

                        # Add events from the calendar to our merge calendar
                        for event in importing.subcomponents:
                            if event.name != 'VEVENT':
                                continue
                            if options.verbose: 
                                print >>sys.stderr, 'Found: %s' % event['SUMMARY']
                            merged_calendar.add_component(event)
    finally:
        # Disconnect from the IMAP server
        if mail_server.state != 'AUTH':
            mail_server.close()
        mail_server.logout()

    # Dump the merged calendar to our output destination
    if options.output:
        output = open(options.output, 'wt')
        try:
            output.write(str(merged_calendar))
        finally:
            output.close()
    else:
        print str(merged_calendar)
    return 0

if __name__ == '__main__':
    try:
        exit_code = main()
    except Exception, err:
        print >>sys.stderr, 'ERROR: %s' % str(err)
        exit_code = 1
    sys.exit(exit_code)

Since I wanted to set up the export job to run on a regular basis via
cron, I chose a command line interface. The main() function for
mailbox2ics.py starts out at line 24 with the usual sort of
configuration for command line option processing via the optparse
module. Listing 3 shows the help output produced when the program is
run with the -h option.

Listing 3

Usage: mailbox2ics.py [options] hostname username mailbox [mailbox...]

Options:
  -h, --help            show this help message and exit
  -p PASSWORD, --password=PASSWORD
                        Password for username
  --port=PORT           Port for IMAP server
  -v, --verbose         Show progress
  -q, --quiet           Do not show progress
  -o OUTPUT, --output=OUTPUT
                        Output file

The –password option can be used to specify the IMAP account
password on the command line, but if you choose to use it consider the
security implications of embedding a password in the command line for
a cron task or shell script. No matter how you specify the password, I
recommend creating a separate mailbox2ics account on the IMAP server
and limiting the rights it has so no data can be created or deleted
and only public folders can be accessed. If –password is not
specified on the command line, the user is prompted for a password
when they run the program. While less useful with cron, providing the
password interactively can be a solution if you are unable, or not
allowed, to create a separate restricted account on the IMAP server.
The account name used to connect to the server is required on the
command line.

There is also a separate option for writing the ICS output data to a
file. The default is to print the sequence of events to standard
output in ICS format. Though it is easy enough to redirect standard
output to a file, the -o option can be useful if you are using the
-v option to enable verbose progress tracking and debugging.

The program uses a separate Calendar instance, merged_data, to
hold all of the ICS information to be included in the output. All of
the VEVENT components from the input are copied to merged_data
in memory, and the entire calendar is written to the output location
at the end of the program. After initialization (line 64),
merged_data is configured with some basic properties. PRODID
is required and specifies the name of the product which produced the
ICS file. CALSCALE defines the date system, or scale, used for the
calendar.

After setting up merged_calendar, mailbox2ics connects to the IMAP
server. It tests whether the user has specified a network port using
–port and only passes a port number to imaplib if the user
includes the option. The optparse library converts the option value to
an integer based on the option configuration, so options.port is
either an integer or None.

The names of all mailboxes to be scanned are passed as arguments to
mailbox2ics on the command line after the rest of the option
switches. Each mailbox name is processed one at a time, in the for
loop starting on line 79. After calling select() to change the
IMAP context, the message ids of all of the messages in the mailbox
are retrieved via a call to search(). The full content of each
message in the mailbox is fetched in turn, and parsed with
email.message_from_string(). Once the message has been parsed, the
msg variable refers to an instance of email.Message.

Each message may have multiple parts containing different MIME
encodings of the same data, as well as any additional message
information or attachments included in the email which generated the
event. For event notification messages, there is typically at least
one human-readable representation of the event and frequently both
HTML and plain text are included. Of course, the message also includes
the actual ICS file, as well. For my purposes, only the ICS
attachments were important, but there is no way to predict where they
will appear in the sequence of attachments on the email message. To
find the ICS attachments, mailbox2ics walks through all of the parts
of the message recursively looking for attachments with mime-type
text/calendar (as specified in the iCalendar standard) and
ignoring everything else. Attachment names are ignored, since
mime-type is a more reliable way to identify the calendar data
accurately.

for part in msg.walk():
    if part.get_content_type() == 'text/calendar':
        # Parse the calendar attachment
        ics_text = part.get_payload(decode=1)
        importing = Calendar.from_string(ics_text)

When it finds an ICS attachment, mailbox2ics parses the text of the
attachment to create a new Calendar instance, then copies the
VEVENT components from the parsed Calendar to merged_calendar.
The events do not need to be sorted into any particular order when
they are added to merged_calendar, since the client reading the
ICS file will filter and reorder them as necessary to displaying them
on screen. It was important to take the entire event, including any
subcomponents, to ensure that all alarms are included. Instead of
traversing the entire calendar and accessing each component
individually, I simply iterated over the subcomponents of the
top-level VCALENDAR node. Most of the ICS files only included one
VEVENT anyway, but I did not want to miss anything important if
that ever turned out not to be the case.

for event in importing.subcomponents:
    if event.name != 'VEVENT':
        continue
    merged_calendar.add_component(event)

Once all of the mailboxes, messages, and calendars are processed, the
merged_calendar refers to a Calendar instance containing all of
the events discovered. The last step in the process, starting at line
119, is for mailbox2ics to create the output. The event data is
formatted using str(merged_calendar), just as in the example
above, and written to the output destination selected by the user
(standard output or file).

Example

Listing 4 includes sample output from running mailbox2ics to merge two
calendars for a couple of telecommuting workers, Alice and Bob. Both
Alice and Bob have placed their calendars online at imap.example.com.
In the output of mailbox2ics, you can see that Alice has 2 events in
her calendar indicating the days when she will be in the office. Bob
has one event for the day he has a meeting scheduled with Alice.

Listing 4

$ mailbox2ics.py -o group_schedule.ics imap.example.com mailbox2ics  "Calendars.Alice" "Calendars.Bob"
Password: 
Logging in to "imap.example.com" as mailbox2ics
Scanning Calendars.Alice ...
Found: In the office to work with Bob on project proposal
Found: In the office
Scanning Calendars.Bob ...
Found: In the office to work with Alice on project proposal

The output file created by mailbox2ics containing the merged calendar
data from Alice and Bob’s calendars is shown in Listing 5. You can see
that it includes all 3 events as VEVENT components nested inside a
single VCALENDAR. There were no alarms or other types of
components in the input data.

Listing 5

BEGIN:VCALENDAR
CALSCALE:GREGORIAN
PRODID:-//mailbox2ics//doughellmann.com//
BEGIN:VEVENT
CLASS:PUBLIC
DTEND;VALUE=DATE:20070704
DTSTAMP:20070705T180246Z
DTSTART;VALUE=DATE:20070703
LAST-MODIFIED:20070705T180246Z
LOCATION:
PRIORITY:5
SEQUENCE:0
SUMMARY:In the office to work with Bob on project proposal
TRANSP:TRANSPARENT
UID:9628812.1182888943029.JavaMail.root(a)imap.example.com
END:VEVENT
BEGIN:VEVENT
CLASS:PUBLIC
DTEND;VALUE=DATE:20070627
DTSTAMP:20070625T154856Z
DTSTART;VALUE=DATE:20070626
LAST-MODIFIED:20070625T154856Z
LOCATION:Atlanta
PRIORITY:5
SEQUENCE:0
SUMMARY:In the office
TRANSP:TRANSPARENT
UID:11588018.1182542267385.JavaMail.root(a)imap.example.com
END:VEVENT
BEGIN:VEVENT
CLASS:PUBLIC
DTEND;VALUE=DATE:20070704
DTSTAMP:20070705T180246Z
DTSTART;VALUE=DATE:20070703
LAST-MODIFIED:20070705T180246Z
LOCATION:
PRIORITY:5
SEQUENCE:0
SUMMARY:In the office to work with Alice on project proposal
TRANSP:TRANSPARENT
UID:9628812.1182888943029.JavaMail.root(a)imap.example.com
END:VEVENT
END:VCALENDAR

Mailbox2ics In Production

To solve my original problem of merging the events into a sharable
calendar to which I could subscribe in iCal, I scheduled mailbox2ics
to run regularly via cron. With some experimentation, I found that
running it every 10 minutes caught most of the updates quickly enough
for my needs. The program runs locally on a web server which has
access to the IMAP server. For better security, it connects to the
IMAP server as a user with restricted permissions. The ICS output
file produced is written to a directory accessible to the web server
software. This lets me serve the ICS file as static content on the web
server to multiple subscribers. Access to the file through the web is
protected by a password, to prevent unauthorized access.

Thoughts About Future Enhancements

Mailbox2ics does everything I need it to do, for now. There are a few
obvious areas where it could be enhanced to make it more generally
useful to other users with different needs, though. Input and output
filtering for events could be added. Incremental update support would
help it scale to manage larger calendars. Handling non-event data in
the calendar could also prove useful. And using a configuration file
to hold the IMAP password would be more secure than passing it on the
command line.

At the time of this writing, mailbox2ics does not offer any way to
filter the input or output data other than by controlling which
mailboxes are scanned. Adding finer-grained filtering support could
be useful. The input data could be filtered at two different points,
based on IMAP rules or the content of the calendar entries themselves.

IMAP filter rules (based on sender, recipient, subject line, message
contents, or other headers) would use the capabilities of
IMAP4.search() and the IMAP server without much effort on my part.
All that would be needed are a few command line options to pass the
filtering rules, or code to read a configuration file. The only
difference in the processing by mailbox2ics would be to convert the
input rules to the syntax understood by the IMAP server and pass them
to search().

Filtering based on VEVENT properties would require a little more
work. The event data must be downloaded and checked locally, since the
IMAP server will not look inside the attachments to check the
contents. Filtering using date ranges for the event start or stop date
could be very useful, and not hard to implement. The Calendar class
already converts dates to datetime instances. The datetime
package makes it easy to test dates against rules such as “events in
the next 7 days” or “events since Jan 1, 2007”.

Another simple addition would be pattern matching against other
property values such as the event summary, organizer, location, or
attendees. The patterns could be regular expressions, or a simpler
syntax such as globbing. The event properties, when present in the
input, are readily available through the __getitem__() API of the
Calendar instance and it would be simple to compare them against the
pattern(s).

If a large amount of data is involved, either spread across several
calendars or because there are a lot of events, it might also be
useful to be able to update an existing cached file, rather than
building the whole ICS file from scratch each time. Looking only at
unread messages in the folder, for example, would let mailbox2ics skip
downloading old events that are no longer relevant or already appear
in the local ICS file. It could then initialize merged_calendar by
reading from the local file before updating it with new events and
re-writing the file. Caching some of the results in this way would
place less load on the IMAP server, so the export could easily be run
more frequently than once every 10 minutes.

In addition to filtering to reduce the information included in the
output, it might also prove useful to add extra information by
including component types other than VEVENT. For example,
including VTODO would allow users to include a group action list
in the group calendar. Most scheduling clients support filtering the
to-do items and alarms out of calendars to which you subscribe, so if
the values are included in a feed, individual users can always ignore
the ones they choose.

As mentioned earlier, using the –password option to provide the
password to the IMAP server is convenient, but not secure. For
example, on some systems it is possible to see the arguments to
programs using ps. This allows any user on the system to watch for
mailbox2ics to run and observe the password used. A more secure way to
provide the password is through a configuration file. The file can
have filesystem permissions set so that only the owner can access
it. It could also, potentially, be encrypted, though that might be
overkill for this type of program. It should not be necessary to run
mailbox2ics on a server where there is a high risk that the password
file might be exposed.

Conclusion

Mailbox2ics was a fun project that took a me just a few hours over a
weekend to implement and test. This project illustrates two reasons
why I enjoy developing with Python. First, difficult tasks are made
easier through the power of the “batteries included” nature of
Python’s standard distribution. And second, coupling Python with the
wide array of other open source libraries available lets you get the
job done, even when the Python standard library lacks the exact tool
you need. Using the ICS file produced by mailbox2ics, I am now able to
access the calendar data I need using my familiar tools, even though
iCalendar is not supported directly by the group’s calendar server.

Originally published in Python Magazine Volume 1 Issue 10 , October, 2007

Multi-processing techniques in Python

Originally published in Python Magazine Volume 1 Number 10 , October,
2007

Has your multi-threaded application grown GILs? Take a look at these
packages for easy-to-use process management and inter-process
communication tools.

There is no predefined theme for this column, so I plan to cover a
different, likely unrelated, subject every month. The topics will
range anywhere from open source packages in the Python Package Index
(formerly The Cheese Shop, now PyPI) to new developments from around
the Python community, and anything that looks interesting in
between. If there is something you would like for me to cover, send a
note with the details to doug dot hellmann at
pythonmagazine dot com and let me know, or add the link to
your del.icio.us account with the tag “pymagdifferent”.

I will make one stipulation for my own sake: any open source libraries
must be registered with PyPI and configured so that I can install them
with distutils. Creating a login at PyPI and registering your
project is easy, and only takes a few minutes. Go on, you know you
want to.

Scaling Python: Threads vs. Processes

In the ongoing discussion of performance and scaling issues with
Python, one persistent theme is the Global Interpreter Lock
(GIL). While the GIL has the advantage of simplifying the
implementation of CPython internals and extension modules, it prevents
users from achieving true multi-threaded parallelism by limiting the
interpreter to executing byte-codes in one thread at a time on a
single processor. Threads which block on I/O or use extension modules
written in another language can release the GIL to allow other threads
to take over control, of course. But if my application is written
entirely in Python, only a limited number of statements will be
executed before one thread is suspended and another is started.

Eliminating the GIL has been on the wish lists of many Python
developers for a long time – I have been working with Python since
1998 and it was a hotly debated topic even then. Around that time,
Greg Stein produced a set of patches for Python 1.5 that eliminated
the GIL entirely, replacing it with a whole set of individual locks
for the mutable data structures (dictionaries, lists, etc.) that had
been protected by the GIL. The result was an interpreter that ran at
roughly half the normal speed, a side-effect of acquiring and
releasing the individual locks used to replace the GIL.

The GIL issue is unique to the C implementation of the
interpreter. The Java implementation of Python, Jython, supports true
threading by taking advantage of the underlying JVM. The IronPython
port, running on Microsoft’s CLR, also has better threading. On the
other hand, those platforms are always playing catch-up with new
language or library features, so if you’re hot to use the latest and
greatest, like I am, the C reference-implementation is still your best
option.

Dropping the GIL from the C implementation remains a low priority for
a variety of reasons. The scope of the changes involved is beyond the
level of anything the current developers are interested in
tackling. Recently, Guido has said he would entertain patches
contributed by the Python community to remove the GIL, as long as
performance of single-threaded applications was not adversely
affected. As far as I know, no one has announced any plans to do so.

Even though there is a FAQ entry on the subject as part of the
standard documentation set for Python, from time to time a request
pops up on comp.lang.python or one of the Python-related mailing lists
to rewrite the interpreter so the lock can be removed. Each time it
happens, the answer is clear: use processes instead of threads.

That response does have some merit. Extension modules become more
complicated without the safety of the GIL. Processes typically have
fewer inherent deadlocking issues than threads. They can be
distributed between the CPUs on a host, and even more importantly, an
application that uses multiple processes is not limited by the size of
a single server, as a multi-threaded application would be.

Since the GIL is still present in Python 3.0, it seems unlikely that
it will be removed from a future version any time soon. This may
disappoint some people, but it is not the end of the world. There are,
after all, strategies for working with multiple processes to scale
large applications. I’m not talking about the well worn, established
techniques from the last millennium that use a different collection of
tools on every platform, nor the time-consuming and error-prone
practices that lead to solving the same problem time and
again. Techniques using low-level, operating system-specific,
libraries for process management are as passé as using compiled
languages for CGI programming. I don’t have time for this low-level
stuff any more, and neither do you. Let’s look at some modern
alternatives.

The subprocess module

Version 2.4 of Python introduced the subprocess module and finally
unified the disparate process management interfaces available in other
standard library packages to provide cross-platform support for
creating new processes. While subprocess solved some of my process
creation problems, it still primarily relies on pipes for inter-process
communication. Pipes are workable, but fairly low-level as far as
communication channels go, and using them for two-way message passing
while avoiding I/O deadlocks can be tricky (don’t forget to flush()).
Passing data through pipes is definitely not as transparent to the
application developer as sharing objects natively between threads.
And pipes don’t help when the processes need to scale beyond a single
server.

Parallel Python

Vitalii Vanovschi’s Parallel Python package (pp) is a more complete
distributed processing package that takes a centralized approach.
Jobs are managed from a “job server”, and pushed out to individual
processing “nodes”.

Those worker nodes are separate processes, and can be running on the
same server or other servers accessible over the network. And when I
say that pp pushes jobs out to the processing nodes, I mean just that
– the code and data are both distributed from the central server to
the remote worker node when the job starts. I don’t even have to
install my application code on each machine that will run the jobs.

Here’s an example, taken right from the Parallel Python Quick Start
guide:

import pp
job_server = pp.Server()
# Start tasks
f1 = job_server.submit(func1, args1, depfuncs1,
    modules1)
f2 = job_server.submit(func1, args2, depfuncs1,
    modules1)
f3 = job_server.submit(func2, args3, depfuncs2,
    modules2)
# Retrieve the results
r1 = f1()
r2 = f2()
r3 = f3()

When the pp worker starts, it detects the number of CPUs in the system
and starts one process per CPU automatically, allowing me to take full
advantage of the computing resources available. Jobs are started
asynchronously, and run in parallel on an available node. The callable
object returned when the job is submitted blocks until the response is
ready, so response sets can be computed asynchronously, then merged
synchronously. Load distribution is transparent, making pp excellent
for clustered environments.

One drawback to using pp is that I have to do a little more work up
front to identify the functions and modules on which each job depends,
so all of the code can be sent to the processing node. That’s easy (or
at least straightforward) when all of the jobs are identical, or use a
consistent set of libraries. If I don’t know everything about the job
in advance, though, I’m stuck. It would be nice if pp could
automatically detect dependencies at runtime. Maybe it will, in a
future version.

The processing Package

Parallel Python is impressive, but it is not the only option for
managing parallel jobs. The processing package from Richard Oudkerk
aims to solve the issues of creating and communicating with multiple
processes in a portable, Pythonic way. Whereas Parallel Python is
designed around a “push” style distribution model, the processing
package is set up to make it easy to create producer/consumer style
systems where worker processes pull jobs from a queue.

The package hides most of the details of selecting an appropriate
communication technique for the platform by choosing reasonable
default behaviors at runtime. The API does include a way to explicitly
select the communication mechanism, in case I need that level of
control to meet specific performance or compatibility requirements.
As a result, I end up with the best of both worlds: usable default
settings that I can tweak later to improve performance.

To make life even easier, the processing.Process class was purposely
designed to match the threading.Thread class API. Since the processing
package is almost a drop-in replacement for the standard library’s
threading module, many of my existing multi-threaded applications can
be converted to use processes simply by changing a few import
statements. That’s the sort of upgrade path I like.

Listing 1 contains a simple example, based on the examples found in
the processing documentation, which passes a string value between
processes as an argument to the Process instance and shows the
similarity between processing and threading. How much easier could it
be?

Listing 1

#!/usr/bin/env python
# Simple processing example

import os
from processing import Process, currentProcess

def f(name):
    print 'Hello,', name, currentProcess()

if __name__ == '__main__':
    print 'Parent process:', currentProcess()
    p = Process(target=f, args=[os.environ.get('USER', 'Unknown user')])
    p.start()
    p.join()

In a few cases, I’ll have more work to do to convert existing code
that was sharing objects which cannot easily be passed from one
process to another (file or database handles, etc.). Occasionally, a
performance-sensitive application needs more control over the
communication channel. In these situations, I might still have to get
my hands dirty with the lower-level APIs in the processing.connection
module. When that time comes, they are all exposed and ready to be
used directly.

Sharing State and Passing Data

For basic state handling, the processing package lets me share data
between processes by using shared objects, similar to the way I might
with threads. There are two types of “managers” for passing objects
between processes. The LocalManager uses shared memory, but the types
of objects that can be shared are limited by a low-level interface
which constrains the data types and sizes. LocalManager is
interesting, but it’s not what has me excited. The SyncManager is the
real story.

SyncManager implements tools for synchronizing inter-process
communication in the style of threaded programming. Locks, semaphores,
condition variables, and events are all there. Special implementations
of Queue, dict, and list that can be used between processes safely are
included as well (Listing 2). Since I’m already comfortable with these
APIs, there is almost no learning curve for converting to the versions
provided by the processing module.

Listing 2

#!/usr/bin/env python
# Pass an object through a queue to another process.

from processing import Process, Queue, currentProcess

class Example:
    def __init__(self, name):
        self.name = name
    def __str__(self):
        return '%s (%s)' % (self.name, currentProcess())


def f(q):
    print 'In child:', q.get()


if __name__ == '__main__':
    q = Queue()
    p = Process(target=f, args=[q])
    p.start()
    o = Example('tester')
    print 'In parent:', o
    q.put(o)
    p.join()

For basic state sharing with SyncManager, using a Namespace is about
as simple as I could hope. A namespace can hold arbitrary attributes,
and any attribute attached to a namespace instance is available in all
client processes which have a proxy for that namespace. That’s
extremely useful for sharing status information, especially since I
don’t have to decide up front what information to share or how big the
values can be. Any process can change existing values or add new
values to the namespace, as illustrated in Listing 3. Changes to the
contents of the namespace are reflected in the other processes the
next time the values are accessed.

#!/usr/bin/env python
# Using a shared namespace.

import processing

def f(ns):
    print ns
    ns.old_coords = (ns.x, ns.y)
    ns.x += 10
    ns.y += 10

if __name__ == '__main__':
    # Initialize the namespace
    manager = processing.Manager()
    ns = manager.Namespace()
    ns.x = 10
    ns.y = 20

    # Use the namespace in another process
    p = processing.Process(target=f, args=(ns,))
    p.start()
    p.join()

    # Show the resulting changes in this process
    print ns

Remote Servers

Configuring a SyncManager to listen on a network socket gives me even
more interesting options. I can start processes on separate hosts, and
they can share data using all of the same high-level mechanisms
described above. Once they are connected, there is no difference in
the way the client programs use the shared resources remotely or
locally.

The objects are passed between client and server using pickles, which
introduces a security hole: because unpacking a pickle may cause code
to be executed, it is risky to trust pickles from an unknown
source. To mitigate this risk, all communication in the processing
package can be secured with digest authentication using the hmac
module from the standard library. Callers can pass authentication keys
to the manager explicitly, but default values are generated if no key
is given. Once the connection is established, the authentication and
digest calculation are handled transparently for me.

Conclusion

The GIL is a fact of life for Python programmers, and we need to
consider it along with all of the other factors that go into planning
large scale programs. Both the processing package and Parallel Python
tackle the issues of multi-processing in Python head on, from
different directions. Where the processing package tries to fit itself
into existing threading designs, pp uses a more explicit distributed
job model. Each approach has benefits and drawbacks, and neither is
suitable for every situation. Both, however, save you a lot of time
over the alternative of writing everything yourself with low-level
libraries. What an age to be alive!