How I Review a PyCon Talk Proposal

As the submission period for PyCon 2012 comes to a close and we
transition to reviewing the proposals from potential speakers, I
thought I would talk about the criteria I use for evaluating the
submissions. I am only one member of the Program Committee, and others
may use different criteria, so please take that into account while
reading this article.

I approach conference talk proposals in much the same way that I
approached submissions for Python Magazine. Just as the magazine had a
limited amount of space for articles each month, there are a limited
number of speaking slots at PyCon. The Program Committee’s job is to
choose the talks that maximize the benefit to the entire conference
audience, within the space and time constraints given. While reviewing
a proposal, I keep in mind that, as one person among 1,500, my opinion
only goes so far toward the makeup of the conference audience. While
my own enthusiasm and interest in a subject do matter, I try to
consider the rest of the audience first. The main question I keep in
my mind is, will the audience seeing this talk receive enough value to
make it worth bumping another talk? A lot of factors go into that
decision, and balancing them all can be a challenge.

Will the audience seeing this talk receive enough value to make it
worth bumping another talk?

The Abstract

The first step to reviewing the proposal is to read it. That seems
fairly obvious, but occasionally a title comes along that is so
compelling that I have to remind myself to keep reading before voting
+1 and moving on to the next talk. It isn’t enough for a proposal to
cover an interesting topic. It has to indicate that the talk will be
interesting, too. While I am reading, I look for several factors.

Is it clear, complete, and compelling?

First, is the abstract clear? The speaker should describe the topic
they plan to talk about in terms I can understand, even if I don’t
know anything about that subject area. The audience for PyCon is
diverse in experience and background, and so are the speakers. A
clearly written abstract, without a lot of domain-specific jargon,
tells me the speaker will be able to communicate with the audience.
If I can’t understand the proposal, I assume the audience won’t
understand the talk either.

Next, is the abstract complete? An incomplete proposal is the
largest negative factor I consider. If a proposal is incomplete, I
can’t really say what the speaker will talk about, or even if they
know the subject matter for their talk. If a proposal does not have a
detailed summary or outline, I as the submitter to provide more
detail.

Finally, I consider whether the abstract is compelling. Without
regard to the actual subject, is the abstract written in a way to
attract an audience? Is it full of boring clichés, vaguely worded
assertions about the superiority of one tool over another, cute
metaphors, or hand waving? I look for an abstract that shows the
speaker is excited about the topic, and that they will be conveying
that same excitement to the audience.

The Topic

For some people, the subject matter of a talk is the most important,
or only, aspect taken into consideration when voting. I have seen
presentations on topics I thought would be boring, but which were
delivered with such enthusiasm that I enjoyed them more than talks I
thought would be interesting from the outset. In my mind, the topic is
an important factor, but not necessarily the most important, to be
considered.

How relevant, niche, immediately useful, and novel is the topic?

I look first at whether the topic is relevant to the conference
attendees. For PyCon, that largely means users of the Python
programming language. Attendees will have a range of experience
levels, interests, and backgrounds, though, so not every talk needs to
include examples of Python code showing how to code around a
problem. I have seen successful talks covering user interface
standards compliance, community building, Java-based testing tools,
and a host of other non-coding topics.

Although we want a broad set of topics, we do need to be careful to
avoid talks that are too narrowly focused on a niche. PyCon has
grown large enough that we run five tracks of talks
simultaneously. That means each talk is usually competing with four
others, and when the start and stop times don’t align it can be more
than that. Each talk needs to attract enough of that audience to
justify being included in the program instead of another of the many
rejected talks. There are no formal metrics for that criteria, because
we cannot know in advance how each talk will be attended. Talks that
are unlikely to attract a significant audience at a large conference
such as PyCon may find a better fit at a regional or subject-matter
based conference. Examples of niche talks are those on topics that
require deep technical knowledge of a scientific field, unpopular
hardware platform, or industry. This is also why I tend to down-vote
talks on brand new libraries or tools, usually with an admonition to
resubmit the proposal when the project has matured and the community
around it has grown. Now that PyCon includes a poster session, I often
recommend that new projects which show a lot of promise convert their
talk proposal into a poster proposal.

As a counterpoint to considering whether a topic is too niche, I also
try to take into account whether the audience will take away something
immediately useful. The members of the Python community blend open
source and closed source solutions. Presentations on a closed source
application or service are less interesting, unless the techniques
presented can be applied elsewhere. Talks on open source projects are
more likely to involve reusable code, just by their nature. However,
new and incomplete projects don’t meet that standard.

The final criteria I consider for the proposal’s topic is whether it
is novel. Although the PyCon has a significant amount of turnover in
the audience year to year, I don’t want the program to be made up of
the same topics over and over. The same goes for subjects that have
appeared on blog posts, articles, and books. I am unlikely to up-vote
an introductory talk for a project with excellent documentation unless
the talk presents material in a new and interesting way.

The Presenter

Foremost in my mind when thinking about the presenter is the question
of whether or not they will be successful at delivering their proposed
talk. That question covers a lot of ground, so I break it up into
several aspects and take each in turn.

How skilled, knowledgeable, and experienced is the speaker?

First, how skilled is the speaker? The PyCon audience deserves
speakers who take their task seriously by preparing and practicing.
Evaluating the presentation skills of an unknown speaker can be
challenging. I have not had the opportunity to attend a lot of other
conferences or user group meetings in person, so unless they have
spoken at PyCon in the recent past I am unlikely to know much about
their style or skill. Restricting myself to events I have attended
means I may tend to favor a small group of “usual suspects.” To avoid
that bias, I look for cases where they have spoken at user group
meetings, regional conferences, or other events. Having access to
videos of past presentations helps, and I try to review at least a few
minutes of video for a speaker, when it is available.

I also want to see an indication that the speaker understands the
subject well enough to speak at length about it. A good speaker does
not need to be a domain expert, but some knowledge and experience is
required. It is not necessary for every presentation on a tool to be
delivered by the tool’s developer. Users and other community members
with good public speaking skills can present

Finally, does the speaker have enough experience speaking to know
their limits? There are two main limits: venue and time. PyCon is
typically held in large hotel ballrooms and that type of venue
presents a couple of challenges. The average audience for any given
talk is 1/4 to 1/5 of the 1,500 conference attendees. Due to the
structure of the venue, interactive sessions tend to flow less well
than more the standard talk-followed-by-questions. Presentations that
rely on showing code battle with the deep room layout, so fonts have
to be large enough that big blocks of code don’t fit on a single
slide. All of these issues can be overcome, but a presenter has to
know that they can’t just slap up their text editor together with a
terminal window using a low-contrast color scheme and expect the
audience to follow along during a live demo. Previous speaking
experience at smaller events shows me that a speaker is likely to
understand these issues.

The time issue is more difficult to balance. I want to see a talk with
enough depth that it shows something interesting, but reaching that
depth may require a time-consuming introduction. Setting the audience
level of the talk is one approach for avoiding this problem, because
the audience for an Intermediate or Experienced talk will not expect
as much of an introduction as someone in a Novice talk. On the other
hand, assuming too much knowledge may mean even an Experience attendee
does not understand the main point of a presentation. Speakers can
reclaim time by removing redundant examples, tightening their focus,
and trimming some of the commentary, so I usually try to point out
material I think can be cut to give the speaker more time.

It is less common for an outline to seem too short to fill the space
given, but it does happen. In this case, I look for information the
presenter may need to add, suggest they work in extra examples, or
simply tell them it looks too short. If a talk can’t fill a 30 minute
slot, but is still important or interesting, I may propose they
convert their proposal into a poster, instead.

Voting Versus Steering

One point frequently lost in the heat of the reviewing process is that
the Program Committee is not just supposed to judge the
proposals. Our responsibility is to make the PyCon program
better. That mission includes guiding the speakers to improve their
talks by refining the original proposals.

I have already mentioned a some of the common suggestions I make
during the course of a review. After considering all of the other
factors, I pause and pretend that the proposal I am reviewing is for
the only talk I will be able to see at the conference. Then I think of
ways the presentation could be improved, and engage with the speaker
to discuss them. I try to provide guidance based on past experience to
help newer speakers understand how their talk might be improved to fit
into PyCon better.

I also tend to ask a lot of questions, especially if I or other
reviewers have a negative impression of the proposal. Asking for
clarification or more detail is an important part of the process.
Sometimes the interaction with the speaker helps me decide whether to
champion a talk, or merely vote +0 or -0.

tl;dr

As I said at the outset, other reviewers may use different criteria
when evaluating a proposal. We want a big enough Program Committee
that we have a range of experience and input into the selection
process. But these are the criteria I use, so if I end up reviewing
your proposal, you know how I went about it.

  1. Is the abstract clear?
  2. Is the abstract complete?
  3. Is the abstract compelling?
  4. Is the topic relevant?
  5. Is the topic too niche?
  6. Is the material immediately useful?
  7. Is the presentation novel?
  8. Is the speaker skilled?
  9. Is the speaker knowledgeable?
  10. Is the speaker experience?
  11. How can the proposal/talk be improved?

See also

Creating a Spelling Checker for reStructuredText Documents

I write a lot using reStructuredText files as the source format,
largely because of the ease of automating the tools used to convert
reST to other formats. The number of files involved has grown to the
point that some of the post-writing tasks were becoming tedious, and I
was skipping steps like running the spelling checker. I finally
decided to do something about that by creating a spelling checker
plugin for Sphinx, released as *sphinxcontrib-spelling*.

I have written about *why I chose reST* before. All of the articles
on this site, including the Python Module of the Week series,
started out as .rst files. I also use Sphinx to produce
several developer manuals at my day job. I like reST and Sphinx
because they both can be extended to meet new needs easily. One area
that has been lacking, though, is support for a spelling checker.

Checking the spelling of the contents of an individual
reStructuredText file from within a text editor like Aquamacs is
straightforward, but I have on the order of 200 separate files making
up parts of this site alone, not to mention *my book*. Manually
checking each file, one at a time, is a tedious job, and not one I
perform very often. After finding a few typos recently, I decided I
needed to take care of the problem by using automation to eliminate
the drudgery and make it easier to run the spelling checker regularly.

The files are already configured to be processed by Sphinx when they
are converted to HTML and PDF format, so that seemed like a natural
way to handle the spelling checker, too. To add a step to the build to
check the spelling of every file, I would need two new tools: an
extension to Sphinx to drive the spelling checker and the spelling
checker itself. I did not find any existing Sphinx extensions that
checked spelling, so I decided to write my own. The first step was to
evaluate spelling checkers.

I did not find any existing Sphinx extensions that checked spelling,
so I decided to write my own.

Choosing a Spelling Checker

I recently read Peter Norvig’s article How to Write a Spelling
Corrector
, which shows how to create a spelling checker from scratch
in Python. As with most nontrivial applications, though, the algorithm
for testing the words is only part of the story when looking at a
spelling checker. An equally important aspect is the dictionary of
words known to be spelled correctly. Without a good dictionary, the
algorithm would report too many false negatives. Not wanting to build
my own dictionary, I decided to investigate existing spelling checkers
and concentrate on writing the interface layer to connect them to
Sphinx.

There are several open source spelling checkers with Python bindings.
I evaluated the aspell-python and PyEnchant (bindings for
enchant, the spelling checker from the AbiWord project). Both tools
required some manual setup to get the engine working. The
aspell-python API was simple to use, but I decided to use PyEnchant
instead. It has an active development group and is more extensible
(with APIs to define alternate dictionaries, tokenizers, and filters).

Installing PyEnchant

I started out by trying to install enchant and PyEnchant from source
under OS X with Python 2.7, but eventually gave up after having to
download several dependencies just to get configure to run for
enchant. I stuck with PyEnchant as a solution because installing
aspell was not really any easier (the installation experience for both
tools could be improved). The simplest solution for OS X and Windows
is to use the platform-specific binary installers for PyEnchant (not
the .egg), since they include all of the dependencies. That means it
cannot be installed into a virtualenv, but I was willing to live with
that for the sake of having any solution at all.

Linux platforms can probably install enchant via RPM or other system
package, so it is less of a challenge to get PyEnchant working there,
and it may even work with pip.

Using PyEnchant

There are several good examples in the PyEnchant tutorial, and I
will not repeat them here. I will cover some of the concepts, though,
as part of explaining the implementation of the new extension.

The PyEnchant API is organized around a “dictionary,” which can be
loaded at runtime based on a language name. Enchant does some work to
try to determine the correct language automatically based on the
environment settings, but I found it more reliable to set the language
explicitly. After the dictionary is loaded, its check() method can
be used to test whether a word is correct or not. For incorrect words,
the suggest() method returns a list of possible alternatives,
sorted by the likelihood they are the intended word.

The check() method works well for individual words, but cannot
process paragraphs. PyEnchant provides an API for checking larger
blocks of text, but I chose to use a lower level API instead. In
addition to the dictionary, PyEnchant includes a “tokenizer” API for
splitting text into candidate words to be checked. Using the tokenizer
API means that the new plugin can run some additional tests on words
not found in the dictionary. For example, I plan to provide an option
to ignore “misspelled” words that appear to be the name of an
importable Python module.

Integrating with Sphinx

The Sphinx Extension API includes several ways to add new features
to Sphinx, including markup roles, language domains, processing
events, and directives. I chose to create a new “builder” class,
because that would give me complete control over the way the document
is processed. The builder API works with a parsed document to create
output, usually in a format like HTML or PDF. In this case, the
SpellingBuilder does not generate any output files. It prints the
list of misspelled words to standard output, and includes the headings
showing where the words appear in the document.

The first step in creating the new extension is to define a
setup() function to be invoked when the module is loaded. The
function receives as argument an instance of the Sphinx
application, ready to be configured. In
sphinxcontrib.spelling.setup(), the new builder and several
configuration options are added to the application. Although the
Sphinx configuration file can contain any Python code, only the
explicitly registered configuration settings affect the way the
environment is saved.

def setup(app):
    app.info('Initializing Spelling Checker')
    app.add_builder(SpellingBuilder)
    # Report guesses about correct spelling
    app.add_config_value('spelling_show_suggestions', False, 'env')
    # Set the language for the text
    app.add_config_value('spelling_lang', 'en_US', 'env')
    # Set a user-provided list of words known to be spelled properly
    app.add_config_value('spelling_word_list_filename', 'spelling_wordlist.txt', 'env')
    # Assume anything that looks like a PyPI package name is spelled properly
    app.add_config_value('spelling_ignore_pypi_package_names', False, 'env')
    # Assume words that look like wiki page names are spelled properly
    app.add_config_value('spelling_ignore_wiki_words', True, 'env')
    # Assume words that are all caps, or all caps with trailing s, are spelled properly
    app.add_config_value('spelling_ignore_acronyms', True, 'env')
    # Assume words that are part of __builtins__ are spelled properly
    app.add_config_value('spelling_ignore_python_builtins', True, 'env')
    # Assume words that look like the names of importable modules are spelled properly
    app.add_config_value('spelling_ignore_importable_modules', True, 'env')
    # Add any user-defined filter classes
    app.add_config_value('spelling_filters', [], 'env')
    # Register the 'spelling' directive for setting parameters within a document
    rst.directives.register_directive('spelling', SpellingDirective)
    return

The builder class is derived from sphinx.builders.Builder. The
important method is write_doc(), which processes the parsed
documents and saves the messages with unknown words to the output
file.

def write_doc(self, docname, doctree):
    self.checker.push_filters(self.env.spelling_document_filters[docname])

    for node in doctree.traverse(docutils.nodes.Text):
        if node.tagname == '#text' and  node.parent.tagname in TEXT_NODES:

            # Figure out the line number for this node by climbing the
            # tree until we find a node that has a line number.
            lineno = None
            parent = node
            seen = set()
            while lineno is None:
                #self.info('looking for line number on %r' % node)
                seen.add(parent)
                parent = node.parent
                if parent is None or parent in seen:
                    break
                lineno = parent.line
            filename = self.env.doc2path(docname, base=None)

            # Check the text of the node.
            for word, suggestions in self.checker.check(node.astext()):
                msg_parts = []
                if lineno:
                    msg_parts.append(darkgreen('(line %3d)' % lineno))
                msg_parts.append(red(word))
                msg_parts.append(self.format_suggestions(suggestions))
                msg = ' '.join(msg_parts)
                self.info(msg)
                self.output.write(u"%s:%s: (%s) %sn" % (
                        self.env.doc2path(docname, None),
                        lineno, word,
                        self.format_suggestions(suggestions),
                        ))

                # We found at least one bad spelling, so set the status
                # code for the app to a value that indicates an error.
                self.app.statuscode = 1

    self.checker.pop_filters()
    return

The builder traverses all of the text nodes, skipping over formatting
nodes and container nodes that contain no text. Each node is converted
to plain text using its astext() method, and the text is given to
the SpellingChecker to be parsed and checked.

class SpellingChecker(object):
    """Checks the spelling of blocks of text.

    Uses options defined in the sphinx configuration file to control
    the checking and filtering behavior.
    """

    def __init__(self, lang, suggest, word_list_filename, filters=[]):
        self.dictionary = enchant.DictWithPWL(lang, word_list_filename)
        self.tokenizer = get_tokenizer(lang, filters)
        self.original_tokenizer = self.tokenizer
        self.suggest = suggest

    def push_filters(self, new_filters):
        """Add a filter to the tokenizer chain.
        """
        t = self.tokenizer
        for f in new_filters:
            t = f(t)
        self.tokenizer = t

    def pop_filters(self):
        """Remove the filters pushed during the last call to push_filters().
        """
        self.tokenizer = self.original_tokenizer

    def check(self, text):
        """Generator function that yields bad words and suggested alternate spellings.
        """
        for word, pos in self.tokenizer(text):
            correct = self.dictionary.check(word)
            if correct:
                continue
            yield word, self.dictionary.suggest(word) if self.suggest else []
        return

Finding Words in the Input Text

The blocks of text from the nodes are parsed using a language-specific
tokenizer provided by PyEnchant. The text is split into words, and
then each word is passed through a series of filters. The API defined
by enchant.tokenize.Filter supports two behaviors. Based on the
return value from _skip(), the word might be ignored entirely and
never returned by the tokenizer. Alternatively, the _split()
method can return a modified version of the text.

In addition to the filters for email addresses and “wiki words”
provided by PyEnchant, sphinxcontrib-spelling includes several
others. The AcronymFilter tells the tokenizer to skip words that
use all uppercase letters.

class AcronymFilter(Filter):
    """If a word looks like an acronym (all upper case letters),
    ignore it.
    """
    def _skip(self, word):
        return (word == word.upper() # all caps
                or
                # pluralized acronym ("URLs")
                (word[-1].lower() == 's'
                 and
                 word[:-1] == word[:-1].upper()
                 )
                )

The ContractionFilter expands common English contractions
that might appear in less formal blog posts.

class list_tokenize(tokenize):
    def __init__(self, words):
        tokenize.__init__(self, '')
        self._words = words
    def next(self):
        if not self._words:
            raise StopIteration()
        word = self._words.pop(0)
        return (word, 0)

class ContractionFilter(Filter):
    """Strip common contractions from words.
    """
    splits = {
        "won't":['will', 'not'],
        "isn't":['is', 'not'],
        "can't":['can', 'not'],
        "i'm":['I', 'am'],
        }
    def _split(self, word):
        # Fixed responses
        if word.lower() in self.splits:
            return list_tokenize(self.splits[word.lower()])

        # Possessive
        if word.lower().endswith("'s"):
            return unit_tokenize(word[:-2])

        # * not
        if word.lower().endswith("n't"):
            return unit_tokenize(word[:-3])

        return unit_tokenize(word)

Because I write about Python a lot, I tend to use the names of
projects that appear on the Python Package Index
(PyPI). PyPiFilterFactory fetches a list of the packages from
the index and then sets up a filter to ignore all of them.

class IgnoreWordsFilter(Filter):
    """Given a set of words, ignore them all.
    """
    def __init__(self, tokenizer, word_set):
        self.word_set = set(word_set)
        Filter.__init__(self, tokenizer)
    def _skip(self, word):
        return word in self.word_set

class IgnoreWordsFilterFactory(object):
    def __init__(self, words):
        self.words = words
    def __call__(self, tokenizer):
        return IgnoreWordsFilter(tokenizer, self.words)

class PyPIFilterFactory(IgnoreWordsFilterFactory):
    """Build an IgnoreWordsFilter for all of the names of packages on PyPI.
    """
    def __init__(self):
        client = xmlrpclib.ServerProxy('http://pypi.python.org/pypi')
        IgnoreWordsFilterFactory.__init__(self, client.list_packages())

PythonBuiltinsFilter ignores functions built into the Python
interpreter.

class PythonBuiltinsFilter(Filter):
    """Ignore names of built-in Python symbols.
    """
    def _skip(self, word):
        return word in __builtins__

Finally, ImportableModuleFilter ignores words that match the
names of modules found on the import path. It uses imp to search
for the module
without actually importing it.

class ImportableModuleFilter(Filter):
    """Ignore names of modules that we could import.
    """
    def __init__(self, tokenizer):
        Filter.__init__(self, tokenizer)
        self.found_modules = set()
        self.sought_modules = set()
    def _skip(self, word):
        if word not in self.sought_modules:
            self.sought_modules.add(word)
            try:
                imp.find_module(word)
            except UnicodeEncodeError:
                return False
            except ImportError:
                return False
            else:
                self.found_modules.add(word)
                return True
        return word in self.found_modules

The SpellingBuilder creates the filter stack based on user
settings, so the filters can be turned on or off.

filters = [ ContractionFilter,
            EmailFilter,
            ]
if self.config.spelling_ignore_wiki_words:
    filters.append(WikiWordFilter)
if self.config.spelling_ignore_acronyms:
    filters.append(AcronymFilter)
if self.config.spelling_ignore_pypi_package_names:
    self.info('Adding package names from PyPI to local spelling dictionary...')
    filters.append(PyPIFilterFactory())
if self.config.spelling_ignore_python_builtins:
    filters.append(PythonBuiltinsFilter)
if self.config.spelling_ignore_importable_modules:
    filters.append(ImportableModuleFilter)
filters.extend(self.config.spelling_filters)

Using the Spelling Checker

PyEnchant and sphinxcontrib-spelling should be installed on the
import path for the same version of Python that Sphinx is using (refer
to the *project home page* for more details). Then the extension
needs to be explicitly enabled for a Sphinx project in order for the
builder to be recognized. To enable the extension, add it to the list
of extension in conf.py.

extensions = [ 'sphinxcontrib.spelling' ]

The other options can be set in conf.py, as well. For example, to
turn on the filter to ignore the names of packages from PyPI, set
spelling_add_pypy_package_names to True.

spelling_add_pypi_package_names = True

Because the spelling checker is integrated with Sphinx using a new
builder class, it is not run when the HTML or LaTeX builders
run. Instead, it needs to run as a separate phase of the build by
passing the -b option to sphinx-build. The output shows each
document name as it is processed, and if there are any errors the line
number and misspelled word is shown. When
spelling_show_suggestions is True, proposed corrections are
included in the output.

$ sphinx-build -b spelling -d build/doctrees source build/spelling
...
writing output... [ 31%] articles/how-tos/sphinxcontrib-spelling/index
(line 255) mispelling ["misspelling", "dispelling", "mi spelling",
"spelling", "compelling", "impelling", "rappelling"]
...

See Also

PyEnchant
Python interface to enchant.
*sphinxcontrib-spelling*
Project home page for the spelling checker.
sphinxcontrib
BitBucket repository for sphinxcontrib-spelling and several other
Sphinx extensions.
Sphinx Extension API
Describes methods for extending Sphinx.
*Defining Custom Roles in Sphinx*
Describes another way to extend Sphinx by modifying the
reStructuredText syntax.

Defining Custom Roles in Sphinx

Defining Custom Roles in Sphinx

Creating custom processing instructions for Sphinx is easy and will
make documenting your project less trouble.

Apparently 42 is a magic number.

While working on issue 42 for virtualenvwrapper, I needed to create
a link from the history file to the newly resolved issue. I finally
decided that pasting the links in manually was getting old, and I
should do something to make it easier. Sphinx and docutils have
built-in markup for linking to RFCs and the Python developers use a
custom role for linking to their bug tracker issues. I decided to
create an extension so I could link to the issue trackers for my
BitBucket projects just as easily.

Extension Options

Sphinx is built on docutils, a set of tools for parsing and working
with reStructuredText markup. The rst parser in docutils is designed
to be extended in two main ways:

  1. Directives let you work with large blocks of text and intercept

    the parsing as well as formatting steps.

  2. Roles are intended for inline markup, within a paragraph.

Directives are used for handling things like inline code, including
source from external locations, or other large-scale processing.
Since each directive defines its own paragraphs, they operate at the
wrong scale for handling in-line markup. I needed to define a new
role.

Defining a Role Processor

The docutils parser works by converting the input text to an internal
tree representation made up of different types of nodes. The tree is
traversed by a writer to create output in the desired format. To add
a directive or role, you need to provide the hooks to be called to
handle the markup when it is encountered in the input file. A role
processor is defined with a function that takes arguments describing
the marked-up text and returns the nodes to be included in the parse
tree.

Roles all have a common syntax, based on the interpreted text
feature of reStructuredText. For example, the rfc role for
linking to an RFC document looks like:

:rfc:`1822`

and produces links like **RFC 1822**, complete with the upper
case RFC.

In my case, I wanted to define new roles for linking to tickets in the
issue tracker for a project (bbissue) and Mercurial changesets
(bbchangeset). The first step was to define the role processing
function.

def bbissue_role(name, rawtext, text, lineno, inliner, options={}, content=[]):
    """Link to a BitBucket issue.

    Returns 2 part tuple containing list of nodes to insert into the
    document and a list of system messages.  Both are allowed to be
    empty.

    :param name: The role name used in the document.
    :param rawtext: The entire markup snippet, with role.
    :param text: The text marked with the role.
    :param lineno: The line number where rawtext appears in the input.
    :param inliner: The inliner instance that called us.
    :param options: Directive options for customization.
    :param content: The directive content for customization.
    """
    try:
        issue_num = int(text)
        if issue_num <= 0:
            raise ValueError
    except ValueError:
        msg = inliner.reporter.error(
            'BitBucket issue number must be a number greater than or equal to 1; '
            '"%s" is invalid.' % text, line=lineno)
        prb = inliner.problematic(rawtext, rawtext, msg)
        return [prb], [msg]
    app = inliner.document.settings.env.app
    node = make_link_node(rawtext, app, 'issue', str(issue_num), options)
    return [node], []

The parser invokes the role processor when it sees interpreted text
using the role in the input. It passes both the raw, unparsed, text as
well as the contents of the interpreted text (the parts between the “`”). It also passes an “inliner”, the part of the parser that
saw the markup and invoked the processor. The inliner gives us a
handle back to docutils and Sphinx so we can access the runtime
environment to get configuration settings or save data for use later.

The return value from the processor is a tuple containing two lists.
The first list contains any new nodes to be added to the parse tree,
and the second list contains error or warning messages to show the
user. Processors are defined to return errors instead of raising
exceptions because the error messages can be inserted into the output
instead of halting all processing.

The bbissue role processor validates the input text by converting
it to an integer issue id. If that isn’t possible, it builds an error
message and returns a problematic node to be added to the output
file. It also returns the message text so the message is printed on
the console. If validation passes, a new node is constructed with
make_link_node(), and only that success node is included in the
return value.

To create the inline node with the hyperlink to a ticket,
make_link_node() looks in Sphinx’s configuration for a
bitbucket_project_url string. Then it builds a reference node
using the URL and other values derived from the values given by the
parser.

def make_link_node(rawtext, app, type, slug, options):
    """Create a link to a BitBucket resource.

    :param rawtext: Text being replaced with link node.
    :param app: Sphinx application context
    :param type: Link type (issue, changeset, etc.)
    :param slug: ID of the thing to link to
    :param options: Options dictionary passed to role func.
    """
    #
    try:
        base = app.config.bitbucket_project_url
        if not base:
            raise AttributeError
    except AttributeError, err:
        raise ValueError('bitbucket_project_url configuration value is not set (%s)' % str(err))
    #
    slash = '/' if base[-1] != '/' else ''
    ref = base + slash + type + '/' + slug + '/'
    set_classes(options)
    node = nodes.reference(rawtext, type + ' ' + utils.unescape(slug), refuri=ref,
                           **options)
    return node

Registering the Role Processor

With the role processor function defined, the next step is to tell
Sphinx to load the extension and to register the new role. Instead of
using setuptools entry points for defining plugins, Sphinx asks you to
list them explicitly in the configuration file. This makes it easy to
install several extensions to be used by several projects, and only
enable the ones you want for any given documentation set.

Extensions are listed in the conf.py configuration file for your
Sphinx project, in the extensions variable. I added my module to
the sphinxcontrib project namespace package, so the module has the
name sphinxcontrib.bitbucket.

# Add any Sphinx extension module names here, as strings. They can be extensions
# coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
extensions = ['sphinx.ext.ifconfig',
              'sphinx.ext.autodoc',
              'sphinxcontrib.bitbucket',
              ]

Sphinx uses the name given to import the module or package containing
the extension, and then call a function named setup() to
initialize the extension. During the initialization phase you can
register new role and directives, as well as configuration values.

def setup(app):
    """Install the plugin.

    :param app: Sphinx application context.
    """
    app.add_role('bbissue', bbissue_role)
    app.add_role('bbchangeset', bbchangeset_role)
    app.add_config_value('bitbucket_project_url', None, 'env')
    return

For this extension I did not want to make any assumptions about the
BitBucket user or project name, so a bitbucket_project_url value
must be added to conf.py.

bitbucket_project_url = 'http://bitbucket.org/dhellmann/virtualenvwrapper/'

Accessing Sphinx Configuration from Your Role

Sphinx handles configuration a little differently from docutils, so I
had to dig for while to find an explanation of how to access the
configuration value from within the role processor. The inliner
argument includes a reference to the current document being processed,
including the docutils settings. Those settings contain an
environment context object, which can be modified by the processors
(to track things like items to include in a table of contents or
index, for example). Sphinx adds its separate application context to
the environment, and the application context includes the
configuration settings. If your role function’s argument is
inliner, then the full path to access a config value called
my_setting is:

inliner.document.settings.env.app.config.my_setting

Results

The new bbissue role looks the same as the rfc role, with the
ticket id as the body of the interpreted text.

For example:

:bbissue:`42`

becomes: issue 42

See also

sphinxcontrib.bitbucket home
Home page for sphinxcontrib.bitbucket, with links to the issue
tracker and announcements of new releases.
sphinxcontrib.bitbucket source
The complete source code for the Sphinx extension described above,
including both bbissue and bbchangeset roles.
Tutorial: Writing a simple extension
Part of the Sphinx documentation set, this tutorial explains how
to create a basic directive processor.
Creating reStructuredText Interpreted Text Roles
David Goodger’s original documentation for creating new roles for
distutils.
Docutils Hacker’s Guide
An introduction to Docutils’ internals by Lea Wiemann.

Evaluating Tools for Developing with SOAP in Python

Evaluating Tools for Developing with SOAP in Python

Originally published in Python Magazine Volume 3 Issue 9 , September,
2009

Greg Jednaszewski was my co-author for this article.

In order to better meet the needs of partners, Racemi needed to
build a private web service to facilitate tighter integration
between our applications and theirs. After researching the state of
SOAP development in Python, we were able to find a set of tools
that met our needs quite well. In this article, we will describe
the criteria we used to evaluate the available tools and the
process we followed to decide which library was right for us.

Racemi’s product, DynaCenter, is a server provisioning and data center
management software suite focusing on large private installations
where automation is key for our end-users. Because we are a small
company, our business model is organized around partnering with larger
companies in the same industry and acting as an OEM. Those partners
typically provide their own user interface, and drive DynaCenter’s
capture and provision services through our API.

Many of our partners’ automation and workflow management systems are
designed to call scripts or external programs, so the first version of
our API was implemented as a series of command line programs.
However, we are increasingly seeing a desire for more seamless
integration through web service APIs. Since most of our partners are
Java shops, in their minds the term web service is synonymous with
SOAP (Simple Object Access Protocol), an HTTP and XML-based protocol
for communicating between applications. Since Python’s standard
library does not include support for SOAP, we knew we would need to
research third-party libraries to find one suitable for creating a web
service interface to DynaCenter. Our first step was to develop a set
of minimum requirements.

Basic Requirements

DynaCenter is designed with several discrete layers that communicate
with each other as needed. The command line programs that comprise the
existing OEM API communicate with internal services running in daemons
on a central control server or on the managed systems. This layered
approach separates the exposed interface from the implementation
details, allowing us to change the implementation but maintain a
consistent API for use by partners. All of the real work for capturing
and provisioning server images is implemented inside the DynaCenter
core engine, which is invoked by the existing command line programs.
The first requirement we established was that the new web service
layer had to be thin so we could reuse as much existing code as
possible, and avoid re-implementing any of the core engine
specifically for the web service.

This project was unique in that many of the features of full-stack web
frameworks would not be useful to meeting our short-term requirements.
We have our own ORM for accessing the DynaCenter’s database, so any
potential solution needed to be able to operate without a fully-
configured ORM component. In addition, we were not building a human
interface, so full-featured templating languages and integration with
Javascript toolkits were largely irrelevant to the project. On the
other hand, while we recognized that SOAP was a short-term requirement
from some of our partners, we did anticipate wanting to support other
protocols like JSON in the future without having to write a new
service completely from scratch.

We also knew that creating a polished product would require
comprehensive documentation.

We also knew that creating a polished product would require
comprehensive documentation. The WSDL (Web Service Definition
Language) file for the SOAP API, which is a formal machine-readable
declaration of what calls and data types an API supports, would be
helpful, but only as a reference. We planned to document the entire
API in a reference manual as well as with sample Java and Python code
bundled in a software development kit (SDK). We could write that
documentation manually, but integration with the documentation tools
was considered a bonus feature.

Finally, we needed support for complex data structures. Our data model
uses a fairly sophisticated representation of image meta-data,
including networking and storage requirements. DynaCenter also
maintains data about the peripherals in a server so that we can
reconfigure the contents of images as they are deployed to run under
new hardware configurations. This information is used as parameters
and return values throughout the API, so we needed to ensure that the
tool we chose would support data types beyond the simple built-ins
like strings and integers.

Meet the Candidates

Through our research, we were able to identify three viable candidate
solutions for building SOAP-based web services in Python.

The Zolera Soap Infrastucture (ZSI), is a part of the pywebsvcs
project. It provides complete server and client libraries for working
with SOAP. To use it, a developer writes the WSDL file (by hand or
using a WSDL editor), and then generates Python source for the client
and stubs for the server. The data structures defined in the WSDL file
are converted into Python classes that can be used in both client and
server code.

soaplib is a lightweight library from Optio Software. It also
supports only SOAP, but in contrast to ZSI it works by generating the
WSDL for your service based on your Python source code. soaplib is not
a full-stack solution, so it needs to be coupled with another
framework such as TurboGears or Pylons to create a service.

TGWebServices (TGWS) is a TurboGears-specific library written by
Kevin Dangoor and maintained by Christophe de Vienne. It provides a
special controller base class to act as the root of the service. It is
similar to soaplib in that it generates the WSDL for a service from
the source at runtime. In fact, we found a reference to the idea of
merging soaplib and TGWebServices, but that work seems to have stalled
out. One difference between the libraries is that TGWS also supports
JSON and “raw” XML messages for the same back-end code.

Now that we had the basic requirements identified and a few candidates
to test, we were able to create a list of evaluation criteria to help
us make our decision.

Installing

A primary concern was whether or not a tool could be installed and
made to work at all using any tutorial or guide from the
documentation. We used a clean virtualenv for each application and
used Python 2.6.2 for all tests. Initial evaluations were made under
Mac OS X 10.5 and eventually prototype servers were set up under
CentOS 4 so the rest of Racemi’s libraries could be used and the
service could work with real data.

The latest official release of ZSI (2.0-rc3) installed using
easy_install, including all dependencies and C extensions. A newer
alpha release (2.1-a1) also installed correctly from a source archive
we downloaded manually. The sample code provided with the source
archive had us up and running a test server in a short time.

We were less successful using easy_install with TGWS because we
did not start out with TurboGears installed and the dependencies were
not configured to bring it in automatically. After modifying the
dependencies in the package by hand, we were able to install it and
configure a test server following the documentation. Once we overcame
that problem, we found that the official distribution of TGWS is only
compatible with TurboGears 1.0. By asking on the support mailing list,
we found patches to make it compatible with TurboGears 1.1 and were
then able to bring up a test server. Since TurboGears 2.x has moved
away from CherryPy, and TGWS uses features of CherryPy, we did not try
to use TurboGears 2.x.

We never did get soaplib to install. It depends on lxml, and
installation on both of our our test platforms failed with compilation
and link errors. At this point, soaplib was moved off of the list of
primary candidates. We kept it open as an option in case the other
tools did not pan out, but not being able to install it hurt our
ability to evaluate it completely.

Feature Completeness

Since we anticipated other web-related work, we also considered the
completeness of the stack. Although ZSI provides a full SOAP server,
it does not easily support other protocols. Since our only hard
requirement for protocols in the first version of the service was
SOAP, this limitation did not rule ZSI out immediately.

Because TGWS sits on top of TurboGears, we knew that if we eventually
wanted to create a UI for the service we could use the same stack. It
also supports JSON out of the box, so third-party JavaScript
developers could create their own UI as well.

Interoperability

Another concern was whether the tool would be inter-operable with a
wide variety of clients. We were especially interested in the Java
applications we expected our partners to be writing. Since we are
primarily a Python shop, we also wanted to be able to test the SOAP
API using Python libraries. In order to verify that both sets of
clients would work without issue, we constructed prototype servers
using each tool and tested them using SOAP clients in Python and Java
(using the Axis libraries).

Both ZSI and TGWS passed the compatibility tests we ran using both
client libraries. The only interoperability issue we came across was
with the SOAP faults generated by TGWS, which did not pass through the
strict XML parser used by the Java Axis libraries. We were able to
overcome this with a few modifications to TGWS (which we have
published for possible inclusion in a future version of TGWS).

Freshness

Our investigations showed that there had not been much recent
development of SOAP libraries in Python, even from the top contenders
we were evaluating. It wasn’t clear whether this was because the
existing tools were stable and declared complete, or if the Python
community has largely moved on to other protocols like JSON. To get a
sense of the “freshness” of each project, we looked for the last
commit to the source repository and also examined mailing list
archives for recent activity. We were especially interested in
responses from developers to requests for support.

The recent activity on the ZSI forums on Sourceforge seemed mostly to
be requests for help. The alpha release we used for one of the tests
was posted to the project site in November of 2007. There had been
more recent activity in the source tree, but we did not want to use an
unreleased package if we could avoid it.

The situation with TGWS was confusing at first because we found
several old sites. By following the chain of links from the oldest to
the newest, we found the most recent code in a BitBucket repository
being maintained by Christophe de Vienne. As mentioned earlier, the
project mailing list was responsive to questions about making TGWS
work with TurboGears 1.1, and pointed us towards a separate set of
patches that were not yet incorporated in the official release.

Documentation

As new users, we wanted to find good documentation for any tool we
selected. Having the source is useful for understanding how you’re
doing something wrong, but learning what to do in the first place
calls for separate instructions. All of the candidates provided enough
documentation for us to create a simple prototype server without too
much trouble.

Just as we expect to have documentation for third-party tools we use,
we need to provide API references and tutorials for the users of our
web service. We use Sphinx for all customer-facing documentation at
Racemi, since it allows us to manage the documentation source along
with our application code, and to build HTML and PDF versions of all
of our manuals. TGWS includes a Sphinx extension that adds directives
for generating documentation for web service controllers, so we could
integrate it with our existing build process easily. ZSI has no native
documentation features. We did consider building something to parse
the WSDL file and generate API docs from that, but the existing Sphinx
integration TGWS provided was a big bonus in our eyes.

Deployment Complexity

We evaluated the options for deploying all of the tools, including how
much the deployment could be automated and how flexible they were. We
decided to run our service behind an Apache proxy so we could encrypt
the traffic with SSL. All of the tools support the standard options
for doing this (mod_proxy, mod_python, and in some cases mod_wsdl)
so there was no clear winner for this criteria.

In addition to simple production deployment, we also needed an option
for running a server in “development” mode without requiring root
access or modifications to a bunch of system services. We found that
both ZSI and TGWS have good development server configurations, and
could be run directly out of a project source tree (in fact, that is
how the prototype servers were tested).

Packaging Complexity

As a packaged OEM product, DynaCenter is a small piece of a larger
software suite being deployed on servers outside of our control. It
needs to play well with others and be easy to install in the field.
Most installations are performed by trained integrators, but they are
not Python programmers and we don’t necessarily want to make them deal
with a lot of our implementation details. We definitely do not want
them downloading dependencies from the Internet, so we package our own
copy of Python and the libraries we use so that installation is
simpler and avoids version conflicts.

ZSI’s only external dependency are PyXML and zope.interface. We
were already packaging PyXML for other reasons, and zope.interface
was easy to add. TGWS depends on TurboGears, which is a collection of
many separate packages. This made re-distribution less convenient,
since we had to grab the sources for each component separately.
Fortunately, the complete list is documented clearly in the
installation script for TurboGears and we were able to distill it down
to the few essential pieces we would actually be using. Those packages
were then integrated with our existing processes so they could be
included in the Python package we build.

Licensing

Although Racemi does contribute to open source tools when possible,
DynaCenter is not itself open source. We therefore had to eliminate
from consideration any tool that required the use of a GNU Public
License. ZSI uses a BSD-like license, which matched our requirements.
The zope.interface package is licensed under the Zope Public
License, which is also BSD-like. TGWS and most of the TurboGears
components are licensed under a BSD or MIT license. The only component
that even mentioned GNU was SQLObject, which uses the LGPL. That would
have been acceptable, but since we have our own ORM and do not need
SQLObject, we decided to skip including it in our package entirely to
avoid any question.

Elegance

SOAP toolkits tend to fall in one of two camps: Those that generate
source from a WSDL file and those that generate a WSDL document from
source. We didn’t particularly care which solution we ended up with,
as long as we didn’t have to write both the WSDL and the source code.
We also wanted to avoid writing vast amounts of boilerplate code, if
possible. As you will see from the examples below, the tools that
generated the WSDL from Python source turned out to be a much more
elegant in the long run.

We also considered the helpfulness of the error messages as part of
evaluating the elegance and usability of the tools. With TGWS, most of
what we were writing was Python. Many of the initial errors we saw
were from the interpreter, and so the error types and descriptions
were familiar. Once those were eliminated, the errors we saw generated
by TGWS code were usually direct and clear, although they did not
always point at the parts of our source code where the problem could
be fixed.

In contrast, we found ZSI’s errors to be very obscure. It seemed many
were caused by a failure of the library to trap problems in the
underlying code, such as indexing into a None value as a tuple.
Even the errors that were generated explicitly by the ZSI code left us
scratching our heads on occasion. We continued evaluating both tools,
but by this time we were leaning towards TGWS and growing more
frustrated with ZSI.

Testing

Automated testing is especially important for a complex product like
DynaCenter, so being able to write tests for the new web service and
integrate them with our existing test suite was an important feature.
ZSI does not preclude writing automated tests, but does not come with
any obvious framework or features for supporting them, so we would
need to roll our own. TGWS takes advantage of TurboGears’ integration
with WebTest to let the developer write unit and integration tests in
Python without even needing to start a test daemon.

Performance

Once we established the ease of creating and testing services with
TGWS, we had basically made our choice for that library. However,
there was one last criteria to check: performance. Using the prototype
servers we had set up for experimenting with the tools, we took some
basic timing measurements by writing a SOAP client in Python to invoke
a service that returned a large data set (500 copies of a complex type
with several properties of different types). We measured the time it
took for the client to ask for the data and then parse it into usable
objects.

The data structure definition was the same for both services, and we
found no significant difference in the performance of the two SOAP
implementations. Interestingly, as the amount of data increased, the
JSON performance reached a 10x improvement over SOAP. Our hypothesis
for the performance difference is that there was less data to parse,
the parser was more efficient, and the objects being created in the
client were simpler because JSON does not try to instantiate user-
defined classes.

Prototyping with ZSI

We were somewhat familiar with ZSI because we had used it in the past
for building a client for interacting with the VMware Virtual Center
web service, so we started with ZSI as our first prototype. For both
prototypes, we implemented a simple echo service that returns as
output whatever it gets as input from the client. Listing 1 contains
the hand-crafted WSDL inputs for the ZSI version of this service.

Listing 1

<?xml version="1.0" encoding="UTF-8"?>
<definitions
  xmlns="http://schemas.xmlsoap.org/wsdl/"
  xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/"
  xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/"
  xmlns:http="http://schemas.xmlsoap.org/wsdl/http/"
  xmlns:xsd="http://www.w3.org/2001/XMLSchema"
  xmlns:tns="urn:ZSI"
  targetNamespace="urn:ZSI" >

  <types>
    <xsd:schema elementFormDefault="qualified"
        targetNamespace="urn:ZSI">
      <xsd:element name="Echo">
        <xsd:complexType>
          <xsd:sequence>
            <xsd:element name="value" type="xsd:anyType"/>
          </xsd:sequence>
        </xsd:complexType>
      </xsd:element>
    </xsd:schema>
  </types>

  <message name="EchoRequest">
    <part name="parameters" element="tns:Echo" />
  </message>
  <message name="EchoResponse">
    <part name="parameters" element="tns:Echo"/>
  </message>

  <portType name="EchoServer">
    <operation name="Echo">
      <input message="tns:EchoRequest"/>
      <output message="tns:EchoResponse"/>
    </operation>
  </portType>

  <binding name="EchoServer" type="tns:EchoServer">
    <soap:binding style="document"
                  transport="http://schemas.xmlsoap.org/soap/http"/>
    <operation name="Echo">
      <soap:operation soapAction="Echo"/>
      <input>
        <soap:body use="literal"/>
      </input>
      <output>
        <soap:body use="literal"/>
      </output>
    </operation>
  </binding>

  <service name="EchoServer">
    <port name="EchoServer" binding="tns:EchoServer">
      <soap:address location="http://localhost:7000"/>
    </port>
  </service>

</definitions>

To generate the client and server code from the WSDL, feed it into the
wsdl2py program (included with ZSI). To add support for complex
types, add the -b option, but it isn’t required for this simple
example. wsdl2py will, in response, produce three files:

Listing 2

EchoServer_client.py is the code needed to build a client for the
SimpleEcho web service.

##################################################
# file: EchoServer_client.py
#
# client stubs generated by
# "ZSI.generate.wsdl2python.WriteServiceModule"
#
##################################################

from EchoServer_types import *
import urlparse, types
from ZSI.TCcompound import ComplexType, Struct
from ZSI import client
from ZSI.schema import GED, GTD
import ZSI
from ZSI.generate.pyclass import pyclass_type

# Locator
class EchoServerLocator:
    EchoServer_address = "http://localhost:7000"
    def getEchoServerAddress(self):
        return EchoServerLocator.EchoServer_address
    def getEchoServer(self, url=None, **kw):
        return EchoServerSOAP(
            url or EchoServerLocator.EchoServer_address,
            **kw)

# Methods
class EchoServerSOAP:
    def __init__(self, url, **kw):
        kw.setdefault("readerclass", None)
        kw.setdefault("writerclass", None)
        # no resource properties
        self.binding = client.Binding(url=url, **kw)
        # no ws-addressing

    # op: Echo
    def Echo(self, request, **kw):
        if isinstance(request, EchoRequest) is False:
            raise TypeError, "%s incorrect request type" % 
                (request.__class__)
        # no input wsaction
        self.binding.Send(None, None, request, soapaction="Echo", **kw)
        # no output wsaction
        response = self.binding.Receive(EchoResponse.typecode)
        return response

EchoRequest = GED("urn:ZSI", "Echo").pyclass

EchoResponse = GED("urn:ZSI", "Echo").pyclass

Listing 3

EchoServer_server.py contains code needed to build the
SimpleEcho web service server.

##################################################
# file: EchoServer_server.py
#
# skeleton generated by
#  "ZSI.generate.wsdl2dispatch.ServiceModuleWriter"
#
##################################################

from ZSI.schema import GED, GTD
from ZSI.TCcompound import ComplexType, Struct
from EchoServer_types import *
from ZSI.ServiceContainer import ServiceSOAPBinding

# Messages
EchoRequest = GED("urn:ZSI", "Echo").pyclass

EchoResponse = GED("urn:ZSI", "Echo").pyclass


# Service Skeletons
class EchoServer(ServiceSOAPBinding):
    soapAction = {}
    root = {}

    def __init__(self, post='', **kw):
        ServiceSOAPBinding.__init__(self, post)

    def soap_Echo(self, ps, **kw):
        request = ps.Parse(EchoRequest.typecode)
        return request,EchoResponse()

    soapAction['Echo'] = 'soap_Echo'
    root[(EchoRequest.typecode.nspname,EchoRequest.typecode.pname)] = 
        'soap_Echo'

Listing 4

EchoServer_types.py has type definitions used by both the client
and server code.

##################################################
# file: EchoServer_types.py
#
# schema types generated by
#  "ZSI.generate.wsdl2python.WriteServiceModule"
#
##################################################

import ZSI
import ZSI.TCcompound
from ZSI.schema import (LocalElementDeclaration, ElementDeclaration,
                        TypeDefinition, GTD, GED)
from ZSI.generate.pyclass import pyclass_type

##############################
# targetNamespace
# urn:ZSI
##############################

class ns0:
    targetNamespace = "urn:ZSI"

    class Echo_Dec(ZSI.TCcompound.ComplexType, ElementDeclaration):
        literal = "Echo"
        schema = "urn:ZSI"
        def __init__(self, **kw):
            ns = ns0.Echo_Dec.schema
            TClist = [ZSI.TC.AnyType(pname=(ns,"value"),
                      aname="_value", minOccurs=1, maxOccurs=1,
                      nillable=False, typed=False,
                      encoded=kw.get("encoded"))]
            kw["pname"] = ("urn:ZSI","Echo")
            kw["aname"] = "_Echo"
            self.attribute_typecode_dict = {}
            ZSI.TCcompound.ComplexType.__init__(self,None,TClist,
                                                inorder=0,**kw)
            class Holder:
                __metaclass__ = pyclass_type
                typecode = self
                def __init__(self):
                    # pyclass
                    self._value = None
                    return
            Holder.__name__ = "Echo_Holder"
            self.pyclass = Holder

# end class ns0 (tns: urn:ZSI)

Once generated, these files are not meant to be edited, because they
will be regenerated as part of a build process whenever the WSDL input
changes. The code in the files grows as more types and calls are added
to the service definition.

The implementation of the server goes in a separate file that imports
the generated code. In the example, the actual service is on lines
18–25 of Listing 5. The @soapmethod decorator defines the input
(an EchoRequest) and the output (an EchoResponse) for the call.
In the example, the implementation of soap_Echo() just fills in
the response value with the request value, and returns both the
request and the response. From there, ZSI takes care of building the
SOAP response and sending it back to the client.

Listing 5

import os
import sys
from EchoServer_client import *
from ZSI.twisted.wsgi import (SOAPApplication,
                              soapmethod,
                              SOAPHandlerChainFactory)

class EchoService(SOAPApplication):
    factory = SOAPHandlerChainFactory
    wsdl_content = dict(name='Echo',
                        targetNamespace='urn:echo',
                        imports=(),
                        portType='',
                        )

    def __call__(self, env, start_response):
        self.env = env
        return SOAPApplication.__call__(self, env, start_response)

    @soapmethod(EchoRequest.typecode,
                EchoResponse.typecode,
                operation='Echo',
                soapaction='Echo')
    def soap_Echo(self, request, response, **kw):
        # Just return what was sent
        response.Value = request.Value
        return request, response

def main():
    from wsgiref.simple_server import make_server
    from ZSI.twisted.wsgi import WSGIApplication

    application         = WSGIApplication()
    httpd               = make_server('', 7000, application)
    application['echo'] = EchoService()
    print "listening..."
    httpd.serve_forever()

if __name__ == '__main__':
    main()

Listing 6 includes a sample of how to use the ZSI client libraries to
access the servers from the client end. All that needs to be done is
to create a handle to the EchoServer web service, build an
EchoRequest, send it off to the web service, and read the
response.

Listing 6

from EchoServer_client import *
import sys, time

loc  = EchoServerLocator()
port = loc.getEchoServer(url='http://localhost:7000/echo')

print "Echo: ",
msg = EchoRequest()
msg.Value = "Is there an echo in here?"
rsp = port.Echo(msg)
print rsp.Value

Prototyping with TGWebServices

To get started with TGWebServices, first create a TurboGears project
by running tg-admin quickstart which will prompt you to name the
new project and Python package, and then produce a directory structure
full of skeleton code. The directory names are based on the project
and package names chosen when running tg-admin. The top-level
directory contains sample configuration files and a script for
starting the server, and a subdirectory containing all the Python code
for the web service.

tg-admin will generate several Python files, but the important
file for defining the web service is controllers.py. Listing 7
shows the controllers.py file for our prototype echo server. The
@wsexpose decorator on line 7 exposes the web service call and
defines the return type as a string. On line 8, @wsvalidate
defines the data types for each parameter. As with the ZSI example,
the actual implementation of the echo call just returns what is passed
in.

Listing 7

from turbogears import controllers, expose, flash
from tgwebservices.controllers import WebServicesRoot, wsexpose, wsvalidate

class EchoService(WebServicesRoot):
    """EchoService web service definition"""

    @wsexpose(str)
    @wsvalidate(value=str)
    def echo(self, value):
        "Echo the input back to the caller."
        return value

class Root(controllers.RootController):
    """The root controller of the application."""

    echo = EchoService('http://localhost:7000/echo/')

The auto-generated WSDL for the web service is accessible via
http://<server>/echo/soap/api.wsdl. Listing 8 shows an example of
the WSDL generated by TGWS for the prototype EchoService. It includes
definitions of all types used in the API (lines 3–20), the request and
response message wrappers for each call (lines 21–26), as well as the
ports (lines 27–45) and a service definition (lines 46–51) pointing to
the server generating the WSDL document. Each port includes the
docstring from the method implementing it (line 29).

Listing 8

<?xml version="1.0" encoding="UTF-8"?>
<wsdl:definitions xmlns:wsdl="http://schemas.xmlsoap.org/wsdl/" xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" name="EchoService" xmlns:types="http://localhost:7000/echo/soap/types" xmlns:soapenc="http://www.w3.org/2001/09/soap-encoding" targetNamespace="http://localhost:7000/echo/soap/" xmlns:tns="http://localhost:7000/echo/soap/">
   <wsdl:types>
     <xsd:schema elementFormDefault="qualified" targetNamespace="http://localhost:7000/echo/soap/types">
          <xsd:element name="echo">
            <xsd:complexType>
              <xsd:sequence>
                <xsd:element name="value" type="xsd:string"/>
              </xsd:sequence>
            </xsd:complexType>
          </xsd:element>
          <xsd:element name="echoResponse">
            <xsd:complexType>
              <xsd:sequence>
                <xsd:element name="result" type="xsd:string"/>
              </xsd:sequence>
            </xsd:complexType>
          </xsd:element>
      </xsd:schema>
   </wsdl:types>
   <wsdl:message name="echoRequest" xmlns="http://localhost:7000/echo/soap/types">
       <wsdl:part name="parameters" element="types:echo"/>
   </wsdl:message>
   <wsdl:message name="echoResponse" xmlns="http://localhost:7000/echo/soap/types">
      <wsdl:part name="parameters" element="types:echoResponse"/>
   </wsdl:message>
   <wsdl:portType name="EchoService_PortType">
      <wsdl:operation name="echo">
        <wsdl:documentation>Echo the input back to the caller.</wsdl:documentation>
         <wsdl:input message="tns:echoRequest"/>
         <wsdl:output message="tns:echoResponse"/>
      </wsdl:operation>
   </wsdl:portType>
   <wsdl:binding name="EchoService_Binding" type="tns:EchoService_PortType">
      <soap:binding style="document" transport="http://schemas.xmlsoap.org/soap/http"/>
      <wsdl:operation name="echo">
         <soap:operation soapAction="echo"/>
         <wsdl:input>
            <soap:body use="literal"/>
         </wsdl:input>
         <wsdl:output>
            <soap:body use="literal"/>
         </wsdl:output>
      </wsdl:operation>
   </wsdl:binding>
   <wsdl:service name="EchoService">
      <wsdl:documentation>WSDL File for EchoService</wsdl:documentation>
      <wsdl:port binding="tns:EchoService_Binding" name="EchoService_PortType">
         <soap:address location="http://localhost:7000/echo/soap/"/>
      </wsdl:port>
   </wsdl:service>
</wsdl:definitions>

The tgwsdoc extension to Sphinx, distributed with TGWS, adds
several auto-documentation directives to make it easy to keep your
documentation in sync with your code. By using autotgwstype,
autotgwscontroller, and autotgwsfunction, you can insert
definitions of the complex types, controllers, or individual API calls
in with the rest of your hand-written documentation. This was
especially useful for us because we already had a lot of text
explaining our existing command line interface. We were able to reuse
a lot of the material and document all three interfaces (command line,
SOAP, and JSON) with a single tool.

Implementation Considerations

Once we had chosen TGWS as our framework, we set about working on the
first implementation of our real service. This helped us uncover a few
small problems with our original “pure” design, and some details we
had not considered while prototyping.

For example, we wanted to make sure that our web service was not only
interoperable with Java clients, but also that the API made sense to a
Java developer. One tool that they might be using, the Java Axis
client, is built by feeding the WSDL file into a code generator to
produce source code for client classes. After we tried working with
the generated Java code, we adjusted our web service API to make it
more usable. For instance, Java doesn’t allow you to specify defaults
for method arguments, which caused problems with a couple of web
service calls that had a handful of required arguments along with many
optional keyword arguments. On the Java side, the caller would have to
pass in all 23 parameters to the call, most of them null placeholders
for the optional parameters. To address that, we moved all the
optional parameters to a separate “options” object that could be
populated and passed in for advanced operations.

There were other minor annoyances, such as the way a camelCase
naming convention resulted in nicer-looking Java code than the
under_scored naming convention typically used by Python
programmers. We ended up going with camelCase names for attributes
and methods of classes used in the public side of the web service.
After making these tweaks, it is not difficult to design an API with
TGWS that makes sense to both Java and Python client developers.

Testing in Java was another challenge for us to work out. We have a
large suite of Python tests driven by nose, and we ultimately were
able to automate the client-side Java testing using junit. We then
integrated the two suites by writing a single Python test to run all
of the junit tests in a separate process and parse the results
from the output.

In addition to developer tests, Racemi has a dedicated group of test
engineers who perform QA and acceptance tests before each new version
of DynaCenter is released. The QA team needed a client library to use
for testing the new web service. None of them are Java programmers, so
the Dev team took on the task of basic Java integration testing. But
for full-on regression testing and automation, QA needed something
lightweight and easy to get up and running with quickly. Suds fit
this bill quite nicely. It is a client-only SOAP interface for Python
that reads the WSDL file at runtime and provides client bindings for
the web service API. Armed with our WSDL and the Suds documentation,
our QA team was able to start building a client test harness almost
immediately.

Conclusions

At the beginning of our evaluation process, we knew there were a lot
of ways to compare the available tools. At first, we weren’t sure if
the code-from-WSDL model used by ZSI or the WSDL-from-code model used
by TGWebServices and soaplib would be easier to use. After creating
the simple echo service prototype with both tools, we found that
writing Python and generating the WSDL worked much better for us.
Because WSDL is an XML format primarily concerned with types, we found
it excessively verbose compared to the Python needed to back it up.
It felt much more natural to express our API with Python code and then
generate the description of it. Starting with the code also lead to
fewer situations where translating to WSDL produced errors, unlike
when we tried to manage the WSDL by hand.

Because WSDL is an XML format primarily concerned with types, we found
it excessively verbose compared to the Python needed to back it up.

As mentioned earlier, we ended up needing to patch TGWebServices to
make it work correctly with TurboGears 1.1. Those patches were
available on the Internet as separate downloads, but we decided to
“fork” the original Mercurial repository and create a new version
that included them directly. We have also added a few other
enhancements, such as the option of specifying which formats (JSON
and/or XML) to use when documenting sample types, and better SOAP
error message handling. We are working with Christophe de Vienne to
move those changes upstream.

TGWebServices stood out as the clear winner for our needs.

Aside from the ease of use benefits and technical merits of
TGWebServices, there were several bonus features that made it
appealing. The integration with Sphinx for generating documentation
meant that not only would we not have to write the reference guide as
a completely separate task, but it would never grow stale as code
(especially data structures) changed during the evolution of the
API. Getting the JSON for “free” was another big win for us because it
made testing easier and did not lock us in to a SOAP solution for all
of our partners. Couple that with the benefit of having the TurboGears
framework already in place for a possible web UI down the road, and
TGWebServices stood out as the clear winner for our needs.

Python Exception Handling Techniques

Error reporting and processing through exceptions is one of
Python’s key features. Care must be taken when handling exceptions
to ensure proper application cleanup while maintaining useful
error reporting.

Error reporting and processing through exceptions is one of Python’s
key features. Unlike C, where the common way to report errors is
through function return values that then have to be checked on every
invocation, in Python a programmer can raise an exception at any point
in a program. When the exception is raised, program execution is
interrupted as the interpreter searches back up the stack to find a
context with an exception handler. This search algorithm allows error
handling to be organized cleanly in a central or high-level place
within the program structure. Libraries may not need to do any
exception handling at all, and simple scripts can frequently get away
with wrapping a portion of the main program in an exception handler to
print a nicely formatted error. Proper exception handling in more
complicated situations can be a little tricky, though, especially in
cases where the program has to clean up after itself as the exception
propagates back up the stack.

Throwing and Catching

The statements used to deal with exceptions are raise and
except. Both are language keywords. The most common form of
throwing an exception with raise uses an instance of an exception
class.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/usr/bin/env python

def throws():
    raise RuntimeError('this is the error message')

def main():
    throws()

if __name__ == '__main__':
    main()

The arguments needed by the exception class vary, but usually include
a message string to explain the problem encountered.

If the exception is left unhandled, the default behavior is for the
interpreter to print a full traceback and the error message included
in the exception.

1
2
3
4
5
6
7
8
9
$ python throwing.py
Traceback (most recent call last):
  File "throwing.py", line 10, in <module>
    main()
  File "throwing.py", line 7, in main
    throws()
  File "throwing.py", line 4, in throws
    raise RuntimeError('this is the error message')
RuntimeError: this is the error message

For some scripts this behavior is sufficient, but it is nicer to catch
the exception and print a more user-friendly version of the error.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
#!/usr/bin/env python

import sys

def throws():
    raise RuntimeError('this is the error message')

def main():
    try:
        throws()
        return 0
    except Exception, err:
        sys.stderr.write('ERROR: %sn' % str(err))
        return 1

if __name__ == '__main__':
    sys.exit(main())

In the example above, all exceptions derived from Exception are
caught, and just the error message is printed to stderr. The program
follows the Unix convention of returning an exit code indicating
whether there was an error or not.

$ python catching.py
ERROR: this is the error message

Logging Exceptions

For daemons or other background processes, printing directly to stderr
may not be an option. The file descriptor might have been closed, or
it may be redirected somewhere that errors are hard to find. A better
option is to use the logging module to log the error, including the
full traceback.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#!/usr/bin/env python

import logging
import sys

def throws():
    raise RuntimeError('this is the error message')

def main():
    logging.basicConfig(level=logging.WARNING)
    log = logging.getLogger('example')
    try:
        throws()
        return 0
    except Exception, err:
        log.exception('Error from throws():')
        return 1

if __name__ == '__main__':
    sys.exit(main())

In this example, the logger is configured to to use the default
behavior of sending its output to stderr, but that can easily be
adjusted. Saving tracebacks to a log file can make it easier to debug
problems that are otherwise hard to reproduce outside of a production
environment.

1
2
3
4
5
6
7
8
$ python logging_errors.py
ERROR:example:Error from throws():
Traceback (most recent call last):
  File "logging_errors.py", line 13, in main
    throws()
  File "logging_errors.py", line 7, in throws
    raise RuntimeError('this is the error message')
RuntimeError: this is the error message

Cleaning Up and Re-raising

In many programs, simply reporting the error isn’t enough. If an
error occurs part way through a lengthy process, you may need to undo
some of the work already completed. For example, changes to a
database may need to be rolled back or temporary files may need to be
deleted. There are two ways to handle cleanup operations, using a
finally stanza coupled to the exception handler, or within an
explicit exception handler that raises the exception after cleanup is
done.

For cleanup operations that should always be performed, the simplest
implementation is to use try:finally. The finally stanza is
guaranteed to be run, even if the code inside the try block raises
an exception.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
#!/usr/bin/env python

import sys

def throws():
    print 'Starting throws()'
    raise RuntimeError('this is the error message')

def main():
    try:
        try:
            throws()
            return 0
        except Exception, err:
            print 'Caught an exception'
            return 1
    finally:
        print 'In finally block for cleanup'

if __name__ == '__main__':
    sys.exit(main())

This old-style example wraps a try:except block with a
try:finally block to ensure that the cleanup code is called no
matter what happens inside the main program.

$ python try_finally_oldstyle.py
Starting throws()
Caught an exception
In finally block for cleanup

While you may continue to see that style in older code, since Python
2.5 it has been possible to combine try:except and try:finally
blocks into a single level. Since the newer style uses fewer levels
of indentation and the resulting code is easier to read, it is being
adopted quickly.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#!/usr/bin/env python

import sys

def throws():
    print 'Starting throws()'
    raise RuntimeError('this is the error message')

def main():
    try:
        throws()
        return 0
    except Exception, err:
        print 'Caught an exception'
        return 1
    finally:
        print 'In finally block for cleanup'

if __name__ == '__main__':
    sys.exit(main())

The resulting output is the same:

$ python try_finally.py
Starting throws()
Caught an exception
In finally block for cleanup

Re-raising Exceptions

Sometimes the cleanup action you need to take for an error is
different than when an operation succeeds. For example, with a
database you may need to rollback the transaction if there is an error
but commit otherwise. In such cases, you will have to catch the
exception and handle it. It may be necessary to catch the exception
in an intermediate layer of your application to undo part of the
processing, then throw it again to continue propagating the error
handling.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#!/usr/bin/env python
"""Illustrate database transaction management using sqlite3.
"""

import logging
import os
import sqlite3
import sys

DB_NAME = 'mydb.sqlite'
logging.basicConfig(level=logging.INFO)
log = logging.getLogger('db_example')

def throws():
    raise RuntimeError('this is the error message')

def create_tables(cursor):
    log.info('Creating tables')
    cursor.execute("create table module (name text, description text)")

def insert_data(cursor):
    for module, description in [('logging', 'error reporting and auditing'),
                                ('os', 'Operating system services'),
                                ('sqlite3', 'SQLite database access'),
                                ('sys', 'Runtime services'),
                                ]:
        log.info('Inserting %s (%s)', module, description)
        cursor.execute("insert into module values (?, ?)", (module, description))
    return

def do_database_work(do_create):
    db = sqlite3.connect(DB_NAME)        
    try:
        cursor = db.cursor()
        if do_create:
            create_tables(cursor)
        insert_data(cursor)
        throws()
    except:
        db.rollback()
        log.error('Rolling back transaction')
        raise
    else:
        log.info('Committing transaction')
        db.commit()
    return

def main():
    do_create = not os.path.exists(DB_NAME)
    try:
        do_database_work(do_create)
    except Exception, err:
        log.exception('Error while doing database work')
        return 1
    else:
        return 0

if __name__ == '__main__':
    sys.exit(main())

This example uses a separate exception handler in
do_database_work() to undo the changes made in the database, then
a global exception handler to report the error message.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
$ python sqlite_error.py
INFO:db_example:Creating tables
INFO:db_example:Inserting logging (error reporting and auditing)
INFO:db_example:Inserting os (Operating system services)
INFO:db_example:Inserting sqlite3 (SQLite database access)
INFO:db_example:Inserting sys (Runtime services)
ERROR:db_example:Rolling back transaction
ERROR:db_example:Error while doing database work
Traceback (most recent call last):
  File "sqlite_error.py", line 51, in main
    do_database_work(do_create)
  File "sqlite_error.py", line 38, in do_database_work
    throws()
  File "sqlite_error.py", line 15, in throws
    raise RuntimeError('this is the error message')
RuntimeError: this is the error message

Preserving Tracebacks

Frequently the cleanup operation itself introduces another opportunity
for an error condition in your program. This is especially the case
when a system runs out of resources (memory, disk space, etc.).
Exceptions raised from within an exception handler can mask the
original error if they aren’t handled locally.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/usr/bin/env python

import sys
import traceback

def throws():
    raise RuntimeError('error from throws')
    
def nested():
    try:
        throws()
    except:
        cleanup()
        raise

def cleanup():
    raise RuntimeError('error from cleanup')

def main():
    try:
        nested()
        return 0
    except Exception, err:
        traceback.print_exc()
        return 1

if __name__ == '__main__':
    sys.exit(main())

When cleanup() raises an exception while the original error is
being processed, the exception handling machinery is reset to deal
with the new error.

1
2
3
4
5
6
7
8
9
$ python masking_exceptions.py
Traceback (most recent call last):
  File "masking_exceptions.py", line 21, in main
    nested()
  File "masking_exceptions.py", line 13, in nested
    cleanup()
  File "masking_exceptions.py", line 17, in cleanup
    raise RuntimeError('error from cleanup')
RuntimeError: error from cleanup

Even catching the second exception does not guarantee that the
original error message will be preserved.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/usr/bin/env python

import sys
import traceback

def throws():
    raise RuntimeError('error from throws')
    
def nested():
    try:
        throws()
    except:
        try:
            cleanup()
        except:
            pass # ignore errors in cleanup
        raise # we want to re-raise the original error

def cleanup():
    raise RuntimeError('error from cleanup')

def main():
    try:
        nested()
        return 0
    except Exception, err:
        traceback.print_exc()
        return 1

if __name__ == '__main__':
    sys.exit(main())

Here, even though we have wrapped the cleanup() call in an
exception handler that ignores the exception, the error in
cleanup() hides the original error because only one exception
context is maintained.

1
2
3
4
5
6
7
8
9
$ python masking_exceptions_catch.py
Traceback (most recent call last):
  File "masking_exceptions_catch.py", line 24, in main
    nested()
  File "masking_exceptions_catch.py", line 14, in nested
    cleanup()
  File "masking_exceptions_catch.py", line 20, in cleanup
    raise RuntimeError('error from cleanup')
RuntimeError: error from cleanup

A naive solution is to catch the original exception and retain it in a
variable, then re-raise it explicitly.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/usr/bin/env python

import sys
import traceback

def throws():
    raise RuntimeError('error from throws')
    
def nested():
    try:
        throws()
    except Exception, original_error:
        try:
            cleanup()
        except:
            pass # ignore errors in cleanup
        raise original_error

def cleanup():
    raise RuntimeError('error from cleanup')

def main():
    try:
        nested()
        return 0
    except Exception, err:
        traceback.print_exc()
        return 1

if __name__ == '__main__':
    sys.exit(main())

As you can see, this does not preserve the full traceback. The stack
trace printed does not include the throws() function at all, even
though that is the original source of the error.

1
2
3
4
5
6
7
$ python masking_exceptions_reraise.py
Traceback (most recent call last):
  File "masking_exceptions_reraise.py", line 24, in main
    nested()
  File "masking_exceptions_reraise.py", line 17, in nested
    raise original_error
RuntimeError: error from throws

A better solution is to re-raise the original exception first, and
handle the clean up in a try:finally block.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#!/usr/bin/env python

import sys
import traceback

def throws():
    raise RuntimeError('error from throws')
    
def nested():
    try:
        throws()
    except Exception, original_error:
        try:
            raise
        finally:
            try:
                cleanup()
            except:
                pass # ignore errors in cleanup

def cleanup():
    raise RuntimeError('error from cleanup')

def main():
    try:
        nested()
        return 0
    except Exception, err:
        traceback.print_exc()
        return 1

if __name__ == '__main__':
    sys.exit(main())

This construction prevents the original exception from being
overwritten by the latter, and preserves the full stack in the
traceback.

1
2
3
4
5
6
7
8
9
$ python masking_exceptions_finally.py
Traceback (most recent call last):
  File "masking_exceptions_finally.py", line 26, in main
    nested()
  File "masking_exceptions_finally.py", line 11, in nested
    throws()
  File "masking_exceptions_finally.py", line 7, in throws
    raise RuntimeError('error from throws')
RuntimeError: error from throws

The extra indention levels aren’t pretty, but it gives the output we
want. The error reported is for the original exception, including the
full stack trace.

See also

Errors and Exceptions
The standard library documentation tutorial on handling errors and exceptions in your code.
PyMOTW: exceptions
Python Module of the Week article about the exceptions module.
exceptions module
Standard library documentation about the exceptions module.
PyMOTW: logging
Python Module of the Week article about the logging module.
logging module
Standard library documentation about the logging module.

Writing Technical Documentation with Sphinx, Paver, and Cog

I’ve been working on the Python Module of the Week series since March of
2007
. During
the course of the project, my article style and tool chain have
both evolved. I now have a fairly smooth production process in
place, so the mechanics of producing a new post don’t get in the
way of the actual research and writing. Most of the tools are open
source, so I thought I would describe the process I go through and
how the tools work together.

Editing Text: TextMate

I work on a MacBook Pro, and use TextMate
for editing the articles and source for PyMOTW. TextMate is the one
tool I use regularly that is not open source. When I’m doing heavy
editing of hundreds of files for my day job I use Aquamacs Emacs, but TextMate is better suited for prose
editing and is easier to extend with quick actions. I discovered
TextMate while looking for a native editor to use for Python Magazine, and after being able to write my
own “bundle” to manage magazine articles (including defining a mode
for the markup language we use) I was hooked.

Some of the features that I like about TextMate for prose editing are
as-you-type spell-checking (I know some people hate this feature, but
I find it useful), text statistics (word count, etc.), easy block
selection (I can highlight a paragraph or several sentences and move
them using cursor keys), a moderately good reStructuredText mode
(emacs’ is better, but TextMate’s is good enough), paren and quote
matching as you type, and very simple extensibility for repetitive
tasks. I also like TextMate’s project management features, since they
makes it easy to open several related files at the same time.

Version Control: svn

I started out using a private svn repository for all of my projects,
including PyMOTW. I’m in the middle of evaluating hosted DVCS
options for PyMOTW
,
but still haven’t had enough time to give them all the research I
think is necessary before making the move. The Python core developers
are considering a similar move (PEP 374) so it will be interesting
to monitor that discussion.
No doubt we have different requirements (for example, they are hosting
their own repository), but the experiences with the various DVCS tools
will be useful input to my own decision.

Markup Language: reStructuredText

When I began posting, I wrote each article by hand using HTML. One of
the first tasks that I automated was the step of passing the source
code through pygments to produce a syntax colorized version. This
worked well enough for me at the time, but restricted me to producing
only HTML output. Eventually John Benediktsson contacted me with a
version of many of the posts converted from HTML to reStructuredText.

When reStructuredText was first put forward in the ‘90’s, I was
heavily into Zope development. As such, I was using StructuredText for documenting my
code, and in the Zope-based wiki that we ran at ZapMedia. I even
wrote my own app to extract
comments and docstrings to generate library documentation for a couple
of libraries I had released as open source. I really liked
StructuredText and, at first, I didn’t like reStructuredText.
Frankly, it looked ugly compared to what I was used to. It quickly
gained acceptance in the general community though, and I knew it would
give me options for producing other output formats for the PyMOTW
posts, so when John sent me the markup files I took another look.

While re-acquainting myself with reST, I realized two things. First,
although there is a bit more punctuation involved in the markup than
with the original StructuredText, the markup language was designed
with consistency in mind so it isn’t as difficult to learn as my first
impressions had lead me to believe. Second, it turned out the part I
thought was “ugly” was actually the part that made reST more
powerful
than StructuredText: It has a standard syntax for extension
directives that users can define for their own documents.

Markup to Output: Sphinx

Before I made a final decision on switching from hand-coded HTML to
reST, I needed a tool to convert to HTML (I still had to post the
results on the blog, after all, and Blogger doesn’t support reST). I
first tried David Goodger’s docutils package. The scripts it includes
felt a little too much like “pieces” of a tool rather than a complete
solution, though, and I didn’t really want to assemble my own wrappers
if I didn’t have to – I wanted to write text for this project, not
code my own tools. Around this time, Georg Brandl had made
significant progress on Sphinx, which
turned out to be a more complete turn-key system for converting a pile
of reST files to HTML or PDF. After a few hours of experimentation, I
had a sample project set up and was generating HTML from my documents
using the standard templates.

I decided that reStructuredText looked like the way to go.

HTML Templates: Jinja:

My next step was to work out exactly how to produce all of the outputs
I needed from reST inputs. Each post for the PyMOTW series ends up
going to several different places:

  • the PyMOTW source distribution (HTML)
  • my Blogger blog (HTML)
  • the PyMOTW project site (HTML)
  • O’Reilly.com (HTML)
  • the PyMOTW “book” (PDF)

Each of the four HTML outputs uses slightly different formatting,
requiring separate templates (PDF is a whole different problem,
covered below). The source distribution and project site are both
full HTML versions of all of the documents, but use different
templates. I decided to use the default Sphinx templates for the
packaged version; I may change that later, but it works for the time
being, and it’s one less custom template to deal with. I wanted the
online version to match the appearance of the rest of my site, so I
needed to create a template for it. The two blogs use a third
template (O’Reilly’s site ignores a lot of the markup due to their
Moveable Type configuration, but the articles come out looking good
enough so I can use the same template I use for my own blog without
worrying about a separate custom template).

Sphinx uses Jinja templates to produce
HTML output. The syntax for Jinja is very similar to Django’s
template language. As it happens, I use Django for the dynamic
portion of my web site that I host myself. I lucked out, and my
site’s base template was simple enough to use with Sphinx without
making any changes. Yay for compatibility!

Cleaning up HTML with BeautifulSoup

The blog posts need to be relatively clean HTML that I can upload to
Blogger and O’Reilly, so they could not include any html or
body tags or require any markup or styles not supported by either
blogging engine. The template I came up with is a stripped down
version that doesn’t include the CSS and markup for sidebars, header,
or footer. The result was almost exactly what I wanted, but had two
problems.

The easiest problem to handle was the permalinks generated by Sphinx.
After each heading on the page, Sphinx inserts an anchor tag with a ¶
character and applies CSS styles that hide/show the tag when the user
hovers over it. That’s a nice feature for the main site and packaged
content, but they didn’t work for the blogs. I have no control over
the CSS used at O’Reilly, so the tags were always visible. I didn’t
really care if they were included on the Blogger pages, so the
simplest thing to do was stick with one “blogging” template and remove
the permalinks.

The second, more annoying, problem, was that Blogger wanted to insert
extra whitespace into the post. There is a configuration option on
Blogger to treat line breaks in the post as “paragraph breaks” (I
think they actually insert br tags). This is very convenient for
normal posts with mostly straight text, since I can simply write each
paragraph on one long line, wrapped visually by my editor, and break
the paragraphs where I want them. The result is I can almost post
directly from plain text input. Unfortunately, the option is applied
to every post in the blog (even old posts), so changing it was not a
realistic option – I wasn’t about to go back and re-edit every single
post I had previously written.

The second, more annoying, problem, was that Blogger wanted to
insert extra whitespace into the post.

Sphinx didn’t have an option to skip generating the permalinks, and
there was no way to express that intent in the template, so I fell
back to writing a little script to strip them out after the fact. I
used BeautifulSoup
to find the tags I wanted removed, delete them from the parse tree,
then assemble the HTML text as a string again. I added code to the
same script to handle the whitespace issue by removing all newlines
from the input unless they were inside pre tags, which Blogger
handled correctly. The result was a single blob of partial HTML
without newlines or permalinks that I could post directly to either
blog without editing it by hand. Score a point for automation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def clean_blog_html(body):
    # Clean up the HTML
    import re
    import sys
    from BeautifulSoup import BeautifulSoup
    from cStringIO import StringIO

    # The post body is passed to stdin.
    soup = BeautifulSoup(body)

    # Remove the permalinks to each header since the blog does not have
    # the styles to hide them.
    links = soup.findAll('a', attrs={'class':"headerlink"})
    [l.extract() for l in links]

    # Get BeautifulSoup's version of the string
    s = soup.__str__(prettyPrint=False)

    # Remove extra newlines.  This depends on the fact that
    # code blocks are passed through pygments, which wraps each part of the line
    # in a span tag.
    pattern = re.compile(r'([^s][^p][^a][^n]>)n$', re.DOTALL|re.IGNORECASE)
    s = ''.join(pattern.sub(r'1', l) for l in StringIO(s))

    return s

Code Syntax Highlighting: pygments

I wanted my posts to look as good as possible, and an important factor
in the appearance would be the presentation of the source code. I
adopted pygments in the early hand-coded
HTML days, because it was easy to integrate into TextMate with a
simple script.

pygmentize -f html -O cssclass=syntax $@

Binding the command to a key combination meant with a few quick
keypresses I had HTML ready to insert into the body of a post.

When I moved to Sphinx, using pygments became even easier because
Sphinx automatically passes included source code through pygments as
it generates its output. Syntax highlighting works for HTML and PDF,
so I didn’t need any custom processing.

Automation: Paver

Automation is important for my sense of well being. I hate dealing
with mundane repetitive tasks, so once an article was written I didn’t
want to have to touch it to prepare it for publication of any of the
final destinations. As I have written before,
I started out using make to run various shell commands. I have
since converted the entire process to Paver.

Automation is important for my sense of well being.

The stock Sphinx integration provided with that comes with Paver
didn’t quite meet my needs, but by examining the source I was able to
create my own replacement tasks in an afternoon. The main problem was
the tight coupling between the code to run Sphinx and the code to find
the options to pass to it. For normal projects with a single
documentation output format (Paver assumes HTML with a single config
file), this isn’t a problem. PyMOTW’s requirements are different,
with the four output formats discussed above.

In order to produce different output with Sphinx, you need different
configuration files. Since the base name for the file must always be
conf.py, that means the files have to be stored in separate
directories. One of the options passed to Sphinx on the command line
tells it the directory to look in for its configuration file. Even
though Paver doesn’t fork() before calling Sphinx, it still uses
the command line options to pass instructions.

Creating separate Sphinx configuration files was easy. The problem
was defining options in Paver to tell Sphinx about each configuration
directory for the different output. Paver options are grouped into
bundles, which are essentially a namespace. When a Paver task looks
for an option, it scans through the bundles, possibly cascading to the
global namespace, until it finds the option by name. The search can
be limited to specific bundles, so that the same option name can be
used to configure different tasks.

The html task from paver.doctools sets the options search order to
look for values first in the sphinx section, then globally. Once
it has retrieved the path values, via _get_paths(), it invokes
Sphinx.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def _get_paths():
    """look up the options that determine where all of the files are."""
    opts = options
    docroot = path(opts.get('docroot', 'docs'))
    if not docroot.exists():
        raise BuildFailure("Sphinx documentation root (%s) does not exist."
                           % docroot)
    builddir = docroot / opts.get("builddir", ".build")
    builddir.mkdir()
    srcdir = docroot / opts.get("sourcedir", "")
    if not srcdir.exists():
        raise BuildFailure("Sphinx source file dir (%s) does not exist"
                            % srcdir)
    htmldir = builddir / "html"
    htmldir.mkdir()
    doctrees = builddir / "doctrees"
    doctrees.mkdir()
    return Bunch(locals())

@task
def html():
    """Build HTML documentation using Sphinx. This uses the following
    options in a "sphinx" section of the options.

    docroot
      the root under which Sphinx will be working. Default: docs
    builddir
      directory under the docroot where the resulting files are put.
      default: build
    sourcedir
      directory under the docroot for the source files
      default: (empty string)
    """
    options.order('sphinx', add_rest=True)
    paths = _get_paths()
    sphinxopts = ['', '-b', 'html', '-d', paths.doctrees,
        paths.srcdir, paths.htmldir]
    dry("sphinx-build %s" % (" ".join(sphinxopts),), sphinx.main, sphinxopts)

This didn’t work for me because I needed to pass a separate
configuration directory (not handled by the default _get_paths())
and different build and output directories. The simplest solution
turned out to be re-implementing the Paver-Sphinx integration to make
it more flexible. I created my own _get_paths() and made it look
for the extra option values and use the directory structure I needed.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def _get_paths():
    """look up the options that determine where all of the files are."""
    opts = options

    docroot = path(opts.get('docroot', 'docs'))
    if not docroot.exists():
        raise BuildFailure("Sphinx documentation root (%s) does not exist."
                           % docroot)

    builddir = docroot / opts.get("builddir", ".build")
    builddir.mkdir()

    srcdir = docroot / opts.get("sourcedir", "")
    if not srcdir.exists():
        raise BuildFailure("Sphinx source file dir (%s) does not exist"
                            % srcdir)

    # Where is the sphinx conf.py file?
    confdir = path(opts.get('confdir', srcdir))

    # Where should output files be generated?
    outdir = opts.get('outdir', '')
    if outdir:
        outdir = path(outdir)
    else:
        outdir = builddir / opts.get('builder', 'html')
    outdir.mkdir()

    # Where are doctrees cached?
    doctrees = opts.get('doctrees', '')
    if not doctrees:
        doctrees = builddir / "doctrees"
    else:
        doctrees = path(doctrees)
    doctrees.mkdir()

    return Bunch(locals())

Then I defined a new function, run_sphinx(), to set up the options
search path, look for the option values, and invoke Sphinx. I set
add_rest to False to disable searching globally for an option to
avoid namespace pollution from option collisions, since I knew I was
going to have options with the same names but different values for
each output format. I also look for a “builder”, to support PDF
generation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
def run_sphinx(*option_sets):
    """Helper function to run sphinx with common options.

    Pass the names of namespaces to be used in the search path
    for options.
    """
    if 'sphinx' not in option_sets:
        option_sets += ('sphinx',)
    kwds = dict(add_rest=False)
    options.order(*option_sets, **kwds)
    paths = _get_paths()
    sphinxopts = ['',
                  '-b', options.get('builder', 'html'),
                  '-d', paths.doctrees,
                  '-c', paths.confdir,
                  paths.srcdir, paths.outdir]
    dry("sphinx-build %s" % (" ".join(sphinxopts),), sphinx.main, sphinxopts)
    return

With a working run_sphinx() function I could define several
Sphinx-based tasks, each taking options with the same names but from
different parts of the namespace. The tasks simply call
run_sphinx() with the desired namespace search path. For example,
to generate the HTML to include in the sdist package, the html
task looks in the html bunch:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
@task
@needs(['cog'])
def html():
    """Build HTML documentation using Sphinx. This uses the following
    options in a "sphinx" section of the options.

    docroot
      the root under which Sphinx will be working.
      default: docs
    builddir
      directory under the docroot where the resulting files are put.
      default: build
    sourcedir
      directory under the docroot for the source files
      default: (empty string)
    doctrees
      the location of the cached doctrees
      default: $builddir/doctrees
    confdir
      the location of the sphinx conf.py
      default: $sourcedir
    outdir
      the location of the generated output files
      default: $builddir/$builder
    builder
      the name of the sphinx builder to use
      default: html
    """
    set_templates(options.html.templates)
    run_sphinx('html')
    return

while generating the HTML output for the website uses a different set
of options from the website bunch:

1
2
3
4
5
6
7
8
@task
@needs(['webtemplatebase', 'cog'])
def webhtml():
    """Generate HTML files for website.
    """
    set_templates(options.website.templates)
    run_sphinx('website')
    return

All of the option search paths also include the sphinx bunch, so
values that do not change (such as the source directory) do not need
to be repeated. The relevant portion of the options from the PyMOTW
pavement.py file looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
options(
    # ...

    sphinx = Bunch(
        sourcedir=PROJECT,
        docroot = '.',
        builder = 'html',
        doctrees='sphinx/doctrees',
        confdir = 'sphinx',
    ),

    html = Bunch(
        builddir='docs',
        outdir='docs',
        templates='pkg',
    ),

    website=Bunch(
        templates = 'web',
        #outdir = 'web',
        builddir = 'web',
    ),

    pdf=Bunch(
        templates='pkg',
        #outdir='pdf_output',
        builddir='web',
        builder='latex',
    ),

    blog=Bunch(
        sourcedir=path(PROJECT)/MODULE,
        builddir='blog_posts',
        outdir='blog_posts',
        confdir='sphinx/blog',
        doctrees='blog_posts/doctrees',
    ),

    # ...
)

To find the sourcedir for the html task, _get_paths() first
looks in the html bunch, then the sphinx bunch.

Capturing Program Output: cog

As an editor at Python Magazine, and reviewer for several books, I’ve
discovered that one of the most frequent sources of errors with
technical writing occurs in the production process where the output of
running sample code is captured to be included in the final text.
This is usually done manually by running the program and copying and
pasting its output from the console. It’s not uncommon for a bug to
be found, or a library to change, requiring a change in the source
code provided with the article. That change, in turn, means the
output of commands may be different. Sometimes the change is minor,
but at other times the output is different in some significant way.
Since I’ve seen the problem come up so many times, I spent time
thinking about and looking for a solution to avoid it in my own work.

During my research, a few people suggested that I switch to using
doctests for my examples, but I felt there were several problems with
that approach. First, the doctest format isn’t very friendly for
users who want to copy and paste examples into their own scripts. The
reader has to select each line individually, and can’t simply grab the
entire block of code. Distributing the examples as separate scripts
makes this easier, since they can simply copy the entire file and
modify it as they want. Using individual .py files also makes it
possible for some of the more complicated examples to run clients and
servers at the same time from different scripts (as with
SimpleXMLRPCServer, for
example). But most importantly, using doctests does not solve the
fundamental problem. Doctests tell me when the output has changed,
but I still have to manually run the scripts to generate that output
and paste it into my document in the first place. What I really
wanted to be able to do was run the script and insert the output,
whatever it was, without manually copying and pasting text from the
console.

I finally found what I was looking for in cog, from Ned Batchelder. Ned
describes cog as a “code generation tool”, and most of the examples he
provides on his site are in that vein. But cog is a more general
purpose tool than that. It gives you a way to include arbitrary
Python instructions in your source document, have them executed, and
then have the source document change to reflect the output.

For each code sample, I wanted to include the Python source followed
by the output it produces when run on the console. There is a reST
directive to include the source file, so that part is easy:

.. include:: anydbm_whichdb.py
    :literal:
    :start-after: #end_pymotw_header

The include directive tells Sphinx that the file
“anydbm_whichdb.py” should be treated as a literal text block (instead
of more reST) and to only include the parts following the last line of
the standard header I use in all my source code. Syntax highlighting
comes for free when the literal block is converted to the output
format.

Grabbing the command output was a little trickier. Normally with cog,
one would embed the actual source to be run in the document. In my
case, I had the text in an external file. Most of the source is
Python, and I could just import it, but I would have to go to special
lengths to capture any output and pass it to cog.out(), the cog
function for including text in the processed document. I didn’t want
my example code littered with calls to cog.out() instead of
print, so I needed to capture sys.stdout and sys.stdin. A bigger
question was whether I wanted to have all of the sample files imported
into the namespace of the build process. Considering both issues, it
made sense to run the script in a separate process and capture the
output.

There is a bit of setup work needed to run the scripts this way, so I
decided to put it all into a function instead of including the
boilerplate code in every cog block. The reST source for running
anydbm_whichdb.py looks like:

.. {{{cog
.. cog.out(run_script(cog.inFile, 'anydbm_whichdb.py'))
.. }}}
.. {{{end}}}

The .. at the start of each line causes the reStructuredText
parser to treat the line as a comment, so it is not included in the
output. After passing the reST file through cog, it is rewritten to
contain:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
.. {{{cog
.. cog.out(run_script(cog.inFile, 'anydbm_whichdb.py'))
.. }}}

::

    $ python anydbm_whichdb.py
    dbhash

.. {{{end}}}

The run_script() function runs the python script it is given, adds
a prefix to make reST treat the following lines as literal text, then
indents the script output. The script is run via Paver’s sh()
function, which wraps the subprocess module and supports the dry-run
feature of Paver. Because the cog instructions are comments, the only
part that shows up in the output is the literal text block with the
command output.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
def run_script(input_file, script_name,
                interpreter='python',
                include_prefix=True,
                ignore_error=False,
                trailing_newlines=True,
                ):
    """Run a script in the context of the input_file's directory,
    return the text output formatted to be included as an rst
    literal text block.

    Arguments:

     input_file
       The name of the file being processed by cog.  Usually passed as cog.inFile.

     script_name
       The name of the Python script living in the same directory as input_file to be run.
       If not using an interpreter, this can be a complete command line.  If using an
       alternate interpreter, it can be some other type of file.

     include_prefix=True
       Boolean controlling whether the :: prefix is included.

     ignore_error=False
       Boolean controlling whether errors are ignored.  If not ignored, the error
       is printed to stdout and then the command is run *again* with errors ignored
       so that the output ends up in the cogged file.

     trailing_newlines=True
       Boolean controlling whether the trailing newlines are added to the output.
       If False, the output is passed to rstrip() then one newline is added.  If
       True, newlines are added to the output until it ends in 2.
    """
    rundir = path(input_file).dirname()
    if interpreter:
        cmd = '%(interpreter)s %(script_name)s' % vars()
    else:
        cmd = script_name
    real_cmd = 'cd %(rundir)s; %(cmd)s 2>&1' % vars()
    try:
        output_text = sh(real_cmd, capture=True, ignore_error=ignore_error)
    except Exception, err:
        print '*' * 50
        print 'ERROR run_script(%s) => %s' % (real_cmd, err)
        print '*' * 50
        output_text = sh(real_cmd, capture=True, ignore_error=True)
    if include_prefix:
        response = 'n::nn'
    else:
        response = ''
    response += 't$ %(cmd)snt' % vars()
    response += 'nt'.join(output_text.splitlines())
    if trailing_newlines:
        while not response.endswith('nn'):
            response += 'n'
    else:
        response = response.rstrip()
        response += 'n'
    return response

I defined run_script() in my pavement.py file, and added it to the
__builtins__ namespace to avoid having to import it each time I
wanted to use it from a source document.

A somewhat more complicated example shows another powerful feature of
cog. Because it can run any arbitrary Python code, it is possible to
establish the preconditions for a script before running it. For
example, anydbm_new.py assumes that its output database does not
already exist. I can ensure that condition by removing it before
running the script.

1
2
3
4
5
6
.. {{{cog
.. workdir = path(cog.inFile).dirname()
.. sh("cd %s; rm -f /tmp/example.db" % workdir)
.. cog.out(run_script(cog.inFile, 'anydbm_new.py'))
.. }}}
.. {{{end}}}

Since cog is integrated into Paver, all I had to do to enable it was
define the options and import the module. I chose to change the begin
and end tags used by cog because the default patterns ([[[cog and
]]]) appeared in the output of some of the scripts (printing
nested lists, for example).

1
2
3
4
5
cog=Bunch(
    beginspec='{{{cog',
    endspec='}}}',
    endoutput='{{{end}}}',
),

To process all of the input files through cog before generating the
output, I added ‘cog’ to the @needs list for any task running
sphinx. Then it was simply a matter of running paver html or paver
webhtml
to generate the output.

Paver includes an uncog task to remove the cog output from your
source files before committing to a source code repository, but I
decided to include the cogged values in committed versions so I would
be alerted if the output ever changed.

Generating PDF: TexLive

Generating HTML using Sphinx and Jinja templates is fairly
straightforward; PDF output wasn’t quite so easy to set up. Sphinx
actually produces LaTeX, another text-based format, as output, along
with a Makefile to run third-party LaTeX tools to create the PDF. I
started out experimenting on a Linux system (normally I use a Mac, but
this box claimed to have the required tools installed). Due to the
age of the system, however, the tools weren’t compatible with the
LaTeX produced by Sphinx. After some searching, and asking on the
sphinx-dev mailing list, I installed a copy of TeX Live, a newer TeX distro. A few tweaks to
my $PATH later and I was in business building PDFs right on my
Mac.

Generating HTML using Sphinx and Jinja templates is fairly
straightforward; PDF output wasn’t quite so easy to set up.

My pdf task runs Sphinx with the “latex” builder, then runs
make using the generated Makefile.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
@task
@needs(['cog'])
def pdf():
    """Generate the PDF book.
    """
    set_templates(options.pdf.templates)
    run_sphinx('pdf')
    latex_dir = path(options.pdf.builddir) / 'latex'
    sh('cd %s; make' % latex_dir)
    return

I still need to experiment with some of the LaTeX options, including
templates for pages in different sizes, logos, and styles. For now
I’m happy with the default look.

Releasing

Once I had the “build” fully automated, it was time to address the
distribution process. For each version, I need to:

  • upload HTML, PDF, and tar.gz files to my server
  • update PyPI
  • post to my blog
  • post to the O’Reilly blog

The HTML and PDF files are copied to my server using rsync, invoked
from Paver. I use a web browser and the admin interface for
django-codehosting to upload the
tar.gz file containing the source distribution manually. That will be
automated, eventually. Once the tar.gz is available, PyPI can be
updated via the builtin task paver register. That just leaves the
two blog posts.

For my own blog, I use MarsEdit to post and edit entries. I
find the UI easy to use, and I like the ability to work on drafts of
posts offline. It is much nicer than the web interface for Blogger,
and has the benefit of being AppleScript-able. I have plans to
automate all of the steps right up to actually posting the new blog
entry, but for now I copy the generated blog entry into a new post
window by hand.

O’Reilly’s blogging policy does not allow desktop clients (too much of
a support issue for the tech staff), so I need to use their Moveable
Type web UI to post. As with MarsEdit, I simply copy the output and
paste it into the field in the browser window, then add tags.

Tying it All Together

A quick overview of my current process is:

  1. Pick a module, research it, and write examples in reST and Python.
    Include the Python source and use cog directives to bring in the
    script output.
  2. Use the command “paver html” to produce HTML output to verify the
    results look good and I haven’t messed up any markup.
  3. Commit the changes to svn. When I’m done with the module, copy the
    “trunk” to a release branch for packaging.
  4. Use “paver sdist” to create the tar.gz file containing the Python
    source and HTML documentation.
  5. Upload the tar.gz file to my site.
  6. Run “paver installwebsite” to regenerate the hosted version of the
    HTML and the PDF, then copy both to my web server.
  7. Run “paver register” to update PyPI with the latest release
    information.
  8. Run “paver blog” to generate the HTML to be posted to the blogs.
    The task opens a new TextMate window containing the HTML so it is
    ready to be copied.
  9. Paste the blog post contents into MarsEdit, add tags, and send it
    to Blogger.
  10. Paste the blog post contents into the MT UI for O’Reilly, add
    tags, verify that it renders properly, then publish.

Try It Yourself

All of the source for PyMOTW (including the pavement.py file with
configuration options, task definitions, and Sphinx integration) is
available from the PyMOTW web site. Sphinx, Paver, cog, and
BeautifulSoup are all open source projects. I’ve only tested the
PyMOTW “build” on Mac OS X, but it should work on Linux without any
major alterations. If you’re on Windows, let me know if you get it
working.

Originally published on my blog, 2 February 2009

Converting from Make to Paver

As I briefly mentioned in an earlier post, I recently moved the
PyMOTW build from make to Kevin Dangoor’s Python-based build tool
Paver. I had been wanting to try Paver out for a while, especially
since seeing Kevin’s presentation at PyWorks in November. As a long time
Unix/Linux user, I didn’t have any particular problems with make, but it
looked intriguing. PyMOTW is one of the few projects I have with a
significant build process (beyond simply creating the source tarball),
so it seemed like a good candidate for experimentation.

Concepts

The basic concepts with Paver and make are the same. Make “targets”
correspond roughly to Paver “tasks”. Paver places less emphasis on file
modification time-stamps, though, so tasks are all essentially “PHONEY”
targets. As with make, Paver keeps track of which dependencies are
executed so they are not repeated while building any one target.

Tasks are implemented as simple Python functions. Paver starts by
loading pavement.py, and tasks can be defined inline there or you can
import code from elsewhere if needed. According to Kevin, once the main
engine settles down enough to reach a 1.0 release, he doesn’t anticipate
a lot of active development on the core. Recipes for extending Paver can
be added easily through external modules which would be distributed
separately.

Building a Source Distribution

The most important target from the old PyMOTW Makefile was “package”.
It ran sphinx-build to create the HTML version of the documentation then
produced a versioned source distribution with distutils. The whole thing
was bundled up and dropped on my computer desktop, ready to be uploaded
to my web site.

package: clean html
    rm -f setup.py
    $(MAKE) setup.py
    rm -f MANIFEST MANIFEST.in
    $(MAKE) MANIFEST.in
    python setup.py sdist --force-manifest
    mv dist/*.gz ~/Desktop/

Paver sits on top of distutils, so one of the pre-defined tasks it has
built-in is “sdist” (similar to python setup.py sdist, for producing
source distributions of Python apps or libraries). In my case, I
extended the task definition to perform some pre-requisite tasks and
move the tarball out of the build directory onto the desktop of my
computer, to make it easier to upload to my web site.

Let’s look at the task definition:

@task@needs(['generate_setup', 'minilib', 'html_clean', 'setuptools.command.sdist'])
def sdist():
    """Create a source distribution.
    """
    # Copy the output file to the desktop
    dist_files = path('dist').glob('*.tar.gz')
    dest_dir = path(options.sdist.outdir).expanduser()
    for f in dist_files:
        f.move(dest_dir)return

The @task decorator registers the function as a task, tying it in to
the list of options available from the command line. The docstring for
the function is included in the help output (paver help or
paver –help-commands).

$ paver help
---> help
Paver 0.8.1

Usage: paver [global options] [option.name=value] task [task options] [task...]

Run 'paver help [section]' to see the following sections of info:

options    global command line options
setup      available distutils/setuptools tasks
tasks      all tasks that have been imported by your pavement

'paver help taskname' will display details for a task.

Tasks defined in your pavement:
  blog            - Generate the blog post version of the HTML for the current module
  html            - Build HTML documentation using Sphinx
  html_clean      - Remove sphinx output directories before building the HTML
  installwebsite  - Rebuild and copy website files to the remote server
  pdf             - Generate the PDF book
  sdist           - Create a source distribution
  webhtml         - Generate HTML files for website
  website         - Create local copy of website files
  webtemplatebase - Import the latest version of the web page template from the source

To run the task, pass the name as argument to paver on the command
line:

$ paver sdist

Prerequisites

The @needs decorator specifies the prerequisites for a task, listed in
order and identified by name. Paver prerequisites correspond to make
dependencies, and are all run before the task, as you would expect.

In the Makefile, before building a source distribution I always ran
the clean and html targets, too. That meant I had a fresh copy of
the HTML version of PyMOTW, generated by sphinx. The next step was to
build setup.py using a simple input template file processed by sed (so I
didn’t have to remember to edit the version and download URL every
time).

Paver provides a task to generate a setup.py (generate_setup),
so I no longer need to mess around with templates on my own. The
minilib task writes a ZIP archive with enough of Paver to support
installation through the usual python setup.py install or
easy_install PyMOTW incantations.

Notice that setuptools.command.sdist is the fully qualified name to
the task being redefined locally. That means in this case the standard
work for sdist() (producing the source distribution) is run prior to
invoking my override function.

I’ve defined an html_clean() task in pavement.py to take the
place of the old make targets clean and html:

@taskdef html_clean():
    """Remove sphinx output directories before building the HTML.
    """
    remake_directories(options.sphinx.doctrees, options.html.outdir)
    call_task('html')
    return

remake_directories() is a simple Python function I’ve written to
remove the directories passed to it and then recreate them, empty. It is
the equivalent of rm -r followed by mkdir. It’s not strictly
necessary, but I’m paranoid about old versions of renamed files ending
up in my generated output, so I always start with an empty directory.

Working with Files

Paver’s standard library includes Jason Orendorff’s path library
for working with directories and files. Using the library, paths are
objects with methods (instead of just strings). Methods include
finding the directory name for a file, getting the contents of a
directory, removing a file, etc. – the sorts of things you want to do
with files and directories. One especially nice feature is the /
operator, which works like os.path.join(). It is simple to
construct a path using components in separate variables, joining them
with /.

The sdist() function is responsible for copying the packaged source
distribution to my desktop. It starts by using path’s globbing support
to build a list of the .tar.gz files created by
setuptools.command.sdist(). (There should only be one file, but predicting
the name is more difficult than just using globbing.) The destination
directory is configured through the options Bundle (a dictionary-like
object with attribute look-up for the keys). Since the value of the
option might include ~, I expand it before using it as the
destination directory for the file move operation.

Options

Make is usually configured via the shell environment and variables
within the Makefile itself. Paver uses a collection of Bundle objects
organized into a hierarchical namespace. The options can be set to
static literal values, computed at runtime using normal Python
expressions, or overridden from the command line.

Each task can define its own section of the namespace with options.
Some underlying recipes (especially the distutils and sphinx
integration) depend on a specific structure, documented with the
relevant task documentation. For example, these are the settings I use
when running sphinx:

sphinx = Bunch(
        sourcedir='PyMOTW',
        docroot = '.',
        builder = 'html',
        doctrees='sphinx/doctrees',
        confdir = 'sphinx',
    ),

Tasks access the options using dot-notation starting with the root of
the namespace. For example, options.sphinx.doctrees.

Running Shell Commands

Even with the power of Python as a programming language, sometimes it
is necessary to shell-out to run an external program. Paver makes that
very easy. sh() wraps Python’s standard library module subprocess to
make it easier to work with for the sorts of use cases commonly found in
a build system. Simply pass a string containing the shell command you
want run, and optionally include the capture argument to have it
return the output text (useful for commands like svnversion).
sh() takes care of running the command, or in dry-run mode printing
the command it would have run.

For example, the last step of building the PyMOTW PDF requires running
a target included in the Makefile generated by Sphinx.

latex_dir = path(options.pdf.builddir) / 'latex'
sh('cd %s; make' % latex_dir)

Sphinx and Cog Integration

Paver also includes built-in support for Sphinx. The standard
integration with Sphinx supports producing HTML output. You can
configure many of the Sphinx options you would normally put in a conf.py
file directly through Paver’s pavement.py. I had to override the way
Sphinx is run by default, because I want to produce 3 different versions
of HTML output (using different templates) and the PDF, but simpler
projects won’t have to do much more than set up the location of the
input files.

In addition to Sphinx, Paver integrates Ned Batchelder’s Cog, a
templating/macro definition tool that lets you generate part of your
documentation on the fly from arbitrary Python code. I’ve done some work
to have cog run the PyMOTW examples and insert the output into the .rst
file before passing it to Sphinx to be converted to HTML or PDF. The
process is complicated enough to warrant its own post, though, so that
will have to wait for another day.

Conclusions

Paver is a useful alternative to make, especially for Python-based
packages. The default integration with distutils makes it very easy to
get started. Build environments requiring a lot of external shell calls
may find Makefile’s easier to deal with. In my case, I was able to fold
a couple of small Python scripts into the pavement.py file, so I
eliminated a few separate tools.

It’s hard to say whether a pavement file is “simpler” than a Makefile.
Task definitions do not tend to be shorter than make targets, but the
verbosity is an artifact of Python (function definitions and decorators,
etc.) rather than anything inherent in the way Paver is designed.

A typical Paver configuration file is likely to be more portable than
a Makefile, so that may be something to take into account. With file
operations easily accessible in a portable library, it should be easy to
set up your pavement.py to work on any OS.

For the complete pavement.py file used by PyMOTW, grab the latest
release from the web site.

Automated Testing with unittest and Proctor

Automated testing is an important part of Agile development
methodologies, and the practice is seeing increasing adoption even in
environments where other Agile tools are not used. This article
discusses testing techniques for you to use with the open source tool
Proctor. By using Proctor, you will not only manage your automated
test suite more effectively, but you will also obtain better results
in the process.

Read more

Command line programs are classes, too!

Most OOP discussions focus on GUI or domain-specific development
areas, completely ignoring the workhorse of computing: command line
programs. This article examines CommandLineApp, a base class for
creating command line programs as objects, with option and argument
validation, help text generation, and more.

Read more

Caching RSS Feeds With feedcache

The past several years have seen a steady increase in the use of RSS
and Atom feeds for data sharing. Blogs, podcasts, social networking
sites, search engines, and news services are just a few examples of data
sources delivered via such feeds. Working with internet services
requires care, because inefficiencies in one client implementation may
cause performance problems with the service that can be felt by all of
the consumers accessing the same server. In this article, I describe the
development of the feedcache package, and give examples of how you can
use it to optimize the use of data feeds in your application.

Read more