Switching blogging platforms. Again.

Over the past week or so I have converted all of my blog content to
reStructuredText and replaced the WordPress instance I was using
with static files. It took me a little while to find the right
combination of tools, and I finally settled on using Tinkerer and
Sphinx with Python 3.

When I started blogging in 2006, I chose blogger.com to host my
site. I didn’t want to host my own server and manage the software, and
Blogger was run by Google so I expected it to be around and usable
forever. I also had a separate website where I posted some of the
longer articles, such as my Python Magazine columns and feature articles, book reviews, and other items
that used a lot of source code for examples. I managed that content
with Sphinx, and posted references to the articles on the blog so
they would go out through the RSS feed.

I eventually grew dissatisfied with the web-based editor provided by
Blogger, but found MarsEdit. When I found the light and started
writing posts in reStructuredText instead of HTML, I created
rst2marsedit so I could keep using the desktop client to preview and
publish my posts. And then when I was working as the Communications
Director for the PSF I created rst2blogger to make that work easier
for myself and the rest of the team, many of whom didn’t have
MarsEdit.

During all that time, Blogger happily served my content and provided
backing for the RSS feeds that are the heartbeat of a blog. At some
point, though, I noticed that the blog wasn’t looking very good on
mobile devices. I don’t even remember the specific details, but the
options for managing and editing the theme at the time made me decide
that continuing with Blogger and supporting mobile content with nicely
formatted source code embedded in blog posts was going to be a
hassle.

I looked around at static content generating tools at that point, but
didn’t find anything I really liked. I had the idea that I really
needed something that supported scheduled posts, which would let me
write content on the weekend and publish it on Monday morning. I had
used this feature of Blogger for a long time with the Python Module of
the Week posts, and had come to rely on it as part of my
workflow. None of the static site generators supported that, of
course, because they all just write HTML files. The best answer I
found was to write cron jobs to deploy new content at scheduled times.

Moving to WordPress

About 2 years ago, I changed jobs to DreamHost, and one of our
primary services is hosting WordPress-based websites. It made sense to
me to give WordPress a try, and it seemed to meet all of my needs – a
fairly nice default theme that supported desktops and mobile devices,
scheduled posts, and per-tag RSS feeds for sending to aggregation
sites. I imported my Blogger content into WordPress, updated the
domain settings to point to the new server, and kept blogging. After
the initial work to set up the theme, nothing was really that
different. I still used rst2marsedit to post, so my day-to-day
interface was exactly the same. After a few minor customizations for
source code listings, the default worked well.

I downloaded the WordPress apps for my phone and tablet, in case I
wanted to blog at a conference. That turned out to be pointless,
though, because when I do write, I tend to write long-form posts, and
wasn’t comfortable writing that much on either mobile device. The apps
didn’t cause any problems, but they weren’t particularly useful for
me.

Trouble Begins

A month or two after I set up the new site, I received an automated
notice that my VPS had been rebooted because it was using too many
resources. I’m used to working with unmanaged cloud servers, but
hadn’t had this experience with a traditional managed hosting service
before. Basically, because the WordPress service was using up too much
CPU, it either crashed the VPS or the VPS was restarted to terminate
the process. I increased the size of the server a couple of times
before things stabilized.

I want to write, not run blogging software.

I think the problem had to do with some search engine spiders hitting
the site all at the same time, but I’m not certain. It is entirely
likely that if I spent the time to figure out what was causing the
problem, I could have added caching or tuned some configuration
settings to make the site behave better. But I really didn’t want to
have to figure all of that out. Some people enjoy tweaking and tuning
and fiddling with services constantly, but that’s not for me. I want
to write, not run blogging software.

For a while I ran a larger VPS to handle the spikes in traffic and
just lived with the situation. I had other things on my mind, other
projects, and it was working well enough. But then some update or
other broke my custom style sheet, so all of the content on the site
looked terrible – it’s very difficult to make sense of unfamiliar
Python code when the indentation has been stripped out. Figuring out
what caused that, how to fix it, and prevent a recurrence, was going
to be a lot of hassle – the same thing that pushed me off of Blogger
in the first place.

Other Options

So a few weeks ago I started looking around at static site building
tools again. I had a few basic requirements. I wanted something
written in Python, in case I needed to extend it. I wanted to write
content in reStructuredText, since I am comfortable with it and can
extend it if needed. And I need tag-specific RSS feeds, for
aggregation on Planet Python and Planet OpenStack. I stopped
worrying about scheduled posts. Although traffic patterns for my site
trend down on the weekend and up during the week, I don’t plan to let
that control my publishing schedule any more. For aspects like
supporting mobile devices, I planned to find or customize a theme.

When I asked for suggestions on Twitter, the most popular response was
Pelican, so that’s where I started. The documentation is clear and
extensive. I was able to set up a new blog instance on my laptop
fairly quickly. There is a tool for converting a WordPress blog
archive file to reStructuredText files to import into the new
blog. The results still needed a fair amount of cleanup, but after
converting it from the other two blogging systems that wasn’t a
surprise.

There were a few hiccups, though. According to the Pelican docs,
activating syntax highlighting requires using a code-block tag,
and it wasn’t clear whether regular literal blocks would work. Since
the converter created plain literal blocks (marked with ::), that
would mean a lot of hand-editing to add syntax highlighting back. I
had trouble with some of the tags in the articles for which I had reST
source files, both blog entries and magazine columns. I had a few
cssclass directives, for example, and there were a few others. I
could update all of them, which would be annoying but not difficult,
but I wasn’t ready to stop looking and make that commitment yet.

Julien Danjou recommended Hyde, so I looked at that next. I wasn’t
able to find good documentation, and based on the output of the quick
start it looked like I would be writing in a combination of markdown
and HTML templates. That didn’t seem like the right direction for me,
especially since I already had a lot of reST content that I wanted to
import.

Fiddling with Tinkerer

I have extensive experience with Sphinx, so what I really wanted was
a tool that would let me use that knowledge and the tools I have
already built and add the pieces that are missing to make a blog. I
knew there was an extension in the sphinx-contrib repository for
creating an RSS feed, but I needed separate feeds for different tags
or categories. Jeff Forcier sent me a link to Tinkerer, and after
reading the docs I knew I would be able to make it do what I wanted.

The Pelican exporter had left all of the articles in one directory,
giving each a unique file name. Tinkerer wanted the input files
organized in a directory structure by year/month/day, so I needed to
move the files around. I wrote a simple bash script to iterate over
all of the files, extract the publication date from the metadata
Pelican had written inside, and then copy the file into the right
directory structure for Tinkerer.

Tinkerer’s tinker command line tool also manages the list of items
in the master table of contents, used to control the order of articles
shown on the site and in the RSS feed. Because I was importing my
articles from scratch, I had an empty file, but I was able to create
the initial list with find and sed. At that point I was able
to run tinker –build with the default theme, and work on cleaning
up the results.

Importing the Content

I edited the source files by hand and with Unix tools like sed to
correct formatting errors from the content exported from
WordPress. For some of the longer articles, I was glad to be able to
replace an exported file with the original source file from the old
version of the site, but a lot of the changes were the same (removing
certain directives and metadata added by the exporter) so plain text
files and a few standard Unix tools once again proved their utility.

After I cleaned up all of the files enough that the build worked, I
made another pass to adjust the formatting to be more consistent and
to remove artifacts left by the importer that didn’t break the build
(many many raw HTML blocks). Then, I installed
sphinxcontrib-spelling and fixed all of the errors it
reported. Finally, I replaced the contents of the entries related to
Python Module of the Week with references to the appropriate pages on
PyMOTW.com, so I only have one copy to keep up to date. It took a
few mornings, but I finally had a clean set of around 500 source
files.

The Blogger and WordPress configurations I had been using included the
year and month in the URL to a post, but not the day. Tinkerer, and
especially the RSS feed generator plugin for Tinkerer, want the URLs
to include the day of the month as well. I made a list of all of the
HTML files and passed it through sed to generate the Apache
redirect rules to put in a .htaccess file at the root of the site.

At this point I had all of the existing content building, so the next
step was to make the output look the way I wanted by updating the
theme.

Creating a Theme

I experimented with a few of the standard themes, but decided that
none quite fit what I wanted. Everyone has a theme based on Bootstrap,
so I didn’t want to use one of those. I did want something that would
work scale to different sizes of screens, though, so I started with
the boilerplate theme from Tinkerer, but modified it heavily using the
Pure CSS tools from Yahoo to provide the layout I wanted.

Next, I spent some time looking for color schemes on colrd.com,
until I found some that I liked. I learned about font-awesome (used
in Bootstrap) and the WebSymbolsRegular font included with
Tinkerer’s themes – both useful for including icons in responsive
designs. I’m no CSS expert, but with this combination of tools I was
able to create a theme that I liked, that looked good with the content
I have and expect to add, and that worked on mobile devices. Not
everything was perfect, though.

Negatives

There were a few aspects of Tinkerer’s default behaviors that I didn’t
like. First, extensions aren’t really installed so much as “vendored”
into your site’s code tree. That means I won’t have problems if new
version of an extension I use is released with a change that isn’t
backwards-compatible, but it also means if there are bug fixes I will
have to handle the update myself without tools like pip.

By default Tinkerer disables the per-heading permalink feature of
Sphinx, and I wanted that left on. Enabling it introduced ugly links
to my RSS feed (probably why it is turned off in the first place), so
I had to make some changes to the RSS feed generation code. I made
some other changes to allow me to limit the number of entries in the
feeds, to save build time and to save consumers of the content from
downloading the entire site over and over. I will be contributing
those changes upstream, soon.

The documentation for the extensions in the tinkerer-contrib
repository is … sparse. Most of the extensions have a README file,
but if I did not understand Sphinx I’m not sure I would have known
what to do with most of them. I was able to make the tag-specific RSS
feed extension work, so that took care of one of my important
requirements.

My site built cleanly with Python 3 the first time I tried.

Python 3.3

I still use Python 2 by default for a lot of things, but I am trying
to make more of an effort to consider Python 3 as well. I have several
libraries that work with both, now, and we are continuing the slow
process of porting libraries used in OpenStack as well. I decided to
go ahead and set up this blog to run with Python 3, so after I had
everything working correctly with Python 2, I built a new virtualenv
using Python 3.3 and installed all of the dependencies. My site built
cleanly with Python 3 the first time I tried. The only problems were
with the old Google sitemap generator script I have been using for
years with PyMOTW.com. I will run that script with Python 2 until I
have time to convert it.

Deploying

I used to use a Mercurial, and then Git, repository with a hook script
to update the old version of my static content. For PyMOTW.com I
simply rsync the content, although I do check the built version of
that site into version control (to make it easier for me to discover
changes in the output format). To keep things simple for now I am
running rsync from the same Makefile I use to control the rest of
the commands to build the site.

Conclusions

There is no clear winner among the available tools for every
situation. I chose something based on Sphinx because I am comfortable
with how that rendering system already works. For someone without that
experience, Pelican may be a better option. For people who don’t like
reStructuredText, hyde or mynt may be more appealing – especially for
a new site, without existing content to be imported. But I’m happy
with Tinkerer, and I’m confident that I can smooth out any rough spots
if I need to.

rst2blogger 1.0

rst2blogger is a command line program for converting
reStructuredText documents to HTML suitable for posting to
blogger.com. It takes as input a single filename and an optional blog
title. The input file is parsed with docutils to create HTML, and
the HTML is uploaded as a draft to the specified blog. If the blogger
account only has one blog, the name does not need to be specified.

See the project documentation for installation and setup
instructions.

sphinxcontrib.spelling 1.1

What is sphinxcontrib.spelling?

sphinxcontrib.spelling is a spelling checker for Sphinx. It uses
PyEnchant to produce a report showing misspelled words.

What’s New in 1.1?

This point update includes new filters to ignore words commonly
encountered in software documentation and other writing about computer
programs. These include Python language built-ins, importable modules,
words that match the names of packages on the Python Package Index,
CamelCase words, and acronyms. There is also a new spelling
directive for creating a local word list within a document.

sphinxcontrib.spelling 1.0

What is sphinxcontrib.spelling?

sphinxcontrib.spelling is a spelling checker for Sphinx. It uses
PyEnchant to produce a report showing misspelled words.

What’s New in 1.0?

This release is completely rewritten from the earlier 0.2 version. The
output includes more details about the location of unknown words in
the source files being processed, and the output is saved for
reference and review. It also includes more extensive
documentation.

new project: sphinxcontrib-paverutils

Kevin Dangoor’s Paver includes basic integration for Sphinx, the
excellent document production toolkit from Georg Brandl. As I have
written before, however, the default integration didn’t quite meet my
needs for producing different forms of output from the same inputs.

Georg has opened the sphinxcontrib repository on BitBucket for
developers who want to collaborate on providing unofficial extensions to
Sphinx, so I decided to go ahead and package up the alternate
integration I use and release it in case someone else finds it helpful.
The result is sphinxcontrib.paverutils.

Writing Technical Documentation with Sphinx, Paver, and Cog

I’ve been working on the Python Module of the Week series since March of
2007
. During
the course of the project, my article style and tool chain have
both evolved. I now have a fairly smooth production process in
place, so the mechanics of producing a new post don’t get in the
way of the actual research and writing. Most of the tools are open
source, so I thought I would describe the process I go through and
how the tools work together.

Editing Text: TextMate

I work on a MacBook Pro, and use TextMate
for editing the articles and source for PyMOTW. TextMate is the one
tool I use regularly that is not open source. When I’m doing heavy
editing of hundreds of files for my day job I use Aquamacs Emacs, but TextMate is better suited for prose
editing and is easier to extend with quick actions. I discovered
TextMate while looking for a native editor to use for Python Magazine, and after being able to write my
own “bundle” to manage magazine articles (including defining a mode
for the markup language we use) I was hooked.

Some of the features that I like about TextMate for prose editing are
as-you-type spell-checking (I know some people hate this feature, but
I find it useful), text statistics (word count, etc.), easy block
selection (I can highlight a paragraph or several sentences and move
them using cursor keys), a moderately good reStructuredText mode
(emacs’ is better, but TextMate’s is good enough), paren and quote
matching as you type, and very simple extensibility for repetitive
tasks. I also like TextMate’s project management features, since they
makes it easy to open several related files at the same time.

Version Control: svn

I started out using a private svn repository for all of my projects,
including PyMOTW. I’m in the middle of evaluating hosted DVCS
options for PyMOTW
,
but still haven’t had enough time to give them all the research I
think is necessary before making the move. The Python core developers
are considering a similar move (PEP 374) so it will be interesting
to monitor that discussion.
No doubt we have different requirements (for example, they are hosting
their own repository), but the experiences with the various DVCS tools
will be useful input to my own decision.

Markup Language: reStructuredText

When I began posting, I wrote each article by hand using HTML. One of
the first tasks that I automated was the step of passing the source
code through pygments to produce a syntax colorized version. This
worked well enough for me at the time, but restricted me to producing
only HTML output. Eventually John Benediktsson contacted me with a
version of many of the posts converted from HTML to reStructuredText.

When reStructuredText was first put forward in the ‘90’s, I was
heavily into Zope development. As such, I was using StructuredText for documenting my
code, and in the Zope-based wiki that we ran at ZapMedia. I even
wrote my own app to extract
comments and docstrings to generate library documentation for a couple
of libraries I had released as open source. I really liked
StructuredText and, at first, I didn’t like reStructuredText.
Frankly, it looked ugly compared to what I was used to. It quickly
gained acceptance in the general community though, and I knew it would
give me options for producing other output formats for the PyMOTW
posts, so when John sent me the markup files I took another look.

While re-acquainting myself with reST, I realized two things. First,
although there is a bit more punctuation involved in the markup than
with the original StructuredText, the markup language was designed
with consistency in mind so it isn’t as difficult to learn as my first
impressions had lead me to believe. Second, it turned out the part I
thought was “ugly” was actually the part that made reST more
powerful
than StructuredText: It has a standard syntax for extension
directives that users can define for their own documents.

Markup to Output: Sphinx

Before I made a final decision on switching from hand-coded HTML to
reST, I needed a tool to convert to HTML (I still had to post the
results on the blog, after all, and Blogger doesn’t support reST). I
first tried David Goodger’s docutils package. The scripts it includes
felt a little too much like “pieces” of a tool rather than a complete
solution, though, and I didn’t really want to assemble my own wrappers
if I didn’t have to – I wanted to write text for this project, not
code my own tools. Around this time, Georg Brandl had made
significant progress on Sphinx, which
turned out to be a more complete turn-key system for converting a pile
of reST files to HTML or PDF. After a few hours of experimentation, I
had a sample project set up and was generating HTML from my documents
using the standard templates.

I decided that reStructuredText looked like the way to go.

HTML Templates: Jinja:

My next step was to work out exactly how to produce all of the outputs
I needed from reST inputs. Each post for the PyMOTW series ends up
going to several different places:

  • the PyMOTW source distribution (HTML)
  • my Blogger blog (HTML)
  • the PyMOTW project site (HTML)
  • O’Reilly.com (HTML)
  • the PyMOTW “book” (PDF)

Each of the four HTML outputs uses slightly different formatting,
requiring separate templates (PDF is a whole different problem,
covered below). The source distribution and project site are both
full HTML versions of all of the documents, but use different
templates. I decided to use the default Sphinx templates for the
packaged version; I may change that later, but it works for the time
being, and it’s one less custom template to deal with. I wanted the
online version to match the appearance of the rest of my site, so I
needed to create a template for it. The two blogs use a third
template (O’Reilly’s site ignores a lot of the markup due to their
Moveable Type configuration, but the articles come out looking good
enough so I can use the same template I use for my own blog without
worrying about a separate custom template).

Sphinx uses Jinja templates to produce
HTML output. The syntax for Jinja is very similar to Django’s
template language. As it happens, I use Django for the dynamic
portion of my web site that I host myself. I lucked out, and my
site’s base template was simple enough to use with Sphinx without
making any changes. Yay for compatibility!

Cleaning up HTML with BeautifulSoup

The blog posts need to be relatively clean HTML that I can upload to
Blogger and O’Reilly, so they could not include any html or
body tags or require any markup or styles not supported by either
blogging engine. The template I came up with is a stripped down
version that doesn’t include the CSS and markup for sidebars, header,
or footer. The result was almost exactly what I wanted, but had two
problems.

The easiest problem to handle was the permalinks generated by Sphinx.
After each heading on the page, Sphinx inserts an anchor tag with a ¶
character and applies CSS styles that hide/show the tag when the user
hovers over it. That’s a nice feature for the main site and packaged
content, but they didn’t work for the blogs. I have no control over
the CSS used at O’Reilly, so the tags were always visible. I didn’t
really care if they were included on the Blogger pages, so the
simplest thing to do was stick with one “blogging” template and remove
the permalinks.

The second, more annoying, problem, was that Blogger wanted to insert
extra whitespace into the post. There is a configuration option on
Blogger to treat line breaks in the post as “paragraph breaks” (I
think they actually insert br tags). This is very convenient for
normal posts with mostly straight text, since I can simply write each
paragraph on one long line, wrapped visually by my editor, and break
the paragraphs where I want them. The result is I can almost post
directly from plain text input. Unfortunately, the option is applied
to every post in the blog (even old posts), so changing it was not a
realistic option – I wasn’t about to go back and re-edit every single
post I had previously written.

The second, more annoying, problem, was that Blogger wanted to
insert extra whitespace into the post.

Sphinx didn’t have an option to skip generating the permalinks, and
there was no way to express that intent in the template, so I fell
back to writing a little script to strip them out after the fact. I
used BeautifulSoup
to find the tags I wanted removed, delete them from the parse tree,
then assemble the HTML text as a string again. I added code to the
same script to handle the whitespace issue by removing all newlines
from the input unless they were inside pre tags, which Blogger
handled correctly. The result was a single blob of partial HTML
without newlines or permalinks that I could post directly to either
blog without editing it by hand. Score a point for automation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def clean_blog_html(body):
    # Clean up the HTML
    import re
    import sys
    from BeautifulSoup import BeautifulSoup
    from cStringIO import StringIO

    # The post body is passed to stdin.
    soup = BeautifulSoup(body)

    # Remove the permalinks to each header since the blog does not have
    # the styles to hide them.
    links = soup.findAll('a', attrs={'class':"headerlink"})
    [l.extract() for l in links]

    # Get BeautifulSoup's version of the string
    s = soup.__str__(prettyPrint=False)

    # Remove extra newlines.  This depends on the fact that
    # code blocks are passed through pygments, which wraps each part of the line
    # in a span tag.
    pattern = re.compile(r'([^s][^p][^a][^n]>)n$', re.DOTALL|re.IGNORECASE)
    s = ''.join(pattern.sub(r'1', l) for l in StringIO(s))

    return s

Code Syntax Highlighting: pygments

I wanted my posts to look as good as possible, and an important factor
in the appearance would be the presentation of the source code. I
adopted pygments in the early hand-coded
HTML days, because it was easy to integrate into TextMate with a
simple script.

pygmentize -f html -O cssclass=syntax $@

Binding the command to a key combination meant with a few quick
keypresses I had HTML ready to insert into the body of a post.

When I moved to Sphinx, using pygments became even easier because
Sphinx automatically passes included source code through pygments as
it generates its output. Syntax highlighting works for HTML and PDF,
so I didn’t need any custom processing.

Automation: Paver

Automation is important for my sense of well being. I hate dealing
with mundane repetitive tasks, so once an article was written I didn’t
want to have to touch it to prepare it for publication of any of the
final destinations. As I have written before,
I started out using make to run various shell commands. I have
since converted the entire process to Paver.

Automation is important for my sense of well being.

The stock Sphinx integration provided with that comes with Paver
didn’t quite meet my needs, but by examining the source I was able to
create my own replacement tasks in an afternoon. The main problem was
the tight coupling between the code to run Sphinx and the code to find
the options to pass to it. For normal projects with a single
documentation output format (Paver assumes HTML with a single config
file), this isn’t a problem. PyMOTW’s requirements are different,
with the four output formats discussed above.

In order to produce different output with Sphinx, you need different
configuration files. Since the base name for the file must always be
conf.py, that means the files have to be stored in separate
directories. One of the options passed to Sphinx on the command line
tells it the directory to look in for its configuration file. Even
though Paver doesn’t fork() before calling Sphinx, it still uses
the command line options to pass instructions.

Creating separate Sphinx configuration files was easy. The problem
was defining options in Paver to tell Sphinx about each configuration
directory for the different output. Paver options are grouped into
bundles, which are essentially a namespace. When a Paver task looks
for an option, it scans through the bundles, possibly cascading to the
global namespace, until it finds the option by name. The search can
be limited to specific bundles, so that the same option name can be
used to configure different tasks.

The html task from paver.doctools sets the options search order to
look for values first in the sphinx section, then globally. Once
it has retrieved the path values, via _get_paths(), it invokes
Sphinx.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def _get_paths():
    """look up the options that determine where all of the files are."""
    opts = options
    docroot = path(opts.get('docroot', 'docs'))
    if not docroot.exists():
        raise BuildFailure("Sphinx documentation root (%s) does not exist."
                           % docroot)
    builddir = docroot / opts.get("builddir", ".build")
    builddir.mkdir()
    srcdir = docroot / opts.get("sourcedir", "")
    if not srcdir.exists():
        raise BuildFailure("Sphinx source file dir (%s) does not exist"
                            % srcdir)
    htmldir = builddir / "html"
    htmldir.mkdir()
    doctrees = builddir / "doctrees"
    doctrees.mkdir()
    return Bunch(locals())

@task
def html():
    """Build HTML documentation using Sphinx. This uses the following
    options in a "sphinx" section of the options.

    docroot
      the root under which Sphinx will be working. Default: docs
    builddir
      directory under the docroot where the resulting files are put.
      default: build
    sourcedir
      directory under the docroot for the source files
      default: (empty string)
    """
    options.order('sphinx', add_rest=True)
    paths = _get_paths()
    sphinxopts = ['', '-b', 'html', '-d', paths.doctrees,
        paths.srcdir, paths.htmldir]
    dry("sphinx-build %s" % (" ".join(sphinxopts),), sphinx.main, sphinxopts)

This didn’t work for me because I needed to pass a separate
configuration directory (not handled by the default _get_paths())
and different build and output directories. The simplest solution
turned out to be re-implementing the Paver-Sphinx integration to make
it more flexible. I created my own _get_paths() and made it look
for the extra option values and use the directory structure I needed.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def _get_paths():
    """look up the options that determine where all of the files are."""
    opts = options

    docroot = path(opts.get('docroot', 'docs'))
    if not docroot.exists():
        raise BuildFailure("Sphinx documentation root (%s) does not exist."
                           % docroot)

    builddir = docroot / opts.get("builddir", ".build")
    builddir.mkdir()

    srcdir = docroot / opts.get("sourcedir", "")
    if not srcdir.exists():
        raise BuildFailure("Sphinx source file dir (%s) does not exist"
                            % srcdir)

    # Where is the sphinx conf.py file?
    confdir = path(opts.get('confdir', srcdir))

    # Where should output files be generated?
    outdir = opts.get('outdir', '')
    if outdir:
        outdir = path(outdir)
    else:
        outdir = builddir / opts.get('builder', 'html')
    outdir.mkdir()

    # Where are doctrees cached?
    doctrees = opts.get('doctrees', '')
    if not doctrees:
        doctrees = builddir / "doctrees"
    else:
        doctrees = path(doctrees)
    doctrees.mkdir()

    return Bunch(locals())

Then I defined a new function, run_sphinx(), to set up the options
search path, look for the option values, and invoke Sphinx. I set
add_rest to False to disable searching globally for an option to
avoid namespace pollution from option collisions, since I knew I was
going to have options with the same names but different values for
each output format. I also look for a “builder”, to support PDF
generation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
def run_sphinx(*option_sets):
    """Helper function to run sphinx with common options.

    Pass the names of namespaces to be used in the search path
    for options.
    """
    if 'sphinx' not in option_sets:
        option_sets += ('sphinx',)
    kwds = dict(add_rest=False)
    options.order(*option_sets, **kwds)
    paths = _get_paths()
    sphinxopts = ['',
                  '-b', options.get('builder', 'html'),
                  '-d', paths.doctrees,
                  '-c', paths.confdir,
                  paths.srcdir, paths.outdir]
    dry("sphinx-build %s" % (" ".join(sphinxopts),), sphinx.main, sphinxopts)
    return

With a working run_sphinx() function I could define several
Sphinx-based tasks, each taking options with the same names but from
different parts of the namespace. The tasks simply call
run_sphinx() with the desired namespace search path. For example,
to generate the HTML to include in the sdist package, the html
task looks in the html bunch:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
@task
@needs(['cog'])
def html():
    """Build HTML documentation using Sphinx. This uses the following
    options in a "sphinx" section of the options.

    docroot
      the root under which Sphinx will be working.
      default: docs
    builddir
      directory under the docroot where the resulting files are put.
      default: build
    sourcedir
      directory under the docroot for the source files
      default: (empty string)
    doctrees
      the location of the cached doctrees
      default: $builddir/doctrees
    confdir
      the location of the sphinx conf.py
      default: $sourcedir
    outdir
      the location of the generated output files
      default: $builddir/$builder
    builder
      the name of the sphinx builder to use
      default: html
    """
    set_templates(options.html.templates)
    run_sphinx('html')
    return

while generating the HTML output for the website uses a different set
of options from the website bunch:

1
2
3
4
5
6
7
8
@task
@needs(['webtemplatebase', 'cog'])
def webhtml():
    """Generate HTML files for website.
    """
    set_templates(options.website.templates)
    run_sphinx('website')
    return

All of the option search paths also include the sphinx bunch, so
values that do not change (such as the source directory) do not need
to be repeated. The relevant portion of the options from the PyMOTW
pavement.py file looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
options(
    # ...

    sphinx = Bunch(
        sourcedir=PROJECT,
        docroot = '.',
        builder = 'html',
        doctrees='sphinx/doctrees',
        confdir = 'sphinx',
    ),

    html = Bunch(
        builddir='docs',
        outdir='docs',
        templates='pkg',
    ),

    website=Bunch(
        templates = 'web',
        #outdir = 'web',
        builddir = 'web',
    ),

    pdf=Bunch(
        templates='pkg',
        #outdir='pdf_output',
        builddir='web',
        builder='latex',
    ),

    blog=Bunch(
        sourcedir=path(PROJECT)/MODULE,
        builddir='blog_posts',
        outdir='blog_posts',
        confdir='sphinx/blog',
        doctrees='blog_posts/doctrees',
    ),

    # ...
)

To find the sourcedir for the html task, _get_paths() first
looks in the html bunch, then the sphinx bunch.

Capturing Program Output: cog

As an editor at Python Magazine, and reviewer for several books, I’ve
discovered that one of the most frequent sources of errors with
technical writing occurs in the production process where the output of
running sample code is captured to be included in the final text.
This is usually done manually by running the program and copying and
pasting its output from the console. It’s not uncommon for a bug to
be found, or a library to change, requiring a change in the source
code provided with the article. That change, in turn, means the
output of commands may be different. Sometimes the change is minor,
but at other times the output is different in some significant way.
Since I’ve seen the problem come up so many times, I spent time
thinking about and looking for a solution to avoid it in my own work.

During my research, a few people suggested that I switch to using
doctests for my examples, but I felt there were several problems with
that approach. First, the doctest format isn’t very friendly for
users who want to copy and paste examples into their own scripts. The
reader has to select each line individually, and can’t simply grab the
entire block of code. Distributing the examples as separate scripts
makes this easier, since they can simply copy the entire file and
modify it as they want. Using individual .py files also makes it
possible for some of the more complicated examples to run clients and
servers at the same time from different scripts (as with
SimpleXMLRPCServer, for
example). But most importantly, using doctests does not solve the
fundamental problem. Doctests tell me when the output has changed,
but I still have to manually run the scripts to generate that output
and paste it into my document in the first place. What I really
wanted to be able to do was run the script and insert the output,
whatever it was, without manually copying and pasting text from the
console.

I finally found what I was looking for in cog, from Ned Batchelder. Ned
describes cog as a “code generation tool”, and most of the examples he
provides on his site are in that vein. But cog is a more general
purpose tool than that. It gives you a way to include arbitrary
Python instructions in your source document, have them executed, and
then have the source document change to reflect the output.

For each code sample, I wanted to include the Python source followed
by the output it produces when run on the console. There is a reST
directive to include the source file, so that part is easy:

.. include:: anydbm_whichdb.py
    :literal:
    :start-after: #end_pymotw_header

The include directive tells Sphinx that the file
“anydbm_whichdb.py” should be treated as a literal text block (instead
of more reST) and to only include the parts following the last line of
the standard header I use in all my source code. Syntax highlighting
comes for free when the literal block is converted to the output
format.

Grabbing the command output was a little trickier. Normally with cog,
one would embed the actual source to be run in the document. In my
case, I had the text in an external file. Most of the source is
Python, and I could just import it, but I would have to go to special
lengths to capture any output and pass it to cog.out(), the cog
function for including text in the processed document. I didn’t want
my example code littered with calls to cog.out() instead of
print, so I needed to capture sys.stdout and sys.stdin. A bigger
question was whether I wanted to have all of the sample files imported
into the namespace of the build process. Considering both issues, it
made sense to run the script in a separate process and capture the
output.

There is a bit of setup work needed to run the scripts this way, so I
decided to put it all into a function instead of including the
boilerplate code in every cog block. The reST source for running
anydbm_whichdb.py looks like:

.. {{{cog
.. cog.out(run_script(cog.inFile, 'anydbm_whichdb.py'))
.. }}}
.. {{{end}}}

The .. at the start of each line causes the reStructuredText
parser to treat the line as a comment, so it is not included in the
output. After passing the reST file through cog, it is rewritten to
contain:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
.. {{{cog
.. cog.out(run_script(cog.inFile, 'anydbm_whichdb.py'))
.. }}}

::

    $ python anydbm_whichdb.py
    dbhash

.. {{{end}}}

The run_script() function runs the python script it is given, adds
a prefix to make reST treat the following lines as literal text, then
indents the script output. The script is run via Paver’s sh()
function, which wraps the subprocess module and supports the dry-run
feature of Paver. Because the cog instructions are comments, the only
part that shows up in the output is the literal text block with the
command output.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
def run_script(input_file, script_name,
                interpreter='python',
                include_prefix=True,
                ignore_error=False,
                trailing_newlines=True,
                ):
    """Run a script in the context of the input_file's directory,
    return the text output formatted to be included as an rst
    literal text block.

    Arguments:

     input_file
       The name of the file being processed by cog.  Usually passed as cog.inFile.

     script_name
       The name of the Python script living in the same directory as input_file to be run.
       If not using an interpreter, this can be a complete command line.  If using an
       alternate interpreter, it can be some other type of file.

     include_prefix=True
       Boolean controlling whether the :: prefix is included.

     ignore_error=False
       Boolean controlling whether errors are ignored.  If not ignored, the error
       is printed to stdout and then the command is run *again* with errors ignored
       so that the output ends up in the cogged file.

     trailing_newlines=True
       Boolean controlling whether the trailing newlines are added to the output.
       If False, the output is passed to rstrip() then one newline is added.  If
       True, newlines are added to the output until it ends in 2.
    """
    rundir = path(input_file).dirname()
    if interpreter:
        cmd = '%(interpreter)s %(script_name)s' % vars()
    else:
        cmd = script_name
    real_cmd = 'cd %(rundir)s; %(cmd)s 2>&1' % vars()
    try:
        output_text = sh(real_cmd, capture=True, ignore_error=ignore_error)
    except Exception, err:
        print '*' * 50
        print 'ERROR run_script(%s) => %s' % (real_cmd, err)
        print '*' * 50
        output_text = sh(real_cmd, capture=True, ignore_error=True)
    if include_prefix:
        response = 'n::nn'
    else:
        response = ''
    response += 't$ %(cmd)snt' % vars()
    response += 'nt'.join(output_text.splitlines())
    if trailing_newlines:
        while not response.endswith('nn'):
            response += 'n'
    else:
        response = response.rstrip()
        response += 'n'
    return response

I defined run_script() in my pavement.py file, and added it to the
__builtins__ namespace to avoid having to import it each time I
wanted to use it from a source document.

A somewhat more complicated example shows another powerful feature of
cog. Because it can run any arbitrary Python code, it is possible to
establish the preconditions for a script before running it. For
example, anydbm_new.py assumes that its output database does not
already exist. I can ensure that condition by removing it before
running the script.

1
2
3
4
5
6
.. {{{cog
.. workdir = path(cog.inFile).dirname()
.. sh("cd %s; rm -f /tmp/example.db" % workdir)
.. cog.out(run_script(cog.inFile, 'anydbm_new.py'))
.. }}}
.. {{{end}}}

Since cog is integrated into Paver, all I had to do to enable it was
define the options and import the module. I chose to change the begin
and end tags used by cog because the default patterns ([[[cog and
]]]) appeared in the output of some of the scripts (printing
nested lists, for example).

1
2
3
4
5
cog=Bunch(
    beginspec='{{{cog',
    endspec='}}}',
    endoutput='{{{end}}}',
),

To process all of the input files through cog before generating the
output, I added ‘cog’ to the @needs list for any task running
sphinx. Then it was simply a matter of running paver html or paver
webhtml
to generate the output.

Paver includes an uncog task to remove the cog output from your
source files before committing to a source code repository, but I
decided to include the cogged values in committed versions so I would
be alerted if the output ever changed.

Generating PDF: TexLive

Generating HTML using Sphinx and Jinja templates is fairly
straightforward; PDF output wasn’t quite so easy to set up. Sphinx
actually produces LaTeX, another text-based format, as output, along
with a Makefile to run third-party LaTeX tools to create the PDF. I
started out experimenting on a Linux system (normally I use a Mac, but
this box claimed to have the required tools installed). Due to the
age of the system, however, the tools weren’t compatible with the
LaTeX produced by Sphinx. After some searching, and asking on the
sphinx-dev mailing list, I installed a copy of TeX Live, a newer TeX distro. A few tweaks to
my $PATH later and I was in business building PDFs right on my
Mac.

Generating HTML using Sphinx and Jinja templates is fairly
straightforward; PDF output wasn’t quite so easy to set up.

My pdf task runs Sphinx with the “latex” builder, then runs
make using the generated Makefile.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
@task
@needs(['cog'])
def pdf():
    """Generate the PDF book.
    """
    set_templates(options.pdf.templates)
    run_sphinx('pdf')
    latex_dir = path(options.pdf.builddir) / 'latex'
    sh('cd %s; make' % latex_dir)
    return

I still need to experiment with some of the LaTeX options, including
templates for pages in different sizes, logos, and styles. For now
I’m happy with the default look.

Releasing

Once I had the “build” fully automated, it was time to address the
distribution process. For each version, I need to:

  • upload HTML, PDF, and tar.gz files to my server
  • update PyPI
  • post to my blog
  • post to the O’Reilly blog

The HTML and PDF files are copied to my server using rsync, invoked
from Paver. I use a web browser and the admin interface for
django-codehosting to upload the
tar.gz file containing the source distribution manually. That will be
automated, eventually. Once the tar.gz is available, PyPI can be
updated via the builtin task paver register. That just leaves the
two blog posts.

For my own blog, I use MarsEdit to post and edit entries. I
find the UI easy to use, and I like the ability to work on drafts of
posts offline. It is much nicer than the web interface for Blogger,
and has the benefit of being AppleScript-able. I have plans to
automate all of the steps right up to actually posting the new blog
entry, but for now I copy the generated blog entry into a new post
window by hand.

O’Reilly’s blogging policy does not allow desktop clients (too much of
a support issue for the tech staff), so I need to use their Moveable
Type web UI to post. As with MarsEdit, I simply copy the output and
paste it into the field in the browser window, then add tags.

Tying it All Together

A quick overview of my current process is:

  1. Pick a module, research it, and write examples in reST and Python.
    Include the Python source and use cog directives to bring in the
    script output.
  2. Use the command “paver html” to produce HTML output to verify the
    results look good and I haven’t messed up any markup.
  3. Commit the changes to svn. When I’m done with the module, copy the
    “trunk” to a release branch for packaging.
  4. Use “paver sdist” to create the tar.gz file containing the Python
    source and HTML documentation.
  5. Upload the tar.gz file to my site.
  6. Run “paver installwebsite” to regenerate the hosted version of the
    HTML and the PDF, then copy both to my web server.
  7. Run “paver register” to update PyPI with the latest release
    information.
  8. Run “paver blog” to generate the HTML to be posted to the blogs.
    The task opens a new TextMate window containing the HTML so it is
    ready to be copied.
  9. Paste the blog post contents into MarsEdit, add tags, and send it
    to Blogger.
  10. Paste the blog post contents into the MT UI for O’Reilly, add
    tags, verify that it renders properly, then publish.

Try It Yourself

All of the source for PyMOTW (including the pavement.py file with
configuration options, task definitions, and Sphinx integration) is
available from the PyMOTW web site. Sphinx, Paver, cog, and
BeautifulSoup are all open source projects. I’ve only tested the
PyMOTW “build” on Mac OS X, but it should work on Linux without any
major alterations. If you’re on Windows, let me know if you get it
working.

Originally published on my blog, 2 February 2009