preparing Sphinx output for Blogger

Blogger doesn’t let me set a different option on an individual post, and since not all of the posts are PyMOTW articles I’ve been trying to keep the “convert line breaks” flag on because it makes it easier for posts like these. The results have been a little ugly, but I think I have that straightened out, finally.

I prepare the PyMOTW articles using reST and convert them to HTML with Sphinx. I have a custom template that spits out only the body of the HTML (with no html or body tags). Code passes through pygments automatically as part of the Sphinx processing. The results include newlines after most of the tags, though. Blogger was converting those newlines to br tags, even when the tags themselves were otherwise invisible (like table tags).

I needed a cleanup script anyway because Sphinx (or docutils, I’m not 100% certain) inserts permalink anchors for each header. The stylesheet I use for the PyMOTW site causes them to be hidden unless the user mouses over the link, but I didn’t want them at all in the blog posts. A previous attempt at a cleanup script with BeautifulSoup stripped the permalinks but also removed the whitespace from within pre tags. A recent update to BeautifulSoup fixed that problem, so I gave it another try today.

Unfortunately, I couldn’t find any combination of arguments to tell BeautifulSoup not to insert newlines between tags. The prettyPrint option was either ignored, or I don’t understand how it is intended to be used. So I use BeautifulSoup to remove the permalinks but fell back on regular expressions for the newline handling.

I want to remove all newline characters immediately after closing tags, except if the tags are part of code or other pre-formatted output. Lines that do not end with tags are probably part of pre blocks, and whitespace is obviously important there. I realized that since pygments consistently uses span tags, as long as I ignored newlines after span tags I should be safe.

This is the script I came up with to take the HTML output of Sphinx and prepare it for posting through Blogger:

#!/usr/bin/env python
# encoding: utf-8
#
# Copyright (c) 2008 Doug Hellmann All rights reserved.
#
"""Clean a sphinx-generated HTML blob to make a blog post.
"""

import re
import sys
from BeautifulSoup import BeautifulSoup
from cStringIO import StringIO

# The post body is passed to stdin.
body = sys.stdin.read()
soup = BeautifulSoup(body)

# Remove the permalinks to each header since the blog does not have
# the styles to hide them.
links = soup.findAll('a', attrs={'class':"headerlink"})
[l.extract() for l in links]

# Get BeautifulSoup's version of the string
s = soup.__str__(prettyPrint=False)

# Remove extra newlines.  This depends on the fact that
# code blocks are passed through pygments, which wraps each part of the line
# in a span tag.
pattern = re.compile(r'([^s][^p][^a][^n]>)n$', re.DOTALL|re.IGNORECASE)
s = ''.join(pattern.sub(r'1', l) for l in StringIO(s))
print s

Today’s PyMOTW post on readline is the first example of the results.

Updated 1 Dec to change import line based on reader comment.