Creating a Spelling Checker for reStructuredText Documents

I write a lot using reStructuredText files as the source format, largely because of the ease of automating the tools used to convert reST to other formats. The number of files involved has grown to the point that some of the post-writing tasks were becoming tedious, and I was skipping steps like running the spelling checker. I finally decided to do something about that by creating a spelling checker plugin for Sphinx, released as sphinxcontrib-spelling.

I have written about why I chose reST before. All of the articles on this site, including the Python Module of the Week, series, started out as .rst files. I also use Sphinx to produce several developer manuals at my day job. I like reST and Sphinx because they both can be extended to meet new needs easily. One area that has been lacking, though, is support for a spelling checker.

Checking the spelling of the contents of an individual reStructuredText file from within a text editor like Aquamacs is straightforward, but I have on the order of 200 separate files making up parts of this site alone, not to mention my book. Manually checking each file, one at a time, is a tedious job, and not one I perform very often. After finding a few typos recently, I decided I needed to take care of the problem by using automation to eliminate the drudgery and make it easier to run the spelling checker regularly.

The files are already configured to be processed by Sphinx when they are converted to HTML and PDF format, so that seemed like a natural way to handle the spelling checker, too. To add a step to the build to check the spelling of every file, I would need two new tools: an extension to Sphinx to drive the spelling checker and the spelling checker itself. I did not find any existing Sphinx extensions that checked spelling, so I decided to write my own. The first step was to evaluate spelling checkers.

I did not find any existing Sphinx extensions that checked spelling, so I decided to write my own.

Choosing a Spelling Checker

I recently read Peter Norvig’s article How to Write a Spelling Corrector, which shows how to create a spelling checker from scratch in Python. As with most nontrivial applications, though, the algorithm for testing the words is only part of the story when looking at a spelling checker. An equally important aspect is the dictionary of words known to be spelled correctly. Without a good dictionary, the algorithm would report too many false negatives. Not wanting to build my own dictionary, I decided to investigate existing spelling checkers and concentrate on writing the interface layer to connect them to Sphinx.

There are several open source spelling checkers with Python bindings. I evaluated the aspell-python and PyEnchant (bindings for enchant, the spelling checker from the AbiWord project). Both tools required some manual setup to get the engine working. The aspell-python API was simple to use, but I decided to use PyEnchant instead. It has an active development group and is more extensible (with APIs to define alternate dictionaries, tokenizers, and filters).

Installing PyEnchant

I started out by trying to install enchant and PyEnchant from source under OS X with Python 2.7, but eventually gave up after having to download several dependencies just to get configure to run for enchant. I stuck with PyEnchant as a solution because installing aspell was not really any easier (the installation experience for both tools could be improved). The simplest solution for OS X and Windows is to use the platform-specific binary installers for PyEnchant (not the .egg), since they include all of the dependencies. That means it cannot be installed into a virtualenv, but I was willing to live with that for the sake of having any solution at all.

Linux platforms can probably install enchant via RPM or other system package, so it is less of a challenge to get PyEnchant working there, and it may even work with pip.

Using PyEnchant

There are several good examples in the PyEnchant tutorial, and I will not repeat them here. I will cover some of the concepts, though, as part of explaining the implementation of the new extension.

The PyEnchant API is organized around a “dictionary,” which can be loaded at runtime based on a language name. Enchant does some work to try to determine the correct language automatically based on the environment settings, but I found it more reliable to set the language explicitly. After the dictionary is loaded, its check() method can be used to test whether a word is correct or not. For incorrect words, the suggest() method returns a list of possible alternatives, sorted by the likelihood they are the intended word.

The check() method works well for individual words, but cannot process paragraphs. PyEnchant provides an API for checking larger blocks of text, but I chose to use a lower level API instead. In addition to the dictionary, PyEnchant includes a “tokenizer” API for splitting text into candidate words to be checked. Using the tokenizer API means that the new plugin can run some additional tests on words not found in the dictionary. For example, I plan to provide an option to ignore “misspelled” words that appear to be the name of an importable Python module.

Integrating with Sphinx

The Sphinx Extension API includes several ways to add new features to Sphinx, including markup roles, language domains, processing events, and directives. I chose to create a new “builder” class, because that would give me complete control over the way the document is processed. The builder API works with a parsed document to create output, usually in a format like HTML or PDF. In this case, the SpellingBuilder does not generate any output files. It prints the list of misspelled words to standard output, and includes the headings showing where the words appear in the document.

The first step in creating the new extension is to define a setup() function to be invoked when the module is loaded. The function receives as argument an instance of the Sphinx application, ready to be configured. In sphinxcontrib.spelling.setup(), the new builder and several configuration options are added to the application. Although the Sphinx configuration file can contain any Python code, only the explicitly registered configuration settings affect the way the environment is saved.

def setup(app):
    app.info('Initializing Spelling Checker')
    app.add_builder(SpellingBuilder)
    # Report guesses about correct spelling
    app.add_config_value('spelling_show_suggestions', False, 'env')
    # Set the language for the text
    app.add_config_value('spelling_lang', 'en_US', 'env')
    # Set a user-provided list of words known to be spelled properly
    app.add_config_value('spelling_word_list_filename', 'spelling_wordlist.txt', 'env')
    # Assume anything that looks like a PyPI package name is spelled properly
    app.add_config_value('spelling_ignore_pypi_package_names', False, 'env')
    # Assume words that look like wiki page names are spelled properly
    app.add_config_value('spelling_ignore_wiki_words', True, 'env')
    # Assume words that are all caps, or all caps with trailing s, are spelled properly
    app.add_config_value('spelling_ignore_acronyms', True, 'env')
    # Assume words that are part of __builtins__ are spelled properly
    app.add_config_value('spelling_ignore_python_builtins', True, 'env')
    # Assume words that look like the names of importable modules are spelled properly
    app.add_config_value('spelling_ignore_importable_modules', True, 'env')
    # Add any user-defined filter classes
    app.add_config_value('spelling_filters', [], 'env')
    # Register the 'spelling' directive for setting parameters within a document
    rst.directives.register_directive('spelling', SpellingDirective)
    return

The builder class is derived from sphinx.builders.Builder. The important method is write_doc(), which processes the parsed documents and saves the messages with unknown words to the output file.

def write_doc(self, docname, doctree):
    self.checker.push_filters(self.env.spelling_document_filters[docname])

    for node in doctree.traverse(docutils.nodes.Text):
        if node.tagname == '#text' and  node.parent.tagname in TEXT_NODES:

            # Figure out the line number for this node by climbing the
            # tree until we find a node that has a line number.
            lineno = None
            parent = node
            seen = set()
            while lineno is None:
                #self.info('looking for line number on %r' % node)
                seen.add(parent)
                parent = node.parent
                if parent is None or parent in seen:
                    break
                lineno = parent.line
            filename = self.env.doc2path(docname, base=None)

            # Check the text of the node.
            for word, suggestions in self.checker.check(node.astext()):
                msg_parts = []
                if lineno:
                    msg_parts.append(darkgreen('(line %3d)' % lineno))
                msg_parts.append(red(word))
                msg_parts.append(self.format_suggestions(suggestions))
                msg = ' '.join(msg_parts)
                self.info(msg)
                self.output.write(u"%s:%s: (%s) %sn" % (
                        self.env.doc2path(docname, None),
                        lineno, word,
                        self.format_suggestions(suggestions),
                        ))

                # We found at least one bad spelling, so set the status
                # code for the app to a value that indicates an error.
                self.app.statuscode = 1

    self.checker.pop_filters()
    return

The builder traverses all of the text nodes, skipping over formatting nodes and container nodes that contain no text. Each node is converted to plain text using its astext() method, and the text is given to the SpellingChecker to be parsed and checked.

class SpellingChecker(object):
    """Checks the spelling of blocks of text.

    Uses options defined in the sphinx configuration file to control
    the checking and filtering behavior.
    """

    def __init__(self, lang, suggest, word_list_filename, filters=[]):
        self.dictionary = enchant.DictWithPWL(lang, word_list_filename)
        self.tokenizer = get_tokenizer(lang, filters)
        self.original_tokenizer = self.tokenizer
        self.suggest = suggest

    def push_filters(self, new_filters):
        """Add a filter to the tokenizer chain.
        """
        t = self.tokenizer
        for f in new_filters:
            t = f(t)
        self.tokenizer = t

    def pop_filters(self):
        """Remove the filters pushed during the last call to push_filters().
        """
        self.tokenizer = self.original_tokenizer

    def check(self, text):
        """Generator function that yields bad words and suggested alternate spellings.
        """
        for word, pos in self.tokenizer(text):
            correct = self.dictionary.check(word)
            if correct:
                continue
            yield word, self.dictionary.suggest(word) if self.suggest else []
        return

Finding Words in the Input Text

The blocks of text from the nodes are parsed using a language-specific tokenizer provided by PyEnchant. The text is split into words, and then each word is passed through a series of filters. The API defined by enchant.tokenize.Filter supports two behaviors. Based on the return value from _skip(), the word might be ignored entirely and never returned by the tokenizer. Alternatively, the _split() method can return a modified version of the text.

In addition to the filters for email addresses and “wiki words” provided by PyEnchant, sphinxcontrib-spelling includes several others. The AcronymFilter tells the tokenizer to skip words that use all uppercase letters.

class AcronymFilter(Filter):
    """If a word looks like an acronym (all upper case letters),
    ignore it.
    """
    def _skip(self, word):
        return (word == word.upper() # all caps
                or
                # pluralized acronym ("URLs")
                (word[-1].lower() == 's'
                 and
                 word[:-1] == word[:-1].upper()
                 )
                )

The ContractionFilter expands common English contractions that might appear in less formal blog posts.

class list_tokenize(tokenize):
    def __init__(self, words):
        tokenize.__init__(self, '')
        self._words = words
    def next(self):
        if not self._words:
            raise StopIteration()
        word = self._words.pop()
        return (word, )

class ContractionFilter(Filter):
    """Strip common contractions from words.
    """
    splits = {
        "won't":['will', 'not'],
        "isn't":['is', 'not'],
        "can't":['can', 'not'],
        "i'm":['I', 'am'],
        }
    def _split(self, word):
        # Fixed responses
        if word.lower() in self.splits:
            return list_tokenize(self.splits[word.lower()])

        # Possessive
        if word.lower().endswith("'s"):
            return unit_tokenize(word[:-2])

        # * not
        if word.lower().endswith("n't"):
            return unit_tokenize(word[:-3])

        return unit_tokenize(word)

Because I write about Python a lot, I tend to use the names of projects that appear on the Python Package Index (PyPI). PyPiFilterFactory fetches a list of the packages from the index and then sets up a filter to ignore all of them.

class IgnoreWordsFilter(Filter):
    """Given a set of words, ignore them all.
    """
    def __init__(self, tokenizer, word_set):
        self.word_set = set(word_set)
        Filter.__init__(self, tokenizer)
    def _skip(self, word):
        return word in self.word_set

class IgnoreWordsFilterFactory(object):
    def __init__(self, words):
        self.words = words
    def __call__(self, tokenizer):
        return IgnoreWordsFilter(tokenizer, self.words)

class PyPIFilterFactory(IgnoreWordsFilterFactory):
    """Build an IgnoreWordsFilter for all of the names of packages on PyPI.
    """
    def __init__(self):
        client = xmlrpclib.ServerProxy('http://pypi.python.org/pypi')
        IgnoreWordsFilterFactory.__init__(self, client.list_packages())

PythonBuiltinsFilter ignores functions built into the Python interpreter.

class PythonBuiltinsFilter(Filter):
    """Ignore names of built-in Python symbols.
    """
    def _skip(self, word):
        return word in __builtins__

Finally, ImportableModuleFilter ignores words that match the names of modules found on the import path. It uses imp to search for the module without actually importing it.

class ImportableModuleFilter(Filter):
    """Ignore names of modules that we could import.
    """
    def __init__(self, tokenizer):
        Filter.__init__(self, tokenizer)
        self.found_modules = set()
        self.sought_modules = set()
    def _skip(self, word):
        if word not in self.sought_modules:
            self.sought_modules.add(word)
            try:
                imp.find_module(word)
            except UnicodeEncodeError:
                return False
            except ImportError:
                return False
            else:
                self.found_modules.add(word)
                return True
        return word in self.found_modules

The SpellingBuilder creates the filter stack based on user settings, so the filters can be turned on or off.

filters = [ ContractionFilter,
            EmailFilter,
            ]
if self.config.spelling_ignore_wiki_words:
    filters.append(WikiWordFilter)
if self.config.spelling_ignore_acronyms:
    filters.append(AcronymFilter)
if self.config.spelling_ignore_pypi_package_names:
    self.info('Adding package names from PyPI to local spelling dictionary...')
    filters.append(PyPIFilterFactory())
if self.config.spelling_ignore_python_builtins:
    filters.append(PythonBuiltinsFilter)
if self.config.spelling_ignore_importable_modules:
    filters.append(ImportableModuleFilter)
filters.extend(self.config.spelling_filters)

Using the Spelling Checker

PyEnchant and sphinxcontrib-spelling should be installed on the import path for the same version of Python that Sphinx is using (refer to the project home page for more details). Then the extension needs to be explicitly enabled for a Sphinx project in order for the builder to be recognized. To enable the extension, add it to the list of extension in conf.py.

extensions = [ 'sphinxcontrib.spelling' ]

The other options can be set in conf.py, as well. For example, to turn on the filter to ignore the names of packages from PyPI, set spelling_add_pypy_package_names to True.

spelling_add_pypi_package_names = True

Because the spelling checker is integrated with Sphinx using a new builder class, it is not run when the HTML or LaTeX builders run. Instead, it needs to run as a separate phase of the build by passing the -b option to sphinx-build. The output shows each document name as it is processed, and if there are any errors the line number and misspelled word is shown. When spelling_show_suggestions is True, proposed corrections are included in the output.

$ sphinx-build -b spelling -d build/doctrees source build/spelling
...
writing output... [ 31%] articles/how-tos/sphinxcontrib-spelling/index
(line 255) mispelling ["misspelling", "dispelling", "mi spelling",
"spelling", "compelling", "impelling", "rappelling"]
...

See Also