Creating a Spelling Checker for reStructuredText Documents
I write a lot using reStructuredText files as the source format, largely because of the ease of automating the tools used to convert reST to other formats. The number of files involved has grown to the point that some of the post-writing tasks were becoming tedious, and I was skipping steps like running the spelling checker. I finally decided to do something about that by creating a spelling checker plugin for Sphinx, released as sphinxcontrib-spelling.
I have written about why I chose reST before. All of the articles on this site, including the Python
Module of the Week, series, started out as
.rst
files. I also use Sphinx to produce
several developer manuals at my day job. I like reST and Sphinx
because they both can be extended to meet new needs easily. One area
that has been lacking, though, is support for a spelling checker.
Checking the spelling of the contents of an individual reStructuredText file from within a text editor like Aquamacs is straightforward, but I have on the order of 200 separate files making up parts of this site alone, not to mention my book. Manually checking each file, one at a time, is a tedious job, and not one I perform very often. After finding a few typos recently, I decided I needed to take care of the problem by using automation to eliminate the drudgery and make it easier to run the spelling checker regularly.
The files are already configured to be processed by Sphinx when they are converted to HTML and PDF format, so that seemed like a natural way to handle the spelling checker, too. To add a step to the build to check the spelling of every file, I would need two new tools: an extension to Sphinx to drive the spelling checker and the spelling checker itself. I did not find any existing Sphinx extensions that checked spelling, so I decided to write my own. The first step was to evaluate spelling checkers.
I did not find any existing Sphinx extensions that checked spelling, so I decided to write my own.
Choosing a Spelling Checker
I recently read Peter Norvig’s article How to Write a Spelling Corrector, which shows how to create a spelling checker from scratch in Python. As with most nontrivial applications, though, the algorithm for testing the words is only part of the story when looking at a spelling checker. An equally important aspect is the dictionary of words known to be spelled correctly. Without a good dictionary, the algorithm would report too many false negatives. Not wanting to build my own dictionary, I decided to investigate existing spelling checkers and concentrate on writing the interface layer to connect them to Sphinx.
There are several open source spelling checkers with Python
bindings. I evaluated the
aspell-python and
PyEnchant
(bindings for enchant,
the spelling checker from the AbiWord project). Both tools required
some manual setup to get the engine working. The aspell-python
API
was simple to use, but I decided to use PyEnchant instead. It has an
active development group and is more extensible (with APIs to define
alternate dictionaries, tokenizers, and filters).
Installing PyEnchant
I started out by trying to install enchant and PyEnchant from source under OS X with Python 2.7, but eventually gave up after having to download several dependencies just to get configure
to run for enchant. I stuck with PyEnchant as a solution because installing aspell
was not really any easier (the installation experience for both tools could be improved). The simplest solution for OS X and Windows is to use the platform-specific binary installers for PyEnchant (not the .egg), since they include all of the dependencies. That means it cannot be installed into a virtualenv, but I was willing to live with that for the sake of having any solution at all.
Linux platforms can probably install enchant via RPM or other system package, so it is less of a challenge to get PyEnchant working there, and it may even work with pip.
Using PyEnchant
There are several good examples in the PyEnchant tutorial, and I will not repeat them here. I will cover some of the concepts, though, as part of explaining the implementation of the new extension.
The PyEnchant API is organized around a “dictionary,” which can be loaded at runtime based on a language name. Enchant does some work to try to determine the correct language automatically based on the environment settings, but I found it more reliable to set the language explicitly. After the dictionary is loaded, its check()
method can be used to test whether a word is correct or not. For incorrect words, the suggest()
method returns a list of possible alternatives, sorted by the likelihood they are the intended word.
The check()
method works well for individual words, but cannot process paragraphs. PyEnchant provides an API for checking larger blocks of text, but I chose to use a lower level API instead. In addition to the dictionary, PyEnchant includes a “tokenizer” API for splitting text into candidate words to be checked. Using the tokenizer API means that the new plugin can run some additional tests on words not found in the dictionary. For example, I plan to provide an option to ignore “misspelled” words that appear to be the name of an importable Python module.
Integrating with Sphinx
The Sphinx Extension API includes several ways to add new features to Sphinx, including markup roles, language domains, processing events, and directives. I chose to create a new “builder” class, because that would give me complete control over the way the document is processed. The builder API works with a parsed document to create output, usually in a format like HTML or PDF. In this case, the SpellingBuilder
does not generate any output files. It prints the list of misspelled words to standard output, and includes the headings showing where the words appear in the document.
The first step in creating the new extension is to define a setup()
function to be invoked when the module is loaded. The function receives as argument an instance of the Sphinx
application, ready to be configured. In sphinxcontrib.spelling.setup()
, the new builder and several configuration options are added to the application. Although the Sphinx configuration file can contain any Python code, only the explicitly registered configuration settings affect the way the environment is saved.
def setup(app):
app.info('Initializing Spelling Checker')
app.add_builder(SpellingBuilder)
# Report guesses about correct spelling
app.add_config_value('spelling_show_suggestions', False, 'env')
# Set the language for the text
app.add_config_value('spelling_lang', 'en_US', 'env')
# Set a user-provided list of words known to be spelled properly
app.add_config_value('spelling_word_list_filename', 'spelling_wordlist.txt', 'env')
# Assume anything that looks like a PyPI package name is spelled properly
app.add_config_value('spelling_ignore_pypi_package_names', False, 'env')
# Assume words that look like wiki page names are spelled properly
app.add_config_value('spelling_ignore_wiki_words', True, 'env')
# Assume words that are all caps, or all caps with trailing s, are spelled properly
app.add_config_value('spelling_ignore_acronyms', True, 'env')
# Assume words that are part of __builtins__ are spelled properly
app.add_config_value('spelling_ignore_python_builtins', True, 'env')
# Assume words that look like the names of importable modules are spelled properly
app.add_config_value('spelling_ignore_importable_modules', True, 'env')
# Add any user-defined filter classes
app.add_config_value('spelling_filters', [], 'env')
# Register the 'spelling' directive for setting parameters within a document
rst.directives.register_directive('spelling', SpellingDirective)
return
The builder class is derived from sphinx.builders.Builder
. The important method is write_doc()
, which processes the parsed documents and saves the messages with unknown words to the output file.
def write_doc(self, docname, doctree):
self.checker.push_filters(self.env.spelling_document_filters[docname])
for node in doctree.traverse(docutils.nodes.Text):
if node.tagname == '#text' and node.parent.tagname in TEXT_NODES:
# Figure out the line number for this node by climbing the
# tree until we find a node that has a line number.
lineno = None
parent = node
seen = set()
while lineno is None:
#self.info('looking for line number on %r' % node)
seen.add(parent)
parent = node.parent
if parent is None or parent in seen:
break
lineno = parent.line
filename = self.env.doc2path(docname, base=None)
# Check the text of the node.
for word, suggestions in self.checker.check(node.astext()):
msg_parts = []
if lineno:
msg_parts.append(darkgreen('(line %3d)' % lineno))
msg_parts.append(red(word))
msg_parts.append(self.format_suggestions(suggestions))
msg = ' '.join(msg_parts)
self.info(msg)
self.output.write(u"%s:%s: (%s) %sn" % (
self.env.doc2path(docname, None),
lineno, word,
self.format_suggestions(suggestions),
))
# We found at least one bad spelling, so set the status
# code for the app to a value that indicates an error.
self.app.statuscode = 1
self.checker.pop_filters()
return
The builder traverses all of the text nodes, skipping over formatting
nodes and container nodes that contain no text. Each node is converted
to plain text using its astext()
method, and the text is given to
the SpellingChecker
to be parsed and checked.
class SpellingChecker(object):
"""Checks the spelling of blocks of text.
Uses options defined in the sphinx configuration file to control
the checking and filtering behavior.
"""
def __init__(self, lang, suggest, word_list_filename, filters=[]):
self.dictionary = enchant.DictWithPWL(lang, word_list_filename)
self.tokenizer = get_tokenizer(lang, filters)
self.original_tokenizer = self.tokenizer
self.suggest = suggest
def push_filters(self, new_filters):
"""Add a filter to the tokenizer chain.
"""
t = self.tokenizer
for f in new_filters:
t = f(t)
self.tokenizer = t
def pop_filters(self):
"""Remove the filters pushed during the last call to push_filters().
"""
self.tokenizer = self.original_tokenizer
def check(self, text):
"""Generator function that yields bad words and suggested alternate spellings.
"""
for word, pos in self.tokenizer(text):
correct = self.dictionary.check(word)
if correct:
continue
yield word, self.dictionary.suggest(word) if self.suggest else []
return
Finding Words in the Input Text
The blocks of text from the nodes are parsed using a language-specific
tokenizer provided by PyEnchant. The text is split into words, and
then each word is passed through a series of filters. The API defined
by enchant.tokenize.Filter
supports two behaviors. Based on the
return value from _skip()
, the word might be ignored entirely and
never returned by the tokenizer. Alternatively, the _split()
method
can return a modified version of the text.
In addition to the filters for email addresses and “wiki words”
provided by PyEnchant, sphinxcontrib-spelling
includes several
others. The AcronymFilter
tells the tokenizer to skip words that use
all uppercase letters.
class AcronymFilter(Filter):
"""If a word looks like an acronym (all upper case letters),
ignore it.
"""
def _skip(self, word):
return (word == word.upper() # all caps
or
# pluralized acronym ("URLs")
(word[-1].lower() == 's'
and
word[:-1] == word[:-1].upper()
)
)
The ContractionFilter
expands common English contractions that might
appear in less formal blog posts.
class list_tokenize(tokenize):
def __init__(self, words):
tokenize.__init__(self, '')
self._words = words
def next(self):
if not self._words:
raise StopIteration()
word = self._words.pop()
return (word, )
class ContractionFilter(Filter):
"""Strip common contractions from words.
"""
splits = {
"won't":['will', 'not'],
"isn't":['is', 'not'],
"can't":['can', 'not'],
"i'm":['I', 'am'],
}
def _split(self, word):
# Fixed responses
if word.lower() in self.splits:
return list_tokenize(self.splits[word.lower()])
# Possessive
if word.lower().endswith("'s"):
return unit_tokenize(word[:-2])
# * not
if word.lower().endswith("n't"):
return unit_tokenize(word[:-3])
return unit_tokenize(word)
Because I write about Python a lot, I tend to use the names of
projects that appear on the Python Package Index
(PyPI). PyPiFilterFactory
fetches a list of the packages from the
index and then sets up a filter to ignore all of them.
class IgnoreWordsFilter(Filter):
"""Given a set of words, ignore them all.
"""
def __init__(self, tokenizer, word_set):
self.word_set = set(word_set)
Filter.__init__(self, tokenizer)
def _skip(self, word):
return word in self.word_set
class IgnoreWordsFilterFactory(object):
def __init__(self, words):
self.words = words
def __call__(self, tokenizer):
return IgnoreWordsFilter(tokenizer, self.words)
class PyPIFilterFactory(IgnoreWordsFilterFactory):
"""Build an IgnoreWordsFilter for all of the names of packages on PyPI.
"""
def __init__(self):
client = xmlrpclib.ServerProxy('http://pypi.python.org/pypi')
IgnoreWordsFilterFactory.__init__(self, client.list_packages())
PythonBuiltinsFilter
ignores functions built into the Python interpreter.
class PythonBuiltinsFilter(Filter):
"""Ignore names of built-in Python symbols.
"""
def _skip(self, word):
return word in __builtins__
Finally, ImportableModuleFilter
ignores words that match the names
of modules found on the import path. It uses imp to
search for the module without actually importing it.
class ImportableModuleFilter(Filter):
"""Ignore names of modules that we could import.
"""
def __init__(self, tokenizer):
Filter.__init__(self, tokenizer)
self.found_modules = set()
self.sought_modules = set()
def _skip(self, word):
if word not in self.sought_modules:
self.sought_modules.add(word)
try:
imp.find_module(word)
except UnicodeEncodeError:
return False
except ImportError:
return False
else:
self.found_modules.add(word)
return True
return word in self.found_modules
The SpellingBuilder
creates the filter stack based on user settings,
so the filters can be turned on or off.
filters = [ ContractionFilter,
EmailFilter,
]
if self.config.spelling_ignore_wiki_words:
filters.append(WikiWordFilter)
if self.config.spelling_ignore_acronyms:
filters.append(AcronymFilter)
if self.config.spelling_ignore_pypi_package_names:
self.info('Adding package names from PyPI to local spelling dictionary...')
filters.append(PyPIFilterFactory())
if self.config.spelling_ignore_python_builtins:
filters.append(PythonBuiltinsFilter)
if self.config.spelling_ignore_importable_modules:
filters.append(ImportableModuleFilter)
filters.extend(self.config.spelling_filters)
Using the Spelling Checker
PyEnchant
and sphinxcontrib-spelling
should be installed on the import path
for the same version of Python that Sphinx is using (refer to the
project home page for more
details). Then the extension needs to be explicitly enabled for a
Sphinx project in order for the builder to be recognized. To enable
the extension, add it to the list of extension in conf.py
.
extensions = [ 'sphinxcontrib.spelling' ]
The other options can be set in conf.py
, as well. For example, to
turn on the filter to ignore the names of packages from PyPI, set
spelling_add_pypy_package_names
to True
.
spelling_add_pypi_package_names = True
Because the spelling checker is integrated with Sphinx using a new
builder class, it is not run when the HTML or LaTeX builders
run. Instead, it needs to run as a separate phase of the build by
passing the -b
option to sphinx-build
. The output shows each
document name as it is processed, and if there are any errors the line
number and misspelled word is shown. When spelling_show_suggestions
is True
, proposed corrections are included in the output.
$ sphinx-build -b spelling -d build/doctrees source build/spelling
...
writing output... [ 31%] articles/how-tos/sphinxcontrib-spelling/index
(line 255) mispelling ["misspelling", "dispelling", "mi spelling",
"spelling", "compelling", "impelling", "rappelling"]
...
See Also
- PyEnchant – Python interface to enchant
- sphinxcontrib-spelling – Project home page for the spelling checker.
- sphinxcontrib – BitBucket repository for sphinxcontrib-spelling and several other Sphinx extensions.
- Sphinx Extension API – Describes methods for extending Sphinx.
- Defining Custom Roles in Sphinx – Describes another way to extend Sphinx by modifying the reStructuredText syntax.