sphinxcontrib-spelling 2.1.2 – Sphinx spelling extension

What is sphinxcontrib.spelling?

sphinxcontrib.spelling is a spelling checker for Sphinx. It uses
PyEnchant to produce a report showing misspelled words.

What’s New in This Release?

  • Fix an issue with six under Python 3.4.

Installing

Please see the documentation for details.

sphinxcontrib.spelling 2.1.1

What is sphinxcontrib.spelling?

sphinxcontrib.spelling is a spelling checker for Sphinx. It uses
PyEnchant to produce a report showing misspelled words.

What’s New in This Release?

This is a point release with some packaging fixes for Debian and a
small code change to improve string handling.

  • remove announce.rst; moved to blogging repository
  • Merged in eriol/sphinxcontrib-spelling (pull request #2)
  • Removed no more used CHANGES file
  • Updated path of test_wordlist.txt
  • Merged in bmispelon/sphinxcontrib-spelling/isupper (pull request #1)
  • Use str.isupper() instead of ad-hoc method
  • fix syntax for tags directive

Installing

Please see the documentation for details.

sphinxcontrib.spelling 2.1

What is sphinxcontrib.spelling?

sphinxcontrib.spelling is a spelling checker for Sphinx. It uses
PyEnchant to produce a report showing misspelled words.

What’s New in This Release?

  • Fix unicode error in PythonBuiltinsFilter.
  • Make error output useful in emacs compiler mode
  • Only show the words being added to a local dictionary if debugging
    is enabled.

Installing

Please see the documentation for details.

sphinxcontrib.spelling 2.0

What is sphinxcontrib.spelling?

sphinxcontrib.spelling is a spelling checker for Sphinx. It uses
PyEnchant to produce a report showing misspelled words.

What’s New in This Release?

  • Add python 3.3 support.
  • Add PyPy support.
  • Use pbr for packaging.
  • Update tox config to work with forked version of PyEnchant until
    changes are accepted upstream.

Installing

Please see the documentation for details.

sphinxcontrib.spelling 1.3

What is sphinxcontrib.spelling?

sphinxcontrib.spelling is a spelling checker for Sphinx. It uses
PyEnchant to produce a report showing misspelled words.

What’s New in 1.3?

This update changes the output format to include the document name
with each misspelled word. It also fixes a bug processing some edge
cases in the input parse tree.

sphinxcontrib.spelling 1.2

What is sphinxcontrib.spelling?

sphinxcontrib.spelling is a spelling checker for Sphinx. It uses
PyEnchant to produce a report showing misspelled words.

What’s New in 1.2?

This update checks the spelling of document titles and section headers
as well as the body of the document. It also fixes a packaging issue
that prevented the tests from working when run directly from the sdist
available on PyPI.

sphinxcontrib.spelling 1.1

What is sphinxcontrib.spelling?

sphinxcontrib.spelling is a spelling checker for Sphinx. It uses
PyEnchant to produce a report showing misspelled words.

What’s New in 1.1?

This point update includes new filters to ignore words commonly
encountered in software documentation and other writing about computer
programs. These include Python language built-ins, importable modules,
words that match the names of packages on the Python Package Index,
CamelCase words, and acronyms. There is also a new spelling
directive for creating a local word list within a document.

Creating a Spelling Checker for reStructuredText Documents

I write a lot using reStructuredText files as the source format,
largely because of the ease of automating the tools used to convert
reST to other formats. The number of files involved has grown to the
point that some of the post-writing tasks were becoming tedious, and I
was skipping steps like running the spelling checker. I finally
decided to do something about that by creating a spelling checker
plugin for Sphinx, released as *sphinxcontrib-spelling*.

I have written about *why I chose reST* before. All of the articles
on this site, including the Python Module of the Week series,
started out as .rst files. I also use Sphinx to produce
several developer manuals at my day job. I like reST and Sphinx
because they both can be extended to meet new needs easily. One area
that has been lacking, though, is support for a spelling checker.

Checking the spelling of the contents of an individual
reStructuredText file from within a text editor like Aquamacs is
straightforward, but I have on the order of 200 separate files making
up parts of this site alone, not to mention *my book*. Manually
checking each file, one at a time, is a tedious job, and not one I
perform very often. After finding a few typos recently, I decided I
needed to take care of the problem by using automation to eliminate
the drudgery and make it easier to run the spelling checker regularly.

The files are already configured to be processed by Sphinx when they
are converted to HTML and PDF format, so that seemed like a natural
way to handle the spelling checker, too. To add a step to the build to
check the spelling of every file, I would need two new tools: an
extension to Sphinx to drive the spelling checker and the spelling
checker itself. I did not find any existing Sphinx extensions that
checked spelling, so I decided to write my own. The first step was to
evaluate spelling checkers.

I did not find any existing Sphinx extensions that checked spelling,
so I decided to write my own.

Choosing a Spelling Checker

I recently read Peter Norvig’s article How to Write a Spelling
Corrector
, which shows how to create a spelling checker from scratch
in Python. As with most nontrivial applications, though, the algorithm
for testing the words is only part of the story when looking at a
spelling checker. An equally important aspect is the dictionary of
words known to be spelled correctly. Without a good dictionary, the
algorithm would report too many false negatives. Not wanting to build
my own dictionary, I decided to investigate existing spelling checkers
and concentrate on writing the interface layer to connect them to
Sphinx.

There are several open source spelling checkers with Python bindings.
I evaluated the aspell-python and PyEnchant (bindings for
enchant, the spelling checker from the AbiWord project). Both tools
required some manual setup to get the engine working. The
aspell-python API was simple to use, but I decided to use PyEnchant
instead. It has an active development group and is more extensible
(with APIs to define alternate dictionaries, tokenizers, and filters).

Installing PyEnchant

I started out by trying to install enchant and PyEnchant from source
under OS X with Python 2.7, but eventually gave up after having to
download several dependencies just to get configure to run for
enchant. I stuck with PyEnchant as a solution because installing
aspell was not really any easier (the installation experience for both
tools could be improved). The simplest solution for OS X and Windows
is to use the platform-specific binary installers for PyEnchant (not
the .egg), since they include all of the dependencies. That means it
cannot be installed into a virtualenv, but I was willing to live with
that for the sake of having any solution at all.

Linux platforms can probably install enchant via RPM or other system
package, so it is less of a challenge to get PyEnchant working there,
and it may even work with pip.

Using PyEnchant

There are several good examples in the PyEnchant tutorial, and I
will not repeat them here. I will cover some of the concepts, though,
as part of explaining the implementation of the new extension.

The PyEnchant API is organized around a “dictionary,” which can be
loaded at runtime based on a language name. Enchant does some work to
try to determine the correct language automatically based on the
environment settings, but I found it more reliable to set the language
explicitly. After the dictionary is loaded, its check() method can
be used to test whether a word is correct or not. For incorrect words,
the suggest() method returns a list of possible alternatives,
sorted by the likelihood they are the intended word.

The check() method works well for individual words, but cannot
process paragraphs. PyEnchant provides an API for checking larger
blocks of text, but I chose to use a lower level API instead. In
addition to the dictionary, PyEnchant includes a “tokenizer” API for
splitting text into candidate words to be checked. Using the tokenizer
API means that the new plugin can run some additional tests on words
not found in the dictionary. For example, I plan to provide an option
to ignore “misspelled” words that appear to be the name of an
importable Python module.

Integrating with Sphinx

The Sphinx Extension API includes several ways to add new features
to Sphinx, including markup roles, language domains, processing
events, and directives. I chose to create a new “builder” class,
because that would give me complete control over the way the document
is processed. The builder API works with a parsed document to create
output, usually in a format like HTML or PDF. In this case, the
SpellingBuilder does not generate any output files. It prints the
list of misspelled words to standard output, and includes the headings
showing where the words appear in the document.

The first step in creating the new extension is to define a
setup() function to be invoked when the module is loaded. The
function receives as argument an instance of the Sphinx
application, ready to be configured. In
sphinxcontrib.spelling.setup(), the new builder and several
configuration options are added to the application. Although the
Sphinx configuration file can contain any Python code, only the
explicitly registered configuration settings affect the way the
environment is saved.

def setup(app):
    app.info('Initializing Spelling Checker')
    app.add_builder(SpellingBuilder)
    # Report guesses about correct spelling
    app.add_config_value('spelling_show_suggestions', False, 'env')
    # Set the language for the text
    app.add_config_value('spelling_lang', 'en_US', 'env')
    # Set a user-provided list of words known to be spelled properly
    app.add_config_value('spelling_word_list_filename', 'spelling_wordlist.txt', 'env')
    # Assume anything that looks like a PyPI package name is spelled properly
    app.add_config_value('spelling_ignore_pypi_package_names', False, 'env')
    # Assume words that look like wiki page names are spelled properly
    app.add_config_value('spelling_ignore_wiki_words', True, 'env')
    # Assume words that are all caps, or all caps with trailing s, are spelled properly
    app.add_config_value('spelling_ignore_acronyms', True, 'env')
    # Assume words that are part of __builtins__ are spelled properly
    app.add_config_value('spelling_ignore_python_builtins', True, 'env')
    # Assume words that look like the names of importable modules are spelled properly
    app.add_config_value('spelling_ignore_importable_modules', True, 'env')
    # Add any user-defined filter classes
    app.add_config_value('spelling_filters', [], 'env')
    # Register the 'spelling' directive for setting parameters within a document
    rst.directives.register_directive('spelling', SpellingDirective)
    return

The builder class is derived from sphinx.builders.Builder. The
important method is write_doc(), which processes the parsed
documents and saves the messages with unknown words to the output
file.

def write_doc(self, docname, doctree):
    self.checker.push_filters(self.env.spelling_document_filters[docname])

    for node in doctree.traverse(docutils.nodes.Text):
        if node.tagname == '#text' and  node.parent.tagname in TEXT_NODES:

            # Figure out the line number for this node by climbing the
            # tree until we find a node that has a line number.
            lineno = None
            parent = node
            seen = set()
            while lineno is None:
                #self.info('looking for line number on %r' % node)
                seen.add(parent)
                parent = node.parent
                if parent is None or parent in seen:
                    break
                lineno = parent.line
            filename = self.env.doc2path(docname, base=None)

            # Check the text of the node.
            for word, suggestions in self.checker.check(node.astext()):
                msg_parts = []
                if lineno:
                    msg_parts.append(darkgreen('(line %3d)' % lineno))
                msg_parts.append(red(word))
                msg_parts.append(self.format_suggestions(suggestions))
                msg = ' '.join(msg_parts)
                self.info(msg)
                self.output.write(u"%s:%s: (%s) %sn" % (
                        self.env.doc2path(docname, None),
                        lineno, word,
                        self.format_suggestions(suggestions),
                        ))

                # We found at least one bad spelling, so set the status
                # code for the app to a value that indicates an error.
                self.app.statuscode = 1

    self.checker.pop_filters()
    return

The builder traverses all of the text nodes, skipping over formatting
nodes and container nodes that contain no text. Each node is converted
to plain text using its astext() method, and the text is given to
the SpellingChecker to be parsed and checked.

class SpellingChecker(object):
    """Checks the spelling of blocks of text.

    Uses options defined in the sphinx configuration file to control
    the checking and filtering behavior.
    """

    def __init__(self, lang, suggest, word_list_filename, filters=[]):
        self.dictionary = enchant.DictWithPWL(lang, word_list_filename)
        self.tokenizer = get_tokenizer(lang, filters)
        self.original_tokenizer = self.tokenizer
        self.suggest = suggest

    def push_filters(self, new_filters):
        """Add a filter to the tokenizer chain.
        """
        t = self.tokenizer
        for f in new_filters:
            t = f(t)
        self.tokenizer = t

    def pop_filters(self):
        """Remove the filters pushed during the last call to push_filters().
        """
        self.tokenizer = self.original_tokenizer

    def check(self, text):
        """Generator function that yields bad words and suggested alternate spellings.
        """
        for word, pos in self.tokenizer(text):
            correct = self.dictionary.check(word)
            if correct:
                continue
            yield word, self.dictionary.suggest(word) if self.suggest else []
        return

Finding Words in the Input Text

The blocks of text from the nodes are parsed using a language-specific
tokenizer provided by PyEnchant. The text is split into words, and
then each word is passed through a series of filters. The API defined
by enchant.tokenize.Filter supports two behaviors. Based on the
return value from _skip(), the word might be ignored entirely and
never returned by the tokenizer. Alternatively, the _split()
method can return a modified version of the text.

In addition to the filters for email addresses and “wiki words”
provided by PyEnchant, sphinxcontrib-spelling includes several
others. The AcronymFilter tells the tokenizer to skip words that
use all uppercase letters.

class AcronymFilter(Filter):
    """If a word looks like an acronym (all upper case letters),
    ignore it.
    """
    def _skip(self, word):
        return (word == word.upper() # all caps
                or
                # pluralized acronym ("URLs")
                (word[-1].lower() == 's'
                 and
                 word[:-1] == word[:-1].upper()
                 )
                )

The ContractionFilter expands common English contractions
that might appear in less formal blog posts.

class list_tokenize(tokenize):
    def __init__(self, words):
        tokenize.__init__(self, '')
        self._words = words
    def next(self):
        if not self._words:
            raise StopIteration()
        word = self._words.pop(0)
        return (word, 0)

class ContractionFilter(Filter):
    """Strip common contractions from words.
    """
    splits = {
        "won't":['will', 'not'],
        "isn't":['is', 'not'],
        "can't":['can', 'not'],
        "i'm":['I', 'am'],
        }
    def _split(self, word):
        # Fixed responses
        if word.lower() in self.splits:
            return list_tokenize(self.splits[word.lower()])

        # Possessive
        if word.lower().endswith("'s"):
            return unit_tokenize(word[:-2])

        # * not
        if word.lower().endswith("n't"):
            return unit_tokenize(word[:-3])

        return unit_tokenize(word)

Because I write about Python a lot, I tend to use the names of
projects that appear on the Python Package Index
(PyPI). PyPiFilterFactory fetches a list of the packages from
the index and then sets up a filter to ignore all of them.

class IgnoreWordsFilter(Filter):
    """Given a set of words, ignore them all.
    """
    def __init__(self, tokenizer, word_set):
        self.word_set = set(word_set)
        Filter.__init__(self, tokenizer)
    def _skip(self, word):
        return word in self.word_set

class IgnoreWordsFilterFactory(object):
    def __init__(self, words):
        self.words = words
    def __call__(self, tokenizer):
        return IgnoreWordsFilter(tokenizer, self.words)

class PyPIFilterFactory(IgnoreWordsFilterFactory):
    """Build an IgnoreWordsFilter for all of the names of packages on PyPI.
    """
    def __init__(self):
        client = xmlrpclib.ServerProxy('http://pypi.python.org/pypi')
        IgnoreWordsFilterFactory.__init__(self, client.list_packages())

PythonBuiltinsFilter ignores functions built into the Python
interpreter.

class PythonBuiltinsFilter(Filter):
    """Ignore names of built-in Python symbols.
    """
    def _skip(self, word):
        return word in __builtins__

Finally, ImportableModuleFilter ignores words that match the
names of modules found on the import path. It uses imp to search
for the module
without actually importing it.

class ImportableModuleFilter(Filter):
    """Ignore names of modules that we could import.
    """
    def __init__(self, tokenizer):
        Filter.__init__(self, tokenizer)
        self.found_modules = set()
        self.sought_modules = set()
    def _skip(self, word):
        if word not in self.sought_modules:
            self.sought_modules.add(word)
            try:
                imp.find_module(word)
            except UnicodeEncodeError:
                return False
            except ImportError:
                return False
            else:
                self.found_modules.add(word)
                return True
        return word in self.found_modules

The SpellingBuilder creates the filter stack based on user
settings, so the filters can be turned on or off.

filters = [ ContractionFilter,
            EmailFilter,
            ]
if self.config.spelling_ignore_wiki_words:
    filters.append(WikiWordFilter)
if self.config.spelling_ignore_acronyms:
    filters.append(AcronymFilter)
if self.config.spelling_ignore_pypi_package_names:
    self.info('Adding package names from PyPI to local spelling dictionary...')
    filters.append(PyPIFilterFactory())
if self.config.spelling_ignore_python_builtins:
    filters.append(PythonBuiltinsFilter)
if self.config.spelling_ignore_importable_modules:
    filters.append(ImportableModuleFilter)
filters.extend(self.config.spelling_filters)

Using the Spelling Checker

PyEnchant and sphinxcontrib-spelling should be installed on the
import path for the same version of Python that Sphinx is using (refer
to the *project home page* for more details). Then the extension
needs to be explicitly enabled for a Sphinx project in order for the
builder to be recognized. To enable the extension, add it to the list
of extension in conf.py.

extensions = [ 'sphinxcontrib.spelling' ]

The other options can be set in conf.py, as well. For example, to
turn on the filter to ignore the names of packages from PyPI, set
spelling_add_pypy_package_names to True.

spelling_add_pypi_package_names = True

Because the spelling checker is integrated with Sphinx using a new
builder class, it is not run when the HTML or LaTeX builders
run. Instead, it needs to run as a separate phase of the build by
passing the -b option to sphinx-build. The output shows each
document name as it is processed, and if there are any errors the line
number and misspelled word is shown. When
spelling_show_suggestions is True, proposed corrections are
included in the output.

$ sphinx-build -b spelling -d build/doctrees source build/spelling
...
writing output... [ 31%] articles/how-tos/sphinxcontrib-spelling/index
(line 255) mispelling ["misspelling", "dispelling", "mi spelling",
"spelling", "compelling", "impelling", "rappelling"]
...

See Also

PyEnchant
Python interface to enchant.
*sphinxcontrib-spelling*
Project home page for the spelling checker.
sphinxcontrib
BitBucket repository for sphinxcontrib-spelling and several other
Sphinx extensions.
Sphinx Extension API
Describes methods for extending Sphinx.
*Defining Custom Roles in Sphinx*
Describes another way to extend Sphinx by modifying the
reStructuredText syntax.