Lessons learned from working on large scale, cross-project initiatives in OpenStack

I have been involved with OpenStack development since just before the Folsom summit in 2012. Over the course of that time, I have participated in innumerable discussions about 3 big features tied to OpenStack logging: translating log messages, adding request IDs to log messages, and adding unique message IDs to log messages. We have had various degrees of success with the design, implementation, and ongoing maintenance of all three features, and the reasons for success or failure in each case provide helpful insight into how to approach changes with large community and product scope that should be considered before our next discussion at the summit/forum in Boston in 2017.
Continue reading Lessons learned from working on large scale, cross-project initiatives in OpenStack

Using Unicode with Sphinx, reStructuredText, and PDF Output

I’m working on updating my book, and besides actually writing the content one of the things I have to do is generate new LaTeX files to deliver to the publisher. I’ve written about my toolchain elsewhere, so I won’t repeat all of that information here. The short version is that I use Paver to drive Sphinx to convert reStructuredText input files to HTML for the website, LaTeX for the compositor at Pearson, and PDF for reviewers. Since the updated version covers Python 3, and one of the key benefits of Python 3 is better Unicode support, I want to include some characters outside of the normal ASCII set in my examples.

When you ask Sphinx’s latex builder to generate LaTeX output the result is a directory with a *.tex file containing your content and some other files with all of the parts you need to convert that LaTeX to other formats, including a Makefile with instructions for building PDF and DVI. By default that Makefile uses pdflatex to convert the *.tex output files it writes to PDF. My article for the random module includes an example of shuffling a card deck. The Python 2 version used letters to represent the card suits, but for Python 3 I switched to using Unicode symbols like what would appear on the cards. Making that work for HTML was easy, but the PDF output proved trickier.

Continue reading Using Unicode with Sphinx, reStructuredText, and PDF Output

Handling High Email Volume with sup

Over the last year, the openstack-dev mailing list has averaged
2500 messages every month. Staying on top of that much email can be
challenging, especially with some of the consumer-grade email clients
available today. I’ve recently upgrade my email setup to use sup, a
terminal-based mail client that is helping me process the mailing
list, and even keep up with gerrit at the same time.

The Story Gets Worse Before It Gets Better

Last summer I moved all of my email processing from Google’s GMail
service to an account under my own domain hosted by FastMail. I liked
GMail’s web interface, but I had grown tired of some incompatibilities
with backup tools and other services. FastMail’s standard IMAP servers
fixed those issues, and I have been happy with the service.

This fall, however, after upgrading my laptop to Yosemite I started
seeing issues with mailing list threads being improperly combined so
that completely unrelated conversations all appear to be part of the
same thread. I traced the problem, thanks to several other Yosemite
users I know, to issues with Mail.app, the MUA that ships with
OS X. At first the problem was isolated to just a few threads. Then it
expanded to a lot of my mailing lists, and finally it started
affecting my inbox. There doesn’t seem to be any way to prevent Mail
from getting confused, or to fix it once it is.

I decided I needed a new mail client, and after reviewing the options
I decided I wasn’t going to find one that met all of my needs. The
problem first showed up with mailing lists, and while not all of the
poorly-threaded messages were from lists all of the threads involved did
include list messages. I tend to treat mailing lists differently that
my regular mail anyway, automatically filing it into folders using
mail filters and the batch reading it. So I decided to set up a second
email client just for reading mailing lists.

Looking At Options

One option I considered was going back to using GMail for the mailing
lists. I haven’t completely ruled this out, but I like having all of
my mail in one account so I don’t have to remember which one to use
when sending messages. I also don’t want messages sent directly to me
to end up in a never-never land by dropping them in an inbox I don’t
check frequently.

I used the FastMail web UI for a few months while I researched, and
it’s not terrible, but I felt I could be doing better. The key binding
support isn’t complete, so I do have to mix keyboard and mouse
commands. That left me looking for other desktop clients.

Several community members recommended mutt, but it looked more complex
than I needed. If I was going to move all of my email to a new
client, mutt would look like a better option. Since I am only reading
the mailing list this way, it felt like overkill.

A couple of people I talked to used sup, which is a terminal-based
app that has the same “archive and forget” behaviors of GMail. I like
that workflow, and the relative simplicity of the most basic setup, so
I spent some time this week configuring it to run on my laptop. After
a couple of false starts, mostly because I’m on OS X and the most
recent instructions are for Linux, I have it working and have been
using it for a few days. I’m already happier with the email workflow,
so I’m documenting the steps I went through to set it up in case
someone else wants to give it a try.

Set up offlineimap

The first step for most of the terminal-based clients is to install a
program to sync all of your email from the mail server to local
files. Desktop clients like Mail.app include this feature as a
built-in, but most terminal apps don’t. Offlineimap is a good
solution for syncing both directions, so that messages you read
locally are marked as read on the server as well.

Homebrew includes a package, so the first step is to install the code
through brew:

$ brew update
$ brew install offlineimap

The next step is to set up the configuration file. The documentation
on the project site gives a complete description of all of the
options, including setting up accounts and repositories. I’ll just
focus on a few of the unusual values I changed.

I already have the mail server configured to sort messages into
mailboxes based on the mailing list id, all under an OpenStack
parent folder. I also have some OpenStack mail archive folders there
for messages that didn’t go through a mailing list. I only want those
folders synced, so I set folderfilter to include those and my sent
mail box.

folderfilter = lambda foldername: foldername.startswith('INBOX.OpenStack') or foldername in ['INBOX.Sent Items']

I had trouble adding the resulting maildirs to sup because of the
spaces in the names, so I also added a name translation function to
nametrans to remove the spaces. While I was at it, I removed the
INBOX. prefix.

nametrans = lambda foldername: re.sub('^INBOX.', '', foldername).replace(' ', '_')

The result is a copy of each of my mailboxes saved under ~/Mail in
maildir format. Using maildir instead of mbox means both sup and
offlineimap can modify different messages in the same mailbox without
conflict. It’s also faster to sync the messages than mbox is.

The next step was to run offlineimap to have it download all of my
current messages. This took a while, since I have about 6 months of
list traffic archived (another nice feature of FastMail is the
auto-purge setting on each mailbox, which means I don’t have to delete
old list messages myself).

Doing the initial sync manually will help uncover connection issues or
other configuration problems, but you don’t want to do it by hand
every time you run it. Instead, you want to set up a cron or launchctl
job to sync your mail regularly. The Homebrew package for offlineimap
includes instructions for setting up a launchctl job, so that’s what
I did.

$ ln -sfv /usr/local/opt/offline-imap/*.plist ~/Library/LaunchAgents
$ launchctl load ~/Library/LaunchAgents/homebrew.mxcl.offline-imap.plist

Set up msmtp

The IMAP protocol is useful for reading messages and accessing mail
boxes on a server, but to send messages you need an SMTP client. The
sup wiki explains how to configure msmtp, so that’s what I went with.

$ brew install msmtp

FastMail uses authentication for its SMTP servers, so only customers
can use them to send mail directly. msmtp was happy to access my
keychain for credentials, so I didn’t have to include my credentials
in clear-text in that file. The only tricky part of setting that up is
the msmtp docs were not quite right about how to create the keychain
entry. The account entry in the keychain should match the user
entry from the msmtp configuration file.

After configuring, I used this command to test:

$ echo test | Mail -s "test on $(date)” my-address@here

I did find that the mail server would deliver some of the messages to
individual addresses, but not to the mailing list. I’m not sure if the
list’s spam filter was blocking them, or if it was happening at
FastMail. I found that setting maildomain to my domain fixed the
problem.

maildomain     doughellmann.com

Set up sup

sup is written in Ruby, so I had to install Ruby and a few other
related tools before it would work. I’m not that familiar with any of
the tools, so I followed instructions I found in the sup
documentation for installing on OS X. It’s possible there is a
simpler way to do this part.

First, install a recent version of ncurses with wide character
support:

$ brew tap homebrew/dupes
$ brew install ncurses
$ brew doctor
$ brew link ncurses --force

Next, following https://rvm.io/rvm/install, install rvm and use it
to install ruby:

$ curl -sSL https://get.rvm.io | bash -s stable --ruby
$ rvm install 2.2.0
$ rvm use 2.2.0

Now it’s possible to install the gems needed by sup. Installing them
all together failed, but installing a couple of the dependencies first
and then installing sup worked fine.

$ gem install xapian-ruby
$ gem install gpgme
$ gem install sup

At this point all of the software is installed, so the next step is to
configure sup to tell it where your email is and how to work with
it. The sup-config command offers a step-by-step setup guide. I
used that for most of the settings, but did not add the mail sources
(more on why in a bit).

$ sup-config

After the basic configuration was done, I edited
~/.sup/config.yaml so that sup would save my outgoing mail to the
Sent_Items mailbox managed by offlineimap and so it would update the
flags on messages (to indicate they had been read, for example) in the
source maildirs. I also updated the sendmail command for the account
to point to msmtp.

:sent_source: maildir:/Users/dhellmann/Mail/Sent_Items
:sync_back_to_maildir: true
:accounts:
  :default:
    :name: Doug Hellmann
    :email: doug@doughellmann.com
    :sendmail: "msmtp -t --read-envelope-from"
    :signature: "/Users/dhellmann/.signature"
    :gpgkey: ''

Next, I used sup-add to add the mail being downloaded by
offlineimap. First I created the source for the sent folder, using
a special option to tell sup not to add messages from that folder to
my inbox:

$ sup-add --archive maildir:/Users/dhellmann/Mail/Sent_Items

Then I added the other mailboxes without the –archive flag,
giving some of them a flag to automatically add labels to the messages
in that mailbox. For example, the mailbox with all of the messages
from gerrit was added with:

$ sup-add -l code-review maildir:/Users/dhellmann/Mail/OpenStack.Code_Review

After all of the mailboxes were added, I ran sup-sync-back-maildir
to reassure sup that it should sync the status of messages back to the
maildir copy, so offlineimap could then update their status on the
mail server.

$ sup-sync-back-maildir -u

And finally I imported all of the messages into sup’s index using
sup-sync. Because all of the messages were read and archived
already, I used the –read and –archive options to set them
that way in the index as well (otherwise I would have had something
like 25,000 unread threads in my inbox).

$ sup-sync --read --archive

Now running sup gives a view of the unprocessed messages. I can
navigate through them using vi keystrokes, kill uninteresting threads,
archive read messages, and reply as needed. See the sup home page for
screenshots and documentation about navigation commands.

Results

Even though all of the messages are separated into different folders
on the mail server, sup shows them all as part of one inbox. This lets
me pay attention to bug report emails and gerrit notifications again,
which I had been ignoring because of the volume. Since they’re
trickling in a few at a time, and interleaved between other messages,
I’m finding it easier to keep up with them. If that changes, I may
drop those folders from my offlineimap configuration and go back to
looking for updates online separately.

Watering Time: Practical Uses for Python’s Calendar Module

The UI for the timer attached to my sprinkler system makes it
difficult to understand exactly when the various sprinklers will
run. All of the data is there, but with only a few controls and a
small LCD screen, there aren’t a lot of presentation options. Using
Python’s calendar module I was able to write a simple program
to format the data to make it easier to identify cases where I might
be over, or under, watering.

The Problem

We don’t have a large yard, but we’ve tried to invest in making it an
attractive place to spend time. A big part of that, especially in the
southeastern US, is keeping plants properly watered through the hot
summer. A couple of years ago, I had an automated irrigation system
installed, with a programmable timer to control the watering schedule.

http://doughellmann.com/blog/wp-content/uploads/2014/05/controllers_00-proc-08.png

The timer support three “programs”, each of which can be scheduled to
run on different days of the week or month, at multiple times. Each
program can activate the sprinklers in several “zones” (areas of the
yard), running them for different amounts of time. This is, as far as
I can tell, a pretty standard internal model for one of these timers,
and once you get the hang of it programming it is pretty
straightforward.

This spring we decided we needed to change the way we were watering a
particularly troublesome spot in the front of the yard, to run the
system for two short cycles instead of one long cycle (the theory
being that this would allow more water to soak in and be used by the
plants in that area). When I examined the current settings, I
discovered that I had also been watering one zone more than I
realized, because it was scheduled in multiple programs. It wasn’t at
all obvious, given the limitations of how the timer shows its
programming, and I only discovered it when I wrote down the entire
schedule to review it. As part of my audit before updating the
schedule, I decided I would write a program to show the schedule on a
calendar so it easier to understand what was happening without having
to perform the calculations in my head.

Designing the Inputs

The first step was to design an input format to represent all of the
data I had in a way that was easy to collect. I chose a YAML format,
since I have lists and mappings of data and using YAML meant I
wouldn’t need to build a separate parser. The first section of the
input file lists the zones, mapping the number used to identify them
in the timer with the name I use for them in my notes.

zones:
  1: turf
  2: f shrubs
  3: b shrubs
  4: patio
  5: garden

The remainder of the input file describes the schedule for each
program (named A, B, and C), including the times of day when the
program runs (multiples are allowed), the days of the week when the
program runs, and the zones to be watered and for how long.

For example, program A runs on Monday, Wednesday, and Friday at 4:00
AM, watering the front shrubs for 15 minutes and the back shrubs for
15 minutes.

programs:
  A:
    start:
      - '4:00'
    days: MWF
    zones:
      - zone: 2
        time: 15
      - zone: 3
        time: 15

Zones are identified in the program schedule by number, and although
they can be listed in any order they are always run in numerical
order.

There are two ways to express the rules for determining which days the
program is active. A program can either run on odd or even days of the
month, or any combination of explicitly selected days of the week. I
decided to use “odd” and “even” as literal values for those cases, and
to use one or two letter abbreviations for days (where Tuesday is T,
Thursday is Th, Saturday is Sa, and Sunday is Su).

Designing the Output

I decided to generate output using a monthly calendar format. I’m
likely to be the only user of the program, so I didn’t worry about
generating HTML and opted to use a simple text chart format.

$ wateringtime -c

                                                                                      May

+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+
| Mon                    | Tue                    | Wed                    | Thu                    | Fri                    | Sat                    | Sun                    |
+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+
|                        |                        |                        | (1)                    | (2)                    | (3)                    | (4)                    |
|                        |                        |                        |                        | 03:15-03:30 - f shrubs | 03:00-03:30 - turf     | 03:15-03:30 - f shrubs |
|                        |                        |                        |                        | 03:30-03:45 - b shrubs |                        | 03:30-03:45 - b shrubs |
|                        |                        |                        |                        | 03:45-03:55 - patio    |                        | 03:45-03:55 - patio    |
|                        |                        |                        |                        | 03:55-04:00 - garden   |                        | 03:55-04:00 - garden   |
|                        |                        |                        |                        | 04:00-04:15 - f shrubs |                        |                        |
|                        |                        |                        |                        | 04:15-04:30 - b shrubs |                        |                        |
+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+
| (5)                    | (6)                    | (7)                    | (8)                    | (9)                    | (10)                   | (11)                   |
| 03:00-03:30 - turf     | 03:15-03:30 - f shrubs | 04:00-04:15 - f shrubs | 03:15-03:30 - f shrubs | 04:00-04:15 - f shrubs | 03:00-03:30 - turf     |                        |
| 04:00-04:15 - f shrubs | 03:30-03:45 - b shrubs | 04:15-04:30 - b shrubs | 03:30-03:45 - b shrubs | 04:15-04:30 - b shrubs | 03:15-03:30 - f shrubs |                        |
| 04:15-04:30 - b shrubs | 03:45-03:55 - patio    |                        | 03:45-03:55 - patio    |                        | 03:30-03:45 - b shrubs |                        |
|                        | 03:55-04:00 - garden   |                        | 03:55-04:00 - garden   |                        | 03:45-03:55 - patio    |                        |
|                        |                        |                        |                        |                        | 03:55-04:00 - garden   |                        |
+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+
| (12)                   | (13)                   | (14)                   | (15)                   | (16)                   | (17)                   | (18)                   |
| 03:00-03:30 - turf     |                        | 03:15-03:30 - f shrubs |                        | 03:15-03:30 - f shrubs | 03:00-03:30 - turf     | 03:15-03:30 - f shrubs |
| 03:15-03:30 - f shrubs |                        | 03:30-03:45 - b shrubs |                        | 03:30-03:45 - b shrubs |                        | 03:30-03:45 - b shrubs |
| 03:30-03:45 - b shrubs |                        | 03:45-03:55 - patio    |                        | 03:45-03:55 - patio    |                        | 03:45-03:55 - patio    |
| 03:45-03:55 - patio    |                        | 03:55-04:00 - garden   |                        | 03:55-04:00 - garden   |                        | 03:55-04:00 - garden   |
| 03:55-04:00 - garden   |                        | 04:00-04:15 - f shrubs |                        | 04:00-04:15 - f shrubs |                        |                        |
| 04:00-04:15 - f shrubs |                        | 04:15-04:30 - b shrubs |                        | 04:15-04:30 - b shrubs |                        |                        |
| 04:15-04:30 - b shrubs |                        |                        |                        |                        |                        |                        |
+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+
| (19)                   | (20)                   | (21)                   | (22)                   | (23)                   | (24)                   | (25)                   |
| 03:00-03:30 - turf     | 03:15-03:30 - f shrubs | 04:00-04:15 - f shrubs | 03:15-03:30 - f shrubs | 04:00-04:15 - f shrubs | 03:00-03:30 - turf     |                        |
| 04:00-04:15 - f shrubs | 03:30-03:45 - b shrubs | 04:15-04:30 - b shrubs | 03:30-03:45 - b shrubs | 04:15-04:30 - b shrubs | 03:15-03:30 - f shrubs |                        |
| 04:15-04:30 - b shrubs | 03:45-03:55 - patio    |                        | 03:45-03:55 - patio    |                        | 03:30-03:45 - b shrubs |                        |
|                        | 03:55-04:00 - garden   |                        | 03:55-04:00 - garden   |                        | 03:45-03:55 - patio    |                        |
|                        |                        |                        |                        |                        | 03:55-04:00 - garden   |                        |
+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+
| (26)                   | (27)                   | (28)                   | (29)                   | (30)                   | (31)                   |                        |
| 03:00-03:30 - turf     |                        | 03:15-03:30 - f shrubs |                        | 03:15-03:30 - f shrubs | 03:00-03:30 - turf     |                        |
| 03:15-03:30 - f shrubs |                        | 03:30-03:45 - b shrubs |                        | 03:30-03:45 - b shrubs |                        |                        |
| 03:30-03:45 - b shrubs |                        | 03:45-03:55 - patio    |                        | 03:45-03:55 - patio    |                        |                        |
| 03:45-03:55 - patio    |                        | 03:55-04:00 - garden   |                        | 03:55-04:00 - garden   |                        |                        |
| 03:55-04:00 - garden   |                        | 04:00-04:15 - f shrubs |                        | 04:00-04:15 - f shrubs |                        |                        |
| 04:00-04:15 - f shrubs |                        | 04:15-04:30 - b shrubs |                        | 04:15-04:30 - b shrubs |                        |                        |
| 04:15-04:30 - b shrubs |                        |                        |                        |                        |                        |                        |
+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+

I used Python’s calendar module to find the days of the month as
weeks. I treat Monday as day 0, which leaves the weekend at the end of
each row of output. That’s different than a typical American calendar,
but it works out well because I may have weekends as a special case,
depending on how bad the drought is here in Georgia and whether we
have watering restrictions in place.

Building Tables with PrettyTable

In addition to the calendar output mode, I included a simple table
output mode to report the settings in a form that is easy to carry
outside and program into the controller. (I still have to do that step
by hand.)

$ wateringtime
+------+----------+
| Zone | Name     |
+------+----------+
|    1 | turf     |
|    2 | f shrubs |
|    3 | b shrubs |
|    4 | patio    |
|    5 | garden   |
+------+----------+
+---------+-------------+------+-------+
| Program | Start Times | Days | Zones |
+---------+-------------+------+-------+
|    A    |     4:00    | MWF  | 2(15) |
|         |             |      | 3(15) |
|    B    |     3:00    | MSa  | 1(30) |
|    C    |     3:15    | even | 2(15) |
|         |             |      | 3(15) |
|         |             |      | 4(10) |
|         |             |      | 5(5)  |
+---------+-------------+------+-------+

That second table contains all of the information I need to reprogram
the timer quickly. This simpler output mode is implemented in
simple.py in two functions.

show_zones() prints the table with the names and id numbers of
the watering zones, taken from the input YAML file.

def show_zones(data):
    t = prettytable.PrettyTable(
        field_names=('Zone', 'Name'),
        print_empty=False,
    )
    t.padding_width = 1
    t.align['Zone'] = 'r'
    t.align['Name'] = 'l'

    for z in sorted(data['zones'].items()):
        t.add_row(z)

    print t.get_string()

It starts by building a PrettyTable object, configured with
two columns. Then it adds one row at a time to the table, where the
data for each row is held in a tuple with two members. The
get_string() method of the table returns the formatted results,
complete with headings and decorations.

The program list is a little more complex, since some cells of the
table have multiple lines. PrettyTable handles that easily,
but I need to build the multi-line strings myself by combining the
zone data.

def show_programs(data):
    t = prettytable.PrettyTable(
        field_names=('Program', 'Start Times', 'Days', 'Zones'),
        print_empty=False,
    )
    t.padding_width = 1
    t.align['Zones'] = 'l'

    for p, pdata in sorted(data['programs'].items()):
        zones = 'n'.join('%(zone)s(%(time)s)' % z
                          for z in sorted(pdata['zones'],
                                          key=operator.itemgetter('zone')))
        t.add_row((p, 'n'.join(pdata['start']), pdata['days'],
                   zones))
    print t.get_string()

Adding Algorithms to Data

The simple output format works with the data in the YAML data
structure directly. The processing it does is very basic, since it is
primarily formatting the existing values. For the calendar view, I
knew I would need some more complex algorithms. I have several
different rules to apply to decide if a program should be included in
the output for a given day, for example, and I want to compute and
show the actual start and end time for each watering event, not just
the program start time. I decided to create a Program class
to help with some of those calculations.

class Program(object):

    def __init__(self, name, pdata):
        self.name = name
        self.data = pdata
        self.days = pdata['days']
        self._day_checker = self._make_day_checker(self.days)

    def _make_day_checker(self, s):
        """Parse a 'days' string

        A days string either contains 'odd', 'even', or 1-2 letter
        abbreviations for the days of the week.
        """
        if s == 'odd':
            return lambda dow, dom: bool(dom % 2)
        elif s == 'even':
            return lambda dow, dom: not bool(dom % 2)
        else:
            valid=[
                self._day_abbr[m]
                for m in re.findall('([MTWF]|Tu|Th|Sa|Su)', s)
            ]
            return lambda dow, dom, valid=valid: dow in valid

    _day_abbr = {
        'M': calendar.MONDAY,
        'T': calendar.TUESDAY,
        'Tu': calendar.TUESDAY,
        'W': calendar.WEDNESDAY,
        'Th': calendar.THURSDAY,
        'F': calendar.FRIDAY,
        'Sa': calendar.SATURDAY,
        'Su': calendar.SUNDAY,
    }

    def occurs_on_day(self, dow, dom):
        """Tests whether the program runs on a given day.

        :param dow: Day of week
        :param dom: Day of month
        """
        return self._day_checker(dow, dom)

The first piece of data I addressed was the rules for which days a
program is active. I have three different modes, and I knew I didn’t
want to test the mode each time a date was checked because the mode
doesn’t change after the YAML file is parsed. I decided to define
_make_day_checker() a factory method that returns a callable to
perform the test. For the “odd” and “even” modes, it returns a
function that looks at the day of the month to see if it is odd or
even respectively. For the explicit day list, I use a regular
expression to parse the string into individual abbreviations, and then
convert those to numbers using a dictionary that maps between the
abbreviations and values from calendar. The public API
occurs_on_day() wraps the checker function.

Next I defined a property to sort the zones before returning them,
just in case I enter values out of order:

    @property
    def zones(self):
        """Returns the zones used in the program, sorted by zone id.
        """
        return sorted(self.data['zones'], key=operator.itemgetter('zone'))

Another property converts the string representation of the program
start times to datetime.time instances, which are easier to
manipulate and use for sorting:

    @property
    def start_times(self):
        return sorted(datetime.datetime.strptime(t, '%H:%M').time()
                      for t in self.data['start'])

A final property produces a series of run time values with the start
and end times as well as the zone id. It is used to build the schedule
part of a calendar cell, which shows the times and zones when the
sprinklers are running. I perform the calculations to find the start
and end times myself, because datetime.time objects do not
work with datetime.timedelta objects.

    @property
    def run_times(self):
        """Returns iterable of start, end, and zone name tuples.
        """
        for s in self.start_times:
            for z in self.zones:
                # FIXME: Convert to datetime and use timedelta?
                h, m = s.hour, s.minute
                m += z['time']
                h += m / 60
                m = m % 60
                e = datetime.time(h, m)
                yield s, e, z['zone']
                s = e

Building the Calendar

With Program in place, the next task was to figure out how to
construct the calendar grid. I knew that Python includes a calendar
module
, and that I could have it give me a list of weeks containing
the days of the month. To build my table, then, I would just need to
iterate over the weeks and days, deciding what to put in each cell.

I started by setting up the data I would be working with and the table
object.

def show(args, data):
    programs = [Program(*p) for p in data['programs'].items()]
    programs.sort(key=lambda p: p.start_times[0])

    t = prettytable.PrettyTable(
        field_names=calendar.day_abbr,
        print_empty=False,
        hrules=prettytable.ALL,
    )
    t.align = 'l'

    cal = calendar.Calendar(calendar.MONDAY)
    month_data = cal.monthdays2calendar(args.year, args.month)

Each row of the calendar is based on a week, and each cell is a
day. There are two nested loops to iterate over the calendar days and
determine the cell and row contents. Some weeks contain days from
multiple months, the end of one month and the beginning of the
next. The output of monthdays2calendar() reports the day of the
month as 0 for days in a week that fall outside of the current
month in either direction, and I skip them in the output (filling the
cell with a blank string to preserve the table structure).

    for week in month_data:
        row = []
        for dom, dow in week:
            if not dom:
                # Zero days are from another month; leave the cell blank.
                row.append('')
                continue

For the remaining days, I loop over the programs that occur on that
day and place a watering event (with start time, end time, and zone)
on each line of the cell. The datetime.time values are
formatted to show only the hour and minutes, and the zone name is used
instead of the zone number so I don’t have to do that conversion in my
head as I read the calendar.

            # Show the day and all watering events on that day.
            lines = ['(%s)' % dom]
            for p in (p for p in programs
                      if p.occurs_on_day(dow, dom)):
                if args.verbose:
                    lines.append('')
                    lines.append('{name} ({days})'.format(name=p.name,
                                                          days=p.days))
                for s, e, z in p.run_times:
                    name = data['zones'][z]
                    lines.append(
                        '{s}-{e} - {name}'.format(
                            s=s.strftime('%H:%M'),
                            e=e.strftime('%H:%M'),
                            name=name,
                        )
                    )
            row.append('n'.join(lines))
        t.add_row(row)

Before printing the table, I use its width to center the month name
over the top.

    formatted = t.get_string()
    # Center the name of the month over the output calendar.
    print 'n{:^{width}}n'.format(
        calendar.month_name[args.month],
        width=len(formatted.splitlines()[0]),
    )
    print formatted

Conclusion

With an hour of work, I was able to create a simple script to let me
visualize the watering schedule more clearly. I found one case of
potential over-watering, where a zone was in a program it shouldn’t
have been, and I was able to modify the timer’s programming to fix
that and make the other adjustments I needed very easily.

The Performance Impact of Using dict() Instead of {} in CPython 2.7

I’ve been reviewing lot of code lately for various open source and
internal projects written in Python. As part of those reviews, I
have noticed what I think is a trend toward using dict()
instead of {} to create dictionaries. I don’t know exactly why
this trend has emerged. Perhaps the authors perceive dict() as
more readable than {}. Whatever the reason, my intuition told
me calling the function version of the constructor for a dictionary
would impose a performance penalty. I studied what happens in both
cases to understand how significant that penalty is, and the
results confirmed my intuition.

tl;dr

With CPython 2.7, using dict() to create dictionaries takes up to
6 times longer and involves more memory allocation operations than the
literal syntax. Use {} to create dictionaries, especially if you
are pre-populating them, unless the literal syntax does not work for
your case.

Initial Hypothesis

I wanted to study the performance difference between the literal
syntax for creating a dictionary instance ({}) and using the name
of the class to create one (dict()). I knew that the Python
interpreter is based on opcodes and that there are codes dedicated to
creating a dictionary that would not be invoked when the dict()
form was used instead of the literal form. I suspected that the extra
overhead for looking up the name “dict” and then calling the
function would make the “function” form slower.

Measuring Performance

I began my analysis by applying timeit to see if the
performance difference was even measurable.

$ python2.7 -m timeit -n 1000000 -r 5 -v 'dict()'
raw times: 0.24 0.24 0.24 0.239 0.24
1000000 loops, best of 5: 0.239 usec per loop

$ python2.7 -m timeit -n 1000000 -r 5 -v '{}'
raw times: 0.0417 0.0413 0.0407 0.0411 0.042
1000000 loops, best of 5: 0.0407 usec per loop

Immediately I could see that not only was there a difference, it was
even more significant than I expected. Using dict() to create an
empty dictionary took 6 times longer than using the literal
syntax. Would the difference be the same if the dictionary had
members?

Using dict() to create an empty dictionary took
6 times longer than using the literal syntax.
$ python2.7 -m timeit -n 1000000 -r 5 -v 'dict(a="A", b="B", c="C")'
raw times: 0.56 0.548 0.563 0.563 0.556
1000000 loops, best of 5: 0.548 usec per loop

$ python2.7 -m timeit -n 1000000 -r 5 -v '{"a": "A", "b": "B", "c": "C"}'
raw times: 0.178 0.178 0.18 0.177 0.179
1000000 loops, best of 5: 0.177 usec per loop

Passing a few members to the dictionary brought the difference closer
together, but the function form was still taking three times as long.

What is going on?

After establishing the performance difference, I asked myself what was
going on to cause such a significant slowdown. To answer that
question, I needed to look more deeply into what the interpreter was
doing as it processed each expression. I wanted to see which (and how
many) opcodes were being executed. I used dis to
disassemble the Python expressions to see which opcodes implement
each.

To use dis from the command line, I needed input files containing the
different expressions I was studying. I created func.py:

dict()

and literal.py:

{}

The output of dis is arranged in columns with the original source line
number, the instruction “address” within the code object, the opcode
name, and any arguments passed to the opcode.

$ python2.7 -m dis func.py
  1           0 LOAD_NAME                0 (dict)
              3 CALL_FUNCTION            0
              6 POP_TOP
              7 LOAD_CONST               0 (None)
             10 RETURN_VALUE

The function form uses two separate opcodes: LOAD_NAME to find the
object associated with the name “dict”, and CALL_FUNCTION to
invoke it. The last three opcodes are not involved in creating or
populating the dictionary, and appear in both versions of the code, so
I ignored them for my analysis.

The literal form uses a special opcode to create the dictionary:

$ python2.7 -m dis literal.py
  1           0 BUILD_MAP                0
              3 POP_TOP
              4 LOAD_CONST               0 (None)
              7 RETURN_VALUE

The BUILD_MAP opcode creates a new empty dictionary instance and
places it on the top of the interpreter’s stack.

After comparing the two sets of opcodes, I suspected that the
CALL_FUNCTION operation was the culprit, since calling functions
is relatively expensive in Python. However, these were trivial
examples and did not look like what I was seeing in code reviews. Most
of the actual code I had seen was populating the dictionary as it
created it, and I wanted to understand what difference that would
make.

Examining More Complex Examples

I created two new source files that set three key/value pairs in the
dictionary as it is created. I started with func-members.py, which
instantiated a dictionary with the same members using the dict()
function.

1
2
3
4
dict(a="A",
     b="B",
     c="C",
     )
I realized that in order to really understand what was going on,
I would have to look at the interpreter implementation.

The disassembled version of func-members.py started the same way
as the earlier example, looking for the dict() function:

$ python2.7 -m dis func-members.py
  1           0 LOAD_NAME                0 (dict)

Then it showed key/value pairs being pushed onto the stack using
LOAD_CONST to create named arguments for the function.

            3 LOAD_CONST               0 ('a')
            6 LOAD_CONST               1 ('A')
            9 LOAD_CONST               2 ('b')

2          12 LOAD_CONST               3 ('B')
           15 LOAD_CONST               4 ('c')

3          18 LOAD_CONST               5 ('C')

Finally dict() was called:

21 CALL_FUNCTION          768
24 POP_TOP
25 LOAD_CONST               6 (None)
28 RETURN_VALUE

Next I created literal-members.py:

1
2
3
4
{"a": "A",
 "b": "B",
 "c": "C",
 }

The disassembled version of this example showed a few differences from
the literal example without any values. First, the argument to
BUILD_MAP was 3 instead of 0, indicating that there were three
key/value pairs on the stack to go into the dictionary.

$ python2.7 -m dis literal-members.py
  1           0 BUILD_MAP                3

It also showed the values and then keys being pushed onto the stack
using LOAD_CONST.

3 LOAD_CONST               0 ('A')
6 LOAD_CONST               1 ('a')

Finally, a new opcode, STORE_MAP, appeared once after each
key/value pair is processed.

9 STORE_MAP

The same pattern repeated for each of the other two key/value pairs.

2          10 LOAD_CONST               2 ('B')
           13 LOAD_CONST               3 ('b')
           16 STORE_MAP

3          17 LOAD_CONST               4 ('C')
           20 LOAD_CONST               5 ('c')
           23 STORE_MAP
           24 POP_TOP
           25 LOAD_CONST               6 (None)
           28 RETURN_VALUE

After looking at the output more closely, I noticed that there were
actually fewer opcodes in the function form than the literal
form. There were no STORE_MAP opcodes, just the CALL_FUNCTION
after all of the items were on the stack. At this point I realized
that in order to really understand what was going on, I would have to
look at the interpreter implementation.

Interpreter Source

Up to now, I had been examining the behavior of the interpreter by
feeding it inputs and seeing what it did with them. To examine it at a
deeper level, I had to download the source following the instructions
in the Dev Guide.

$ hg clone http://hg.python.org/cpython
destination directory: cpython
requesting all changes
adding changesets
adding manifests
adding file changes
added 80406 changesets with 178013 changes to 9805 files (+2 heads)
updating to branch default
3750 files updated, 0 files merged, 0 files removed, 0 files unresolved

$ cd cpython

And, since I am looking at Python 2.7 rather than 3.3:

$ hg update 2.7
3768 files updated, 0 files merged, 810 files removed, 0 files unresolved

The interpreter evaluates opcodes in a loop defined in
PyEval_EvalFrameEx() in Python/ceval.c. Each opcode name
corresponds to an entry in the switch statement. For example, the
POP_TOP opcode that appears near the end of each disassembled
example is implemented as:

1
2
3
4
        case POP_TOP:
            v = POP();
            Py_DECREF(v);
            goto fast_next_opcode;

The top-most item is removed from the stack and its reference count is
decremented to allow it (eventually) to be garbage collected. After
orienting myself in the source, I was ready to trace through the
opcodes used in the examples above.

What Happens When You Call dict()?

The disassembly above shows that the opcodes used to call dict()
to create a dictionary are LOAD_NAME, LOAD_CONST, and
CALL_FUNCTION.

The LOAD_NAME opcode finds the object associated with the given
name (“dict” in this case) and puts it on top of the stack.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
        case LOAD_NAME:
            w = GETITEM(names, oparg);
            if ((v = f->f_locals) == NULL) {
                PyErr_Format(PyExc_SystemError,
                             "no locals when loading %s",
                             PyObject_REPR(w));
                why = WHY_EXCEPTION;
                break;
            }
            if (PyDict_CheckExact(v)) {
                x = PyDict_GetItem(v, w);
                Py_XINCREF(x);
            }
            else {
                x = PyObject_GetItem(v, w);
                if (x == NULL && PyErr_Occurred()) {
                    if (!PyErr_ExceptionMatches(
                                    PyExc_KeyError))
                        break;
                    PyErr_Clear();
                }
            }
            if (x == NULL) {
                x = PyDict_GetItem(f->f_globals, w);
                if (x == NULL) {
                    x = PyDict_GetItem(f->f_builtins, w);
                    if (x == NULL) {
                        format_exc_check_arg(
                                    PyExc_NameError,
                                    NAME_ERROR_MSG, w);
                        break;
                    }
                }
                Py_INCREF(x);
            }
            PUSH(x);
            continue;

Three separate namespaces (represented as dictionaries) are
searched. First, the local namespace from inside any function scope,
followed by the module global namespace, and then the set of built-ins.

LOAD_CONST is the next opcode used. It pushes literal constant
values onto the interpreter’s stack:

1
2
3
4
5
        case LOAD_CONST:
            x = GETITEM(consts, oparg);
            Py_INCREF(x);
            PUSH(x);
            goto fast_next_opcode;

The oparg value indicates which constant to take out of the set of
constants found in the code object. The constant’s reference count is
increased and then it is pushed onto the top of the stack. This is an
inexpensive operation since no name look-up is needed.

The portion of the implementation of CALL_FUNCTION in the case
statement looks similarly simple:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
        case CALL_FUNCTION:
        {
            PyObject **sp;
            PCALL(PCALL_ALL);
            sp = stack_pointer;
#ifdef WITH_TSC
            x = call_function(&sp, oparg, &intr0, &intr1);
#else
            x = call_function(&sp, oparg);
#endif
            stack_pointer = sp;
            PUSH(x);
            if (x != NULL)
                continue;
            break;
        }

The function is called and its return value is pushed onto the
stack. (The WITH_TSC conditional compilation instruction controls
whether the Pentium timestamp counter is used, and can be ignored for
this general analysis.)

The implementation of call_function() starts to expose some of
the complexity of calling Python functions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
static PyObject *
call_function(PyObject ***pp_stack, int oparg
#ifdef WITH_TSC
                , uint64* pintr0, uint64* pintr1
#endif
                )
{
    int na = oparg & 0xff;
    int nk = (oparg>>8) & 0xff;
    int n = na + 2 * nk;
    PyObject **pfunc = (*pp_stack) - n - 1;
    PyObject *func = *pfunc;
    PyObject *x, *w;

    /* Always dispatch PyCFunction first, because these are
       presumed to be the most frequent callable object.
    */
    if (PyCFunction_Check(func) && nk == 0) {
        int flags = PyCFunction_GET_FLAGS(func);
        PyThreadState *tstate = PyThreadState_GET();

        PCALL(PCALL_CFUNCTION);
        if (flags & (METH_NOARGS | METH_O)) {
            PyCFunction meth = PyCFunction_GET_FUNCTION(func);
            PyObject *self = PyCFunction_GET_SELF(func);
            if (flags & METH_NOARGS && na == 0) {
                C_TRACE(x, (*meth)(self,NULL));
            }
            else if (flags & METH_O && na == 1) {
                PyObject *arg = EXT_POP(*pp_stack);
                C_TRACE(x, (*meth)(self,arg));
                Py_DECREF(arg);
            }
            else {
                err_args(func, flags, na);
                x = NULL;
            }
        }
        else {
            PyObject *callargs;
            callargs = load_args(pp_stack, na);
            READ_TIMESTAMP(*pintr0);
            C_TRACE(x, PyCFunction_Call(func,callargs,NULL));
            READ_TIMESTAMP(*pintr1);
            Py_XDECREF(callargs);
        }
    } else {
        if (PyMethod_Check(func) && PyMethod_GET_SELF(func) != NULL) {
            /* optimize access to bound methods */
            PyObject *self = PyMethod_GET_SELF(func);
            PCALL(PCALL_METHOD);
            PCALL(PCALL_BOUND_METHOD);
            Py_INCREF(self);
            func = PyMethod_GET_FUNCTION(func);
            Py_INCREF(func);
            Py_DECREF(*pfunc);
            *pfunc = self;
            na++;
            n++;
        } else
            Py_INCREF(func);
        READ_TIMESTAMP(*pintr0);
        if (PyFunction_Check(func))
            x = fast_function(func, pp_stack, n, na, nk);
        else
            x = do_call(func, pp_stack, na, nk);
        READ_TIMESTAMP(*pintr1);
        Py_DECREF(func);
    }

    /* Clear the stack of the function object.  Also removes
       the arguments in case they weren't consumed already
       (fast_function() and err_args() leave them on the stack).
     */
    while ((*pp_stack) > pfunc) {
        w = EXT_POP(*pp_stack);
        Py_DECREF(w);
        PCALL(PCALL_POP);
    }
    return x;
}

The number arguments the function is passed is given in oparg. The
low-end byte is the number of positional arguments, and the high-end
byte is the number of keyword arguments (lines 8-9). The value 768 in
the example above translates to 3 keyword arguments and 0 positional
arguments.

21 CALL_FUNCTION          768

There are separate cases for built-in functions implemented in C,
function written in Python, and methods of objects. All of the cases
eventually use load_args() to pull the positional arguments off
of the stack as a tuple:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
static PyObject *
load_args(PyObject ***pp_stack, int na)
{
    PyObject *args = PyTuple_New(na);
    PyObject *w;

    if (args == NULL)
        return NULL;
    while (--na >= 0) {
        w = EXT_POP(*pp_stack);
        PyTuple_SET_ITEM(args, na, w);
    }
    return args;
}

And then update_keyword_args() is used to pull keyword arguments
off of the stack as a dictionary:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
static PyObject *
update_keyword_args(PyObject *orig_kwdict, int nk, PyObject ***pp_stack,
                    PyObject *func)
{
    PyObject *kwdict = NULL;
    if (orig_kwdict == NULL)
        kwdict = PyDict_New();
    else {
        kwdict = PyDict_Copy(orig_kwdict);
        Py_DECREF(orig_kwdict);
    }
    if (kwdict == NULL)
        return NULL;
    while (--nk >= 0) {
        int err;
        PyObject *value = EXT_POP(*pp_stack);
        PyObject *key = EXT_POP(*pp_stack);
        if (PyDict_GetItem(kwdict, key) != NULL) {
            PyErr_Format(PyExc_TypeError,
                         "%.200s%s got multiple values "
                         "for keyword argument '%.200s'",
                         PyEval_GetFuncName(func),
                         PyEval_GetFuncDesc(func),
                         PyString_AsString(key));
            Py_DECREF(key);
            Py_DECREF(value);
            Py_DECREF(kwdict);
            return NULL;
        }
        err = PyDict_SetItem(kwdict, key, value);
        Py_DECREF(key);
        Py_DECREF(value);
        if (err) {
            Py_DECREF(kwdict);
            return NULL;
        }
    }
    return kwdict;
}

That’s right.

To pass keyword arguments to dict() to instantiate a dictionary,
first another dictionary is created.

After the arguments are prepared, they are passed to dict().
The implementation of the dictionary object is found in
Objects/dictobject.c, and the dict type is defined as:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
PyTypeObject PyDict_Type = {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "dict",
    sizeof(PyDictObject),
    0,
    (destructor)dict_dealloc,                   /* tp_dealloc */
    (printfunc)dict_print,                      /* tp_print */
    0,                                          /* tp_getattr */
    0,                                          /* tp_setattr */
    (cmpfunc)dict_compare,                      /* tp_compare */
    (reprfunc)dict_repr,                        /* tp_repr */
    0,                                          /* tp_as_number */
    &dict_as_sequence,                          /* tp_as_sequence */
    &dict_as_mapping,                           /* tp_as_mapping */
    (hashfunc)PyObject_HashNotImplemented,      /* tp_hash */
    0,                                          /* tp_call */
    0,                                          /* tp_str */
    PyObject_GenericGetAttr,                    /* tp_getattro */
    0,                                          /* tp_setattro */
    0,                                          /* tp_as_buffer */
    Py_TPFLAGS_DEFAULT | Py_TPFLAGS_HAVE_GC |
        Py_TPFLAGS_BASETYPE | Py_TPFLAGS_DICT_SUBCLASS,         /* tp_flags */
    dictionary_doc,                             /* tp_doc */
    dict_traverse,                              /* tp_traverse */
    dict_tp_clear,                              /* tp_clear */
    dict_richcompare,                           /* tp_richcompare */
    0,                                          /* tp_weaklistoffset */
    (getiterfunc)dict_iter,                     /* tp_iter */
    0,                                          /* tp_iternext */
    mapp_methods,                               /* tp_methods */
    0,                                          /* tp_members */
    0,                                          /* tp_getset */
    0,                                          /* tp_base */
    0,                                          /* tp_dict */
    0,                                          /* tp_descr_get */
    0,                                          /* tp_descr_set */
    0,                                          /* tp_dictoffset */
    dict_init,                                  /* tp_init */
    PyType_GenericAlloc,                        /* tp_alloc */
    dict_new,                                   /* tp_new */
    PyObject_GC_Del,                            /* tp_free */
};

Because dict is a class, dict() creates a new object and
then invokes the __init__() method. The initialization for
dict is handled by dict_init():

1
2
3
4
5
static int
dict_init(PyObject *self, PyObject *args, PyObject *kwds)
{
    return dict_update_common(self, args, kwds, "dict");
}

Which calls dict_update_common() to update the contents of the
dictionary with the arguments passed to the initialization function.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
static int
dict_update_common(PyObject *self, PyObject *args, PyObject *kwds, char *methname)
{
    PyObject *arg = NULL;
    int result = 0;

    if (!PyArg_UnpackTuple(args, methname, 0, 1, &arg))
        result = -1;

    else if (arg != NULL) {
        if (PyObject_HasAttrString(arg, "keys"))
            result = PyDict_Merge(self, arg, 1);
        else
            result = PyDict_MergeFromSeq2(self, arg, 1);
    }
    if (result == 0 && kwds != NULL)
        result = PyDict_Merge(self, kwds, 1);
    return result;
}

In this case, a set of keyword arguments are passed so the very last
case is triggered and PyDict_Merge() is used to copy the keyword
arguments into the dictionary. There are a couple of cases for
merging, but from what I can tell because there are two dictionaries
involved the first case applies. The target dictionary is resized to
be big enough to hold the new values, and then the items from the
merging dictionary are copied in one at a time.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
int
PyDict_Merge(PyObject *a, PyObject *b, int override)
{
    register PyDictObject *mp, *other;
    register Py_ssize_t i;
    PyDictEntry *entry;

    /* We accept for the argument either a concrete dictionary object,
     * or an abstract "mapping" object.  For the former, we can do
     * things quite efficiently.  For the latter, we only require that
     * PyMapping_Keys() and PyObject_GetItem() be supported.
     */
    if (a == NULL || !PyDict_Check(a) || b == NULL) {
        PyErr_BadInternalCall();
        return -1;
    }
    mp = (PyDictObject*)a;
    if (PyDict_Check(b)) {
        other = (PyDictObject*)b;
        if (other == mp || other->ma_used == 0)
            /* a.update(a) or a.update({}); nothing to do */
            return 0;
        if (mp->ma_used == 0)
            /* Since the target dict is empty, PyDict_GetItem()
             * always returns NULL.  Setting override to 1
             * skips the unnecessary test.
             */
            override = 1;
        /* Do one big resize at the start, rather than
         * incrementally resizing as we insert new items.  Expect
         * that there will be no (or few) overlapping keys.
         */
        if ((mp->ma_fill + other->ma_used)*3 >= (mp->ma_mask+1)*2) {
           if (dictresize(mp, (mp->ma_used + other->ma_used)*2) != 0)
               return -1;
        }
        for (i = 0; i <= other->ma_mask; i++) {
            entry = &other->ma_table[i];
            if (entry->me_value != NULL &&
                (override ||
                 PyDict_GetItem(a, entry->me_key) == NULL)) {
                Py_INCREF(entry->me_key);
                Py_INCREF(entry->me_value);
                if (insertdict(mp, entry->me_key,
                               (long)entry->me_hash,
                               entry->me_value) != 0)
                    return -1;
            }
        }
    }
    else {
        /* Do it the generic, slower way */
        PyObject *keys = PyMapping_Keys(b);
        PyObject *iter;
        PyObject *key, *value;
        int status;

        if (keys == NULL)
            /* Docstring says this is equivalent to E.keys() so
             * if E doesn't have a .keys() method we want
             * AttributeError to percolate up.  Might as well
             * do the same for any other error.
             */
            return -1;

        iter = PyObject_GetIter(keys);
        Py_DECREF(keys);
        if (iter == NULL)
            return -1;

        for (key = PyIter_Next(iter); key; key = PyIter_Next(iter)) {
            if (!override && PyDict_GetItem(a, key) != NULL) {
                Py_DECREF(key);
                continue;
            }
            value = PyObject_GetItem(b, key);
            if (value == NULL) {
                Py_DECREF(iter);
                Py_DECREF(key);
                return -1;
            }
            status = PyDict_SetItem(a, key, value);
            Py_DECREF(key);
            Py_DECREF(value);
            if (status < 0) {
                Py_DECREF(iter);
                return -1;
            }
        }
        Py_DECREF(iter);
        if (PyErr_Occurred())
            /* Iterator completed, via error */
            return -1;
    }
    return 0;
}

Creating a Dictionary with {}

The opcodes used to implement the literal examples are BUILD_MAP,
LOAD_CONST, and STORE_MAP. I started with the first opcode,
BUILD_MAP, which creates the dictionary instance:

1
2
3
4
5
        case BUILD_MAP:
            x = _PyDict_NewPresized((Py_ssize_t)oparg);
            PUSH(x);
            if (x != NULL) continue;
            break;

The dictionary is created using _PyDict_NewPresized(), from
Objects/dictobject.c.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
/* Create a new dictionary pre-sized to hold an estimated number of elements.
   Underestimates are okay because the dictionary will resize as necessary.
   Overestimates just mean the dictionary will be more sparse than usual.
*/

PyObject *
_PyDict_NewPresized(Py_ssize_t minused)
{
    PyObject *op = PyDict_New();

    if (minused>5 && op != NULL && dictresize((PyDictObject *)op, minused) == -1) {
        Py_DECREF(op);
        return NULL;
    }
    return op;
}

The argument for BUILD_MAP is the number of items that are going
to be added to the new dictionary as it is created. The disassembly
for literal-members.py showed that value as 3 earlier.

$ python2.7 -m dis literal-members.py
  1           0 BUILD_MAP                3

Specifying the initial number of items in the dictionary is an
optimization for managing memory, since it means the table size can be
set ahead of time and it does not need to be reallocated in some
cases.

The argument for BUILD_MAP is the number of items that are
going to be added to the new dictionary as it is created.

Each key/value pair is added to the dictionary using three
opcodes. Two instances of LOAD_CONST push the value, then the key,
onto the stack. Then a STORE_MAP opcode adds the pair to the
dictionary.

2          10 LOAD_CONST               2 ('B')
           13 LOAD_CONST               3 ('b')
           16 STORE_MAP

As we saw early, LOAD_CONST is fairly straightforward and
economical. STORE_MAP looks for the key, value, and dictionary on
the stack and calls PyDict_SetItem() to add the new key/value
pair. The STACKADJ(-2) line removes the key and value off from the
stack.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
        case STORE_MAP:
            w = TOP();     /* key */
            u = SECOND();  /* value */
            v = THIRD();   /* dict */
            STACKADJ(-2);
            assert (PyDict_CheckExact(v));
            err = PyDict_SetItem(v, w, u);  /* v[w] = u */
            Py_DECREF(u);
            Py_DECREF(w);
            if (err == 0) continue;
            break;

PyDict_SetItem() is the same function invoked when a program
uses d[k] = v to associate the value v with a key k in a
dictionary.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
int
PyDict_SetItem(register PyObject *op, PyObject *key, PyObject *value)
{
    register long hash;

    if (!PyDict_Check(op)) {
        PyErr_BadInternalCall();
        return -1;
    }
    assert(key);
    assert(value);
    if (PyString_CheckExact(key)) {
        hash = ((PyStringObject *)key)->ob_shash;
        if (hash == -1)
            hash = PyObject_Hash(key);
    }
    else {
        hash = PyObject_Hash(key);
        if (hash == -1)
            return -1;
    }
    return dict_set_item_by_hash_or_entry(op, key, hash, NULL, value);
}

The functions it calls handle resizing the internal data structure
used by the dictionary, if that’s necessary.

Extreme Tests

I now had an answer explaining why the literal syntax was so much more
efficient than using the name to instantiate a dictionary. I had
noticed earlier, though, that the performance benefits were reduced
when I added a few key/value pairs so I wanted to see if that trend
continued as I added more arguments.

I decided to jump right to the maximum point, and create dictionaries
with 255 members. Because the number of keyword arguments passed to a
function is represented in a byte, a function may be passed at most
255 literal keyword arguments (you can pass more arguments if you
populate a dictionary and use the syntax callable(**kwds), but 255
seemed like a reasonable number to test).

Not wanting to type out all of those arguments, I created one script to
generate a call to dict() with 255 arguments:

#!/usr/bin/env python

print 'dict(',
for i in range(255):
    print 'a%d=%d,' % (i, i),
print ')'

and another to create a literal dictionary with the same members:

#!/usr/bin/env python

print '{',
for i in range(255):
    print '"a%r": %r,' % (i, i),
print '}'

I then ran the scripts, and passed their output to timeit. I
used fewer iterations, since it took longer to run each one and
process all of the arguments.

$ python2.7 makebigfunc.py > func-big.py
$ python2.7 -m timeit -n 10000 -r 5 -v "$(cat func-big.py)"
raw times: 0.214 0.211 0.211 0.216 0.221
10000 loops, best of 5: 21.1 usec per loop

$ python2.7 makebigdict.py > literal-big.py
$ python2.7 -m timeit -n 10000 -r 5 -v "$(cat literal-big.py)"
raw times: 0.138 0.136 0.137 0.137 0.14
10000 loops, best of 5: 13.6 usec per loop

The difference has narrowed, but using dict() still takes 1.6
times as long as {}.

Conclusions

In summary, calling dict() requires these steps:

  1. Find the object associated with the name “dict” and push it
    onto the stack.
  2. Push the key/value pairs onto the stack as constant values.
  3. Get the key/value pairs off of the stack and create a dictionary
    to hold the keyword arguments to the function.
  4. Call the constructor for dict to make a new object.
  5. Initialize the new object by passing the keyword arguments to its
    initialization method.
  6. Resize the new dict and copy the key/value pairs into it
    from the keyword arguments.

Whereas using {} to create a dictionary uses only these steps:

  1. Create an empty but pre-allocated dictionary instance.
  2. Push the key/value pairs onto the stack as constant values.
  3. Store each key/value pair in the dictionary.

The times involved here are pretty small, but as a general principle I
try to avoid code constructions I know to introduce performance hits.
On the other hand, there may be times when using dict() is
necessary, or easier. For example, in versions of Python earlier than
2.7, creating a dictionary from an iterable required using a generator
expression as argument to dict():

d = dict((k, v) for k, v in some_iterable)

With 2.7 and later, though, dictionary comprehensions are built into
the language syntax:

{k: v for k, v in some_iterable}

So that eliminates one common case for calling dict(). Using
the literal dictionary syntax feels more “pythonic” to me, so I try to
just do it anyway.

Determining the Name of a Process from Python

Finding the name of the program from which a Python module is
running can be trickier than it would seem at first, and
investigating the reasons led to some interesting experiments.

A couple of weeks ago at the OpenStack Folsom Summit, Mark McClain
pointed out an interesting code snippet he had discovered in the Nova
sources
:

nova/utils.py: 339

script_dir = os.path.dirname(inspect.stack()[-1][1])

The code is part of the logic to find a configuration file that lives
in a directory relative to where the application startup script is
located. It looks at the call stack to find the main program, and
picks the filename out of the stack details.

The code seems to be taken from a response to a StackOverflow
question
, and when I saw it I thought it looked like a case of
someone going to more trouble than was needed to get the
information. Mark had a similar reaction, and asked if I knew of a
simpler way to determine the program name.

I thought it looked like a case of someone going to more trouble
than was needed…

Similar examples with inspect.stack() appear in four places in
the Nova source code (at last as-of today). All of them are either
building filenames relative to the location of the original “main”
program, or are returning the name of that program to be used to build
a path to another file (such as a log file or other program). Those
are all good reasons to be careful about the location and name of the
main program, but none explain why the obvious solution isn’t good
enough. I assumed that if the OpenStack developers were looking at
stack frames there must have been a reason. I decided to examine the
original code and spend a little time deciphering what it is doing,
and especially to see if there were cases where it did not work as
desired (so I could justify a patch).

The Stack

The call to inspect.stack() retrieves the Python interpreter
stack frames for the current thread. The return value is a list with
information about the calling function in position 0 and the “top”
of the stack at the end of the list. Each item in the list is a tuple
containing:

  • the stack frame data structure
  • the filename for the code being run in that frame
  • the line number within the file
  • the co_name member of the code object from the frame,
    giving the function or method name being executed
  • the source lines for that bit of code, when available
  • an index into the list of source lines showing the actual source
    line for the frame

show_stack.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import inspect


def show_stack():
    stack = inspect.stack()
    for s in stack:
        print 'filename:', s[1]
        print 'line    :', s[2]
        print 'co_name :', s[3]
        print


def inner():
    show_stack()


def outer():
    inner()


if __name__ == '__main__':
    outer()

The information is intended to be used for generating tracebacks or by
tools like pdb when debugging an application (although
pdb has its own implementation). To answer the question “Which
program am I running in?” the filename is most the interesting piece
of data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
$ python show_stack.py
filename: show_stack.py
line    : 6
co_name : show_stack

filename: show_stack.py
line    : 15
co_name : inner

filename: show_stack.py
line    : 19
co_name : outer

filename: show_stack.py
line    : 23
co_name : <module>

One obvious issue with these results is that the filename in the stack
frame is relative to the start up directory of the application. It
could lead to an incorrect path if the process has changed its working
directory between startup and checking the stack. But there is
another mode where looking at the top of the stack produces completely
invalid results.

The simple one-liner is not always going to produce the right
results.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ python -m show_stack
filename: /Users/dhellmann/.../show_stack.py
line    : 6
co_name : show_stack

filename: /Users/dhellmann/.../show_stack.py
line    : 15
co_name : inner

filename: /Users/dhellmann/.../show_stack.py
line    : 19
co_name : outer

filename: /Users/dhellmann/.../show_stack.py
line    : 23
co_name : <module>

filename: /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py
line    : 72
co_name : _run_code

filename: /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py
line    : 162
co_name : _run_module_as_main

The -m option to the interpreter triggers the runpy module,
which takes the module name specified and executes it like a main
program. As the stack printout above illustrates, runpy is then at
the top of the stack, so the “main” part of our local module is
several levels down from the top. That means the simple one-liner is
not always going to produce the right results.

Why the Obvious Solution Fails

Now that I knew there were ways to get the wrong results by looking at
the stack, the next question was whether there was another way to find
the program name that was simpler, more efficient, and especially more
correct. The simplest solution is to look at the command line
arguments passed through sys.argv.

argv.py

1
2
3
import sys

print sys.argv[0]

Normally, the first element in sys.argv is the script that was run
as the main program. The value always points to the same file,
although the method of invoking it may cause the value to fluctuate
between a relative and full path.

1
2
3
4
5
6
7
8
$ python argv.py
argv.py

$ ./argv.py
./argv.py

$ python -m argv
/Users/dhellmann/.../argv.py

As this example demonstrates, when a script is run directly or passed
as an argument to the interpreter, sys.argv contains a relative
path
to the script file. Using -m we see the full path, so
looking at the command line arguments is more robust for that
case. However, we cannot depend on -m being used so we aren’t
guaranteed to get the extra details.

Using import

The next alternative I considered was probing the main program module
myself. Every module has a special property, __file__, which holds
the path to the file from which the module was loaded. To access the
main program module from within Python, you import a specially named
module __main__. To test this method, I created a main program
that loads another module:

import_main_app.py

1
2
3
import import_main_module

import_main_module.main()

And the second module imports __main__ and print the file it was
loaded from.

import_main.py

1
2
3
import __main__

print __main__.__file__

Looking at the __main__ module always pointed to the actual main
program module, but it did not always produce a full path. This makes
sense, because the filename for a module that goes into the stack
frame comes from the module itself.

1
2
3
4
5
$ python import_main.py
import_main.py

$ python -m import_main
/Users/dhellmann/.../import_main.py

Wandering Down the Garden Path

After I found such a simple way to reliably retrieve the program name,
I spent a while thinking about the motivation of the person who
decided that looking at stack frames was the best solution. I came up
with two hypotheses. First, it is entirely possible that they did not
know about importing __main__. It isn’t the sort of thing one
needs to do very often, and I don’t even remember where I learned
about doing it (or why, because I’m pretty sure I’ve never used the
feature in production code for any reason). That seems like the most
plausible reason, but the other idea I had was that for some reason it
was very important to have a relatively tamper-proof value –
something that could not be overwritten accidentally. This new idea
merited further investigation, so I worked back through the methods of
accessing the program name to determine which, if any, met the new
criteria.

I did not need to experiment with sys.argv to know it was
mutable. The arguments are saved in a normal list object, and
can be modified quite easily, as demonstrated here.

argv_modify.py

1
2
3
4
5
6
7
8
import sys

print 'Type  :', type(sys.argv)
print 'Before:', sys.argv

sys.argv[0] = 'wrong'

print 'After :', sys.argv

All normal list operations are supported, so replacing the program
name is a simple assignment statement. Because sys.argv is a list,
it is also susceptible to having values removed by pop(),
remove(), or a slice assignment gone awry.

$ python argv_modify.py
Type  : <type 'list'>
Before: ['argv_modify.py']
After : ['wrong']

The __file__ attribute of a module is a string, which is not
itself mutable, but the contents can be replaced by assigning a new
value to the attribute.

import_modify.py

1
2
3
4
5
6
7
import __main__

print 'Before:', __main__.__file__

__main__.__file__ = 'wrong'

print 'After :', __main__.__file__

This is less likely to happen by accident, so it seems somewhat
safer. Nonetheless, changing it is easy.

$ python import_modify.py
Before: import_modify.py
After : wrong

That leaves the stack frame.

Down the Rabbit Hole

As described above, the return value of inspect.stack() is a
list of tuples. The list is computed each time the function is called,
so it was unlikely that one part of a program would accidentally
modify it. The key word there is accidentally, but even a malicious
program would have to go to a bit of effort to return fake stack data.

stack_modify1.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import inspect


def faux_stack():
    full_stack = orig_stack()
    top = full_stack[-1]
    top = (top[0], 'wrong') + top[2:]
    full_stack[-1] = top
    return full_stack
orig_stack = inspect.stack
inspect.stack = faux_stack


stack_data = inspect.stack()
script_name = stack_data[-1][1]
print 'From stack:', script_name
print 'From frame:', stack_data[-1][0].f_code.co_filename

The filename actually appears in two places in the data returned by
inspect.stack(). The first location is in the tuple that is part
of the list returned as the stack itself. The second is in the code
object of the stack frame within that same tuple
(frame.f_code.co_filename).

$ python stack_modify1.py
From stack: wrong
From frame: stack_modify1.py

It turned out to be more challenging to change the code object.

Replacing the filename in the tuple was relatively easy, and would be
sufficient for code that trusted the stack contents returned by
inspect.stack(). It turned out to be more challenging to change
the code object. For C Python, the code class is implemented
in C as part of the set of objects used internally by the interpreter.

Objects/codeobject.c

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
static PyMemberDef code_memberlist[] = {
    {"co_argcount",     T_INT,          OFF(co_argcount),       READONLY},
    {"co_nlocals",      T_INT,          OFF(co_nlocals),        READONLY},
    {"co_stacksize",T_INT,              OFF(co_stacksize),      READONLY},
    {"co_flags",        T_INT,          OFF(co_flags),          READONLY},
    {"co_code",         T_OBJECT,       OFF(co_code),           READONLY},
    {"co_consts",       T_OBJECT,       OFF(co_consts),         READONLY},
    {"co_names",        T_OBJECT,       OFF(co_names),          READONLY},
    {"co_varnames",     T_OBJECT,       OFF(co_varnames),       READONLY},
    {"co_freevars",     T_OBJECT,       OFF(co_freevars),       READONLY},
    {"co_cellvars",     T_OBJECT,       OFF(co_cellvars),       READONLY},
    {"co_filename",     T_OBJECT,       OFF(co_filename),       READONLY},
    {"co_name",         T_OBJECT,       OFF(co_name),           READONLY},
    {"co_firstlineno", T_INT,           OFF(co_firstlineno),    READONLY},
    {"co_lnotab",       T_OBJECT,       OFF(co_lnotab),         READONLY},
    {NULL}      /* Sentinel */
};

The data members of a code object are all defined as READONLY,
which means you cannot modify them from within Python code directly.

code_modify_fail.py

1
2
3
4
5
6
7
import inspect

stack_data = inspect.stack()
frame = stack_data[0][0]
code = frame.f_code

code.co_filename = 'wrong'

Attempting to change a read-only property causes a TypeError.

1
2
3
4
5
$ python code_modify_fail.py
Traceback (most recent call last):
  File "code_modify_fail.py", line 8, in <module>
    code.co_filename = 'wrong'
TypeError: readonly attribute

Instead of changing the code object itself, I would have to replace it
with another object. The reference to the code object is accessed
through the frame object, so in order to insert my code object into
the stack frame I would need to modify the frame. Frame objects are
also immutable, however, so that meant creating a fake frame to
replace the original value. Unfortunately, it is not possible to
instantiate code or frame objects from within
Python, so I ended up having to create classes to mimic the originals.

stack_modify2.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import collections
import inspect

# Define a namedtuple with the attributes of a stack frame
frame_attrs = ['f_back',
               'f_code',
               'f_builtins',
               'f_globals',
               'f_lasti',
               'f_lineno',
               ]
frame = collections.namedtuple('frame', ' '.join(frame_attrs))

# Define a namedtuple with the attributes of a code object
code_attrs = ['co_argcount',
              'co_nlocals',
              'co_stacksize',
              'co_flags',
              'co_code',
              'co_consts',
              'co_names',
              'co_varnames',
              'co_freevars',
              'co_cellvars',
              'co_filename',
              'co_name',
              'co_firstlineno',
              'co_lnotab',
              ]
code = collections.namedtuple('code', ' '.join(code_attrs))


def _make_fake_frame(original, filename):
    """Return a new fake frame object with the wrong filename."""
    new_c_attrs = dict((a, getattr(original.f_code, a))
                          for a in code_attrs)
    new_c_attrs['co_filename'] = filename
    new_c = code(**new_c_attrs)

    new_f_attrs = dict((a, getattr(original, a))
                           for a in frame_attrs)
    new_f_attrs['f_code'] = new_c
    new_f = frame(**new_f_attrs)
    return new_f


def faux_stack():
    full_stack = orig_stack()
    top = full_stack[-1]

    new_frame = _make_fake_frame(top[0], 'wrong')

    # Replace the top of the stack
    top = (new_frame, 'wrong') + top[2:]
    full_stack[-1] = top

    return full_stack
orig_stack = inspect.stack
inspect.stack = faux_stack


def show_app_name():
    stack_data = inspect.stack()
    script_name = stack_data[-1][1]
    print 'From stack:', script_name
    print 'From frame:', stack_data[-1][0].f_code.co_filename


if __name__ == '__main__':
    show_app_name()

I stole the idea of using namedtuple as a convenient way to
have a class with named attributes but no real methods from
inspect, which uses it to define a Traceback class.

$ python stack_modify2.py
From stack: wrong
From frame: wrong

Replacing the frame and code objects worked well for accessing the
“code” object directly, but failed when I tried to use
inspect.getframeinfo() because there is an explicit type check
with a TypeError near the beginning of getframeinfo()
(see line 16 below).

http://hg.python.org/cpython/file/35ef949e85d7/Lib/inspect.py#l987

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def getframeinfo(frame, context=1):
    """Get information about a frame or traceback object.

    A tuple of five things is returned: the filename, the line number of
    the current line, the function name, a list of lines of context from
    the source code, and the index of the current line within that list.
    The optional second argument specifies the number of lines of context
    to return, which are centered around the current line."""
    if istraceback(frame):
        lineno = frame.tb_lineno
        frame = frame.tb_frame
    else:
        lineno = frame.f_lineno
    if not isframe(frame):
        raise TypeError('{!r} is not a frame or traceback object'.format(frame))

    filename = getsourcefile(frame) or getfile(frame)
    if context > 0:
        start = lineno - 1 - context//2
        try:
            lines, lnum = findsource(frame)
        except IOError:
            lines = index = None
        else:
            start = max(start, 1)
            start = max(0, min(start, len(lines) - context))
            lines = lines[start:start+context]
            index = lineno - 1 - start
    else:
        lines = index = None

    return Traceback(filename, lineno, frame.f_code.co_name, lines, index)

The solution was to replace getframeinfo() with a version that
skips the check. Unfortunately, getframeinfo() uses
getfile(), which performs a similar check, so that function
needed to be replaced, too.

stack_modify3.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import inspect

from stack_modify2 import show_app_name


def getframeinfo(frame, context=1):
    """Get information about a frame or traceback object.

    A tuple of five things is returned: the filename, the line number of
    the current line, the function name, a list of lines of context from
    the source code, and the index of the current line within that list.
    The optional second argument specifies the number of lines of context
    to return, which are centered around the current line."""
    if inspect.istraceback(frame):
        lineno = frame.tb_lineno
        frame = frame.tb_frame
    else:
        lineno = frame.f_lineno
    # if not isframe(frame):
    #     raise TypeError('{!r} is not a frame or traceback object'.format(frame))

    filename = inspect.getsourcefile(frame) or inspect.getfile(frame)
    if context > 0:
        start = lineno - 1 - context//2
        try:
            lines, lnum = inspect.findsource(frame)
        except IOError:
            lines = index = None
        else:
            start = max(start, 1)
            start = max(0, min(start, len(lines) - context))
            lines = lines[start:start+context]
            index = lineno - 1 - start
    else:
        lines = index = None

    return inspect.Traceback(filename, lineno, frame.f_code.co_name, lines, index)
inspect.getframeinfo = getframeinfo


def getfile(object):
    """Work out which source or compiled file an object was defined in."""
    if hasattr(object, 'f_code'):
        return object.f_code.co_filename
    return orig_getfile(object)
orig_getfile = inspect.getfile
inspect.getfile = getfile


if __name__ == '__main__':
    show_app_name()
    s = inspect.stack()
    print inspect.getframeinfo(s[-1][0])

Now the caller can use inspect.getframeinfo() (really my
replacement function) and see the modified filename in the return
value.

1
2
3
4
5
$ python stack_modify3.py
From stack: wrong
From frame: wrong
Traceback(filename='wrong', lineno=51, function='<module>',
code_context=None, index=None)

After reviewing inspect.py one more time to see if I needed to
replace any other functions, I realized that a better solution was
possible. The implementation of inspect.stack() is very small,
since it calls inspect.getouterframes() to actually build the
list of frames. The seed frame passed to getouterframes() comes
from sys._getframe().

http://hg.python.org/cpython/file/35ef949e85d7/Lib/inspect.py#l1052

def stack(context=1):
    """Return a list of records for the stack above the caller's frame."""
    return getouterframes(sys._getframe(1), context)

The rest of the stack is derived from the first frame returned by
_getframe() using the f_back attribute to link from one
frame to the next.

http://hg.python.org/cpython/file/35ef949e85d7/Lib/inspect.py#l1025

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def getouterframes(frame, context=1):
    """Get a list of records for a frame and all higher (calling) frames.

    Each record contains a frame object, filename, line number, function
    name, a list of lines of context, and index within the context."""
    framelist = []
    while frame:
        framelist.append((frame,) + getframeinfo(frame, context))
        frame = frame.f_back
    return framelist

If I modified getouterframes() instead of inspect.stack(),
then I could ensure that my fake frame information was inserted at the
beginning of the stack, and all of the rest of the inspect
functions would honor it.

stack_modify4.py

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
import collections
import inspect

# Define a namedtuple with the attributes of a stack frame
frame_attrs = ['f_back',
               'f_code',
               'f_builtins',
               'f_globals',
               'f_lasti',
               'f_lineno',
               ]
frame = collections.namedtuple('frame', ' '.join(frame_attrs))

# Define a namedtuple with the attributes of a code object
code_attrs = ['co_argcount',
              'co_nlocals',
              'co_stacksize',
              'co_flags',
              'co_code',
              'co_consts',
              'co_names',
              'co_varnames',
              'co_freevars',
              'co_cellvars',
              'co_filename',
              'co_name',
              'co_firstlineno',
              'co_lnotab',
              ]
code = collections.namedtuple('code', ' '.join(code_attrs))


def _make_fake_frame(original, filename):
    """Return a new fake frame object with the wrong filename."""
    new_c_attrs = dict((a, getattr(original.f_code, a))
                          for a in code_attrs)
    new_c_attrs['co_filename'] = filename
    new_c = code(**new_c_attrs)

    new_f_attrs = dict((a, getattr(original, a))
                           for a in frame_attrs)
    new_f_attrs['f_code'] = new_c
    new_f = frame(**new_f_attrs)
    return new_f


def getouterframes(frame, context=1):
    """Get a list of records for a frame and all higher (calling) frames.

    Each record contains a frame object, filename, line number, function
    name, a list of lines of context, and index within the context."""
    framelist = []
    while frame:
        if not frame.f_back:
            # Replace the top of the stack with a fake entry
            frame = _make_fake_frame(frame, 'wrong')
        framelist.append((frame,) + getframeinfo(frame, context))
        frame = frame.f_back
    return framelist
inspect.getouterframes = getouterframes


def getframeinfo(frame, context=1):
    """Get information about a frame or traceback object.

    A tuple of five things is returned: the filename, the line number of
    the current line, the function name, a list of lines of context from
    the source code, and the index of the current line within that list.
    The optional second argument specifies the number of lines of context
    to return, which are centered around the current line."""
    if inspect.istraceback(frame):
        lineno = frame.tb_lineno
        frame = frame.tb_frame
    else:
        lineno = frame.f_lineno
    # if not isframe(frame):
    #     raise TypeError('{!r} is not a frame or traceback object'.format(frame))

    filename = inspect.getsourcefile(frame) or inspect.getfile(frame)
    if context > 0:
        start = lineno - 1 - context // 2
        try:
            lines, lnum = inspect.findsource(frame)
        except IOError:
            lines = index = None
        else:
            start = max(start, 1)
            start = max(0, min(start, len(lines) - context))
            lines = lines[start:start + context]
            index = lineno - 1 - start
    else:
        lines = index = None

    return inspect.Traceback(filename, lineno, frame.f_code.co_name, lines, index)
inspect.getframeinfo = getframeinfo


def getfile(object):
    """Work out which source or compiled file an object was defined in."""
    if isinstance(object, frame):
        return object.f_code.co_filename
    return orig_getfile(object)
orig_getfile = inspect.getfile
inspect.getfile = getfile


def show_app_name():
    stack_data = inspect.stack()
    #print [(s[1], s[0].__class__, s[0].f_code.co_name) for s in stack_data]
    print 'From stack        :', stack_data[-1][1]
    print 'From code in frame:', stack_data[-1][0].f_code.co_filename
    print 'From frame info   :', inspect.getframeinfo(stack_data[-1][0]).filename


if __name__ == '__main__':
    show_app_name()

The customized versions of getframeinfo() and getfile()
are still required to avoid exceptions caused by the type checking.

$ python stack_modify4.py
From stack        : wrong
From code in frame: wrong
From frame info   : wrong

Enough of That

At this point I have proven to myself that while it is unlikely that
anyone would bother to do it in a real program (and they would
certainly not do it by accident) it is possible to intercept the
introspection calls and insert bogus information to mislead a program
trying to discover information about itself. This implementation does
not work to subvert pdb, because it does not use
inspect. Probably because it predates inspect,
pdb has its own implementation of a stack building function,
which could be replaced using the same technique as what was done
above.

This investigation led me to several conclusions. First, I still don’t
know why the original code is looking at the stack to discover the
program name. I should ask on the OpenStack mailing list, but in the
mean time I had fun experimenting while researching the question.
Second, given that looking at __main__.__file__ produces a value
at least as correct as looking at the stack in all cases, and more
correct when a program is launched using the -m flag, it seems
like the solution with best combination of reliability and
simplicity. A patch may be in order. And finally, monkey-patching can
drive you to excesses, madness, or both.

Updates

8 May – Updated styles around embedded source files and added a label
for each.

Preparing My First Patch for OpenStack

I joined DreamHost about four weeks ago on Feb 6. and am on the
team building a cloud hosting service based on the open source project
OpenStack. I spent the first couple of weeks at the new job doing
the usual sorts of new-hire activities: reading a lot of documentation
and learning my way around the development environment and tool
chain. I’ve done a little bit of work on some internal code already,
and I recently had a good opportunity to start working on OpenStack
itself.

The OpenStack project team was running a coordinated “Global
Hack-In
” on March 1st to work on bugs for the upcoming Essex
release. Several members of the OpenStack Atlanta meetup, sponsored
by DreamHost, met to participate in the hack-in as a group. Since we
don’t have official offices in Atlanta, yet, we gathered over at
Duncan McGreggor’s house
. Brent Scotten has already posted a quick
recap of the day on the OpenStack blog, but I wanted to go into a
little more detail about my experiences preparing my first patch for
the project because for me this was a new code base on a new project
using new version control and code review tools.

Becoming an OpenStack Contributor

The process for signing up to be a contributor to OpenStack isn’t
terribly complicated, but there are a lot of steps and the
instructions are spread out over a couple of different documents. I
took care of subscribing to the required mailing lists and signing the
contributor agreement a couple of weeks ago, but I am including those
steps here along with the technical stuff I did yesterday, to provide
an end-to-end view of the process.

Community

The first set of instructions, in the wiki under HowToContribute,
explains how to join the OpenStack community. I registered an account
on Launchpad, the code hosting service managed by Canonical and
used by Ubuntu and other open source projects. Then I joined all of
the OpenStack projects (Nova, Swift, Glance, and Keystone). Joining
the individual teams under the OpenStack umbrella gave me access to
the project mailing lists, so I can eventually participate in
conversations about designs and implementation details.

Contributor Agreement

Please refer to this wiki page for the current process for signing
the CLA.

Development Environment

I am familiar with source control and virtualization, but not the
specific flavors used for this project. My next step was to set up a
development environment where I could work on the Nova project
documentation, which meant setting up an virtual machine where I could
build the project.

Virtualbox

My desktop system is running Mac OS X 10.7.3 (Lion), and OpenStack is
intended to run on Linux (primarily Ubuntu). To set up an Ubuntu
system for the build, I used virtualbox and vagrant. Duncan and
Mark had already found a Vagrantfile that would set up an Oneiric
image and then use DevStack to deploy all of the parts of OpenStack
into the new VM using these chef cookbooks.

I saved the Vagrantfile to an empty directory and started a VM using
vagrant up. After a short wait while devstack downloaded and
installed all of the dependencies, the VM was ready and running a copy
of OpenStack based on the git repositories created by the devstack
script. The local repositories are created in /opt/stack within
the VM, but I wanted to work in a copy that was mounted via NFS from
the host OS so I could edit under OS X and build in the VM.

I was going to work on Nova, so I cloned its git repository from
github and modified the Vagrantfile to have more RAM and so it
would mount the directory where my source code was checked out using
NFS to /home/vagrant/Devel.

# -*- mode: ruby -*-

Vagrant::Config.run do |config|
  sshdir = "#{ENV['HOME']}/.ssh/"
  cachedir = (ENV['CACHEDIR'] or "/Users/dhellmann/Documents/Devel/openstack-vm/cache")
  checkout = (ENV['COOKBOOKS'] or "/Users/dhellmann/Documents/Devel/openstack-vm/openstack-cookbooks")

  # Make /vagrant an NFS mount instead of using the default setting
  #config.vm.share_folder("v-root", "/vagrant", ".", :nfs => true)

  ip_prefix = (ENV['IP_PREFIX'] or "10.0.5.")
  mac_prefix = (ENV['MAC_PREFIX'] or "080027027")
  suffix = "100"
  ip = "#{ip_prefix}#{suffix}"
  config.vm.box = "oneiric"
  config.vm.box_url = "http://images.ansolabs.com/vagrant/oneiric64.box"
  config.vm.customize ['modifyvm', :id, '--memory', '2048']
  config.vm.network :hostonly, ip, :mac => "#{mac_prefix}#{suffix}"

  config.vm.share_folder("v-cache", "/home/vagrant/cache", cachedir, :nfs => true)
  config.vm.share_folder("v-ssh", "/home/vagrant/.host-ssh", sshdir)
  config.vm.share_folder("v-dev", "/home/vagrant/Devel", "/Users/dhellmann/Documents/Devel/nova-docs",
                         :nfs => true)

  config.vm.forward_port 80, 8080
  config.vm.provision :chef_solo do |chef|
    chef.cookbooks_path = "#{checkout}/cookbooks"
    chef.roles_path = "#{checkout}/roles"
    chef.log_level = :debug
    chef.run_list = [
      "recipe[anso::cache]",
      "recipe[nova::hostname]",
      "recipe[nova::source]",
      #"recipe[anso::settings]", # vim / screen / git settings for testing
    ]
    chef.json.merge!({
      :nova => {
        :source => {
          :mysql_password => "secrete",
          :rabbit_password => "secrete",
          :admin_password => "secrete",
          :service_token => "secrete",
          :flat_interface => "eth1",
          :public_interface => "eth1",
          :floating_range => (ENV['FLOATING'] or "#{ip_prefix}128/28"),
          :host_ip => ip,
        }
      },
    })
  end
end

After restarting the VM (vagrant halt && vagrant up) I was able to
login and switch Nova to use the git repository I had just checked
out:

$ vagrant ssh
$ cd Devel/nova
$ sudo python setup.py develop

I encountered some I/O issues using vagrant ssh to connect to the
VM, so in the end I switched to using ssh directly via the port that
was forwarded using the instructions in the Vagrantfile above and the
key vagrant used when setting up the VM.

$ ssh -i ~/.vagrant.d/insecure_private_key -p 2222 vagrant@127.0.0.1

Building the Documentation

The documentation (and test) build for Nova use a local virtualenv
created within the source tree. The instructions for creating the
virtualenv
are part of the Nova Developer Guide. We all had
issues with timeouts and installation failures of one sort or another
setting up the virtualenv within a virtualbox VM, and I eventually
resorted to installing the dependencies into the virtualenv by hand
using:

$ cd Devel/nova
$ source .venv/bin/activate
(.venv)$ pip install -r tools/pip-requires
(.venv)$ pip install -r tools/test-requires

Sphinx and the other related tools needed to build the documentation
are installed into the virtualenv so that, with the environment
active, it is easy to build the HTML documentation using make:

$ cd doc
$ make

Choosing Something to Fix

It takes Sphinx a while to process all of the documentation, and it
produced a lot of warnings and errors along the way. Duncan and Mark
worked on some bugs in the OpenStack code, but I decided that since I
was new to the tools as well as the code-base I would focus on
shepherding a small documentation change all the way through the
process this time around, to make sure I had everything set up
correctly and that I understood the process. Fixing some of the errors
in the reStructuredText formatting within the documentation met my
criteria nicely.

Tracking Changes

OpenStack uses Launchpad for bug reports and task management.
Launchpad’s operating model uses “blueprints” to group related related
bugs and feature tickets. After studying the errors produced by
Sphinx, I decided that there were basically three causes:

  1. Problems in the automatically generated markup used to produce the

    API guide.

  2. Problems in the manually written markup of the manual.

  3. Problems in the manually written reStructuredText within the Nova

    source code, used when the API guide is generated.

I created one blueprint called sphinx-doc-cleanup to track all of
the changes, and then opened three separate bug tickets to address the
different types of problems. After opening the bugs, I went back to
the blueprint page to associate the new tickets with the blueprint.

I decided to tackle the changes in the order they are listed above,
because the fix for the first one was just a few lines and it would
eliminate several error messages.

Crafting the Fix

When Sphinx runs to build the Nova documentation, the first thing it
does is generate skeleton files for some of the API documentation
using a script called nova/doc/generate_autodoc_index.sh. For each
module that needs to be documented and for which no documentation
exists, a reStructuredText file is emitted containing a header such
as:

The :mod:`nova.wsgi` Module
===========================
.. automodule:: nova.wsgi
   :members:
   :undoc-members:
   :show-inheritance:

The result is that even if there is no special local documentation for
the module, the doc-strings from the code will be processed and
included in the final output.

The bug I found was with the way generate_autodoc_index.sh
calculated the length of the underline to create the reStructuredText
heading. The original version of the script used a single long line of
=, but some of the module names were longer than the line. Rather
than just extend it, I changed the script to calculate the proper
length.

Because this was my first patch, I also added myself to the
Authors file at the top of the Nova source tree so I would be
listed among the project contributors. I committed the change to my
local repository, including the blueprint and bug id in the git commit
message. At that point I was ready to submit the patch for review.

Review Tools and Process

OpenStack uses github to host its code, but the project does not
accept
pull requests in the usual way. Instead, it uses a copy of
Gerrit tied to Launchpad and Jenkins so that changes can be reviewed
by other developers and then run through the test suite before being
merged into the main repository automatically.

The instructions for setting up git to be able to submit change
requests to Gerrit are in the GerritWorkflow page of the OpenStack
wiki. After installing the git-review plug-in described there, I
ran into some trouble authenticating with Gerrit. I finally figured
out that I needed to upload the public half of my my ssh key to the
Gerrit server. I don’t know if it was missing because I did not have
it installed on Launchpad when I authenticated the first time, or if I
would have had to take the extra step anyway. Installing the key
allowed me to complete the setup, though, and then submit my change
for review
with git review.

While I was looking for instructions about how to make sure someone
saw that the change was ready for review, Vish Ishaya reviewed the
change (email notification of reviews are sent to a review mailing
list populated, I assume, from people who have signed up for the
OpenStack Gerrit instance). I ended up needing to modify the git
metadata for my change to use my DreamHost email address instead of my
personal account, but after a bit of back-and-forth on IRC to ask for
another review, Vish approved the change. A little while later the
Jenkins build completed and the change was merged into the main
repository. My first commit was merged!

Final Thoughts

So far I have found the OpenStack community helpful and welcoming. I
especially want to thank Anne Gentle, who assisted me with finding
some of the development setup documentation I needed. I should also
thank Vish Ishaya and Brian Waldon for jumping right on the code
review as soon as I submitted the patch, which made it possible to
achieve my personal goal for the day. As I mentioned above, the
OpenStack developers are using several tools I hadn’t used extensively
before. The tools integrate well, especially those like gerrit that
can use Launchpad for authentication. I am confident that I have
everything I need set up now, which means I can start making more
significant contributions.

Using Fuzzy Matching to Search by Sound with Python

When you’re writing code to search a database, you can’t rely on
all those data entries being spelled correctly. Doug Hellmann,
developer at DreamHost and author of The Python Standard Library By Example,
reviews available options for searching databases by the sound of
the target’s name, rather than relying on the entry’s accuracy.

Searching for a person’s name in a database is a unique
challenge. Depending on the source and age of the data, you may not be
able to count on the spelling of the name being correct, or even the
same name being spelled the same way when it appears more than
once. Discrepancies between stored data and search terms may be
introduced due to personal choice or cultural differences in
spellings, homophones, transcription errors, illiteracy, or simply
lack of standardized spellings during some time periods. These sorts
of problems are especially prevalent in transcriptions of handwritten
historical records used by historians, genealogists, and other
researchers.

Discrepancies between stored data and search terms may be
introduced due to personal choice or cultural differences in
spellings, homophones, transcription errors, illiteracy, or simply
lack of standardized spellings during some time periods.

A common way to solve the string search problem is to look for values
“close” to the same as the search target. Using a traditional fuzzy
match
algorithm to compute the closeness of two arbitrary strings is
expensive, though, and is not appropriate for searching large
data sets. A better solution is to compute hash values for entries in
the database in advance, and several special hash algorithms have been
created for this purpose. These phonetic hash algorithms allow you to
compare two words or names based on how they sound, rather than the
precise spelling.

Early Efforts: Soundex

One such algorithm is Soundex, developed by Margaret K. Odell and
Robert C. Russell in the early 1900s. The Soundex algorithm appears
frequently in genealogical contexts because it is associated with the
U.S. Census and is specifically designed to encode names. A Soundex
hash value is calculated by using the first letter of the name and
converting the consonants in the rest of the name to digits using a
simple lookup table. Vowels and duplicate encoded values are dropped,
and the result is padded up to–or truncated down to–four
characters. The Fuzzy library includes a Soundex implementation for
Python programs.

The Soundex algorithm appears frequently in genealogical contexts
because it is associated with the U.S. Census and is specifically
designed to encode names.

#!/usr/bin/env python

import fuzzy

names = [ 'Catherine', 'Katherine', 'Katarina',
          'Johnathan', 'Jonathan', 'John',
          'Teresa', 'Theresa',
          'Smith', 'Smyth',
          'Jessica',
          'Joshua',
          ]

soundex = fuzzy.Soundex(4)

for n in names:
    print '%-10s' % n, soundex(n)

The output of show_soundex.py demonstrates that some of the names
with similar sound are encoded with the same hash value, but the
results are not ideal. The variations “Theresa” and “Teresa” both
produce the same Soundex hash, but “Catherine” and “Katherine” start
with a different letter so even though they sound the same the hash
outputs are different. The last two names, “Jessica” and “Joshua,” are
not related at all but are given the same hash value because the
letters “J,” “S,” and “C” all map to the digit 2, and the algorithm
removes duplicates. These types of failures illustrate a major
shortcoming of Soundex.

$ python show_soundex.py
Catherine  C365
Katherine  K365
Katarina   K365
Johnathan  J535
Jonathan   J535
John       J500
Teresa     T620
Theresa    T620
Smith      S530
Smyth      S530
Jessica    J200
Joshua     J200

Beyond English: NYSIIS

Algorithms developed after Soundex use different encoding schemes,
either building on Soundex by tweaking the lookup table or starting
from scratch with their own rules. All of them process phonemes
differently in an attempt to improve accuracy. For example, in the
1970s the New York State Identification and Intelligence System
algorithm
(NYSIIS) was published by Robert L. Taft. NYSIIS was
originally used by what is now the New York State Division of Criminal
Justice Services to help identify people in their database. It
produces better results than Soundex because it takes special care to
handle phonemes that occur in European and Hispanic surnames.

NYSIIS produces better results than Soundex because it takes
special care to handle phonemes that occur in European and Hispanic
surnames.

#!/usr/bin/env python

import fuzzy

names = [ 'Catherine', 'Katherine', 'Katarina',
          'Johnathan', 'Jonathan', 'John',
          'Teresa', 'Theresa',
          'Smith', 'Smyth',
          'Jessica',
          'Joshua',
          ]

for n in names:
    print '%-10s' % n, fuzzy.nysiis(n)

The output of show_nysiis.py is better than the results from
Soundex with this sample data because “Catherine”, “Katherine”, and
“Katarina” all map to the same hash value. The incorrect match of
“Jessica” and “Joshua” is also eliminated because more of the letters
from the names are used in the NYSIIS hash values.

$ python show_nysiis.py
Catherine  CATARAN
Katherine  CATARAN
Katarina   CATARAN
Johnathan  JANATAN
Jonathan   JANATAN
John       JAN
Teresa     TARAS
Theresa    TARAS
Smith      SNATH
Smyth      SNATH
Jessica    JASAC
Joshua     JAS

A New Approach: Metaphone

Metaphone, published in 1990 by Lawrence Philips, is another algorithm
that improves on earlier systems such as Soundex and NYSIIS. The
Metaphone algorithm is significantly more complicated than the others
because it includes special rules for handling spelling
inconsistencies and for looking at combinations of consonants in
addition to some vowels. An updated version of the algorithm called
Double Metaphone goes even further by adding rules for handling some
spellings and pronunciations from languages other than English.

The Metaphone algorithm is significantly more complicated than the
others because it includes special rules for handling spelling
inconsistencies and for looking at combinations of consonants in
addition to some vowels.

#!/usr/bin/env python

import fuzzy

names = [ 'Catherine', 'Katherine', 'Katarina',
          'Johnathan', 'Jonathan', 'John',
          'Teresa', 'Theresa',
          'Smith', 'Smyth',
          'Jessica',
          'Joshua',
          ]

dmetaphone = fuzzy.DMetaphone(4)

for n in names:
    print '%-10s' % n, dmetaphone(n)

In addition to having a broader set of encoding rules, Double
Metaphone generates two alternate hashes for each input word. This
gives the caller the ability to present search results with two levels
of precision. In the results from the sample program, “Catherine” and
“Katherine” have the same primary hash value. Their secondary hash
value is the same as the primary hash for “Katarina,” finding the
match that Soundex did not, but giving it less weight than the results
from NYSIIS implied.

$ python show_dmetaphone.py
Catherine  ['K0RN', 'KTRN']
Katherine  ['K0RN', 'KTRN']
Katarina   ['KTRN', None]
Johnathan  ['JN0N', 'ANTN']
Jonathan   ['JN0N', 'ANTN']
John       ['JN', 'AN']
Teresa     ['TRS', None]
Theresa    ['0RS', 'TRS']
Smith      ['SM0', 'XMT']
Smyth      ['SM0', 'XMT']
Jessica    ['JSK', 'ASK']
Joshua     ['JX', 'AX']

Applying Phonetic Searches

Using phonetic searches in your application is straightforward, but
may require adding extensions to the database server or bundling a
third-party library with your application. MySQL, Postgresql, SQLite,
and Microsoft SQL Server all support Soundex through a string function
that can be invoked directly in queries. Postgresql also includes
functions to calculate hashes using the original Metaphone algorithm
and Double Metaphone.

Using phonetic searches in your application is straightforward, but
may require adding extensions to the database server or bundling a
third-party library with your application.

Stand-alone implementations for all of the algorithms also are
available for major programming languages such as Python, PHP, Ruby,
Perl, C/C++, and Java. These libraries can be used with databases that
do not have support for phonetic hash functions built in, such as
MongoDB. For example, this script loads a series of names into a
database, saving each hash value as a precomputed value to make
searching easier later.

#!/usr/bin/env python

import argparse

import fuzzy
from pymongo import Connection

parser = argparse.ArgumentParser(description='Load names into the database')
parser.add_argument('name', nargs='+')
args = parser.parse_args()

c = Connection()
db = c.phonetic_search
dmetaphone = fuzzy.DMetaphone()
soundex = fuzzy.Soundex(4)

for n in args.name:
    # Compute the hashes. Save soundex
    # and nysiis as lists to be consistent
    # with dmetaphone return type.
    values = {'_id':n,
              'name':n,
              'soundex':[soundex(n)],
              'nysiis':[fuzzy.nysiis(n)],
              'dmetaphone':dmetaphone(n),
              }
    print 'Loading %s: %s, %s, %s' % 
        (n, values['soundex'][0], values['nysiis'][0],
         values['dmetaphone'])
    db.people.update({'_id':n}, values,
                     True, # insert if not found
                     False,
                     )

Run mongodb_load.py from the command line to save names for
retrieval later.

$ python mongodb_load.py Jonathan Johnathan Joshua Jessica
Loading Jonathan: J535, JANATAN, ['JN0N', 'ANTN']
Loading Johnathan: J535, JANATAN, ['JN0N', 'ANTN']
Loading Joshua: J200, JAS, ['JX', 'AX']
Loading Jessica: J200, JASAC, ['JSK', 'ASK']

$ python mongodb_load.py Catherine Katherine Katarina
Loading Catherine: C365, CATARAN, ['K0RN', 'KTRN']
Loading Katherine: K365, CATARAN, ['K0RN', 'KTRN']
Loading Katarina: K365, CATARAN, ['KTRN', None]

The search program mongodb_search.py lets the user select a hash
function and then constructs a MongoDB query to find all names with a
hash value matching the input name.

#!/usr/bin/env python

import argparse

import fuzzy
from pymongo import Connection

ENCODERS = {
    'soundex':fuzzy.Soundex(4),
    'nysiis':fuzzy.nysiis,
    'dmetaphone':fuzzy.DMetaphone(),
    }

parser = argparse.ArgumentParser(description='Search for a name in the database')
parser.add_argument('algorithm', choices=('soundex', 'nysiis', 'dmetaphone'))
parser.add_argument('name')
args = parser.parse_args()

c = Connection()
db = c.phonetic_search

encoded_name = ENCODERS[args.algorithm](args.name)
query = {args.algorithm:encoded_name}

for person in db.people.find(query):
    print person['name']

In some of these sample cases the extra values in the result set are
desirable because they are valid matches. On the other hand, the
Soundex search for “Joshua” returns the unrelated value “Jessica”
again. Although Soundex produces poor results when compared to the
other algorithms, it is still used in many cases because it is built
in to the database server. Its simplicity also means it is faster than
the NYSIIS or Double Metaphone. In situations where the results are
good enough, its speed may be a deciding factor in selecting it.

$ python mongodb_search.py soundex Katherine
Katherine
Katarina

$ python mongodb_search.py nysiis Katherine
Catherine
Katherine
Katarina

$ python mongodb_search.py soundex Joshua
Joshua
Jessica

$ python mongodb_search.py nysiis Joshua
Joshua

Final Thoughts

I hope this article demonstrates the power that phonetic hash
algorithms can add to the search features of your application, and the
ease with which you can implement them. Selecting the right algorithm
to use will depend on the nature of the data and the types of searches
being performed. If the right algorithm is not clear from the data
available, it may be best to provide an option to the user to let them
pick a hash algorithm. Offering the user a choice will provide most
flexibility for experimentation and refining searches, although it
does require a little more work on your part to set up the
indexes. Many researchers, historians, and genealogists are familiar
with the names of the algorithms, if not the implementations, so
presenting them as options should not intimidate users.

References

Originally published on informit.com , 3 March 2012

How I Review a PyCon Talk Proposal

As the submission period for PyCon 2012 comes to a close and we
transition to reviewing the proposals from potential speakers, I
thought I would talk about the criteria I use for evaluating the
submissions. I am only one member of the Program Committee, and others
may use different criteria, so please take that into account while
reading this article.

I approach conference talk proposals in much the same way that I
approached submissions for Python Magazine. Just as the magazine had a
limited amount of space for articles each month, there are a limited
number of speaking slots at PyCon. The Program Committee’s job is to
choose the talks that maximize the benefit to the entire conference
audience, within the space and time constraints given. While reviewing
a proposal, I keep in mind that, as one person among 1,500, my opinion
only goes so far toward the makeup of the conference audience. While
my own enthusiasm and interest in a subject do matter, I try to
consider the rest of the audience first. The main question I keep in
my mind is, will the audience seeing this talk receive enough value to
make it worth bumping another talk? A lot of factors go into that
decision, and balancing them all can be a challenge.

Will the audience seeing this talk receive enough value to make it
worth bumping another talk?

The Abstract

The first step to reviewing the proposal is to read it. That seems
fairly obvious, but occasionally a title comes along that is so
compelling that I have to remind myself to keep reading before voting
+1 and moving on to the next talk. It isn’t enough for a proposal to
cover an interesting topic. It has to indicate that the talk will be
interesting, too. While I am reading, I look for several factors.

Is it clear, complete, and compelling?

First, is the abstract clear? The speaker should describe the topic
they plan to talk about in terms I can understand, even if I don’t
know anything about that subject area. The audience for PyCon is
diverse in experience and background, and so are the speakers. A
clearly written abstract, without a lot of domain-specific jargon,
tells me the speaker will be able to communicate with the audience.
If I can’t understand the proposal, I assume the audience won’t
understand the talk either.

Next, is the abstract complete? An incomplete proposal is the
largest negative factor I consider. If a proposal is incomplete, I
can’t really say what the speaker will talk about, or even if they
know the subject matter for their talk. If a proposal does not have a
detailed summary or outline, I as the submitter to provide more
detail.

Finally, I consider whether the abstract is compelling. Without
regard to the actual subject, is the abstract written in a way to
attract an audience? Is it full of boring clichés, vaguely worded
assertions about the superiority of one tool over another, cute
metaphors, or hand waving? I look for an abstract that shows the
speaker is excited about the topic, and that they will be conveying
that same excitement to the audience.

The Topic

For some people, the subject matter of a talk is the most important,
or only, aspect taken into consideration when voting. I have seen
presentations on topics I thought would be boring, but which were
delivered with such enthusiasm that I enjoyed them more than talks I
thought would be interesting from the outset. In my mind, the topic is
an important factor, but not necessarily the most important, to be
considered.

How relevant, niche, immediately useful, and novel is the topic?

I look first at whether the topic is relevant to the conference
attendees. For PyCon, that largely means users of the Python
programming language. Attendees will have a range of experience
levels, interests, and backgrounds, though, so not every talk needs to
include examples of Python code showing how to code around a
problem. I have seen successful talks covering user interface
standards compliance, community building, Java-based testing tools,
and a host of other non-coding topics.

Although we want a broad set of topics, we do need to be careful to
avoid talks that are too narrowly focused on a niche. PyCon has
grown large enough that we run five tracks of talks
simultaneously. That means each talk is usually competing with four
others, and when the start and stop times don’t align it can be more
than that. Each talk needs to attract enough of that audience to
justify being included in the program instead of another of the many
rejected talks. There are no formal metrics for that criteria, because
we cannot know in advance how each talk will be attended. Talks that
are unlikely to attract a significant audience at a large conference
such as PyCon may find a better fit at a regional or subject-matter
based conference. Examples of niche talks are those on topics that
require deep technical knowledge of a scientific field, unpopular
hardware platform, or industry. This is also why I tend to down-vote
talks on brand new libraries or tools, usually with an admonition to
resubmit the proposal when the project has matured and the community
around it has grown. Now that PyCon includes a poster session, I often
recommend that new projects which show a lot of promise convert their
talk proposal into a poster proposal.

As a counterpoint to considering whether a topic is too niche, I also
try to take into account whether the audience will take away something
immediately useful. The members of the Python community blend open
source and closed source solutions. Presentations on a closed source
application or service are less interesting, unless the techniques
presented can be applied elsewhere. Talks on open source projects are
more likely to involve reusable code, just by their nature. However,
new and incomplete projects don’t meet that standard.

The final criteria I consider for the proposal’s topic is whether it
is novel. Although the PyCon has a significant amount of turnover in
the audience year to year, I don’t want the program to be made up of
the same topics over and over. The same goes for subjects that have
appeared on blog posts, articles, and books. I am unlikely to up-vote
an introductory talk for a project with excellent documentation unless
the talk presents material in a new and interesting way.

The Presenter

Foremost in my mind when thinking about the presenter is the question
of whether or not they will be successful at delivering their proposed
talk. That question covers a lot of ground, so I break it up into
several aspects and take each in turn.

How skilled, knowledgeable, and experienced is the speaker?

First, how skilled is the speaker? The PyCon audience deserves
speakers who take their task seriously by preparing and practicing.
Evaluating the presentation skills of an unknown speaker can be
challenging. I have not had the opportunity to attend a lot of other
conferences or user group meetings in person, so unless they have
spoken at PyCon in the recent past I am unlikely to know much about
their style or skill. Restricting myself to events I have attended
means I may tend to favor a small group of “usual suspects.” To avoid
that bias, I look for cases where they have spoken at user group
meetings, regional conferences, or other events. Having access to
videos of past presentations helps, and I try to review at least a few
minutes of video for a speaker, when it is available.

I also want to see an indication that the speaker understands the
subject well enough to speak at length about it. A good speaker does
not need to be a domain expert, but some knowledge and experience is
required. It is not necessary for every presentation on a tool to be
delivered by the tool’s developer. Users and other community members
with good public speaking skills can present

Finally, does the speaker have enough experience speaking to know
their limits? There are two main limits: venue and time. PyCon is
typically held in large hotel ballrooms and that type of venue
presents a couple of challenges. The average audience for any given
talk is 1/4 to 1/5 of the 1,500 conference attendees. Due to the
structure of the venue, interactive sessions tend to flow less well
than more the standard talk-followed-by-questions. Presentations that
rely on showing code battle with the deep room layout, so fonts have
to be large enough that big blocks of code don’t fit on a single
slide. All of these issues can be overcome, but a presenter has to
know that they can’t just slap up their text editor together with a
terminal window using a low-contrast color scheme and expect the
audience to follow along during a live demo. Previous speaking
experience at smaller events shows me that a speaker is likely to
understand these issues.

The time issue is more difficult to balance. I want to see a talk with
enough depth that it shows something interesting, but reaching that
depth may require a time-consuming introduction. Setting the audience
level of the talk is one approach for avoiding this problem, because
the audience for an Intermediate or Experienced talk will not expect
as much of an introduction as someone in a Novice talk. On the other
hand, assuming too much knowledge may mean even an Experience attendee
does not understand the main point of a presentation. Speakers can
reclaim time by removing redundant examples, tightening their focus,
and trimming some of the commentary, so I usually try to point out
material I think can be cut to give the speaker more time.

It is less common for an outline to seem too short to fill the space
given, but it does happen. In this case, I look for information the
presenter may need to add, suggest they work in extra examples, or
simply tell them it looks too short. If a talk can’t fill a 30 minute
slot, but is still important or interesting, I may propose they
convert their proposal into a poster, instead.

Voting Versus Steering

One point frequently lost in the heat of the reviewing process is that
the Program Committee is not just supposed to judge the
proposals. Our responsibility is to make the PyCon program
better. That mission includes guiding the speakers to improve their
talks by refining the original proposals.

I have already mentioned a some of the common suggestions I make
during the course of a review. After considering all of the other
factors, I pause and pretend that the proposal I am reviewing is for
the only talk I will be able to see at the conference. Then I think of
ways the presentation could be improved, and engage with the speaker
to discuss them. I try to provide guidance based on past experience to
help newer speakers understand how their talk might be improved to fit
into PyCon better.

I also tend to ask a lot of questions, especially if I or other
reviewers have a negative impression of the proposal. Asking for
clarification or more detail is an important part of the process.
Sometimes the interaction with the speaker helps me decide whether to
champion a talk, or merely vote +0 or -0.

tl;dr

As I said at the outset, other reviewers may use different criteria
when evaluating a proposal. We want a big enough Program Committee
that we have a range of experience and input into the selection
process. But these are the criteria I use, so if I end up reviewing
your proposal, you know how I went about it.

  1. Is the abstract clear?
  2. Is the abstract complete?
  3. Is the abstract compelling?
  4. Is the topic relevant?
  5. Is the topic too niche?
  6. Is the material immediately useful?
  7. Is the presentation novel?
  8. Is the speaker skilled?
  9. Is the speaker knowledgeable?
  10. Is the speaker experience?
  11. How can the proposal/talk be improved?

See also

Creating a Spelling Checker for reStructuredText Documents

I write a lot using reStructuredText files as the source format,
largely because of the ease of automating the tools used to convert
reST to other formats. The number of files involved has grown to the
point that some of the post-writing tasks were becoming tedious, and I
was skipping steps like running the spelling checker. I finally
decided to do something about that by creating a spelling checker
plugin for Sphinx, released as *sphinxcontrib-spelling*.

I have written about *why I chose reST* before. All of the articles
on this site, including the Python Module of the Week series,
started out as .rst files. I also use Sphinx to produce
several developer manuals at my day job. I like reST and Sphinx
because they both can be extended to meet new needs easily. One area
that has been lacking, though, is support for a spelling checker.

Checking the spelling of the contents of an individual
reStructuredText file from within a text editor like Aquamacs is
straightforward, but I have on the order of 200 separate files making
up parts of this site alone, not to mention *my book*. Manually
checking each file, one at a time, is a tedious job, and not one I
perform very often. After finding a few typos recently, I decided I
needed to take care of the problem by using automation to eliminate
the drudgery and make it easier to run the spelling checker regularly.

The files are already configured to be processed by Sphinx when they
are converted to HTML and PDF format, so that seemed like a natural
way to handle the spelling checker, too. To add a step to the build to
check the spelling of every file, I would need two new tools: an
extension to Sphinx to drive the spelling checker and the spelling
checker itself. I did not find any existing Sphinx extensions that
checked spelling, so I decided to write my own. The first step was to
evaluate spelling checkers.

I did not find any existing Sphinx extensions that checked spelling,
so I decided to write my own.

Choosing a Spelling Checker

I recently read Peter Norvig’s article How to Write a Spelling
Corrector
, which shows how to create a spelling checker from scratch
in Python. As with most nontrivial applications, though, the algorithm
for testing the words is only part of the story when looking at a
spelling checker. An equally important aspect is the dictionary of
words known to be spelled correctly. Without a good dictionary, the
algorithm would report too many false negatives. Not wanting to build
my own dictionary, I decided to investigate existing spelling checkers
and concentrate on writing the interface layer to connect them to
Sphinx.

There are several open source spelling checkers with Python bindings.
I evaluated the aspell-python and PyEnchant (bindings for
enchant, the spelling checker from the AbiWord project). Both tools
required some manual setup to get the engine working. The
aspell-python API was simple to use, but I decided to use PyEnchant
instead. It has an active development group and is more extensible
(with APIs to define alternate dictionaries, tokenizers, and filters).

Installing PyEnchant

I started out by trying to install enchant and PyEnchant from source
under OS X with Python 2.7, but eventually gave up after having to
download several dependencies just to get configure to run for
enchant. I stuck with PyEnchant as a solution because installing
aspell was not really any easier (the installation experience for both
tools could be improved). The simplest solution for OS X and Windows
is to use the platform-specific binary installers for PyEnchant (not
the .egg), since they include all of the dependencies. That means it
cannot be installed into a virtualenv, but I was willing to live with
that for the sake of having any solution at all.

Linux platforms can probably install enchant via RPM or other system
package, so it is less of a challenge to get PyEnchant working there,
and it may even work with pip.

Using PyEnchant

There are several good examples in the PyEnchant tutorial, and I
will not repeat them here. I will cover some of the concepts, though,
as part of explaining the implementation of the new extension.

The PyEnchant API is organized around a “dictionary,” which can be
loaded at runtime based on a language name. Enchant does some work to
try to determine the correct language automatically based on the
environment settings, but I found it more reliable to set the language
explicitly. After the dictionary is loaded, its check() method can
be used to test whether a word is correct or not. For incorrect words,
the suggest() method returns a list of possible alternatives,
sorted by the likelihood they are the intended word.

The check() method works well for individual words, but cannot
process paragraphs. PyEnchant provides an API for checking larger
blocks of text, but I chose to use a lower level API instead. In
addition to the dictionary, PyEnchant includes a “tokenizer” API for
splitting text into candidate words to be checked. Using the tokenizer
API means that the new plugin can run some additional tests on words
not found in the dictionary. For example, I plan to provide an option
to ignore “misspelled” words that appear to be the name of an
importable Python module.

Integrating with Sphinx

The Sphinx Extension API includes several ways to add new features
to Sphinx, including markup roles, language domains, processing
events, and directives. I chose to create a new “builder” class,
because that would give me complete control over the way the document
is processed. The builder API works with a parsed document to create
output, usually in a format like HTML or PDF. In this case, the
SpellingBuilder does not generate any output files. It prints the
list of misspelled words to standard output, and includes the headings
showing where the words appear in the document.

The first step in creating the new extension is to define a
setup() function to be invoked when the module is loaded. The
function receives as argument an instance of the Sphinx
application, ready to be configured. In
sphinxcontrib.spelling.setup(), the new builder and several
configuration options are added to the application. Although the
Sphinx configuration file can contain any Python code, only the
explicitly registered configuration settings affect the way the
environment is saved.

def setup(app):
    app.info('Initializing Spelling Checker')
    app.add_builder(SpellingBuilder)
    # Report guesses about correct spelling
    app.add_config_value('spelling_show_suggestions', False, 'env')
    # Set the language for the text
    app.add_config_value('spelling_lang', 'en_US', 'env')
    # Set a user-provided list of words known to be spelled properly
    app.add_config_value('spelling_word_list_filename', 'spelling_wordlist.txt', 'env')
    # Assume anything that looks like a PyPI package name is spelled properly
    app.add_config_value('spelling_ignore_pypi_package_names', False, 'env')
    # Assume words that look like wiki page names are spelled properly
    app.add_config_value('spelling_ignore_wiki_words', True, 'env')
    # Assume words that are all caps, or all caps with trailing s, are spelled properly
    app.add_config_value('spelling_ignore_acronyms', True, 'env')
    # Assume words that are part of __builtins__ are spelled properly
    app.add_config_value('spelling_ignore_python_builtins', True, 'env')
    # Assume words that look like the names of importable modules are spelled properly
    app.add_config_value('spelling_ignore_importable_modules', True, 'env')
    # Add any user-defined filter classes
    app.add_config_value('spelling_filters', [], 'env')
    # Register the 'spelling' directive for setting parameters within a document
    rst.directives.register_directive('spelling', SpellingDirective)
    return

The builder class is derived from sphinx.builders.Builder. The
important method is write_doc(), which processes the parsed
documents and saves the messages with unknown words to the output
file.

def write_doc(self, docname, doctree):
    self.checker.push_filters(self.env.spelling_document_filters[docname])

    for node in doctree.traverse(docutils.nodes.Text):
        if node.tagname == '#text' and  node.parent.tagname in TEXT_NODES:

            # Figure out the line number for this node by climbing the
            # tree until we find a node that has a line number.
            lineno = None
            parent = node
            seen = set()
            while lineno is None:
                #self.info('looking for line number on %r' % node)
                seen.add(parent)
                parent = node.parent
                if parent is None or parent in seen:
                    break
                lineno = parent.line
            filename = self.env.doc2path(docname, base=None)

            # Check the text of the node.
            for word, suggestions in self.checker.check(node.astext()):
                msg_parts = []
                if lineno:
                    msg_parts.append(darkgreen('(line %3d)' % lineno))
                msg_parts.append(red(word))
                msg_parts.append(self.format_suggestions(suggestions))
                msg = ' '.join(msg_parts)
                self.info(msg)
                self.output.write(u"%s:%s: (%s) %sn" % (
                        self.env.doc2path(docname, None),
                        lineno, word,
                        self.format_suggestions(suggestions),
                        ))

                # We found at least one bad spelling, so set the status
                # code for the app to a value that indicates an error.
                self.app.statuscode = 1

    self.checker.pop_filters()
    return

The builder traverses all of the text nodes, skipping over formatting
nodes and container nodes that contain no text. Each node is converted
to plain text using its astext() method, and the text is given to
the SpellingChecker to be parsed and checked.

class SpellingChecker(object):
    """Checks the spelling of blocks of text.

    Uses options defined in the sphinx configuration file to control
    the checking and filtering behavior.
    """

    def __init__(self, lang, suggest, word_list_filename, filters=[]):
        self.dictionary = enchant.DictWithPWL(lang, word_list_filename)
        self.tokenizer = get_tokenizer(lang, filters)
        self.original_tokenizer = self.tokenizer
        self.suggest = suggest

    def push_filters(self, new_filters):
        """Add a filter to the tokenizer chain.
        """
        t = self.tokenizer
        for f in new_filters:
            t = f(t)
        self.tokenizer = t

    def pop_filters(self):
        """Remove the filters pushed during the last call to push_filters().
        """
        self.tokenizer = self.original_tokenizer

    def check(self, text):
        """Generator function that yields bad words and suggested alternate spellings.
        """
        for word, pos in self.tokenizer(text):
            correct = self.dictionary.check(word)
            if correct:
                continue
            yield word, self.dictionary.suggest(word) if self.suggest else []
        return

Finding Words in the Input Text

The blocks of text from the nodes are parsed using a language-specific
tokenizer provided by PyEnchant. The text is split into words, and
then each word is passed through a series of filters. The API defined
by enchant.tokenize.Filter supports two behaviors. Based on the
return value from _skip(), the word might be ignored entirely and
never returned by the tokenizer. Alternatively, the _split()
method can return a modified version of the text.

In addition to the filters for email addresses and “wiki words”
provided by PyEnchant, sphinxcontrib-spelling includes several
others. The AcronymFilter tells the tokenizer to skip words that
use all uppercase letters.

class AcronymFilter(Filter):
    """If a word looks like an acronym (all upper case letters),
    ignore it.
    """
    def _skip(self, word):
        return (word == word.upper() # all caps
                or
                # pluralized acronym ("URLs")
                (word[-1].lower() == 's'
                 and
                 word[:-1] == word[:-1].upper()
                 )
                )

The ContractionFilter expands common English contractions
that might appear in less formal blog posts.

class list_tokenize(tokenize):
    def __init__(self, words):
        tokenize.__init__(self, '')
        self._words = words
    def next(self):
        if not self._words:
            raise StopIteration()
        word = self._words.pop(0)
        return (word, 0)

class ContractionFilter(Filter):
    """Strip common contractions from words.
    """
    splits = {
        "won't":['will', 'not'],
        "isn't":['is', 'not'],
        "can't":['can', 'not'],
        "i'm":['I', 'am'],
        }
    def _split(self, word):
        # Fixed responses
        if word.lower() in self.splits:
            return list_tokenize(self.splits[word.lower()])

        # Possessive
        if word.lower().endswith("'s"):
            return unit_tokenize(word[:-2])

        # * not
        if word.lower().endswith("n't"):
            return unit_tokenize(word[:-3])

        return unit_tokenize(word)

Because I write about Python a lot, I tend to use the names of
projects that appear on the Python Package Index
(PyPI). PyPiFilterFactory fetches a list of the packages from
the index and then sets up a filter to ignore all of them.

class IgnoreWordsFilter(Filter):
    """Given a set of words, ignore them all.
    """
    def __init__(self, tokenizer, word_set):
        self.word_set = set(word_set)
        Filter.__init__(self, tokenizer)
    def _skip(self, word):
        return word in self.word_set

class IgnoreWordsFilterFactory(object):
    def __init__(self, words):
        self.words = words
    def __call__(self, tokenizer):
        return IgnoreWordsFilter(tokenizer, self.words)

class PyPIFilterFactory(IgnoreWordsFilterFactory):
    """Build an IgnoreWordsFilter for all of the names of packages on PyPI.
    """
    def __init__(self):
        client = xmlrpclib.ServerProxy('http://pypi.python.org/pypi')
        IgnoreWordsFilterFactory.__init__(self, client.list_packages())

PythonBuiltinsFilter ignores functions built into the Python
interpreter.

class PythonBuiltinsFilter(Filter):
    """Ignore names of built-in Python symbols.
    """
    def _skip(self, word):
        return word in __builtins__

Finally, ImportableModuleFilter ignores words that match the
names of modules found on the import path. It uses imp to search
for the module
without actually importing it.

class ImportableModuleFilter(Filter):
    """Ignore names of modules that we could import.
    """
    def __init__(self, tokenizer):
        Filter.__init__(self, tokenizer)
        self.found_modules = set()
        self.sought_modules = set()
    def _skip(self, word):
        if word not in self.sought_modules:
            self.sought_modules.add(word)
            try:
                imp.find_module(word)
            except UnicodeEncodeError:
                return False
            except ImportError:
                return False
            else:
                self.found_modules.add(word)
                return True
        return word in self.found_modules

The SpellingBuilder creates the filter stack based on user
settings, so the filters can be turned on or off.

filters = [ ContractionFilter,
            EmailFilter,
            ]
if self.config.spelling_ignore_wiki_words:
    filters.append(WikiWordFilter)
if self.config.spelling_ignore_acronyms:
    filters.append(AcronymFilter)
if self.config.spelling_ignore_pypi_package_names:
    self.info('Adding package names from PyPI to local spelling dictionary...')
    filters.append(PyPIFilterFactory())
if self.config.spelling_ignore_python_builtins:
    filters.append(PythonBuiltinsFilter)
if self.config.spelling_ignore_importable_modules:
    filters.append(ImportableModuleFilter)
filters.extend(self.config.spelling_filters)

Using the Spelling Checker

PyEnchant and sphinxcontrib-spelling should be installed on the
import path for the same version of Python that Sphinx is using (refer
to the *project home page* for more details). Then the extension
needs to be explicitly enabled for a Sphinx project in order for the
builder to be recognized. To enable the extension, add it to the list
of extension in conf.py.

extensions = [ 'sphinxcontrib.spelling' ]

The other options can be set in conf.py, as well. For example, to
turn on the filter to ignore the names of packages from PyPI, set
spelling_add_pypy_package_names to True.

spelling_add_pypi_package_names = True

Because the spelling checker is integrated with Sphinx using a new
builder class, it is not run when the HTML or LaTeX builders
run. Instead, it needs to run as a separate phase of the build by
passing the -b option to sphinx-build. The output shows each
document name as it is processed, and if there are any errors the line
number and misspelled word is shown. When
spelling_show_suggestions is True, proposed corrections are
included in the output.

$ sphinx-build -b spelling -d build/doctrees source build/spelling
...
writing output... [ 31%] articles/how-tos/sphinxcontrib-spelling/index
(line 255) mispelling ["misspelling", "dispelling", "mi spelling",
"spelling", "compelling", "impelling", "rappelling"]
...

See Also

PyEnchant
Python interface to enchant.
*sphinxcontrib-spelling*
Project home page for the spelling checker.
sphinxcontrib
BitBucket repository for sphinxcontrib-spelling and several other
Sphinx extensions.
Sphinx Extension API
Describes methods for extending Sphinx.
*Defining Custom Roles in Sphinx*
Describes another way to extend Sphinx by modifying the
reStructuredText syntax.