PyATL meetup Oct. 11th

The Python Atlanta Meetup group meets tomorrow night at Turner, on
Techwood Drive. This month’s theme is “Zope Related Technologies”.
Here’s the schedule:

Oct. 11th Schedule: Round Table Discussion, Lightening Talks, Main Presentation

7:15-7:30 Meet at Turner Lobby.
7:30-7:45 Opening Remarks and setup.
7:45-8:25 20 Minute Interactive discussion Atlanta Plone and/or Derek Richardson
8:35-8:40 5 Minute Break
8:40-9:00 20 Minute Main Presentation: Drew Smathers, Zope 3
9:00-? General Discussion, Coding Sessions

PyMOTW: difflib

The difflib module contains several classes for comparing sequences,
especially of lines of text from files, and manipulating the results.
The SequenceMatcher class compares any 2 sequences of values, as
long as the values are hash-able. It uses a recursive algorithm to
identify the longest contiguous matching blocks from the sequences,
eliminating “junk” values. The Differ class works on sequences of
text lines and produces human-readable deltas, including differences
within individual lines. The HtmlDiff class produces similar
results formatted as an HTML table.

Read more at pymotw.com: difflib

Python Magazine: First issue free!

The premier issue of Python Magazine is available for download
right now, completely free.

I haven’t mentioned it previously on this blog, but I’m the Technical
Editor for the magazine. That means I review and test the submitted
code, and write a monthly column. The column runs under the title “And
Now for Something Completely Different” and will focus on technical
topics (this month I talk about the GIL and 2 packages for managing
processes).

Other regular columns include Brian Jones (Editor in Chief), Mark
Mruss
(“Welcome to Python”, targeted at newer users or introductory
topics), and Steve Holden (“Random Hits”, the end-note editorial).

In addition to the regular columns, there are 4 feature articles this
month:

  1. John Berninger covers Extending Python using C, without using a
    binding generator. He’s Old School.
  2. Kevin Ryan introduces form processing in WSGI with some clever
    data-driven techniques using lambda.
  3. Sayamindu Dasgupta writes a PyGTK widget using Cairo primitives to
    draw the widget view.
  4. And I discuss a fun hack I came up with to pull iCalendar data out
    of an IMAP server.

I’m really excited about the how this issue has turned out (Arbi
Arzoumani did a great job with the design and layout), and hope you
like it, too. Head over to http://pythonmagazine.com/c/issue/2007/10 and
download the PDF version. If you do like it, subscribe! If you think you
could do better, submit an idea for an article and write for us!

Besides soliciting articles from you, I’ll always be on the look-out
for good ideas to cover in my own column. If there is something you
want me to cover, email me directly (doug dot hellmann at
pythonmagazine dot com) or tag a link to a site or blog post
with pymagdifferent on del.icio.us.

Working with IMAP and iCalendar

How can you access group calendar information if your
Exchange-like mail and calendaring server does not provide
iCalendar feeds, and you do not, or cannot, use Outlook? Use
Python to extract the calendar data and generate your own feed, of
course! This article discusses a surprisingly simple program to
perform what seems like a complex series of operations: scanning
IMAP folders, extracting iCalendar attachments, and merging the
contained events together into a single calendar.

Background

I recently needed to access shared schedule information stored on an
Exchange-like mail and calendaring server. Luckily, I was able to
combine an existing third party open source library with the tools in
the Python standard library to create a command line program to
convert the calendar data into a format I could use with my desktop
client directly. The final product is called mailbox2ics. It ended up
being far shorter than I had anticipated when I started thinking about
how to accomplish my goal. The entire program is just under 140 lines
long, including command line switch handling, some error processing,
and debug statements. The output file produced can be consumed by any
scheduling client that supports the iCalendar standard.

Using Exchange, or a compatible replacement, for email and scheduling
makes sense for many businesses and organizations. The client program,
Microsoft Outlook, is usually familiar to non-technical staff members,
and therefore new hires can hit the ground running instead of being
stymied trying to figure out how to accomplish their basic, everyday
communication tasks. However, my laptop runs Mac OS X and I do not
have Outlook. Purchasing a copy of Outlook at my own expense, in
addition to inflicting further software bloat on my already crowded
computer, seemed like an unnecessarily burdensome hassle just to be
able to access schedule information.

Changing the server software was not an option. A majority of the
users already had Outlook and were accustomed to using it for their
scheduling, and I did not want to have to support a different server
platform. That left me with one option: invent a way to pull the data
out of the existing server, so I could convert it to a format that I
could use with my usual tools: Apple’s iCal and Mail.

With iCal (and many other standards-compliant calendar tools) it is
possible to subscribe to calendar data feeds. Unfortunately, the
server we were using did not have the ability to export the schedule
data in a standard format using a single file or URL. However, the
server did provide access to the calendar data via IMAP using shared
public folders. I decided to use Python to write a program to extract
the data from the server and convert it into a usable feed. The feed
would be passed to iCal, which would merge the group schedule with the
rest of my calendar information so I could see the group events
alongside my other meetings, deadlines, and reminders about when the
recycling is picked up on our street.

IMAP Basics

The calendar data was only accessible to me as attachments on email
messages accessed via an IMAP server. The messages were grouped into
several folders, with each folder representing a separate public
calendar used for a different purpose (meeting room schedules, event
planning, holiday and vacation schedules, etc.). I had read-only
access to all of the email messages in the public calendar
folders. Each email message typically had one attachment describing a
single event. To produce the merged calendar, I needed to scan several
folders, read each message in the folder, find and parse the calendar
data in the attachments, and identify the calendar events. Once I
identified the events to include in the output, I needed to add them
to an output file in a format iCal understands.

Python’s standard library includes the imaplib module for working
with IMAP servers. The IMAP4 and IMAP4_SSL classes provide a high
level interface to all of the features I needed: connecting to the
server securely, accessing mailboxes, finding messages, and
downloading them. To experiment with retrieving data from the IMAP
server, I started by establishing a secure connection to the server on
the standard port for IMAP-over-SSL, and logging in using my regular
account. This would not be a desirable way to run the final program on
a regular basis, but it works fine for development and testing.

mail_server = imaplib.IMAP4_SSL(hostname)
mail_server.login(username, password)

It is also possible to use IMAP over a non-standard port. In that
case, the caller can pass port as an additional option to
imaplib.IMAP4_SSL(). To work with an IMAP server without SSL
encryption, you can use the IMAP4 class, but using SSL is
definitely preferred.

mail_server = imaplib.IMAP4_SSL(hostname, port)
mail_server.login(username, password)

The connection to the IMAP server is “stateful”. The client remembers
which methods have been called on it, and changes its internal state
to reflect those calls. The internal state is used to detect logical
errors in the sequence of method calls without the round-trip to the
server.

On an IMAP server, messages are organized into “mailboxes”. Each
mailbox has a name and, since mailboxes might be nested, the full name
of the mailbox is the path to that mailbox. Mailbox paths work just
like paths to directories or folders in a filesystem. The paths are
single strings, with levels usually separated by a forward slash
(/) or period (.). The actual separator value used depends on
the configuration of your IMAP server; one of my servers uses a slash,
while the other uses a period. If you do not already know how your
server is set up, you will need to experiment to determine the correct
values for folder names.

Once I had my client connected to the server, the next step was to
call select() to set the mailbox context to be used when searching
for and downloading messages.

mail_server.select('Public Folders/EventCalendar')
# or
mail_server.select('Public Folders.EventCalendar')

After a mailbox is selected, it is possible to retrieve messages from
the mailbox using search(). The IMAP method search() supports
filtering to identify only the messages you need. You can search for
messages based on the content of the message headers, with the rules
evaluated in the server instead of your client, thus reducing the
amount of information the server has to transmit to the client. Refer
to RFC 3501 (“Internet Message Access Protocol”) for details about the
types of queries which can be performed and the syntax for passing the
query arguments.

In order to implement mailbox2ics, I needed to look at all of the
messages in every mailbox the user named on the command line, so I
simply used the filter “ALL” with each mailbox. The return value
from search() includes a response code and a string with the
message numbers separated by spaces. A separate call is required to
retrieve more details about an individual message, such as the headers
or body.

(typ, [message_ids]) = mail_server.search(None, 'ALL')
message_ids = message_ids.split()

Individual messages are retrieved via fetch(). If only part of the
message is desired (size, envelope, body), that part can be fetched to
limit bandwidth. I could not predict which subset of the message body
might include the attachments I wanted, so it was simplest for me to
download the entire message. Calling fetch(“(RFC822)”) returns a
string containing the MIME-encoded version of the message with all
headers intact.

typ, message_parts = mail_server.fetch(
    message_ids[0], '(RFC822)')
message_body = message_parts[0][1]

Once the message body had been downloaded, the next step was to parse
it to find the attachments with calendar data. Beginning with version
2.2.3, the Python standard library has included the email package
for working with standards-compliant email messages. There is a
straightforward factory for converting message text to Message
objects. To parse the text representation of an email and create a
Message instance from it, use email.message_from_string().

msg = email.message_from_string(message_body)

Message objects are almost always made up of multiple parts. The parts
of the message are organized in a tree structure, with message
attachments supporting nested attachments. Subparts or attachments can
even include entire email messages, such as when you forward a message
which already contains an attachment to someone else. To iterate over
all of the parts of the Message tree recursively, use the walk()
method.

for part in msg.walk():
    print part.get_content_type()

Having access to the email package saved an enormous amount of time on
this project. Parsing multi-part email messages reliably is tricky,
even with (or perhaps because of) the many standards involved. With
the email package, in just a few lines of Python, you can parse and
traverse all of the parts of even the most complex standard-compliant
multi-part email message, giving you access to the type and content of
each part.

Accessing Calendar Data

The “Internet Calendaring and Scheduling Core Object Specification”,
or iCalendar, is defined in RFC 2445. iCalendar is a data format
for sharing scheduling and other date-oriented information. One
typical way to receive an iCalendar event notification, such as an
invitation to a meeting, is via an email attachment. Most standard
calendaring tools, such as iCal and Outlook, generate these email
messages when you initially “invite” another participant to a meeting,
or update an existing meeting description. The iCalendar standard says
the file should have filename extension ICS and mime-type
text/calendar. The input data for mailbox2ics came from email
attachments of this type.

The iCalendar format is text-based. A simple example of an ICS file
with a single event is provided in Listing 1. Calendar events have
properties to indicate who was invited to an event, who originated it,
where and when it will be held, and all of the other expected bits of
information important for a scheduled event. Each property of the
event is encoded on its own line, with long values wrapped onto
multiple lines in a well-defined way to allow the original content to
be reconstructed by a client receiving the iCalendar representation of
the data. Some properties also can be repeated, to handle cases such
as meetings with multiple invitees.

Listing 1

BEGIN:VCALENDAR
CALSCALE:GREGORIAN
PRODID:-//Big Calendar Corp//Server Version X.Y.Z//EN
VERSION:2.0
METHOD:PUBLISH
BEGIN:VEVENT
UID:20379258.1177945519186.JavaMail.root(a)imap.example.com
LAST-MODIFIED:20070519T000650Z
DTSTAMP:20070519T000650Z
DTSTART;VALUE=DATE:20070508
DTEND;VALUE=DATE:20070509
PRIORITY:5
TRANSP:OPAQUE
SEQUENCE:0
SUMMARY:Day off
LOCATION:
CLASS:PUBLIC
END:VEVENT
END:VCALENDAR

In addition to having a variety of single or multi-value properties,
calendar elements can be nested, much like email messages with
attachments. An ICS file is made up of a VCALENDAR component,
which usually includes one or more VEVENT components. A
VCALENDAR might also include VTODO components (for tasks on a
to-do list). A VEVENT may contain a VALARM, which specifies
the time and means by which the user should be reminded of the event.
The complete description of the iCalendar format, including valid
component types and property names, and the types of values which are
legal for each property, is available in the RFC.

This sounds complex, but luckily, I did not have to worry about
parsing the ICS data at all. Instead of doing the work myself, I took
advantage of an open source Python library for working with iCalendar
data released by Max M. (maxm@mxm.dk). His iCalendar library
(available from codespeak.net) makes parsing ICS data sources very
simple. The API for the library was designed based on the email
package discussed previously, so working with Calendar instances and
email.Message instances is similar. Use the class method
Calendar.from_string() to parse the text representation of the
calendar data to create a Calendar instance populated with all of the
properties and subcomponents described in the input data.

from icalendar import Calendar, Event
cal_data = Calendar.from_string(open('sample.ics', 'rb').read())

Once you have instantiated the Calendar object, there are two
different ways to iterate through its components: via the walk()
method or subcomponents attribute. Using walk() will traverse
the entire tree and let you process each component in the tree
individually. Accessing the subcomponents list directly lets you
work with a larger portion of the calendar data tree at one time.
Properties of an individual component, such as the summary or start
date, are accessed via the __getitem__() API, just as with a
standard Python dictionary. The property names are not case sensitive.

For example, to print the “SUMMARY” field values from all top level
events in a calendar, you would first iterate over the subcomponents,
then check the name attribute to determine the component type. If
the type is VEVENT, then the summary can be accessed and printed.

for event in cal_data.subcomponents:
    if event.name == 'VEVENT':
        print 'EVENT:', event['SUMMARY']

While most of the ICS attachments in my input data would be made up of
one VCALENDAR component with one VEVENT subcomponent, I did
not want to require this limitation. The calendars are writable by
anyone in the organization, so while it was unlikely that anyone would
have added a VTODO or VJOURNAL to public data, I could not
count on it. Checking for VEVENT as I scanned each component let
me ignore components with types that I did not want to include in the
output.

Writing ICS data to a file is as simple as reading it, and only takes
a few lines of code. The Calendar class handles the difficult tasks of
encoding and formatting the data as needed to produce a fully
formatted ICS representation, so I only needed to write the formatted
text to a file.

ics_output = open('output.ics', 'wb')
try:
    ics_output.write(str(cal_data))
finally:
    ics_output.close()

Finding Max M’s iCalendar library saved me a lot of time and effort,
and demonstrates clearly the value of Python and open source in
general. The API is concise and, since it is patterned off of another
library I was already using, the idioms were familiar. I had not
embarked on this project eager to write parsers for the input data, so
I was glad to have libraries available to do that part of the work for
me.

Putting It All Together

At this point, I had enough pieces to build a program to do what I
needed. I could read the email messages from the server via IMAP,
parse each message, and then search through its attachments to find
the ICS attachments. Once I had the attachments, I could parse them
and produce another ICS file to be imported into my calendar client.
All that remained was to tie the pieces together and give it a user
interface. The source for the resulting program, mailbox2ics.py,
is provided in Listing 2.

Listing 2

#!/usr/bin/env python
# mailbox2ics.py

"""Convert the contents of an imap mailbox to an ICS file.

This program scans an IMAP mailbox, reads in any messages with ICS
files attached, and merges them into a single ICS file as output.
"""

# Import system modules
import imaplib
import email
import getpass
import optparse
import sys

# Import Local modules
from icalendar import Calendar, Event

# Module

def main():
    # Set up our options
    option_parser = optparse.OptionParser(
        usage='usage: %prog [options] hostname username mailbox [mailbox...]'
        )
    option_parser.add_option('-p', '--password', dest='password',
                             default='',
                             help='Password for username',
                             )
    option_parser.add_option('--port', dest='port',
                             help='Port for IMAP server',
                             type="int",
                             )
    option_parser.add_option('-v', '--verbose', 
                             dest="verbose", 
                             action="store_true", 
                             default=True,
                             help='Show progress',
                             )
    option_parser.add_option('-q', '--quiet', 
                             dest="verbose", 
                             action="store_false", 
                             help='Do not show progress',
                             )
    option_parser.add_option('-o', '--output', dest="output",
                             help="Output file",
                             default=None,
                             )

    (options, args) = option_parser.parse_args()
    if len(args) < 3:
        option_parser.print_help()
        print >>sys.stderr, 'nERROR: Please specify a username, hostname, and mailbox.'
        return 1
    hostname = args[0]
    username = args[1]
    mailboxes = args[2:]

    # Make sure we have the credentials to login to the IMAP server.
    password = options.password or getpass.getpass(stream=sys.stderr)

    # Initialize a calendar to hold the merged data
    merged_calendar = Calendar()
    merged_calendar.add('prodid', '-//mailbox2ics//doughellmann.com//')
    merged_calendar.add('calscale', 'GREGORIAN')

    if options.verbose:
        print >>sys.stderr, 'Logging in to "%s" as %s' % (hostname, username)

    # Connect to the mail server
    if options.port is not None:
        mail_server = imaplib.IMAP4_SSL(hostname, options.port)
    else:
        mail_server = imaplib.IMAP4_SSL(hostname)
    (typ, [login_response]) = mail_server.login(username, password)
    try:
        # Process the mailboxes
        for mailbox in mailboxes:
            if options.verbose: print >>sys.stderr, 'Scanning %s ...' % mailbox
            (typ, [num_messages]) = mail_server.select(mailbox)
            if typ == 'NO':
                raise RuntimeError('Could not find mailbox %s: %s' % 
                                   (mailbox, num_messages))
            num_messages = int(num_messages)
            if not num_messages:
                if options.verbose: print >>sys.stderr, '  empty'
                continue

            # Find all messages
            (typ, [message_ids]) = mail_server.search(None, 'ALL')
            for num in message_ids.split():

                # Get a Message object
                typ, message_parts = mail_server.fetch(num, '(RFC822)')
                msg = email.message_from_string(message_parts[0][1])

                # Look for calendar attachments
                for part in msg.walk():
                    if part.get_content_type() == 'text/calendar':
                        # Parse the calendar attachment
                        ics_text = part.get_payload(decode=1)
                        importing = Calendar.from_string(ics_text)

                        # Add events from the calendar to our merge calendar
                        for event in importing.subcomponents:
                            if event.name != 'VEVENT':
                                continue
                            if options.verbose: 
                                print >>sys.stderr, 'Found: %s' % event['SUMMARY']
                            merged_calendar.add_component(event)
    finally:
        # Disconnect from the IMAP server
        if mail_server.state != 'AUTH':
            mail_server.close()
        mail_server.logout()

    # Dump the merged calendar to our output destination
    if options.output:
        output = open(options.output, 'wt')
        try:
            output.write(str(merged_calendar))
        finally:
            output.close()
    else:
        print str(merged_calendar)
    return 0

if __name__ == '__main__':
    try:
        exit_code = main()
    except Exception, err:
        print >>sys.stderr, 'ERROR: %s' % str(err)
        exit_code = 1
    sys.exit(exit_code)

Since I wanted to set up the export job to run on a regular basis via
cron, I chose a command line interface. The main() function for
mailbox2ics.py starts out at line 24 with the usual sort of
configuration for command line option processing via the optparse
module. Listing 3 shows the help output produced when the program is
run with the -h option.

Listing 3

Usage: mailbox2ics.py [options] hostname username mailbox [mailbox...]

Options:
  -h, --help            show this help message and exit
  -p PASSWORD, --password=PASSWORD
                        Password for username
  --port=PORT           Port for IMAP server
  -v, --verbose         Show progress
  -q, --quiet           Do not show progress
  -o OUTPUT, --output=OUTPUT
                        Output file

The –password option can be used to specify the IMAP account
password on the command line, but if you choose to use it consider the
security implications of embedding a password in the command line for
a cron task or shell script. No matter how you specify the password, I
recommend creating a separate mailbox2ics account on the IMAP server
and limiting the rights it has so no data can be created or deleted
and only public folders can be accessed. If –password is not
specified on the command line, the user is prompted for a password
when they run the program. While less useful with cron, providing the
password interactively can be a solution if you are unable, or not
allowed, to create a separate restricted account on the IMAP server.
The account name used to connect to the server is required on the
command line.

There is also a separate option for writing the ICS output data to a
file. The default is to print the sequence of events to standard
output in ICS format. Though it is easy enough to redirect standard
output to a file, the -o option can be useful if you are using the
-v option to enable verbose progress tracking and debugging.

The program uses a separate Calendar instance, merged_data, to
hold all of the ICS information to be included in the output. All of
the VEVENT components from the input are copied to merged_data
in memory, and the entire calendar is written to the output location
at the end of the program. After initialization (line 64),
merged_data is configured with some basic properties. PRODID
is required and specifies the name of the product which produced the
ICS file. CALSCALE defines the date system, or scale, used for the
calendar.

After setting up merged_calendar, mailbox2ics connects to the IMAP
server. It tests whether the user has specified a network port using
–port and only passes a port number to imaplib if the user
includes the option. The optparse library converts the option value to
an integer based on the option configuration, so options.port is
either an integer or None.

The names of all mailboxes to be scanned are passed as arguments to
mailbox2ics on the command line after the rest of the option
switches. Each mailbox name is processed one at a time, in the for
loop starting on line 79. After calling select() to change the
IMAP context, the message ids of all of the messages in the mailbox
are retrieved via a call to search(). The full content of each
message in the mailbox is fetched in turn, and parsed with
email.message_from_string(). Once the message has been parsed, the
msg variable refers to an instance of email.Message.

Each message may have multiple parts containing different MIME
encodings of the same data, as well as any additional message
information or attachments included in the email which generated the
event. For event notification messages, there is typically at least
one human-readable representation of the event and frequently both
HTML and plain text are included. Of course, the message also includes
the actual ICS file, as well. For my purposes, only the ICS
attachments were important, but there is no way to predict where they
will appear in the sequence of attachments on the email message. To
find the ICS attachments, mailbox2ics walks through all of the parts
of the message recursively looking for attachments with mime-type
text/calendar (as specified in the iCalendar standard) and
ignoring everything else. Attachment names are ignored, since
mime-type is a more reliable way to identify the calendar data
accurately.

for part in msg.walk():
    if part.get_content_type() == 'text/calendar':
        # Parse the calendar attachment
        ics_text = part.get_payload(decode=1)
        importing = Calendar.from_string(ics_text)

When it finds an ICS attachment, mailbox2ics parses the text of the
attachment to create a new Calendar instance, then copies the
VEVENT components from the parsed Calendar to merged_calendar.
The events do not need to be sorted into any particular order when
they are added to merged_calendar, since the client reading the
ICS file will filter and reorder them as necessary to displaying them
on screen. It was important to take the entire event, including any
subcomponents, to ensure that all alarms are included. Instead of
traversing the entire calendar and accessing each component
individually, I simply iterated over the subcomponents of the
top-level VCALENDAR node. Most of the ICS files only included one
VEVENT anyway, but I did not want to miss anything important if
that ever turned out not to be the case.

for event in importing.subcomponents:
    if event.name != 'VEVENT':
        continue
    merged_calendar.add_component(event)

Once all of the mailboxes, messages, and calendars are processed, the
merged_calendar refers to a Calendar instance containing all of
the events discovered. The last step in the process, starting at line
119, is for mailbox2ics to create the output. The event data is
formatted using str(merged_calendar), just as in the example
above, and written to the output destination selected by the user
(standard output or file).

Example

Listing 4 includes sample output from running mailbox2ics to merge two
calendars for a couple of telecommuting workers, Alice and Bob. Both
Alice and Bob have placed their calendars online at imap.example.com.
In the output of mailbox2ics, you can see that Alice has 2 events in
her calendar indicating the days when she will be in the office. Bob
has one event for the day he has a meeting scheduled with Alice.

Listing 4

$ mailbox2ics.py -o group_schedule.ics imap.example.com mailbox2ics  "Calendars.Alice" "Calendars.Bob"
Password: 
Logging in to "imap.example.com" as mailbox2ics
Scanning Calendars.Alice ...
Found: In the office to work with Bob on project proposal
Found: In the office
Scanning Calendars.Bob ...
Found: In the office to work with Alice on project proposal

The output file created by mailbox2ics containing the merged calendar
data from Alice and Bob’s calendars is shown in Listing 5. You can see
that it includes all 3 events as VEVENT components nested inside a
single VCALENDAR. There were no alarms or other types of
components in the input data.

Listing 5

BEGIN:VCALENDAR
CALSCALE:GREGORIAN
PRODID:-//mailbox2ics//doughellmann.com//
BEGIN:VEVENT
CLASS:PUBLIC
DTEND;VALUE=DATE:20070704
DTSTAMP:20070705T180246Z
DTSTART;VALUE=DATE:20070703
LAST-MODIFIED:20070705T180246Z
LOCATION:
PRIORITY:5
SEQUENCE:0
SUMMARY:In the office to work with Bob on project proposal
TRANSP:TRANSPARENT
UID:9628812.1182888943029.JavaMail.root(a)imap.example.com
END:VEVENT
BEGIN:VEVENT
CLASS:PUBLIC
DTEND;VALUE=DATE:20070627
DTSTAMP:20070625T154856Z
DTSTART;VALUE=DATE:20070626
LAST-MODIFIED:20070625T154856Z
LOCATION:Atlanta
PRIORITY:5
SEQUENCE:0
SUMMARY:In the office
TRANSP:TRANSPARENT
UID:11588018.1182542267385.JavaMail.root(a)imap.example.com
END:VEVENT
BEGIN:VEVENT
CLASS:PUBLIC
DTEND;VALUE=DATE:20070704
DTSTAMP:20070705T180246Z
DTSTART;VALUE=DATE:20070703
LAST-MODIFIED:20070705T180246Z
LOCATION:
PRIORITY:5
SEQUENCE:0
SUMMARY:In the office to work with Alice on project proposal
TRANSP:TRANSPARENT
UID:9628812.1182888943029.JavaMail.root(a)imap.example.com
END:VEVENT
END:VCALENDAR

Mailbox2ics In Production

To solve my original problem of merging the events into a sharable
calendar to which I could subscribe in iCal, I scheduled mailbox2ics
to run regularly via cron. With some experimentation, I found that
running it every 10 minutes caught most of the updates quickly enough
for my needs. The program runs locally on a web server which has
access to the IMAP server. For better security, it connects to the
IMAP server as a user with restricted permissions. The ICS output
file produced is written to a directory accessible to the web server
software. This lets me serve the ICS file as static content on the web
server to multiple subscribers. Access to the file through the web is
protected by a password, to prevent unauthorized access.

Thoughts About Future Enhancements

Mailbox2ics does everything I need it to do, for now. There are a few
obvious areas where it could be enhanced to make it more generally
useful to other users with different needs, though. Input and output
filtering for events could be added. Incremental update support would
help it scale to manage larger calendars. Handling non-event data in
the calendar could also prove useful. And using a configuration file
to hold the IMAP password would be more secure than passing it on the
command line.

At the time of this writing, mailbox2ics does not offer any way to
filter the input or output data other than by controlling which
mailboxes are scanned. Adding finer-grained filtering support could
be useful. The input data could be filtered at two different points,
based on IMAP rules or the content of the calendar entries themselves.

IMAP filter rules (based on sender, recipient, subject line, message
contents, or other headers) would use the capabilities of
IMAP4.search() and the IMAP server without much effort on my part.
All that would be needed are a few command line options to pass the
filtering rules, or code to read a configuration file. The only
difference in the processing by mailbox2ics would be to convert the
input rules to the syntax understood by the IMAP server and pass them
to search().

Filtering based on VEVENT properties would require a little more
work. The event data must be downloaded and checked locally, since the
IMAP server will not look inside the attachments to check the
contents. Filtering using date ranges for the event start or stop date
could be very useful, and not hard to implement. The Calendar class
already converts dates to datetime instances. The datetime
package makes it easy to test dates against rules such as “events in
the next 7 days” or “events since Jan 1, 2007”.

Another simple addition would be pattern matching against other
property values such as the event summary, organizer, location, or
attendees. The patterns could be regular expressions, or a simpler
syntax such as globbing. The event properties, when present in the
input, are readily available through the __getitem__() API of the
Calendar instance and it would be simple to compare them against the
pattern(s).

If a large amount of data is involved, either spread across several
calendars or because there are a lot of events, it might also be
useful to be able to update an existing cached file, rather than
building the whole ICS file from scratch each time. Looking only at
unread messages in the folder, for example, would let mailbox2ics skip
downloading old events that are no longer relevant or already appear
in the local ICS file. It could then initialize merged_calendar by
reading from the local file before updating it with new events and
re-writing the file. Caching some of the results in this way would
place less load on the IMAP server, so the export could easily be run
more frequently than once every 10 minutes.

In addition to filtering to reduce the information included in the
output, it might also prove useful to add extra information by
including component types other than VEVENT. For example,
including VTODO would allow users to include a group action list
in the group calendar. Most scheduling clients support filtering the
to-do items and alarms out of calendars to which you subscribe, so if
the values are included in a feed, individual users can always ignore
the ones they choose.

As mentioned earlier, using the –password option to provide the
password to the IMAP server is convenient, but not secure. For
example, on some systems it is possible to see the arguments to
programs using ps. This allows any user on the system to watch for
mailbox2ics to run and observe the password used. A more secure way to
provide the password is through a configuration file. The file can
have filesystem permissions set so that only the owner can access
it. It could also, potentially, be encrypted, though that might be
overkill for this type of program. It should not be necessary to run
mailbox2ics on a server where there is a high risk that the password
file might be exposed.

Conclusion

Mailbox2ics was a fun project that took a me just a few hours over a
weekend to implement and test. This project illustrates two reasons
why I enjoy developing with Python. First, difficult tasks are made
easier through the power of the “batteries included” nature of
Python’s standard distribution. And second, coupling Python with the
wide array of other open source libraries available lets you get the
job done, even when the Python standard library lacks the exact tool
you need. Using the ICS file produced by mailbox2ics, I am now able to
access the calendar data I need using my familiar tools, even though
iCalendar is not supported directly by the group’s calendar server.

Originally published in Python Magazine Volume 1 Issue 10 , October, 2007

Multi-processing techniques in Python

Originally published in Python Magazine Volume 1 Number 10 , October,
2007

Has your multi-threaded application grown GILs? Take a look at these
packages for easy-to-use process management and inter-process
communication tools.

There is no predefined theme for this column, so I plan to cover a
different, likely unrelated, subject every month. The topics will
range anywhere from open source packages in the Python Package Index
(formerly The Cheese Shop, now PyPI) to new developments from around
the Python community, and anything that looks interesting in
between. If there is something you would like for me to cover, send a
note with the details to doug dot hellmann at
pythonmagazine dot com and let me know, or add the link to
your del.icio.us account with the tag “pymagdifferent”.

I will make one stipulation for my own sake: any open source libraries
must be registered with PyPI and configured so that I can install them
with distutils. Creating a login at PyPI and registering your
project is easy, and only takes a few minutes. Go on, you know you
want to.

Scaling Python: Threads vs. Processes

In the ongoing discussion of performance and scaling issues with
Python, one persistent theme is the Global Interpreter Lock
(GIL). While the GIL has the advantage of simplifying the
implementation of CPython internals and extension modules, it prevents
users from achieving true multi-threaded parallelism by limiting the
interpreter to executing byte-codes in one thread at a time on a
single processor. Threads which block on I/O or use extension modules
written in another language can release the GIL to allow other threads
to take over control, of course. But if my application is written
entirely in Python, only a limited number of statements will be
executed before one thread is suspended and another is started.

Eliminating the GIL has been on the wish lists of many Python
developers for a long time – I have been working with Python since
1998 and it was a hotly debated topic even then. Around that time,
Greg Stein produced a set of patches for Python 1.5 that eliminated
the GIL entirely, replacing it with a whole set of individual locks
for the mutable data structures (dictionaries, lists, etc.) that had
been protected by the GIL. The result was an interpreter that ran at
roughly half the normal speed, a side-effect of acquiring and
releasing the individual locks used to replace the GIL.

The GIL issue is unique to the C implementation of the
interpreter. The Java implementation of Python, Jython, supports true
threading by taking advantage of the underlying JVM. The IronPython
port, running on Microsoft’s CLR, also has better threading. On the
other hand, those platforms are always playing catch-up with new
language or library features, so if you’re hot to use the latest and
greatest, like I am, the C reference-implementation is still your best
option.

Dropping the GIL from the C implementation remains a low priority for
a variety of reasons. The scope of the changes involved is beyond the
level of anything the current developers are interested in
tackling. Recently, Guido has said he would entertain patches
contributed by the Python community to remove the GIL, as long as
performance of single-threaded applications was not adversely
affected. As far as I know, no one has announced any plans to do so.

Even though there is a FAQ entry on the subject as part of the
standard documentation set for Python, from time to time a request
pops up on comp.lang.python or one of the Python-related mailing lists
to rewrite the interpreter so the lock can be removed. Each time it
happens, the answer is clear: use processes instead of threads.

That response does have some merit. Extension modules become more
complicated without the safety of the GIL. Processes typically have
fewer inherent deadlocking issues than threads. They can be
distributed between the CPUs on a host, and even more importantly, an
application that uses multiple processes is not limited by the size of
a single server, as a multi-threaded application would be.

Since the GIL is still present in Python 3.0, it seems unlikely that
it will be removed from a future version any time soon. This may
disappoint some people, but it is not the end of the world. There are,
after all, strategies for working with multiple processes to scale
large applications. I’m not talking about the well worn, established
techniques from the last millennium that use a different collection of
tools on every platform, nor the time-consuming and error-prone
practices that lead to solving the same problem time and
again. Techniques using low-level, operating system-specific,
libraries for process management are as passé as using compiled
languages for CGI programming. I don’t have time for this low-level
stuff any more, and neither do you. Let’s look at some modern
alternatives.

The subprocess module

Version 2.4 of Python introduced the subprocess module and finally
unified the disparate process management interfaces available in other
standard library packages to provide cross-platform support for
creating new processes. While subprocess solved some of my process
creation problems, it still primarily relies on pipes for inter-process
communication. Pipes are workable, but fairly low-level as far as
communication channels go, and using them for two-way message passing
while avoiding I/O deadlocks can be tricky (don’t forget to flush()).
Passing data through pipes is definitely not as transparent to the
application developer as sharing objects natively between threads.
And pipes don’t help when the processes need to scale beyond a single
server.

Parallel Python

Vitalii Vanovschi’s Parallel Python package (pp) is a more complete
distributed processing package that takes a centralized approach.
Jobs are managed from a “job server”, and pushed out to individual
processing “nodes”.

Those worker nodes are separate processes, and can be running on the
same server or other servers accessible over the network. And when I
say that pp pushes jobs out to the processing nodes, I mean just that
– the code and data are both distributed from the central server to
the remote worker node when the job starts. I don’t even have to
install my application code on each machine that will run the jobs.

Here’s an example, taken right from the Parallel Python Quick Start
guide:

import pp
job_server = pp.Server()
# Start tasks
f1 = job_server.submit(func1, args1, depfuncs1,
    modules1)
f2 = job_server.submit(func1, args2, depfuncs1,
    modules1)
f3 = job_server.submit(func2, args3, depfuncs2,
    modules2)
# Retrieve the results
r1 = f1()
r2 = f2()
r3 = f3()

When the pp worker starts, it detects the number of CPUs in the system
and starts one process per CPU automatically, allowing me to take full
advantage of the computing resources available. Jobs are started
asynchronously, and run in parallel on an available node. The callable
object returned when the job is submitted blocks until the response is
ready, so response sets can be computed asynchronously, then merged
synchronously. Load distribution is transparent, making pp excellent
for clustered environments.

One drawback to using pp is that I have to do a little more work up
front to identify the functions and modules on which each job depends,
so all of the code can be sent to the processing node. That’s easy (or
at least straightforward) when all of the jobs are identical, or use a
consistent set of libraries. If I don’t know everything about the job
in advance, though, I’m stuck. It would be nice if pp could
automatically detect dependencies at runtime. Maybe it will, in a
future version.

The processing Package

Parallel Python is impressive, but it is not the only option for
managing parallel jobs. The processing package from Richard Oudkerk
aims to solve the issues of creating and communicating with multiple
processes in a portable, Pythonic way. Whereas Parallel Python is
designed around a “push” style distribution model, the processing
package is set up to make it easy to create producer/consumer style
systems where worker processes pull jobs from a queue.

The package hides most of the details of selecting an appropriate
communication technique for the platform by choosing reasonable
default behaviors at runtime. The API does include a way to explicitly
select the communication mechanism, in case I need that level of
control to meet specific performance or compatibility requirements.
As a result, I end up with the best of both worlds: usable default
settings that I can tweak later to improve performance.

To make life even easier, the processing.Process class was purposely
designed to match the threading.Thread class API. Since the processing
package is almost a drop-in replacement for the standard library’s
threading module, many of my existing multi-threaded applications can
be converted to use processes simply by changing a few import
statements. That’s the sort of upgrade path I like.

Listing 1 contains a simple example, based on the examples found in
the processing documentation, which passes a string value between
processes as an argument to the Process instance and shows the
similarity between processing and threading. How much easier could it
be?

Listing 1

#!/usr/bin/env python
# Simple processing example

import os
from processing import Process, currentProcess

def f(name):
    print 'Hello,', name, currentProcess()

if __name__ == '__main__':
    print 'Parent process:', currentProcess()
    p = Process(target=f, args=[os.environ.get('USER', 'Unknown user')])
    p.start()
    p.join()

In a few cases, I’ll have more work to do to convert existing code
that was sharing objects which cannot easily be passed from one
process to another (file or database handles, etc.). Occasionally, a
performance-sensitive application needs more control over the
communication channel. In these situations, I might still have to get
my hands dirty with the lower-level APIs in the processing.connection
module. When that time comes, they are all exposed and ready to be
used directly.

Sharing State and Passing Data

For basic state handling, the processing package lets me share data
between processes by using shared objects, similar to the way I might
with threads. There are two types of “managers” for passing objects
between processes. The LocalManager uses shared memory, but the types
of objects that can be shared are limited by a low-level interface
which constrains the data types and sizes. LocalManager is
interesting, but it’s not what has me excited. The SyncManager is the
real story.

SyncManager implements tools for synchronizing inter-process
communication in the style of threaded programming. Locks, semaphores,
condition variables, and events are all there. Special implementations
of Queue, dict, and list that can be used between processes safely are
included as well (Listing 2). Since I’m already comfortable with these
APIs, there is almost no learning curve for converting to the versions
provided by the processing module.

Listing 2

#!/usr/bin/env python
# Pass an object through a queue to another process.

from processing import Process, Queue, currentProcess

class Example:
    def __init__(self, name):
        self.name = name
    def __str__(self):
        return '%s (%s)' % (self.name, currentProcess())


def f(q):
    print 'In child:', q.get()


if __name__ == '__main__':
    q = Queue()
    p = Process(target=f, args=[q])
    p.start()
    o = Example('tester')
    print 'In parent:', o
    q.put(o)
    p.join()

For basic state sharing with SyncManager, using a Namespace is about
as simple as I could hope. A namespace can hold arbitrary attributes,
and any attribute attached to a namespace instance is available in all
client processes which have a proxy for that namespace. That’s
extremely useful for sharing status information, especially since I
don’t have to decide up front what information to share or how big the
values can be. Any process can change existing values or add new
values to the namespace, as illustrated in Listing 3. Changes to the
contents of the namespace are reflected in the other processes the
next time the values are accessed.

#!/usr/bin/env python
# Using a shared namespace.

import processing

def f(ns):
    print ns
    ns.old_coords = (ns.x, ns.y)
    ns.x += 10
    ns.y += 10

if __name__ == '__main__':
    # Initialize the namespace
    manager = processing.Manager()
    ns = manager.Namespace()
    ns.x = 10
    ns.y = 20

    # Use the namespace in another process
    p = processing.Process(target=f, args=(ns,))
    p.start()
    p.join()

    # Show the resulting changes in this process
    print ns

Remote Servers

Configuring a SyncManager to listen on a network socket gives me even
more interesting options. I can start processes on separate hosts, and
they can share data using all of the same high-level mechanisms
described above. Once they are connected, there is no difference in
the way the client programs use the shared resources remotely or
locally.

The objects are passed between client and server using pickles, which
introduces a security hole: because unpacking a pickle may cause code
to be executed, it is risky to trust pickles from an unknown
source. To mitigate this risk, all communication in the processing
package can be secured with digest authentication using the hmac
module from the standard library. Callers can pass authentication keys
to the manager explicitly, but default values are generated if no key
is given. Once the connection is established, the authentication and
digest calculation are handled transparently for me.

Conclusion

The GIL is a fact of life for Python programmers, and we need to
consider it along with all of the other factors that go into planning
large scale programs. Both the processing package and Parallel Python
tackle the issues of multi-processing in Python head on, from
different directions. Where the processing package tries to fit itself
into existing threading designs, pp uses a more explicit distributed
job model. Each approach has benefits and drawbacks, and neither is
suitable for every situation. Both, however, save you a lot of time
over the alternative of writing everything yourself with low-level
libraries. What an age to be alive!