Multi-processing techniques in Python

Originally published in Python Magazine Volume 1 Number 10 , October,
2007

Has your multi-threaded application grown GILs? Take a look at these
packages for easy-to-use process management and inter-process
communication tools.

There is no predefined theme for this column, so I plan to cover a
different, likely unrelated, subject every month. The topics will
range anywhere from open source packages in the Python Package Index
(formerly The Cheese Shop, now PyPI) to new developments from around
the Python community, and anything that looks interesting in
between. If there is something you would like for me to cover, send a
note with the details to doug dot hellmann at
pythonmagazine dot com and let me know, or add the link to
your del.icio.us account with the tag “pymagdifferent”.

I will make one stipulation for my own sake: any open source libraries
must be registered with PyPI and configured so that I can install them
with distutils. Creating a login at PyPI and registering your
project is easy, and only takes a few minutes. Go on, you know you
want to.

Scaling Python: Threads vs. Processes

In the ongoing discussion of performance and scaling issues with
Python, one persistent theme is the Global Interpreter Lock
(GIL). While the GIL has the advantage of simplifying the
implementation of CPython internals and extension modules, it prevents
users from achieving true multi-threaded parallelism by limiting the
interpreter to executing byte-codes in one thread at a time on a
single processor. Threads which block on I/O or use extension modules
written in another language can release the GIL to allow other threads
to take over control, of course. But if my application is written
entirely in Python, only a limited number of statements will be
executed before one thread is suspended and another is started.

Eliminating the GIL has been on the wish lists of many Python
developers for a long time – I have been working with Python since
1998 and it was a hotly debated topic even then. Around that time,
Greg Stein produced a set of patches for Python 1.5 that eliminated
the GIL entirely, replacing it with a whole set of individual locks
for the mutable data structures (dictionaries, lists, etc.) that had
been protected by the GIL. The result was an interpreter that ran at
roughly half the normal speed, a side-effect of acquiring and
releasing the individual locks used to replace the GIL.

The GIL issue is unique to the C implementation of the
interpreter. The Java implementation of Python, Jython, supports true
threading by taking advantage of the underlying JVM. The IronPython
port, running on Microsoft’s CLR, also has better threading. On the
other hand, those platforms are always playing catch-up with new
language or library features, so if you’re hot to use the latest and
greatest, like I am, the C reference-implementation is still your best
option.

Dropping the GIL from the C implementation remains a low priority for
a variety of reasons. The scope of the changes involved is beyond the
level of anything the current developers are interested in
tackling. Recently, Guido has said he would entertain patches
contributed by the Python community to remove the GIL, as long as
performance of single-threaded applications was not adversely
affected. As far as I know, no one has announced any plans to do so.

Even though there is a FAQ entry on the subject as part of the
standard documentation set for Python, from time to time a request
pops up on comp.lang.python or one of the Python-related mailing lists
to rewrite the interpreter so the lock can be removed. Each time it
happens, the answer is clear: use processes instead of threads.

That response does have some merit. Extension modules become more
complicated without the safety of the GIL. Processes typically have
fewer inherent deadlocking issues than threads. They can be
distributed between the CPUs on a host, and even more importantly, an
application that uses multiple processes is not limited by the size of
a single server, as a multi-threaded application would be.

Since the GIL is still present in Python 3.0, it seems unlikely that
it will be removed from a future version any time soon. This may
disappoint some people, but it is not the end of the world. There are,
after all, strategies for working with multiple processes to scale
large applications. I’m not talking about the well worn, established
techniques from the last millennium that use a different collection of
tools on every platform, nor the time-consuming and error-prone
practices that lead to solving the same problem time and
again. Techniques using low-level, operating system-specific,
libraries for process management are as passé as using compiled
languages for CGI programming. I don’t have time for this low-level
stuff any more, and neither do you. Let’s look at some modern
alternatives.

The subprocess module

Version 2.4 of Python introduced the subprocess module and finally
unified the disparate process management interfaces available in other
standard library packages to provide cross-platform support for
creating new processes. While subprocess solved some of my process
creation problems, it still primarily relies on pipes for inter-process
communication. Pipes are workable, but fairly low-level as far as
communication channels go, and using them for two-way message passing
while avoiding I/O deadlocks can be tricky (don’t forget to flush()).
Passing data through pipes is definitely not as transparent to the
application developer as sharing objects natively between threads.
And pipes don’t help when the processes need to scale beyond a single
server.

Parallel Python

Vitalii Vanovschi’s Parallel Python package (pp) is a more complete
distributed processing package that takes a centralized approach.
Jobs are managed from a “job server”, and pushed out to individual
processing “nodes”.

Those worker nodes are separate processes, and can be running on the
same server or other servers accessible over the network. And when I
say that pp pushes jobs out to the processing nodes, I mean just that
– the code and data are both distributed from the central server to
the remote worker node when the job starts. I don’t even have to
install my application code on each machine that will run the jobs.

Here’s an example, taken right from the Parallel Python Quick Start
guide:

import pp
job_server = pp.Server()
# Start tasks
f1 = job_server.submit(func1, args1, depfuncs1,
    modules1)
f2 = job_server.submit(func1, args2, depfuncs1,
    modules1)
f3 = job_server.submit(func2, args3, depfuncs2,
    modules2)
# Retrieve the results
r1 = f1()
r2 = f2()
r3 = f3()

When the pp worker starts, it detects the number of CPUs in the system
and starts one process per CPU automatically, allowing me to take full
advantage of the computing resources available. Jobs are started
asynchronously, and run in parallel on an available node. The callable
object returned when the job is submitted blocks until the response is
ready, so response sets can be computed asynchronously, then merged
synchronously. Load distribution is transparent, making pp excellent
for clustered environments.

One drawback to using pp is that I have to do a little more work up
front to identify the functions and modules on which each job depends,
so all of the code can be sent to the processing node. That’s easy (or
at least straightforward) when all of the jobs are identical, or use a
consistent set of libraries. If I don’t know everything about the job
in advance, though, I’m stuck. It would be nice if pp could
automatically detect dependencies at runtime. Maybe it will, in a
future version.

The processing Package

Parallel Python is impressive, but it is not the only option for
managing parallel jobs. The processing package from Richard Oudkerk
aims to solve the issues of creating and communicating with multiple
processes in a portable, Pythonic way. Whereas Parallel Python is
designed around a “push” style distribution model, the processing
package is set up to make it easy to create producer/consumer style
systems where worker processes pull jobs from a queue.

The package hides most of the details of selecting an appropriate
communication technique for the platform by choosing reasonable
default behaviors at runtime. The API does include a way to explicitly
select the communication mechanism, in case I need that level of
control to meet specific performance or compatibility requirements.
As a result, I end up with the best of both worlds: usable default
settings that I can tweak later to improve performance.

To make life even easier, the processing.Process class was purposely
designed to match the threading.Thread class API. Since the processing
package is almost a drop-in replacement for the standard library’s
threading module, many of my existing multi-threaded applications can
be converted to use processes simply by changing a few import
statements. That’s the sort of upgrade path I like.

Listing 1 contains a simple example, based on the examples found in
the processing documentation, which passes a string value between
processes as an argument to the Process instance and shows the
similarity between processing and threading. How much easier could it
be?

Listing 1

#!/usr/bin/env python
# Simple processing example

import os
from processing import Process, currentProcess

def f(name):
    print 'Hello,', name, currentProcess()

if __name__ == '__main__':
    print 'Parent process:', currentProcess()
    p = Process(target=f, args=[os.environ.get('USER', 'Unknown user')])
    p.start()
    p.join()

In a few cases, I’ll have more work to do to convert existing code
that was sharing objects which cannot easily be passed from one
process to another (file or database handles, etc.). Occasionally, a
performance-sensitive application needs more control over the
communication channel. In these situations, I might still have to get
my hands dirty with the lower-level APIs in the processing.connection
module. When that time comes, they are all exposed and ready to be
used directly.

Sharing State and Passing Data

For basic state handling, the processing package lets me share data
between processes by using shared objects, similar to the way I might
with threads. There are two types of “managers” for passing objects
between processes. The LocalManager uses shared memory, but the types
of objects that can be shared are limited by a low-level interface
which constrains the data types and sizes. LocalManager is
interesting, but it’s not what has me excited. The SyncManager is the
real story.

SyncManager implements tools for synchronizing inter-process
communication in the style of threaded programming. Locks, semaphores,
condition variables, and events are all there. Special implementations
of Queue, dict, and list that can be used between processes safely are
included as well (Listing 2). Since I’m already comfortable with these
APIs, there is almost no learning curve for converting to the versions
provided by the processing module.

Listing 2

#!/usr/bin/env python
# Pass an object through a queue to another process.

from processing import Process, Queue, currentProcess

class Example:
    def __init__(self, name):
        self.name = name
    def __str__(self):
        return '%s (%s)' % (self.name, currentProcess())


def f(q):
    print 'In child:', q.get()


if __name__ == '__main__':
    q = Queue()
    p = Process(target=f, args=[q])
    p.start()
    o = Example('tester')
    print 'In parent:', o
    q.put(o)
    p.join()

For basic state sharing with SyncManager, using a Namespace is about
as simple as I could hope. A namespace can hold arbitrary attributes,
and any attribute attached to a namespace instance is available in all
client processes which have a proxy for that namespace. That’s
extremely useful for sharing status information, especially since I
don’t have to decide up front what information to share or how big the
values can be. Any process can change existing values or add new
values to the namespace, as illustrated in Listing 3. Changes to the
contents of the namespace are reflected in the other processes the
next time the values are accessed.

#!/usr/bin/env python
# Using a shared namespace.

import processing

def f(ns):
    print ns
    ns.old_coords = (ns.x, ns.y)
    ns.x += 10
    ns.y += 10

if __name__ == '__main__':
    # Initialize the namespace
    manager = processing.Manager()
    ns = manager.Namespace()
    ns.x = 10
    ns.y = 20

    # Use the namespace in another process
    p = processing.Process(target=f, args=(ns,))
    p.start()
    p.join()

    # Show the resulting changes in this process
    print ns

Remote Servers

Configuring a SyncManager to listen on a network socket gives me even
more interesting options. I can start processes on separate hosts, and
they can share data using all of the same high-level mechanisms
described above. Once they are connected, there is no difference in
the way the client programs use the shared resources remotely or
locally.

The objects are passed between client and server using pickles, which
introduces a security hole: because unpacking a pickle may cause code
to be executed, it is risky to trust pickles from an unknown
source. To mitigate this risk, all communication in the processing
package can be secured with digest authentication using the hmac
module from the standard library. Callers can pass authentication keys
to the manager explicitly, but default values are generated if no key
is given. Once the connection is established, the authentication and
digest calculation are handled transparently for me.

Conclusion

The GIL is a fact of life for Python programmers, and we need to
consider it along with all of the other factors that go into planning
large scale programs. Both the processing package and Parallel Python
tackle the issues of multi-processing in Python head on, from
different directions. Where the processing package tries to fit itself
into existing threading designs, pp uses a more explicit distributed
job model. Each approach has benefits and drawbacks, and neither is
suitable for every situation. Both, however, save you a lot of time
over the alternative of writing everything yourself with low-level
libraries. What an age to be alive!