iTunes 9.1 Smart Playlist Order Fix

Today I reset my 30GB video iPod and all of the smart playlists broke.
Some were sorted in the wrong order (by name), some had no contents, and
some were missing entirely.

I found a few references to the same problem:

None of those solutions worked for me. What did seem to work was to
set iTunes to sync all copies of all podcasts (the default was to sync
only unplayed).

Caching RSS Feeds With feedcache

Originally published in Python Magazine Volume 1 Issue 11 , November,
2007

The past several years have seen a steady increase in the use of
RSS and Atom feeds for data sharing. Blogs, podcasts, social
networking sites, search engines, and news services are just a few
examples of data sources delivered via such feeds. Working with
internet services requires care, because inefficiencies in one
client implementation may cause performance problems with the
service that can be felt by all of the consumers accessing the
same server. In this article, I describe the development of the
feedcache package, and give examples of how you can use it to
optimize the use of data feeds in your application.

I frequently find myself wanting to listen to one or two episodes from
a podcast, but not wanting to subscribe to the entire series. In order
to scratch this itch, I built a web based tool, hosted at
http://www.castsampler.com/, to let me pick and choose individual
episodes from a variety of podcast feeds, then construct a single feed
with the results. Now I subscribe to the single feed with my podcast
client, and easily populate it with new episodes when I encounter any
that sound interesting. The feedcache package was developed as part
of this tool to manage accessing and updating the feeds efficiently,
and has been released separately under the BSD license.

Example Feed Data

The two most common publicly implemented formats for syndicating web
data are RSS (in one of a few versions) and Atom. Both formats have a
similar structure. Each feed begins with basic information about the
data source (title, link, description, etc.). The introductory
information is followed by a series of “items”, each of which
represents a resource like a blog post, news article, or podcast.
Each item, in turn, has a title, description, and other information
like when it was written. It may also refer to one or more
attachments, or enclosures.

Listing 1 shows a sample RSS 2.0 feed and Listing 2 shows a sample
Atom feed. Each sample listing contains one item with a single podcast
enclosure. Both formats are XML, and contain essentially the same
data. They use slightly different tag names though, and podcast
enclosures are handled differently between the two formats, which can
make working with different feed formats more work in some
environments. Fortunately, Python developers do not need to worry
about the differences in the feed formats, thanks to the Universal
Feed Parser.

Listing 1

<?xml version="1.0" encoding="utf-8"?>

<rss version="2.0">
  <channel>
    <title>Sample RSS 2.0 Feed</title>
    <link>http://www.example.com/rss.xml</link>
    <description>Sample feed using RSS 2.0 format.</description>
    <language>en-us</language>
    <item>
      <title>item title goes here</title>
      <link>http://www.example.com/items/1/</link>
      <description>description goes here</description>
      <author>authoremail@example.com (author goes here)</author>
      <pubDate>Sat, 4 Aug 2007 15:00:36 -0000</pubDate>
      <guid>http://www.example.com/items/1/</guid>
      <enclosure url="http://www.example.com/items/1/enclosure" length="100" type="audio/mpeg">
      </enclosure>
    </item>
  </channel>
</rss>

Listing 2

<?xml version="1.0" encoding="utf-8"?>

<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-us">
  <title>Sample Atom Feed</title>
  <link href="http://www.example.com/" rel="alternate"></link>
  <link href="http://www.example.com/atom.xml" rel="self"></link>
  <id>http://www.example.com/atom.xml</id>
  <updated>2007-08-04T15:00:36Z</updated>
  <entry>
    <title>title goes here</title>
    <link href="http://www.example.com/items/1/" rel="alternate"></link>
    <updated>2007-08-04T15:00:36Z</updated>
    <author>
      <name>author goes here</name>
      <email>authoremail@example.com</email>
    </author>
    <id>http://www.example.com/items/1/</id>
    <summary type="html">description goes here</summary>
    <link length="100" href="http://www.example.com/items/1/enclosure" type="audio/mpeg" rel="enclosure">
    </link>
  </entry>
</feed>

Universal Feed Parser

Mark Pilgrim’s Universal Feed Parser is an open source module that
manages most aspects of downloading and parsing RSS and Atom feeds.
Once the feed has been downloaded and parsed, the parser returns an
object with all of the parsed data easily accessible through a single
API, regardless of the original feed format.

Listing 3 shows a simple example program for accessing feeds with
feedparser. On line 9, a URL from the command line arguments is
passed to feedparser.parse() to be downloaded and parsed. The
results are returned as a FeedParserDict. The properties of the
FeedParserDict can be accessed via the dictionary API or using
attribute names as illustrated on line
10.

Listing 3

#!/usr/bin/env python
"""Print contents of feeds specified on the command line.
"""

import feedparser
import sys

for url in sys.argv[1:]:
    data = feedparser.parse(url)
    for entry in data.entries:
        print '%s: %s' % (data.feed.title, entry.title)

When the sample program in Listing 3 is run with the URL for the feed
of feedcache project releases, it shows the titles for the
releases available right now:

$ python Listing3.py http://feeds.feedburner.com/FeedcacheReleases
feedcache Releases: feedcache 0.1
feedcache Releases: feedcache 0.2
feedcache Releases: feedcache 0.3
feedcache Releases: feedcache 0.4
feedcache Releases: feedcache 0.5

Every time the program runs, it fetches the entire feed, whether the
contents have changed or not. That inefficiency might not matter a
great deal for a client program that is not run frequently, but the
inefficiencies add up on the server when many clients access the same
feed, especially if they check the feed on a regular basis. This
inefficient behavior can become an especially bad problem for the
server if the feed contents are produced dynamically, since each
client incurs a certain amount of CPU, I/O, and bandwidth load needed
to produce the XML representation of the feed. Some sites are
understandably strict about how often a client can retrieve feeds, to
cut down on heavy bandwidth and CPU consumers. Slashdot, for example,
returns a special feed with a warning to any client that accesses
their RSS feed too frequently over a short span of time.

A Different Type of Podcast Aggregator

A typical aggregator design would include a monitor to regularly
download the feeds and store the fresh information about the feed and
its contents in a database. The requirements for CastSampler are a
little different, though.

CastSampler remembers the feeds to which a user has subscribed, but
unlike other feed aggregators, it only downloads the episode metadata
while the user is choosing episodes to add to their download
feed. Since the user does not automatically receive every episode of
every feed, the aggregator does not need to constantly monitor all of
the feeds. Instead, it shows a list of episodes for a selected feed,
and lets the user choose which episodes to download. Then it needs to
remember those selected episodes later so it can produce the combined
feed for the user’s podcast client.

If every item from every feed was stored in the database, most of the
data in the database would be for items that were never selected for
download. There would need to be a way to remove old data from the
database when it expired or was no longer valid, adding to the
maintenance work for the site. Instead, CastSampler only uses the
database to store information about episodes selected by the user.
The rest of the data about the feed is stored outside of the database
in a form that makes it easier to discard old data when the feed is
updated. This division eliminates a lot of the data management effort
behind running the site.

Feedcache Requirements

An important goal for this project was to make CastSampler a polite
consumer of feeds, and ensure that it did not overload servers while a
user was selecting podcast episodes interactively. By caching the feed
data for a short period of time, CastSampler could avoid accessing
feeds every time it needed to show the feed data. A persistent cache,
written to disk, would let the data be reused even if the application
was restarted, such as might happen during development. Using a cache
would also improve responsiveness, since reading data from the local
disk would be faster than fetching the feed from the remote server.
To further reduce server load, feedcache is designed to take
advantage of conditional GET features of HTTP, to avoid downloading
the full feed whenever possible.

Another goal was to have a small API for the cache. It should take
care of everything for the caller, so there would not need to be many
functions to interact with it. To retrieve the contents of a feed, the
caller should only have to provide the URL for that feed. All other
information needed to track the freshness of the data in the cache
would be managed internally.

It was also important for the cache to be able to store data in
multiple ways, to make it more flexible for other programmers who
might want to use it. Although CastSampler was going to store the
cache on disk, other applications with more computing resources or
tighter performance requirements might prefer to hold the cache in
memory. Using disk storage should not be hard coded into the cache
management logic.

These requirements led to a design which split the responsibility for
managing the cached data between two objects. The Cache object
tracks information about a feed so it can download the latest version
as efficiently as possible, only when needed. Persistent storage of
the data in the cache is handled by a separate back end storage
object. Dividing the responsibilities in this way maximizes the
flexibility of the Cache, since it can concentrate on tracking
whether the feed is up to date without worrying about storage
management. It also let Cache users take advantage of multiple
storage implementations.

The Cache Class

Once the basic requirements and a skeleton design were worked out, the
next step was to start writing tests so the implementation of
Cache could begin. Working with a few simple tests would clarify
how a Cache user would want to access feeds. The first test was to
verify that the Cache would fetch feed data.

import unittest, cache
class CacheTest(unittest.TestCase):
    def testFetch(self):
        c = cache.Cache({})
        parsed_feed = c.fetch('http://feeds.feedburner.com/FeedcacheReleases')
        self.failUnless(parsed_feed.entries)

Since the design separated storage and feed management
responsibilities, it was natural to pass the storage handler to the
Cache when it is initialized. The dictionary API is used for the
storage because there are several storage options available that
support it. The shelve module in the Python standard library stores
data persistently using an object that conforms to the dictionary API,
as does the shove library from L.C. Rees. Either library would work
well for the final application. For initial testing, using a simple
dictionary to hold the data in memory was convenient, since that meant
the tests would not need any external resources.

After constructing the Cache, the next step in the test is to
retrieve a feed. I considered using using the __getitem__() hook,
but since Cache would not support any of the other dictionary
methods, I rejected it in favor of an explicit method,
fetch(). The caller passes a feed URL to fetch(), which
returns a FeedParserDict instance. Listing 4 shows the first version
of the Cache class that works for the test as it is written. No
actual caching is being done, yet. The Cache instance simply uses
the feedparser module to retrieve and parse the feed.

Listing 4

#!/usr/bin/env python
"""The first version of Cache
"""

import unittest
import feedparser

class Cache:
    def __init__(self, storage):
        self.storage = storage
        return

    def fetch(self, url):
        return feedparser.parse(url)

class CacheTest(unittest.TestCase):

    def testFetch(self):
        c = Cache({})
        parsed_feed = c.fetch('http://feeds.feedburner.com/FeedcacheReleases')
        self.failUnless(parsed_feed.entries)
        return

if __name__ == '__main__':
    unittest.main()

Throttling Downloads

Now that Cache could successfully download feed data, the first
optimization to make was to hold on to the data and track its age.
Then for every call to fetch(), Cache could first check to see
if fresh data was already available locally before going out to the
server to download the feed again.

Listing 5 shows the version of Cache with a download throttle, in
the form of a timeToLiveSeconds parameter. Items already in the
cache will be reused until they are older than
timeToLiveSeconds. The default value for timeToLiveSeconds
means that any given feed will not be checked more often than every
five minutes.

Listing 5

#!/usr/bin/env python
"""The first version of Cache
"""

import time
import unittest
import feedparser

class Cache:
    def __init__(self, storage, timeToLiveSeconds=300):
        self.storage = storage
        self.time_to_live = timeToLiveSeconds
        return

    def fetch(self, url):
        now = time.time()
        cached_time, cached_content = self.storage.get(url, (None, None))

        # Does the storage contain a version of the data
        # which is older than the time-to-live?
        if cached_time is not None:
            age = now - cached_time
            if age <= self.time_to_live:
                return cached_content

        parsed_data = feedparser.parse(url)
        self.storage[url] = (now, parsed_data)
        return parsed_data

class CacheTest(unittest.TestCase):

    def testFetch(self):
        c = Cache({})
        parsed_feed = c.fetch('http://feeds.feedburner.com/FeedcacheReleases')
        self.failUnless(parsed_feed.entries)
        return

    def testReuseContentsWithinTimeToLiveWindow(self):
        url = 'http://feeds.feedburner.com/FeedcacheReleases'
        c = Cache({ url:(time.time(), 'prepopulated cache')})
        cache_contents = c.fetch(url)
        self.failUnlessEqual(cache_contents, 'prepopulated cache')
        return

if __name__ == '__main__':
    unittest.main()

The new implementation of fetch() stores the current time along
with the feed data when the storage is updated. When fetch() is
called again with the same URL, the time in the cache is checked
against the current time to determine if the value in the cache is
fresh enough. The test on line 38 verifies this behavior by
pre-populating the Cache‘s storage with data, and checking to
see that the existing cache contents are returned instead of the
contents of the feed.

Conditional HTTP GET

Conditional HTTP GET allows a client to tell a server something about
the version of a feed the client already has. The server can decide if
the contents of the feed have changed and, if they have not, send a
short status code in the HTTP response instead of a complete copy of
the feed data. Conditional GET is primarily a way to conserve
bandwidth, but if the feed has not changed and the server’s version
checking algorithm is efficient then the server may use fewer CPU
resources to prepare the response, as well.

When a server implements conditional GET, it uses extra headers with
each response to notify the client. There are two headers involved,
and the server can use either or both together, in case the client
only supports one. Cache supports both headers.

Although timestamps are an imprecise way to detect change, since the
time on different servers in a pool might vary slightly, they are
simple to work with. The Last-Modified header contains a timestamp
value that indicates when the feed contents last changed. The client
sends the timestamp back to the server in the next request as
If-Modified-Since. The server then compares the dates to determine
if the feed has been modified since the last request from the client.

A more precise way to determine if the feed has changed is to use an
Entity Tag in the ETag header. An ETag is a hashed
representation of the feed state, or of a value the server can use to
quickly determine if the feed has been updated. The data and algorithm
for computing the hash is left up to the server, but it should be less
expensive than returning the feed contents or there won’t be any
performance gains. When the client sees an ETag header, it can
send the associated value back to the server with the next request in
the If-None-Match request header. When the server sees
If-None-Match, it computes the current hash and compares it to the
value sent by the client. If they match, the feed has not changed.

When using either ETag or modification timestamps, if the server
determines that the feed has not been updated since the previous
request, it returns a response code of 304, or “Not Modified” and
includes nothing in the body of the response. When it sees the 304
status in the response from the server, the client should reuse the
version of the feed it already has.

Creating a Test Server

In order to write correct tests to exercise conditional GET in
feedcache, more control over the server would be important. The
feedburner URL used in the earlier tests might be down, or return
different data if a feed was updated. It would be necessary for the
server to respond reliably with data the test code knew in advance,
and to be sure it would not stop responding if it was queried too
often by the tests. The tests also control which of the headers
(ETag or If-Modified-Since) was used to determine if the feed
had changed, so both methods could be tested independently. The
solution was to write a small test HTTP server that could be managed
by the unit tests and configured as needed. Creating the test server
was easy, using a few standard library modules.

The test server code, along with a base class for unit tests that use
it, can be found in Listing 6. The TestHTTPServer (line 91) is
derived from BaseHTTPServer.HTTPServer. The serve_forever()
method (line 112) has been overridden with an implementation that
checks a flag after each request to see if the server should keep
running. The test harness sets the flag to stop the test server after
each test. The serve_forever() loop also counts the requests
successfully processed, so the tests can determine how many times the
Cache fetches a feed.

Listing 6

#!/usr/bin/env python
"""Simple HTTP server for testing the feed cache.
"""

import BaseHTTPServer
import email.utils
import logging
import md5
import threading
import time
import unittest
import urllib


def make_etag(data):
    """Given a string containing data to be returned to the client,
    compute an ETag value for the data.
    """
    _md5 = md5.new()
    _md5.update(data)
    return _md5.hexdigest()


class TestHTTPHandler(BaseHTTPServer.BaseHTTPRequestHandler):
    "HTTP request handler which serves the same feed data every time."

    FEED_DATA = """<?xml version="1.0" encoding="utf-8"?>

<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-us">
  <title>CacheTest test data</title>
  <link href="http://localhost/feedcache/" rel="alternate"></link>
  <link href="http://localhost/feedcache/atom/" rel="self"></link>
  <id>http://localhost/feedcache/</id>
  <updated>2006-10-14T11:00:36Z</updated>
  <entry>
    <title>single test entry</title>
    <link href="http://www.example.com/" rel="alternate"></link>
    <updated>2006-10-14T11:00:36Z</updated>
    <author>
      <name>author goes here</name>
      <email>authoremail@example.com</email>
    </author>
    <id>http://www.example.com/</id>
    <summary type="html">description goes here</summary>
    <link length="100" href="http://www.example.com/enclosure" type="text/html" rel="enclosure">
    </link>
  </entry>
</feed>"""

    # The data does not change, so save the ETag and modified times
    # as class attributes.
    ETAG = make_etag(FEED_DATA)
    MODIFIED_TIME = email.utils.formatdate(usegmt=True)

    def do_GET(self):
        "Handle GET requests."

        if self.path == '/shutdown':
            # Shortcut to handle stopping the server
            self.server.stop()
            self.send_response(200)

        else:
            incoming_etag = self.headers.get('If-None-Match', None)
            incoming_modified = self.headers.get('If-Modified-Since', None)

            send_data = True

            # Does the client have the same version of the data we have?
            if self.server.apply_modified_headers:
                if incoming_etag == self.ETAG:
                    self.send_response(304)
                    send_data = False

                elif incoming_modified == self.MODIFIED_TIME:
                    self.send_response(304)
                    send_data = False

            # Now optionally send the data, if the client needs it
            if send_data:
                self.send_response(200)
                self.send_header('Content-Type', 'application/atom+xml')
                self.send_header('ETag', self.ETAG)
                self.send_header('Last-Modified', self.MODIFIED_TIME)
                self.end_headers()

                self.wfile.write(self.FEED_DATA)
        return


class TestHTTPServer(BaseHTTPServer.HTTPServer):
    """HTTP Server which counts the number of requests made
    and can stop based on client instructions.
    """

    def __init__(self, applyModifiedHeaders=True):
        self.apply_modified_headers = applyModifiedHeaders
        self.keep_serving = True
        self.request_count = 0
        BaseHTTPServer.HTTPServer.__init__(self, ('', 9999), TestHTTPHandler)
        return

    def getNumRequests(self):
        "Return the number of requests which have been made on the server."
        return self.request_count

    def stop(self):
        "Stop serving requests, after the next request."
        self.keep_serving = False
        return

    def serve_forever(self):
        "Main loop for server"
        while self.keep_serving:
            self.handle_request()
            self.request_count += 1
        return


class HTTPTestBase(unittest.TestCase):
    "Base class for tests that use a TestHTTPServer"

    TEST_URL = 'http://localhost:9999/'

    CACHE_TTL = 0

    def setUp(self):
        self.server = self.getServer()
        self.server_thread = threading.Thread(target=self.server.serve_forever)
        self.server_thread.setDaemon(True) # so the tests don't hang if cleanup fails
        self.server_thread.start()
        return

    def getServer(self):
        "Return a web server for the test."
        return TestHTTPServer()

    def tearDown(self):
        # Stop the server thread
        ignore = urllib.urlretrieve('http://localhost:9999/shutdown')
        time.sleep(1)
        self.server.server_close()
        self.server_thread.join()
        return


class HTTPTest(HTTPTestBase):

    def testResponse(self):
        # Verify that the server thread responds
        # without error.
        filename, response = urllib.urlretrieve(self.TEST_URL)
        return

if __name__ == '__main__':
    unittest.main()

The test server processes incoming HTTP requests with
TestHTTPHandler (line 24), derived from
BaseHTTPServer.BaseHTTPRequestHandler. TestHTTPHandler
implements do_GET() (line 55) to respond to HTTP GET requests.
Feed data for the tests is hard coded in the FEED_DATA class
attribute (line 27). The URL path /shutdown is used to tell the
server to stop responding to requests. All other paths are treated as
requests for the feed data. The requests are processed by checking the
If-None-Match and If-Modified-Since headers, and responding
either with a 304 status or with the static feed data.

HTTPTestBase is a convenience base class to be used by other
tests. It manages a TestHTTPServer instance in a separate thread,
so the tests can all run in a single process. Listing 7 shows what the
existing tests look like, rewritten to use the HTTPTestBase as a
base class. The only differences are the base class for the tests and
the use of self.TEST_URL, which points to the local test server
instead of the feedburner URL from Listing 5.

Listing 7

#!/usr/bin/env python
"""The first version of Cache
"""

import time
import unittest
import feedparser
from Listing5 import Cache
from Listing6 import HTTPTestBase

class CacheTest(HTTPTestBase):

    def testFetch(self):
        c = Cache({})
        parsed_feed = c.fetch(self.TEST_URL)
        self.failUnless(parsed_feed.entries)
        return

    def testReuseContentsWithinTimeToLiveWindow(self):
        c = Cache({ self.TEST_URL:(time.time(), 'prepopulated cache')})
        cache_contents = c.fetch(self.TEST_URL)
        self.failUnlessEqual(cache_contents, 'prepopulated cache')
        return

if __name__ == '__main__':
    unittest.main()

Implementing Conditional HTTP GET

With these testing tools in place, the next step was to enhance the
Cache class to monitor and use the conditional HTTP GET
parameters. Listing 8 shows the final version of Cache with these
features. The fetch() method has been enhanced to send the
ETag and modified time from the cached version of the feed to the
server, when they are available.

Listing 8

#!/usr/bin/env python
"""Cache class with conditional HTTP GET support.
"""

import feedparser
import time
import unittest
import UserDict

import Listing6 # For the test base class

class Cache:

    def __init__(self, storage, timeToLiveSeconds=300, userAgent='feedcache'):
        self.storage = storage
        self.time_to_live = timeToLiveSeconds
        self.user_agent = userAgent
        return

    def fetch(self, url):
        modified = None
        etag = None
        now = time.time()

        cached_time, cached_content = self.storage.get(url, (None, None))

        # Does the storage contain a version of the data
        # which is older than the time-to-live?
        if cached_time is not None:
            if self.time_to_live:
                age = now - cached_time
                if age <= self.time_to_live:
                    return cached_content

            # The cache is out of date, but we have
            # something.  Try to use the etag and modified_time
            # values from the cached content.
            etag = cached_content.get('etag')
            modified = cached_content.get('modified')

        # We know we need to fetch, so go ahead and do it.
        parsed_result = feedparser.parse(url,
                                         agent=self.user_agent,
                                         modified=modified,
                                         etag=etag,
                                         )

        status = parsed_result.get('status', None)
        if status == 304:
            # No new data, based on the etag or modified values.
            # We need to update the modified time in the
            # storage, though, so we know that what we have
            # stored is up to date.
            self.storage[url] = (now, cached_content)

            # Return the data from the cache, since
            # the parsed data will be empty.
            parsed_result = cached_content
        elif status == 200:
            # There is new content, so store it unless there was an error.
            error = parsed_result.get('bozo_exception')
            if not error:
                self.storage[url] = (now, parsed_result)

        return parsed_result


class SingleWriteMemoryStorage(UserDict.UserDict):
    """Cache storage which only allows the cache value
    for a URL to be updated one time.
    """

    def __setitem__(self, url, data):
        if url in self.keys():
            modified, existing = self[url]
            # Allow the modified time to change,
            # but not the feed content.
            if data[1] != existing:
                raise AssertionError('Trying to update cache for %s to %s' 
                                         % (url, data))
        UserDict.UserDict.__setitem__(self, url, data)
        return


class CacheConditionalGETTest(Listing6.HTTPTestBase):

    def setUp(self):
        Listing6.HTTPTestBase.setUp(self)
        self.cache = Cache(storage=SingleWriteMemoryStorage(),
                           timeToLiveSeconds=0, # so we do not reuse the local copy
                           )
        return

    def testFetchOnceForEtag(self):
        # Fetch data which has a valid ETag value, and verify
        # that while we hit the server twice the response
        # codes cause us to use the same data.

        # First fetch populates the cache
        response1 = self.cache.fetch(self.TEST_URL)
        self.failUnlessEqual(response1.feed.title, 'CacheTest test data')

        # Remove the modified setting from the cache so we know
        # the next time we check the etag will be used
        # to check for updates.  Since we are using an in-memory
        # cache, modifying response1 updates the cache storage
        # directly.
        response1['modified'] = None

        # Wait so the cache data times out
        time.sleep(1)

        # This should result in a 304 status, and no data from
        # the server.  That means the cache won't try to
        # update the storage, so our SingleWriteMemoryStorage
        # should not raise and we should have the same
        # response object.
        response2 = self.cache.fetch(self.TEST_URL)
        self.failUnless(response1 is response2)

        # Should have hit the server twice
        self.failUnlessEqual(self.server.getNumRequests(), 2)
        return

    def testFetchOnceForModifiedTime(self):
        # Fetch data which has a valid Last-Modified value, and verify
        # that while we hit the server twice the response
        # codes cause us to use the same data.

        # First fetch populates the cache
        response1 = self.cache.fetch(self.TEST_URL)
        self.failUnlessEqual(response1.feed.title, 'CacheTest test data')

        # Remove the etag setting from the cache so we know
        # the next time we check the modified time will be used
        # to check for updates.  Since we are using an in-memory
        # cache, modifying response1 updates the cache storage
        # directly.
        response1['etag'] = None

        # Wait so the cache data times out
        time.sleep(1)

        # This should result in a 304 status, and no data from
        # the server.  That means the cache won't try to
        # update the storage, so our SingleWriteMemoryStorage
        # should not raise and we should have the same
        # response object.
        response2 = self.cache.fetch(self.TEST_URL)
        self.failUnless(response1 is response2)

        # Should have hit the server twice
        self.failUnlessEqual(self.server.getNumRequests(), 2)
        return

if __name__ == '__main__':
    unittest.main()

The FeedParserDict object returned from feedparser.fetch()
conveniently includes the ETag and modified timestamp, if the
server sent them. On lines 38-39, once the cached feed is determined
to be out of date, the ETag and modified values are retrieved so
they can be passed in to feedparser.parse() on line 42.

Since the updated client sends ETag and If-Modified-Since
headers, the server may now respond with a status code indicating that
the cached copy of the data is still valid. It is no longer sufficient
to simply store the response from the server before returning it. The
status code must be checked, as on line 49, and if the status is
304 then the timestamp of the cached copy is updated. If the
timestamp was not updated, then as soon as the cached copy of the feed
exceeded the time-to-live, the Cache would request a new copy of
the feed from the server every time the feed was accessed. Updating
the timestamp ensures that the download throttling remains enforced.

Separate tests for each conditional GET header are implemented in
CacheConditionalGETTest. To verify that the Cache handles the
304 status code properly and does not try to update the contents
of the storage on a second fetch, these tests use a special storage
class. The SingleWriteMemoryStorage raises an AssertionError
if the a value is modified after it is set the first time. An
AssertionError is used, because that is how unittest.TestCase
signals a test failure, and modifying the contents of the storage is a
failure for these tests.

Each test method of CacheConditionalGETTest verifies handling for
one of the conditional GET headers at a time. Since the test server
always sets both headers, each test clears one value from the cache
before making the second request. The remaining header value is sent
to the server as part of the second request, and the server responds
with the 304 status code.

Persistent Storage With shelve

All of the examples and tests so far have used in-memory storage
options. For CastSampler, though, the cache of feed data needed to be
stored on disk. As mentioned earlier, the shelve module in the
standard library provides a simple persistent storage mechanism. It
also conforms to the dictionary API used by the Cache class.

Using shelve by itself works in a simple single threaded case but
it is not clear from its documentation whether shelve supports
write access from multiple concurrent threads. To ensure the shelf is
not corrupted, a thread lock should be used. CacheStorageLock is a
simple wrapper around shelve that uses a lock to prevent more than
one thread from accessing the shelf simultaneously. Listing 9 contains
the code for the CacheStorageLock and a test that illustrates
using it to combine a Cache and shelve.

Listing 9

#!/usr/bin/env python

from __future__ import with_statement

"""Using Cache with shelve.
"""

import os
import shelve
import tempfile
import threading
import unittest

from Listing6 import HTTPTestBase
from Listing8 import Cache

class CacheStorageLock:

    def __init__(self, shelf):
        self.lock = threading.Lock()
        self.shelf = shelf
        return

    def __getitem__(self, key):
        with self.lock:
            return self.shelf[key]

    def get(self, key, default=None):
        with self.lock:
            try:
                return self.shelf[key]
            except KeyError:
                return default

    def __setitem__(self, key, value):
        with self.lock:
            self.shelf[key] = value


class CacheShelveTest(HTTPTestBase):

    def setUp(self):
        HTTPTestBase.setUp(self)
        handle, self.shelve_filename = tempfile.mkstemp('.shelve')
        os.close(handle) # we just want the file name, so close the open handle
        os.unlink(self.shelve_filename) # remove empty file so shelve is not confused
        return

    def tearDown(self):
        try:
            os.unlink(self.shelve_filename)
        except AttributeError:
            pass
        HTTPTestBase.tearDown(self)
        return

    def test(self):
        storage = shelve.open(self.shelve_filename)
        locking_storage = CacheStorageLock(storage)
        try:
            fc = Cache(locking_storage)

            # First fetch the data through the cache
            parsed_data = fc.fetch(self.TEST_URL)
            self.failUnlessEqual(parsed_data.feed.title, 'CacheTest test data')

            # Now retrieve the same data directly from the shelf
            modified, shelved_data = storage[self.TEST_URL]

            # The data should be the same
            self.failUnlessEqual(parsed_data, shelved_data)
        finally:
            storage.close()
        return


if __name__ == '__main__':
    unittest.main()

The test setUp() method uses tempfile to create a temporary
filename for the cache. The temporary file has to be deleted in
setUp() because if the file exists, but is empty, shelve
cannot determine which database module to use to open an empty
file. The test() method fetches the data from the server, then
compares the returned data with the data in the shelf to verify that
they are the same.

CacheStorageLock uses a threading.Lock instance to control
access to the shelf. It only manages access for the methods known to
be used by Cache. The lock is acquired and released using the
with statement, which is new for Python 2.6. Since this code was
written with Python 2.5, the module starts with a from __future__
import statement to enable the syntax for with.

Other Persistence Options

At any one time, shelve only allows one process to open a shelf
file to write to it. In applications with multiple processes that need
to modify the cache, alternative storage options are
desirable. Cache treats its storage object as a dictionary, so any
class that conforms to the dictionary API can be used for back end
storage. The shove module, by L. C. Rees, uses the dictionary API
and offers support for a variety of back end storage options. The
supported options include relational databases, BSD-style databases,
Amazon’s S3 storage service, and others.

The filesystem store option was particularly interesting for
CastSampler. With shove‘s file store, each key is mapped to a
filename. The data associated with the key is pickled and stored in
the file. By using separate files, it is possible to have separate
threads and processes updating the cache simultaneously. Although the
shove file implementation does not handle file locking, for my
purposes it was unlikely that two threads would try to update the same
feed at the same time.

Listing 10 includes a test that illustrates using shove file
storage with feedcache. The primary difference in the APIs for
shove and shelve is the syntax for specifying the storage
destination. Shove uses a URL syntax to indicate which back end should
be used. The format for each back end is described in the docstrings.

Listing 10

#!/usr/bin/env python
"""Tests with shove filesystem storage.
"""

import os
import shove
import tempfile
import threading
import unittest

from Listing6 import HTTPTestBase
from Listing8 import Cache

class CacheShoveTest(HTTPTestBase):

    def setUp(self):
        HTTPTestBase.setUp(self)
        self.shove_dirname = tempfile.mkdtemp('shove')
        return

    def tearDown(self):
        try:
            os.system('rm -rf %s' % self.storage_dirname)
        except AttributeError:
            pass
        HTTPTestBase.tearDown(self)
        return

    def test(self):
        # First fetch the data through the cache
        storage = shove.Shove('file://' + self.shove_dirname)
        try:
            fc = Cache(storage)
            parsed_data = fc.fetch(self.TEST_URL)
            self.failUnlessEqual(parsed_data.feed.title, 'CacheTest test data')
        finally:
            storage.close()

        # Now retrieve the same data directly from the shelf
        storage = shove.Shove('file://' + self.shove_dirname)
        try:
            modified, shelved_data = storage[self.TEST_URL]
        finally:
            storage.close()

        # The data should be the same
        self.failUnlessEqual(parsed_data, shelved_data)
        return


if __name__ == '__main__':
    unittest.main()

Using feedcache With Multiple Threads

Up to this point, all of the examples have been running in a single
thread driven by the unittest framework. Now that integrating
shove and feedcache has been shown to work, it is possible to
take a closer look at using multiple threads to fetch feeds, and build
a more complex example application. Spreading the work of fetching
data into multiple processing threads is more complicated, but yields
better performance under most circumstances because while one thread
is blocked waiting for data from the network, another thread can take
over and process a different URL.

Listing 11 shows a sample application which accepts URLs as arguments
on the command line and prints the titles of all of the entries in the
feeds. The results may be mixed together, depending on how the
processing control switches between active threads. This example
program is more like a traditional feed aggregator, since it processes
every entry of every feed.

Listing 11

#!/usr/bin/env python
"""Example use of feedcache.Cache combined with threads.
"""

import Queue
import sys
import shove
import threading

from Listing8 import Cache

MAX_THREADS=5
OUTPUT_DIR='/tmp/feedcache_example'


def main(urls=[]):

    if not urls:
        print 'Specify the URLs to a few RSS or Atom feeds on the command line.'
        return

    # Add the URLs to a queue
    url_queue = Queue.Queue()
    for url in urls:
        url_queue.put(url)

    # Add poison pills to the url queue to cause
    # the worker threads to break out of their loops
    for i in range(MAX_THREADS):
        url_queue.put(None)

    # Track the entries in the feeds being fetched
    entry_queue = Queue.Queue()

    print 'Saving feed data to', OUTPUT_DIR
    storage = shove.Shove('file://' + OUTPUT_DIR)
    try:

        # Start a few worker threads
        worker_threads = []
        for i in range(MAX_THREADS):
            t = threading.Thread(target=fetch_urls,
                                 args=(storage, url_queue, entry_queue,))
            worker_threads.append(t)
            t.setDaemon(True)
            t.start()

        # Start a thread to print the results
        printer_thread = threading.Thread(target=print_entries, args=(entry_queue,))
        printer_thread.setDaemon(True)
        printer_thread.start()

        # Wait for all of the URLs to be processed
        url_queue.join()

        # Wait for the worker threads to finish
        for t in worker_threads:
            t.join()

        # Poison the print thread and wait for it to exit
        entry_queue.put((None,None))
        entry_queue.join()
        printer_thread.join()

    finally:
        storage.close()
    return


def fetch_urls(storage, input_queue, output_queue):
    """Thread target for fetching feed data.
    """
    c = Cache(storage)

    while True:
        next_url = input_queue.get()
        if next_url is None: # None causes thread to exit
            input_queue.task_done()
            break

        feed_data = c.fetch(next_url)
        for entry in feed_data.entries:
            output_queue.put( (feed_data.feed, entry) )
        input_queue.task_done()
    return


def print_entries(input_queue):
    """Thread target for printing the contents of the feeds.
    """
    while True:
        feed, entry = input_queue.get()
        if feed is None: # None causes thread to exist
            input_queue.task_done()
            break

        print '%s: %s' % (feed.title, entry.title)
        input_queue.task_done()
    return


if __name__ == '__main__':
    main(sys.argv[1:])

The design uses queues to pass data between two different types of
threads to work on the feeds. Multiple threads use feedcache to
fetch feed data. Each of these threads has its own Cache, but they
all share a common shove store. A single thread waits for the feed
entries to be added to its queue, and then prints each feed title and
entry title.

The main() function sets up two different queues for passing data
in and out of the worker threads. The url_queue (lines 23-25)
contains the URLs for feeds, taken from the command line arguments.
The entry_queue (line 33) is used to pass feed content from the
threads that fetch the feeds to the queue that prints the results. A
shove filesystem store (line 36) is used to cache the feeds. Once
all of the worker threads are started (lines 40-51), the rest of the
main program simply waits for each stage of the work to be completed
by the threads.

The last entries added to the url_queue are None values, which
trigger the worker thread to exit. When the url_queue has been
drained (line 54), the worker threads can be cleaned up. After the
worker threads have finished, (None, None) is added to the
entry_queue to trigger the printing thread to exit when all of the
entries have been printed.

The fetch_urls() function (lines 70-85) runs in the worker
threads. It takes one feed URL at a time from the input queue,
retrieves the feed contents from a cache, then adds the feed entries
to the output queue. When the item taken out of the queue is None
instead of a URL string, it is interpreted as a signal that the thread
should break out of its processing loop. Each thread running
fetch_urls() creates a local Cache instance using a common
storage back end. Sharing the storage ensures that all of the feed
data is written to the same place, while creating a local Cache
instance ensures threads can fetch data in parallel.

The consumer of the queue of entries is print_entries() (lines
88-99). It takes one entry at a time from the queue and prints the
feed and entry titles. Only one thread runs print_entries(), but a
separate thread is used so that output can be produced as soon as
possible, instead of waiting for all of the fetch_urls() threads
to complete before printing the feed contents.

Running the program produces output similar to the example in Listing
3:

$ python Listing11.py http://feeds.feedburner.com/FeedcacheReleases
Saving feed data to /tmp/feedcache_example
feedcache Releases: feedcache 0.1
feedcache Releases: feedcache 0.2
feedcache Releases: feedcache 0.3
feedcache Releases: feedcache 0.4
feedcache Releases: feedcache 0.5

The difference is that it takes much less time to run the program in
Listing 11 when multiple feeds are passed on the command line, and
when some of the data has already been cached.

Future Work

The current version of feedcache meets most of the requirements
for CastSampler, but there is still room to improve it as a
general purpose tool. It would be nice if it offered finer control
over the length of time data stays in the cache, for example. And,
although shove is a completely separate project, feedcache
would be more reliable if shove‘s file storage were used file
locking, to prevent corruption when two threads or processes write to
the same part of the cache at the same time.

Determining how long to hold the data in a cache can be a tricky
problem. With web content such as RSS and Atom feeds, the web server
may offer hints by including explicit expiration dates or caching
instructions. HTTP headers such as Expires and Cache-Control
can include details beyond the Last-Modified and ETag values
already being handled by the Cache. If the server uses additional
cache headers, feedparser saves the associated values in the
FeedParserDict. To support the caching hints, feedcache would
need to be enhanced to understand the rules for the Cache-Control
header, and to save the expiration time as well as the time-to-live
for each feed.

Supporting a separate time-to-live value for each feed would let
feedcache use a different refresh throttle for different
sites. Data from relatively infrequently updated feeds, such as
Slashdot, would stay in the cache longer than data from more
frequently updated feeds, such as a Twitter feed. Applications that
use feedcache in a more traditional way would be able to adjust
the update throttle for each feed separately to balance the freshness
of the data in the cache and the load placed on the server.

Conclusions

Original sources of RSS and Atom feeds are being created all the time
as new and existing applications expose data for syndication. With the
development of mash-up tools such as Yahoo! Pipes and Google’s Mashup
Editor, these feeds can be combined, filtered, and expanded in new and
interesting ways, creating even more sources of data. I hope this
article illustrates how building your own applications to read and
manipulate syndication feeds in Python with tools like feedparser and
feedcache is easy, even while including features that make your
program cooperate with servers to manage load.

I would like to offer a special thanks to Mrs. PyMOTW for her help
editing this article.

Converting podcasts to regular tracks in iTunes

I have spent the better part of the morning trying to work out how to
convert podcasts to “regular” tracks in iTunes, so they would show up in
shuffle, etc. Mostly this was for my collection of Jonathan Coulton
Thing a Week” episodes, but it would be useful for anything you
wanted to move out of your podcast list into the main audio portion of
the library. I suppose the reason it took so long to find the solution
is I started by searching for it on Google instead of just looking
through the iTunes menu options, though as you will see the solution
wasn’t immediately obvious even once I had found it.

I knew that iTunes would let me change settings like “Remember
playback position” and “Skip when shuffling” from the info dialog
(Cmd-I), so I started there. The only setting that even mentioned
podcast was the genre, and I already knew that was not what I needed to
change. Some of the tracks already had “real” genres and only some were
set to “Podcast”.

I also knew there is a separate podcast flag available for queries in
Smart Lists and AppleScript, since I used it to create my “Active Queue”
podcast playlist with selections of episodes from various podcast series
(it’s like creating your own mix tape, but for talk radio). I tried to
write a simple AppleScript to change the podcast flag of selected tracks
to true. It turns out the flag is a read-only attribute. No amount of
searching uncovered any way to change the setting using AppleScript.

The first useful looking suggestions I ran into were to convert the
ID3 tag format to an older version
, then convert it back. Doing that
erased most of the comments and other meta-data associated with the
tracks, though, so I didn’t like the results.

Next I found a few forum and blog posts that talked about an
ITUNESPODCAST setting in the extended ID3 tags. They all mentioned a
Windows program for removing or changing the flag, though. I examined a
few of the files with Ned Batchelder’s python module id3reader, but
didn’t see anything that looked like “ITUNESPODCAST” in the output.

Going back to Google, I finally found a reference to converting the
files to AAC using an option in the Advanced menu. That seemed like
overkill, but at this point I was becoming fed up and just wanted to be
done with the whole thing. I could always re-encode as MP3, after all.
Well, iTunes didn’t have a menu option to “Encode as AAC”. It did have
“Convert selection to MP3”, which didn’t make much sense to me. As far
as I knew, the tracks were already MP3 files. But lo and behold,
selecting that menu option did enable them in the iTunes Music Library.
It made copies of all of the tracks as it converted them, so I could
even delete the podcast subscription.

So, if you want to add podcast episodes you have already downloaded to
your music library and turn off the podcast flag, select the track and
choose Advanced->Convert selection to MP3.

PyMOTW: Queue

The Queue module provides a FIFO implementation suitable for
multi-threaded programming. It can be used to pass messages or other
data between producer and consumer threads safely. Locking is handled
for the caller, so it is simple to have as many threads as you want
working with the same Queue instance. A Queue’s size (number of
elements) may be restricted to throttle memory usage or processing.

Read more at pymotw.com: Queue

Things to Do

In no particular order:

  1. Cull my Google Reader subscriptions. 364 is too many.
  2. Finish reading Dreaming in Code.
  3. Add tagging support to codehosting.
  4. Verify all of the domains under my control with Google Web Master
    tools.
  5. Create a Trac plugin for code reviews based on the process we use
    at work.
  6. Change the monitor feeds on CastSampler.com so they do not include
    items without enclosures.
  7. Enhance BlogBackup to save enclosures and images linked from blog
    posts.
  8. Write a tool to convert an m3u file to an RSS/Atom feed for
    Patrick so he will set up a podcast of his demo recordings.
  9. Improve AppleScript support in Adium.
  10. Add support to Adium for notifications when a screen name appears in
    a chat message.

CastSampler.com monitoring feeds

On the plane back from Phoenix this week, I implemented some changes
to the way CastSampler.com republishes feeds for the sites a user
subscribes to. The user page used to link directly to the original feed
so it would be easy to copy it to a regular RSS reader to keep up to
date on new shows. That link has been replaced with a “monitor” feed
which uses the original description and title for each item, but
replaces the link with a new URL that causes the show to be added to
your CastSampler queue. The user page still links to the original home
page for the feed, so I think I am doing enough as far as attribution
and advertisement. Any author information included in the original feed
is also passed through to the monitor feed. The OPML file generated for
a user’s feeds links to these “monitor” feeds instead of the original
source, too.

The goal of these changes is to make it easy to use a feed-reader such
as Bloglines or Google Reader to monitor podcasts from CastSampler. To
add an episode to your queue, just click the link in the monitor feed to
be directed to the appropriate CastSampler.com page.

By the way, how cool is it to be able to develop a web app on my
Powerbook while I’m on a plane? What an age to be alive.

CastSampler.com

My most recent project is CastSampler.com, a tool for building a
personal “mix-tape” style podcast. I tend to listen to one or two
episodes from a lot of different shows, so I don’t want to subscribe to
the full show feed. Instead, I add the show to my CastSampler list, then
I can add only those episodes that I want to my personal feed.

I have plenty of work left to do, but the basic features all work now
so I would love to get some feedback.