Caching RSS Feeds With feedcache

The past several years have seen a steady increase in the use of RSS
and Atom feeds for data sharing. Blogs, podcasts, social networking
sites, search engines, and news services are just a few examples of data
sources delivered via such feeds. Working with internet services
requires care, because inefficiencies in one client implementation may
cause performance problems with the service that can be felt by all of
the consumers accessing the same server. In this article, I describe the
development of the feedcache package, and give examples of how you can
use it to optimize the use of data feeds in your application.

Read more

New project: feedcache

Back in mid-June I promised Jeremy Jones that I would clean up some
of the code I use for CastSampler.com to cache RSS and Atom feeds so
he could look at it for his podgrabber project. I finally found some
time to work on it this weekend (not quite 2 months later, sorry
Jeremy).

The result is feedcache, which I have released in “alpha” status,
for now. I don’t usually bother releasing my code in alpha state,
because that usually means I’m not actually using it anywhere with
enough regularity to ensure that it is robust. I am going ahead and
releasing feedcache early because I am hoping for some feedback on the
API. I realized that the way I cache feeds for CastSampler.com is not
the way all applications will want to cache them, so the design might be
biased.

The Design

There are two aspects to handling caching the feed data. The high
level code that knows it is working with RSS or Atom feeds, and low
level code that saves the data with a timestamp. The high level Cache
class is responsible for fetching, updating, and expiring feed content.
The low level storage classes are responsible for saving and restoring
feed content.

Since the storage handling is separated from the cache management, it
is possible to adapt the Cache to whatever sort of storage option might
work best for you. So far, I have implemented two backend storage
options. MemoryStorage keeps everything in memory, and is mostly useful
for testing. ShelveStorage option uses the shelve module to store all
of the feed data in one file using pickles. I hope that the API for the
backend storage manager is simple enough to make it easy for you to tie
in your own backend if neither of these options is appealing. Something
that uses memcached would be very interesting, for example.

The Cache class uses a fairly simple algorithm to decide if it needs
to update the stored data:

  1. If there is nothing stored for the URL, fetch the data.
  2. If there is something stored for the URL and its time-to-live has not
    passed, use that data. (This throttles repeated requests for the same
    feed content.)
  3. If the stored data has expired, use any available ETag and
    modification time header data to perform a conditional GET of the
    data. If new data is returned, update the stored data. If no new data
    is returned, update the time-to-live for the stored data and return
    what is stored.

The feed data is retrieved and parsed by Mark Pilgrim’s feedparser
module
, so the Cache really does just manage the contents of the
backend storage.

Another benefit of separating the cache manager from the storage
handler is only the storage handler needs to be thread-safe. The storage
handler is given to each Cache as an argument to the constructor. In a
multi-threaded app, each thread can have its own Cache (which does the
fetching, when needed) and share a single backend storage handler.

Example

Here is a simple example program that uses a shelve file for storage.
The example does not use multiple threads, but should still illustrate
how to use the cache.

def main(urls=[]):
    print 'Saving feed data to ./.feedcache'
    storage = shelvestorage.ShelveStorage('.feedcache')
    storage.open()
    try:
        fc = cache.Cache(storage)
        for url in urls:
            parsed_data = fc[url]
            print parsed_data.feed.title
            for entry in parsed_data.entries:
                print 't', entry.title
    finally:
        storage.close()
    return

Additional Work

This project is still a work in process, but I would appreciate any
feedback you have, good or bad. And of course, report bugs if you find
them!

Things to Do

In no particular order:

  1. Cull my Google Reader subscriptions. 364 is too many.
  2. Finish reading Dreaming in Code.
  3. Add tagging support to codehosting.
  4. Verify all of the domains under my control with Google Web Master
    tools.
  5. Create a Trac plugin for code reviews based on the process we use
    at work.
  6. Change the monitor feeds on CastSampler.com so they do not include
    items without enclosures.
  7. Enhance BlogBackup to save enclosures and images linked from blog
    posts.
  8. Write a tool to convert an m3u file to an RSS/Atom feed for
    Patrick so he will set up a podcast of his demo recordings.
  9. Improve AppleScript support in Adium.
  10. Add support to Adium for notifications when a screen name appears in
    a chat message.

CastSampler.com monitoring feeds

On the plane back from Phoenix this week, I implemented some changes
to the way CastSampler.com republishes feeds for the sites a user
subscribes to. The user page used to link directly to the original feed
so it would be easy to copy it to a regular RSS reader to keep up to
date on new shows. That link has been replaced with a “monitor” feed
which uses the original description and title for each item, but
replaces the link with a new URL that causes the show to be added to
your CastSampler queue. The user page still links to the original home
page for the feed, so I think I am doing enough as far as attribution
and advertisement. Any author information included in the original feed
is also passed through to the monitor feed. The OPML file generated for
a user’s feeds links to these “monitor” feeds instead of the original
source, too.

The goal of these changes is to make it easy to use a feed-reader such
as Bloglines or Google Reader to monitor podcasts from CastSampler. To
add an episode to your queue, just click the link in the monitor feed to
be directed to the appropriate CastSampler.com page.

By the way, how cool is it to be able to develop a web app on my
Powerbook while I’m on a plane? What an age to be alive.

feed auto-discovery

I added feed auto-discovery to CastSampler.com today. It was pretty
easy using the feedfinder.py module, except for one small
problem. Something about the timelimit() decorator in that module
causes problems with django or mod_python (probably
mod_python). When timelimit() is enabled, the finder either produces
no URLs at all or an exception about “unmarshalling code objects” in a
“restricted execution environment.” It works great in my development
environment, which does not use mod_python. To get it to work in
production, I disabled the timelimit() decorator. I hope that does not
come back to bite me in the future.

CastSampler.com

My most recent project is CastSampler.com, a tool for building a
personal “mix-tape” style podcast. I tend to listen to one or two
episodes from a lot of different shows, so I don’t want to subscribe to
the full show feed. Instead, I add the show to my CastSampler list, then
I can add only those episodes that I want to my personal feed.

I have plenty of work left to do, but the basic features all work now
so I would love to get some feedback.