New project: feedcache

Originally posted Aug 5, 2007 · 3 min read

Back in mid-June I promised Jeremy Jones that I would clean up some of the code I use for CastSampler.com to cache RSS and Atom feeds so he could look at it for his podgrabber project. I finally found some time to work on it this weekend (not quite 2 months later, sorry Jeremy).

The result is feedcache, which I have released in “alpha” status, for now. I don’t usually bother releasing my code in alpha state, because that usually means I’m not actually using it anywhere with enough regularity to ensure that it is robust. I am going ahead and releasing feedcache early because I am hoping for some feedback on the API. I realized that the way I cache feeds for CastSampler.com is not the way all applications will want to cache them, so the design might be biased.

The Design

There are two aspects to handling caching the feed data. The high level code that knows it is working with RSS or Atom feeds, and low level code that saves the data with a timestamp. The high level Cache class is responsible for fetching, updating, and expiring feed content. The low level storage classes are responsible for saving and restoring feed content.

Since the storage handling is separated from the cache management, it is possible to adapt the Cache to whatever sort of storage option might work best for you. So far, I have implemented two backend storage options. MemoryStorage keeps everything in memory, and is mostly useful for testing. ShelveStorage option uses the shelve module to store all of the feed data in one file using pickles. I hope that the API for the backend storage manager is simple enough to make it easy for you to tie in your own backend if neither of these options is appealing. Something that uses memcached would be very interesting, for example.

The Cache class uses a fairly simple algorithm to decide if it needs to update the stored data:

If there is nothing stored for the URL, fetch the data.
If there is something stored for the URL and its time-to-live has not passed, use that data. (This throttles repeated requests for the same feed content.)
If the stored data has expired, use any available ETag and modification time header data to perform a conditional GET of the data. If new data is returned, update the stored data. If no new data is returned, update the time-to-live for the stored data and return what is stored.

The feed data is retrieved and parsed by Mark Pilgrim’s feedparser module, so the Cache really does just manage the contents of the backend storage.

Another benefit of separating the cache manager from the storage handler is only the storage handler needs to be thread-safe. The storage handler is given to each Cache as an argument to the constructor. In a multi-threaded app, each thread can have its own Cache (which does the fetching, when needed) and share a single backend storage handler.

Example

Here is a simple example program that uses a shelve file for storage. The example does not use multiple threads, but should still illustrate how to use the cache.

def main(urls=[]):
    print 'Saving feed data to ./.feedcache'
    storage = shelvestorage.ShelveStorage('.feedcache')
    storage.open()
    try:
        fc = cache.Cache(storage)
        for url in urls:
            parsed_data = fc[url]
            print parsed_data.feed.title
            for entry in parsed_data.entries:
                print 't', entry.title
    finally:
        storage.close()
    return

Additional Work

This project is still a work in process, but I would appreciate any feedback you have, good or bad. And of course, report bugs if you find them!