Back in mid-June I promised Jeremy Jones that I would clean up some
of the code I use for CastSampler.com to cache RSS and Atom feeds so
he could look at it for his podgrabber project. I finally found some
time to work on it this weekend (not quite 2 months later, sorry
The result is feedcache, which I have released in “alpha” status,
for now. I don’t usually bother releasing my code in alpha state,
because that usually means I’m not actually using it anywhere with
enough regularity to ensure that it is robust. I am going ahead and
releasing feedcache early because I am hoping for some feedback on the
API. I realized that the way I cache feeds for CastSampler.com is not
the way all applications will want to cache them, so the design might be
There are two aspects to handling caching the feed data. The high
level code that knows it is working with RSS or Atom feeds, and low
level code that saves the data with a timestamp. The high level Cache
class is responsible for fetching, updating, and expiring feed content.
The low level storage classes are responsible for saving and restoring
Since the storage handling is separated from the cache management, it
is possible to adapt the Cache to whatever sort of storage option might
work best for you. So far, I have implemented two backend storage
options. MemoryStorage keeps everything in memory, and is mostly useful
for testing. ShelveStorage option uses the shelve module to store all
of the feed data in one file using pickles. I hope that the API for the
backend storage manager is simple enough to make it easy for you to tie
in your own backend if neither of these options is appealing. Something
that uses memcached would be very interesting, for example.
The Cache class uses a fairly simple algorithm to decide if it needs
to update the stored data:
- If there is nothing stored for the URL, fetch the data.
- If there is something stored for the URL and its time-to-live has not
passed, use that data. (This throttles repeated requests for the same
- If the stored data has expired, use any available ETag and
modification time header data to perform a conditional GET of the
data. If new data is returned, update the stored data. If no new data
is returned, update the time-to-live for the stored data and return
what is stored.
The feed data is retrieved and parsed by Mark Pilgrim’s feedparser
module, so the Cache really does just manage the contents of the
Another benefit of separating the cache manager from the storage
handler is only the storage handler needs to be thread-safe. The storage
handler is given to each Cache as an argument to the constructor. In a
multi-threaded app, each thread can have its own Cache (which does the
fetching, when needed) and share a single backend storage handler.
Here is a simple example program that uses a shelve file for storage.
The example does not use multiple threads, but should still illustrate
how to use the cache.
def main(urls=): print 'Saving feed data to ./.feedcache' storage = shelvestorage.ShelveStorage('.feedcache') storage.open() try: fc = cache.Cache(storage) for url in urls: parsed_data = fc[url] print parsed_data.feed.title for entry in parsed_data.entries: print 't', entry.title finally: storage.close() return
This project is still a work in process, but I would appreciate any
feedback you have, good or bad. And of course, report bugs if you find