PyMOTW: Call for input

Tomorrow’s post will cover the ConfigParser module. Beyond that, I
have a few more weeks planned out, and am looking for suggestions for
which modules to cover next.

If you were stranded on a desert island, which standard library module
would you want, and why?

PyMOTW: Python Module of the Week

I am starting a new series of posts today that I am calling “Python
Module of the Week” (PyMOTW)
. I have two goals for this:

  1. to work my way though the standard library and learn something about
    each module
  2. to get into the habit of posting to the blog more regularly

I will cheat a little, and start out with some modules that I already
know something about. Hopefully that will give me enough of a head-start
that I can keep up a fairly regular flow.

image0Subscribe to PyMOTW in your feed reader

Converting Python source to HTML

For my PyMOTW series, I have found that I want to convert a lot of
python source code to HTML. In a perfect world it would be easy for me
to produce pretty XML/HTML and use CSS, but it is not obvious how to use
CSS from Blogger. Instead, I am using a CLI app based on this ASPN
recipe
which produces HTML snippets that I can paste directly into a
new blog post. The output HTML is more verbose than I wanted, but I like
the fact that it has no external dependencies.

If you have any alternatives, I would appreciate hearing about them.

PyMOTW: fileinput

To start this series, let’s take a look at the fileinput module,
a very useful module for creating command line programs for processing
text files in a filter-ish manner. For example, the m3utorss app I
recently wrote for my friend Patrick to convert some of his demo
recordings into a podcastable format.

Read more at pymotw.com: fileinput

Distributing django applications

I had a report that version 1.2 of my codehosting package did not
include all of the required files. It turns out I messed up the setup.py
file and left out the templates, images, and CSS files. Oops.

In the process of trying to fix the setup file, I discovered that
distutils does not include package data in sdist. Not a big deal,
since I just created a MANIFEST.in file to get around it.

My next challenge (for this project) is how to write the templates in
a way that would let anyone actually reuse them. For example, the
project details page shows info about the most current release and a
complete release history. It uses a 2 column layout for that, but the
way I have it implemented the layout is defined in the base template for
my site. I want to move that layout from the site base template down
into the application base template, but I do not want to repeat myself
if I can avoid it. Maybe I need to get over that and just repeat the
layout instructions. Or refactor the site base template somehow.
Obviously that needs more thought. I did find some useful advice in
DosAndDontsForApplicationWriters, but have not implemented all of
those suggestions.

In the mean time, release 1.4 of codehosting is more flexible than the
previous releases and is probably closer to something useful for people
other than me.

[Updated 28 Sept 2007 to correct typo in title]

How NOT to Backup a Blogger Blog

Over at the Google Operating System blog, they offer a way to
“backup” your blog
. It is mostly a manual hack to load the entire blog
into one page in a web browser, then save the resulting HTML, though a
similar technique is offered for saving the contents of your XML feed.

There are a few problems with this technique:

  1. It depends on knowing how many posts are in the blog, up front.
  2. The steps and tools given are manual.
  3. Comments are handled separately.

A backup needs to be automated. If I have to remember to do something
by hand, it isn’t going to be done on a regular basis. I want to add to
my blog without worrying about how many posts there are and tweaking
some backup procedure that depends on knowing all about the content of
the blog up front. I want comments saved automatically along with each
post, not in one big lump. And if I need to import the data into a
database, I want the backup format to support parsing the data easily.

What to do?

Enter BlogBackup, the unimaginatively named, fully automatic,
backup software for your blog. Just point the command line tool at your
blog feed and a directory where the backup output should go. It will
automatically perform a full backup, including:

  1. Every blog post is saved to a separate file in an easily parsable
    format, including all of the meta-data provided by the feed
    (categories, tags, publish dates, author, etc.).
  2. Comments are saved in separate directories, organized around the post
    with which they are associated. Comments also include all of their
    meta-data.
  3. The content of blog posts and comments are copied to a separate text
    file for easy indexing by desktop search tools such as Spotlight.

Since the tool is a command line program, it is easy to automate with
cron or a similar scheduling tool. Since it is fully automatic and reads
the feed itself, you do not need to reconfigure it as your blog grows.
And the data is stored in a format which makes it easy to parse to load
into another database of some sort.

So, go forth and automate.

Better blogger backups

I have enhanced the blog backup script I wrote a while back to
automatically find and include comments feeds, so comments are now
archived along with the original feed data. The means for recognizing
“comments” feeds may make the script work only with blogger.com, though,
since it depends on having “comments” in the URL. This does what I need
now, though.

testing regular expressions

I discovered Christof Hoeke’s retest program today. This is a very
slick use of Python’s standard library HTTP server module to package an
AJAX app for interactively testing out regular expressions. I used to
have a Tkinter app that did something similar, but Christof’s is much
lighter weight.

Now I need to figure out how to package it to run as an app when I
double click on it in the Finder, instead of opening the .py file in an
editor.

Object-Relational Mappers

My friend Steve and I have spent some time discussing
object-relational mapping recently, partially initiated by his
comments on the ORM features in django
.

For some reason I’ve never quite understood, there seems to be an
inherent fear of SQL in the web development community, and over the
years there have been many efforts to hide the SQL completely (or in
the case of Zope, encourage the use of a custom object database
instead of a relational database). Personally I’m wary of any form
of object relational mapping which works automatically. What I do
want is a nice abstraction layer (sometimes called the data access
object pattern), so that the code working with objects doesn’t know
that the objects are actually stored in a relational database.

I tend to agree. I’m confused by the intense need to create a new way
to express a relational database schema in Python, Ruby, or any other
language. The DDL is a perfectly legitimate way to express the schema
of the database. Why not just use it?

We use an ORM like that at work. The whole thing was written several
years ago before the SQLObject and SQLAlchemy ORMs were available, of
course, or we would be using one of them. The database connection layer
scans the database table definitions when it connects for the first
time. The base class for persistent objects uses that information to
discover the names and types of attributes for classes. We do it all at
runtime now, though we have discussed caching the information somehow
for production servers (maybe using a pickle or even writing out a
python module during the build by parsing the DDL itself). Scanning the
tables doesn’t take as long as you would think, though, so it hasn’t
become a hot-spot for performance tuning. Yet.

Steve suggested a slightly different design. Use DDL to define the
schema, then convert the schema to base classes (one per table) with a
code generator. Then subclass from the auto-generated tables to add
“business logic”. I’m not sure how well that would work, but it sounds
like an interesting idea. If the generated code is going to support
querying for and returning related objects, how does it know to use the
subclass to create instances instead of the generated class?

I do like the automatic handling of queries for related objects, and
the system used by django is particularly elegant. Two features I
especially like are:

  1. The QuerySet isn’t resolved and executed until you start indexing
    into it.
  2. Modifying a QuerySet by applying a filter actually creates a new
    QuerySet.

This means passing QuerySet instances around is inexpensive, and callers
do not have to worry about call-by-reference objects being modified
unexpectedly. I need to study SQLAlchemy again, to see how it handles
query result sets.