Book Review: Programming Collective Intelligence
The latest book I’ve been reading as part of the Atlanta Python Users’s Group Book Club is Programming Collective Intelligence by Toby Segaran.
Disclosure: My copy of the book was provided free, as part of O’Reilly Media’s support for the book club.
My Impressions
I have to admit, I was a little concerned when I picked up Programming Collective Intelligence that my rusty math skills would be a hindrance to really understanding the material. But all of the statistics or linear algebra needed (not a lot) are explained quite clearly in context (something my college professors could never seem to manage). It did take me longer than it usually does to read a book of this size because this one is crammed full of great material. It has a high information density, but is still a pleasant read. Ending each chapter with a list of exercises you can use to explore the topics presented earlier in more depth was a nice touch.
While the source code is not always as clear as the prose (mostly due to variable name choices), it is presented with plenty of descriptive text that is clear. Most of the chapters build the source along with your own knowledge, rather then presenting a large complete program after a lengthy description. In fact, many of the in-line examples are created using the Python interpreter command line, making it easy to work along with the text and experiment with the data on your own.
I definitely recommend this book. The algorithms covered are fascinating, and I’m already considering how I can use the optimization techniques to solve a sticky problem we’ve been trying to address at work.
Book Summary
The first chapter introduces collective intelligence (combining input from a large group of people to achieve insight) and machine learning (adaptive algorithms which can be trained to perform a task more accurately or make predictions).
Chapter 2 dives right in to building a recommendation engine. The first small example program finds users with similar tastes in movies. This example is used to explore different ways to calculate similarity between data points, and how to use those values to rank other users who have critiqued movies based on how similar they are to you. Critics with taste similar to your own can be used to find a recommendation for a movie you have not seen. These ideas are expanded in a larger example which recommends links from del.icio.us. This is the first of many real example programs throughout the book which use Web 2.0 APIs to pull data from public sites. Chapter 2 closes with a discussion of the pros and cons of user-based vs. item-based filtering and when each is appropriate.
In chapter 3, the similarity calculations developed in chapter 2 are used to build data clustering algorithms (hierarchical, column, and K-Means). The first example groups blogs based on the words which appear in the posts on that blog. The example works through the entire process of breaking the input into words to be counted, all the way to visualization of the clustering results. Sample code for drawing dendrograms using PIL is included. Next, an example using Zebo.com discovers clusters in the preferences people have (Zebo lets users post lists of things they want).
Chapter 4 discusses the challenges experienced when building a full-text search engine. The example code starts out a little confusing because it stubs in the whole API instead of “evolving” the class throughout the chapter. But the discussion is clear, and once the code is complete it makes sense. The discussion of PageRank have especially good examples. Chapter 4 also introduces a simple neural network implementation and shows how to train it to include the click counts for search results in their rankings. The neural network code might have been more clear if it had used a functional programming style, but that might just be a personal preference. In general, the implementation is very straightforward and it should be possible to use it for other purposes. This was my first exposure to neural networks, and they strike me as surprisingly simple for something with such an exotic sounding name.
Chapter 5 covers “stochastic optimization” techniques for selecting the best result from several options in a set. Random searching, hill climbing, simulated annealing, and genetic algorithms are covered. The discussion also includes the limitations of optimization as an approach. Once the basic techniques are explained, the sample flight scheduling application is converted to use live data from Kayak.com.
In chapter 6, various algorithms for classifying documents are covered. A naive Bayesian spam filter is used to examine the challenges of breaking documents up into classifiable “features”. There is good coverage of the techniques for limiting false classifications using separate thresholds and a description of how to combine the probabilities for each feature to calculate the probability of the source document belonging in one category or another. The Fisher method, used by SpamBayes, is also discussed. Once the classifier is complete, an example program for filtering blog feeds is built with it. The code samples in chapter 6 start to suffer from abbreviated symbol names, but once you figure out the abbreviations the rest of the structure of the code makes sense.
The material in chapter 7, Modeling with Decision Trees, reminded me of an expert systems class I had in college. In class, we had to build our decision trees by hand but chapter 7 shows how to “train” a tree from input data with known outcomes. The material covers methods for splitting the tree into sets based on Gini impurity or Entropy, and then building a tree recursively by repeatedly splitting sets until no more information is gained by having separate nodes in the decision tree. Again, once an example program is built with a simple dataset, the program is enhanced by introducing a web 2.0 site which can provide similar data. In this case, real estate price information from Zillow.com is used. To illustrate how the same decision tree code can be used completely different types of data, a hot-or-not predictor is built with data from http://hotornot.com.
Chapter 8 leaves the realm of strict classification and introduces tools for building price models for predicting price for items using multiple variables. As with the earlier chapters, several techniques are presented and their pros and cons are covered in detail. There are plenty of graphs to illustrate the importance of selecting the right number of neighbors for the k-nearest neighbors calculation, for example. This chapter also discusses optimizing the scale of data from heterogeneous variables, and weighting different variables based on how much they effect the outcome. The real world dataset for chapter 8 comes from eBay pricing data.
Chapter 9 returns to classification, and covers tools for classifying data where the division of the data can be expressed as a function of 2 or more variables. The problems with decision tree and basic linear classification are discussed in the context of a dating site match-making application. This segues into a discussion of kernel methods for dealing with non-linear classification. The input data for the match-maker uses the distance apart the 2 parties live, as calculated using the Yahoo! Maps API. Although support vector machines are discussed in theory, the actual code for working with them is written using the open source LIBSVM library due to the intense computational requirements. At the end of the chapter the completed match-maker is turned loose on Facebook data to predict “friends”.
While the earlier chapters have focused on placing data into categories, chapter 10 covers techniques for discovering categories within the data itself. The first example presented is a tool for finding themes among news items in RSS feeds. The feature extraction technique discussed, non-negative matrix factorization, is implemented using the NumPy libraries for matrix math. The second example uses Yahoo! Finance APIs to examine trading volume for various stocks, looking for relationships.
Chapter 11 introduces a few techniques for genetic programming, evolving applications through trials and mutations. The sample code includes classes to represent programs as trees of data which are easier to mutate than raw text source would be. The chapter explains how to measure the success of any one program tree, apply random changes through mutation and crossover, then evolve a set of programs by identifying and retaining those which are becoming more successful at reaching the desired outcome. The importance of diversity for keeping the result set from reaching a local maxima is stressed. The sample programs include a formula building tool and a player for a simple game.
Chapter 12 wraps up the book with a summary of all of the algorithms and techniques presented earlier, and is intended to serve as a reference. There is less code, but all of the examples are with new data so the prose is not just a repetition of what has already been seen. There are additional diagrams to help explain the techniques, in particular the neural network details are expanded.
Web 2.0 APIs/Sites
All of the examples throughout the book use either simple flat files for input, or a Web 2.0 API of some sort.
- del.icio.us
- MovieLens/GroupLens
- RSS/Atom
- Zebo
- Kayak
- Askimet
- Zillow
- hotornot.com
- eBay
- Yahoo! Maps
- Yahoo! Finance
Open Source Libraries
Many of the example programs use open source libraries to process or retrieve the data. All of those libraries are listed, with instructions for retrieving and installing them, in Appendix A.
- Python Imaging Library (PIL)
- BeautifulSoup
- SQLite
- pysqlite
- LIBSVM
- feedparser
- NumPy
- matplotlib
- pydelicious