PyMOTW: heapq

The heapq implements a min-heap sort algorithm suitable for use with Python’s lists.

Module: heapq
Purpose: In-place heap sort algorithm
Python Version: New in 2.3 with additions in 2.5

Description:

A heap is a tree-like data structure where the child nodes have a sort-order relationship with the parents. Binary heaps can be represented using a list or array organized so that the children of element N are at positions 2*N+1 and 2*N+2 (for zero-based indexes). This feature makes it possible to rearrange heaps in place, so it is not necessary to reallocate as much memory when adding or removing items.

A max-heap ensures that the parent is larger than or equal to both of its children. A min-heap requires that the parent be less than or equal to its children. Python’s heapq module implements a min-heap.

Creating a Heap:

There are 2 basic ways to create a heap, heappush() and heapify().

Using heappush(), the heap sort order of the elements is maintained as new items are added from a data source.

import heapq
from heapq_showtree import show_tree
from heapq_heapdata import data

heap = []
print 'random :', data
print

for n in data:
print 'add %3d:' % n
heapq.heappush(heap, n)
show_tree(heap)



$ python heapq_heappush.py
random : [19, 9, 4, 10, 11, 8, 2]

add 19:

19
------------------------------------

add 9:

9
19
------------------------------------

add 4:

4
19 9
------------------------------------

add 10:

4
10 9
19
------------------------------------

add 11:

4
10 9
19 11
------------------------------------

add 8:

4
10 8
19 11 9
------------------------------------

add 2:

2
10 4
19 11 9 8
------------------------------------



If the data is already in memory, it is more efficient to use heapify() to rearrange the items of the list in place.

import heapq
from heapq_showtree import show_tree
from heapq_heapdata import data

print 'random :', data
heapq.heapify(data)
print 'heapified :'
show_tree(data)



$ python heapq_heapify.py
random : [19, 9, 4, 10, 11, 8, 2]
heapified :

2
9 4
10 11 8 19
------------------------------------



Accessing Contents of a Heap:

Once the heap is organized correctly, use heappop() to remove the element with the lowest value. In this example, adapted from the stdlib documentation, heapify() and heappop() are used to sort a list of numbers.

import heapq
from heapq_showtree import show_tree
from heapq_heapdata import data

print 'random :', data
heapq.heapify(data)
print 'heapified :'
show_tree(data)
print

inorder = []
while data:
smallest = heapq.heappop(data)
print 'pop %3d:' % smallest
show_tree(data)
inorder.append(smallest)
print 'inorder :', inorder



$ python heapq_heappop.py
random : [19, 9, 4, 10, 11, 8, 2]
heapified :

2
9 4
10 11 8 19
------------------------------------


pop 2:

4
9 8
10 11 19
------------------------------------

pop 4:

8
9 19
10 11
------------------------------------

pop 8:

9
10 19
11
------------------------------------

pop 9:

10
11 19
------------------------------------

pop 10:

11
19
------------------------------------

pop 11:

19
------------------------------------

pop 19:

------------------------------------

inorder : [2, 4, 8, 9, 10, 11, 19]


To remove existing elements and replace them with new values in a single operation, use heapreplace().

import heapq
from heapq_showtree import show_tree
from heapq_heapdata import data

heapq.heapify(data)
print 'start:'
show_tree(data)

for n in [0, 7, 13, 9, 5]:
smallest = heapq.heapreplace(data, n)
print 'replace %2d with %2d:' % (smallest, n)
show_tree(data)



This technique lets you maintain a fixed size heap, such as a queue of jobs ordered by priority.


$ python heapq_heapreplace.py
start:

2
9 4
10 11 8 19
------------------------------------

replace 2 with 0:

0
9 4
10 11 8 19
------------------------------------

replace 0 with 7:

4
9 7
10 11 8 19
------------------------------------

replace 4 with 13:

7
9 8
10 11 13 19
------------------------------------

replace 7 with 9:

8
9 9
10 11 13 19
------------------------------------

replace 8 with 5:

5
9 9
10 11 13 19
------------------------------------



Data Extremes:

heapq also includes 2 functions to examine an iterable to find a range of the largest or smallest values it contains. Using nlargest() and nsmallest() are really only efficient for relatively small values of n > 1, but can still come in handy in a few cases.

import heapq
from heapq_heapdata import data

print 'all :', data
print '3 largest :', heapq.nlargest(3, data)
print 'from sort :', list(reversed(sorted(data)[-3:]))
print '3 smallest:', heapq.nsmallest(3, data)
print 'from sort :', sorted(data)[:3]



$ python heapq_extremes.py
all : [19, 9, 4, 10, 11, 8, 2]
3 largest : [19, 11, 10]
from sort : [19, 11, 10]
3 smallest: [2, 4, 8]
from sort : [2, 4, 8]


References:

heapq Theory
WikiPedia – Heap Data Structure
Python Module of the Week Home
Download Sample Code


Technorati Tags:
,


  • http://www.blogger.com/profile/17320536844472897403 István Albert

    What a great tutorial! Showing how the tree changes after pushing values onto the heap are fantastic.

    I’m putting “LazyWeb” on notice … from now on there is no excuse for not understanding how heaps work ;-)

  • http://www.blogger.com/profile/01892352754222143463 Doug Hellmann

    Thanks, István! By the way, the code to print the heap is included in the example source tarball, even though I don’t show it here in this post.

  • Beetle B.

    Thanks for highlighting the module.

    1. I haven’t looked at the source code, but usually heappush is slower (O(n*logn) vs O(n) for heapify). It’d be worth experimenting, but I suspect that for large enough n, it would be faster to read the elements into a list, and heapify it rather than heappush one by one.

    2. Regarding nlargest(), here’s a relevant blog post. The gist of it is that if you want to get the top n elements, you can either sort the whole list and read them off, or construct a heap and keep popping the largest elements. The assertion is that the latter approach is faster (as you need not sort the whole list, and creating a heap is O(n)).

    So I decided to try it out. I created a random list of 1 million integers. I then wrote my own (very simple) heap functions – I did not know at the time of the heapq module – and timed it.

    Then I found out about the heapq module and used nlargest(). I was quite surprised to see that my routine was 2-3 times faster (if I used psyco) than the heapq function.

    I can send the code over if you want to compare…

  • http://www.blogger.com/profile/01892352754222143463 Doug Hellmann

    Beetle B., the standard library documentation doesn’t talk a lot about performance of heappush vs. heapify, but I wouldn’t be surprised to find heapify more efficient.

    I’m also not surprised to find the code optimzied with psyco runs faster than native Python. That’s the point, right? :-)

  • Beetle B.

    Actually, in a sense, I compared psyco on both. psyco had little impact on the heapq function.

    I used the algorithms given in the Cormen Intro to Algorithms book.

    I suspect the reason I got a good speedup with psyco is that a key function I wrote is recursive. I’ll try writing a nonrecursive one and see what difference that makes.

    But in a more general sense, I would think trying to find the top 10 numbers among 10 million of them using heaps should be significantly faster than using sort. In Python, using heapq, it really wasn’t much faster (perhaps a speedup of 2 at most) – which made me wonder if there’s some inefficiency in the heapq code.

    Do you know where on my system (Linux) I can find the source for heapq? I honestly don’t have a clue.

  • http://www.blogger.com/profile/01892352754222143463 Doug Hellmann

    I’m sure a non-recursive key function would have an impact.

    As far as where the heapq source is, try this:

    $ python
    >>> import heapq
    >>> print heapq.__file__

  • Beetle B.

    Thanks for the tip – I didn’t know about that.

    So I looked into their source code and found out that the heapq module stores everything in the opposite order from what I do (root node is smallest element – not the largest one). So when I did nlargest, it simply sorted the whole list!

    nsmallest, which was “equivalent” to what I was doing, works.

    I also replaced my function with a nonrecursive one – no difference. So at the end: With psyco my version is much faster than heapq, but without it is much slower…