Thursday, September 15, 2011

A Hodgepodge of Python


Here at Adku, we do most of our development in Python. Python is concise, expressive, and well suited to rapid prototyping and readable code. I must admit though, I was initially skeptical of the choice to use Python. In my mind, dynamic typing = bugs, interpreted language = slow. Ahhhhh! But slowly my misgivings have disappeared, and I’ve come to really appreciate the Python Language for the wide coverage of its standard library, and the elegant design of many of the idioms.
For the rest of this post, I’d like to go over three of my favorite python features.
The timeit module
My initial worries about python’s performance versus traditional compiled languages have turned out to be irrelevant*. But in the few times when I’ve had some cause for worry, I found python’s timeit module to be a great help. Unlike other profiling modules, timeit can be run strictly from the command line. So we can quickly answer questions such as the following.
how much is xrange was faster than range?
:~/ $ python -m timeit 'for i in xrange(1000): continue'
100000 loops, best of 3: 19.1 usec per loop
:~/ $ python -m timeit 'for i in range(1000): continue'
10000 loops, best of 3: 24.6 usec per loop
What is the fastest way to reverse a list?
:~/ $ python -m timeit 'l=range(1000)' 'l[::-1]'
100000 loops, best of 3: 12.3 usec per loop
:~/ $ python -m timeit 'l=range(1000)' 'l.reverse()'
100000 loops, best of 3: 8.85 usec per loop
And as you’ll notice from the second example, even simple multiline programs can be timed; just separate lines with quotes.
* a poor algorithm seems to be more often responsible.
** Matt just pointed out to me that more detailed profiling can be obtained with a cProfile (two)liner
python -m cProfile (your script path) > timing.txt
cat timing.txt | sort -snk 4 | tail -n 50



zip
You can read about zip here: http://docs.python.org/library/functions.html#zip. The way I like to think about it is: if nested for loops are like walking through a matrix element by element, zip allows you to just “zip” down the diagonal.
Generally, zip is used to combine columns or rows of data. But there are two uses cases which I think are really cool.
Matrix transpose:
Suppose we are given a matrix expressed as a nested list:
a = [[...],...[...]]
Then we can find the transpose of this matrix by writing
zip(*a)
Grouping a list:
Suppose we have a list
a = [x1, x2, x3, …, xn]
Then we can group it into [(x1,x2), (x3,x4), … (xn-1,xn)] by writing
zip(*[iter(a)]*2)
Replace 2 with any integer k, and you can make k groups.
The main thing to understand in both of these examples is the * operator. In python, * has the usual meaning of multiplication when it’s a binary operator, but as a unary operator it unzip a list and pass it as arguments to a function. Aside from that though, both of these examples just require some staring and thinking to understand. But it’s worth it! I promise
collections.defaultdict
With collections.defaultdict, you can set a default value to the dictionaries you create. No more
a = {}
if not ‘foo’ in a:
a[‘foo’] = 0
else:
a[‘foo’] += 1
Just do
a = collections.defaultdict(int)
a[‘foo’] += 1
...
If you want, you can also do nested dictionaries. For example,
a = collections.defaultdict(lambda: collections.defaultdict(int))
will initialized a “matrix” of all zeroes.
That’s it for now. I would love to hear about more python features in the comments.

8 comments:

  1. a = {}
    a['foo'] = a.get('foo', 0) + 1

    ReplyDelete
  2. Very cool stuff. Post more.... :)
    Thanks

    ReplyDelete
  3. @Matt: defaultdict is faster in many real use situations (after you factor out the cost to import a module).

    python -m timeit 'from collections import defaultdict' 'd = {}' 'for i in xrange(1000): d.get(i, 0)'
    1000 loops, best of 3: 887 usec per loop

    python -m timeit 'from collections import defaultdict' 'd = defaultdict(int)' 'for i in xrange(1000): d[i]'
    1000 loops, best of 3: 1.74 msec per loop

    From that alone you'd be reasonable to prefer your dict.get method. But consider the more common use case (IMO) in which you hit the dictionary at the same key many times.

    python -m timeit 'from collections import defaultdict' 'd = defaultdict(int)' 'for i in xrange(100): [d[i] for j in xrange(100)]'
    100 loops, best of 3: 4.37 msec per loop

    python -m timeit 'from collections import defaultdict' 'd = {}' 'for i in xrange(100): [d.get(i, 0) for j in xrange(100)]'
    100 loops, best of 3: 9.69 msec per loop

    defaultdict(int) is twice as fast ;)

    ReplyDelete