8th August 2007

Posted in

Blog :: Python Performance

I've been emailing back and forth with a friend who is also learning Python. He is a physicist, rather than a programmer, and is focusing on the data crunching abilities of Python in ways that I haven't explored.

One of the things he was trying to do is load some data representing a 2d array from binary file and plot the points. He'd figured out the loading (via the struct module) and the plotting but was noticing that he had a few bogus values in the array that he needed to replace with zero.

We wrote the naive implementation together - his initial data structure for a 2d array was lists of lists and my first thought was to loop over the outer list and apply a filtering function to each of the inner lists like this:

def not70(x):
    if x == 70:
        return 0
    else:
        return x
    
def replaceList(a):
    """loop through 2d list with for loop, using map to find and replace
    values in lines of array"""
    for index, line in enumerate(a):
        a[index] = map(not70, line)

This worked, but it took a lot of time for his very large data sets (something like 10-15 seconds).

I was inspired to check out how we could improve this. My first thought was to switch from using lists to using arrays. Python has a built in array type that holds homogeneous base types (as opposed to lists which can hold objects of any type heterogeneously). After one blunder (filtering an empty list is faster than filtering a big one, who knew) I found that using the same algorithm on a list of arrays was actually slower than on a list of lists!

Hmm. Well I know that the built in arrays have a rep for being slow, hence the presence of NumPy. My box (Kubuntu 6.06) is still on Python 2.4 and didn't have NumPy installed by default so I started out with the Numeric module. Now, interestingly enough, using the same algorithm was even slower yet (those built in lists sure are optimized). I knew we could do better, though, if we could express our filtering as Numeric array operation. After a little bit of thought (the Numeric api was not always obvious to me) I came up with:

def letsUseALittleMath(a):
    b = Numeric.not_equal(a, 70)
    a = a * b

The first line results in an equally sized array "b" filled with 1s where the condition is met (an entry is not equal to 70, our supposed "bogus" value for illustration sake) and 0s where the corresponding position had our target value. Simple multiplication in Numeric is implemented so that in the resulting array c, c[i][j] = a[i][j] * b[i][j]. And of course multiplying a cell by zero yields zero and multiplying a cell by 1 yields the original value.

This was indeed a lot faster - in the tests I'm running with the timeit module, using replaceList above on a large list of lists takes 28 seconds for 500 iterations, with 39 seconds and 59 seconds respectively for the same algorithm on the presumably more efficient data structures of a list of built in arrays and a Numeric 2d array. Using the built-in array operations only took 3.5 seconds for 500 iterations.

That was a lot better, but Justin wanted to try Numpy. I `sudo easy_install numpy`ed and once gcc was finished chugging ran his script. He re-implemented the filtering function using a logical condition in an array index as Numpy allows (very cool - how do I support that in my own classes?) to to come up with the one liner filtering function:

def fastway(a):
	a[a==70]=0

Using Numpy and the built in syntax for replacing values like that resulted in a 3x speedup over the algorithm I implemented in Numeric. Here's the file that right now is yielding:

Time using all lists: 30.540692091
Time using lists of arrays: 33.4445209503
Time using Numpy arrays and array multiplication: 3.20908308029
Time using fastway: 1.08567285538

This seems like relatively decent performance (I haven't implemented in any other languages for a baseline) and is especially nice that it's relatively easy to optimize: we're both Python newbies but I only have about an hour of hacking and reading invested in improving the performance of the operation. It's also possible that there's a more straightforward/obviously faster way of doing this - I wondered about filtering the values before going to the 2d array - but this was a fun learning experience. Lessons learned - stay away from looping over large datasets and don't bother with built-in arrays (go straight to Numpy).

Posted on August 8th 2007, 10:48 PM