6th June 2008

Posted in

Blog :: You're killing me, man

OK, you're killing me out there!

I'm turning into a crank, I know, but the quality of conversation around PHP development lately is really bugging me.

I ranted at length in the comments at the Sitepoint blog post about microbenchmarks and thought I got it out of my system.

No such luck - this morning in my feeds was a link to a blog post titled in_array is quite slow.  DO NOT WANT!

I'm going to explain one more time what's bugging me and then start ignoring it...

First take a look at the really badly titled in_array post. Basically the author is repeatedly importing a large dataset from XML to a database and wants to eliminated duplicate records. The script starts running slowly (as in hours) and he discovers that the duplicate elimination logic consists of building an array of all the existing unique id's in the database and another array of the all the ids from the XML data. Then for each id in the list of new ids it searches the array of existing ids to see if the id already exists; inserting it only if it's new ...

With me so far? Notice how I cleverly didn't use the words "in_array". In fact, let's write some sample code to accomplish this algorithm without resorting to in_array.


foreach($new_ids as $new_id)
{
    $found = false;
    foreach($old_ids as $old_id)
    {
        if($old_id == $new_id)
        {  
            $found = true;
            break;
        }
    }
    if(!$found) insert_new_id($new_id);
}

So what's wrong with that code? Nothing, if my datasets stay small. The problem is that it's an O(n2) algorithm. This is Big O notation and completely worth reading about if you are or intend to be a programmer. Big O notation gives you a way to categorize the speed of  algorithms. I've heard that O stands for "on the order of" but I don't see that on the wikipedia page. In any event, Big O analysis doesn't care about the actual speed of an algorithm, only about the way in which the number of operations varies as the dataset the algorithm is performed on varies. In this case it is pretty easy to see that if each of our lists has 10 items the inner loop will run up to 10x10 times (say if there were no matches between the two lists). If there are 1000 items in each list it will run 1000x1000 times. And to abstract this - if there are "n" items in each list the inner loop will run n2 times - hence the Big O label. N squared is really bad because exponential growth doesn't scale the way we intuitively want things to scale. Instinctively I always assume algorithms are linear. If processing 1000 items in a list takes 1 second, processing 2000 items ought to take 2 seconds, right? That's true for O(n) algorithms only!

So back to the blog post - what's wrong with in_array() that makes it run so slowly? Nothing - in_array() is a search function - probably more sophisticated than my for loop code but essentially the same idea. Using in_array() in an inner loop gives you n-squared runtimes!

Now using the $old_ids list as a hash-table instead (values in the keys) lets you convert this back to a linear runtime. Replacing the for loop (or in_array) with


    if(!isset($old_ids[$new_id])) insert_new_id($new_id);

Makes the code run much faster. Down from hours to .8 seconds in this case. But notice that this has nothing to with in_array - instead it has to do with choices of data-structures and algorithms. It doesn't help that PHP's hashtable and list type are the same structure (that's not per se bad, as long as you are aware that there are two different such data structures)! We now have an O(n) algorithm because the speed of a hashtable lookup is independent of the number of items it contains. The whole "inner loop" becomes a single constant time operation that doesn't vary with the length of the input data. And of course a better blog title would be "Picking the right data structure" or perhaps even "Duh! Searching is slower than lookup!"

And here's my where my rant comes in. I'm not going to speculate about the causes (ease of use and ubiquitous deployment, in my book, plus the lingering awkwardness of the language that causes some of the solid developers to defect (eg: I still have Paul Bissex and Simon Willison filed under PHP in my RSS reader...)) but the quality of commentary in the PHP community is bothering me. Shallowness and bikeshedding abounds. The sitepoint post I mentioned at the beginning was pointing to yet another PHP microbenchmark discussion - should you use single quotes or double quotes around strings? Is while faster than foreach? To reference or not to reference when passing data structures around... All discussed un-ironically as "PHP best practices".

Stop it already people!

Here's some semi-constructive advice. Don't ever write about syntax unless you use the term newbie in the title. (The "for loop" for newbies is just fine). For everybody else - syntax is not programming! Saying you are a programmer because you "know PHP" (ie understand the syntax of the language) is like me saying I'm a painter because I can name all the colors. Especially don't pair discussions of syntax and speed - in fact don't talk about speed at all! Unless of of course you cite the rules for optimization:

  1. Don't!
  2. (for experts only) Don't yet

Or at least Knuth's aphorism ("premature optimization is the root of all evil"). In fact... I'm coining my own aphorism here (naming your statements makes them sound more official). Henceforth metapundit's law must be respected: don't optimize (or blog about optimization) unless you know who Knuth is. And if you've got some solid additions to my PHP feeds (is Harry Fuecks still writing?) I'd be glad to hear them.

Posted on June 6th 2008, 01:13 PM