17th October 2006

Posted in

Blog :: Code smells and design principles

I've had a variety of little PHP-howto tips floating around my brain lately. If I can post them here over the next couple of days I can make room in my brain for all the yummy python and TurboGears goodness it's been getting lately!

Basically the mental notes I've been making to myself all are demonstrations of Bad Code Smells and basic programming "craftmanship" guidelines. More and more, lately, I have been understanding and appreciating "programming as an artform" viewpoint on continued development of my skills as a programmer. If you haven't really run across this idea, you should read people like Martin Fowler, Andy Hunt, Bruce Eckel or (for a more hacker-ish take on some of the same ideas) read the archives of Paul Graham and Steven Yegge's blogs. Go read Eric Raymond's The Art of Unix Programming to top it off, and you'll know where I'm coming from.

Back already? You didn't go read any of that stuff and you're wondering about that reference to code smells, right? I'll try to explain (briefly) and then plunge right into the examples. The concept of Code Smells is that there are coding idioms that usually lead to bad code in some way. Perhaps the most famous example of this sort of thing is Edgar Dijkstra's much parodied Goto considered harmful article. Basically Dijkstra argued that one particular style of programming (using GOTO statements) was likely to lead to bugs and unmaintainable code. Dijkstra also offered a remedy (structured programming) in such convincing terms that today, GOTO use is widely deprecated and structured programming is by far the dominant idiom.

The concept of is not better programming through rules (if x than y). Instead it is part of the process of building up your intuition about how to keep your code readable and manageable. Coding principles or patterns (like DRY - Don't Repeat Yourself) are the positive converse of bad code smells. Again - they aren't meant to be rules you can just mindlessly follow. They're ideas and concepts you need to internalise as a programmer in order to become more skillful. My examples will, of necessity, be in the narrow domain of "web programming with PHP" but understanding the ideas will help you be a better programmer in whatever language (even Blub) you happen to use.

[they're] more what you'd call "guidelines" than actual rules
Captain Barbarossa

So let's start with a really simple idea. Loops.

I know, I know, you're thinking "LOOPS?" I could be reading about metaprogramming in ruby or macros in Lisp and this guy wants to lecture me about loops.

Well, yes, actually. Now syntactic loops are probably one of the very first ideas that mades programming easier than flipping toggle switches or typing 1s and 0s. Just a slightly higher order of abstraction than named variables right? This is CS 100 stuff really - if you can't write a for loop or a while loop then you aren't a programmer, end of story.

Granted. But I don't have in mind here syntax, but rather concepts. Here's the programming principle I have in mind "LAG - Loops are good!". Lets look at the application of this principle.

What's wrong with the following extremely common coding idiom?

  class foo{
    function foo($duration, $height, $width, $quality){
      $this->_duration = $duration;
      $this->_height = $height;
      $this->_width = $width;
      $this->_quality = $quality;
    }
    /* Useful code goes here*/
}

It's just a constructor, setting internal private variables from the initial arguments, right? What's wrong with this? Well nothing. Except... what if you want to add another argument to your constructor? (Aside from the fact that if you have more than 3 or 4 arguments to a function you might want to rethink your API). You'd just copy the bottom line, paste it, and do a s/quality/foo/ in your text editor of choice. Nothing simpler, right? You've got that M-x, CTRL-f, keystroke sequence down pat, right? Oh and don't forget to add a $foo to the parameter list of the function.

Hmm. Copy and paste is a bad code smell in its own right. And what about DRY? How many times is the identifier "duration" repeated in those four lines of code? Three times, right? Is there any way to eliminate the repetition? And how about eliminating the more subtle repetition where (essentially) "_x = x" is repeated 4 times. How about using a loop? (Line numbers added for reference)

  1class foo{
  2  function foo($duration, $height, $width, $quality){
  3    foreach(array('duration', 'height', 'width', 'quality') as $arg)
  4    {
  5      $this->{"_$arg"} = $$arg;
  5    }
  6  }
  8}

Now we can still get this a little cleaner with the use of a slightly advanced language feature. Even without it, however, we've reduced the repetition of the fundamental pattern (_x=x) down to the single ugly line which substitutes the appropriate strings for us. For those of you not used to PHP, object member references don't have to be declared, braces are used to clarify string substitution, and if $arg is a variable whose content is the string "duration", $$arg is the same thing as $duration. This code is ugly enough to be mistaken for Perl, but to those familiar with the idioms the meaning is immediately clear: set the object member variable named by concatenating "_" and the contents of $arg to equal the value of the variable whose name $arg contains.

Our repetition of identifiers is down to two instances: the argument list and the array. We can still eliminate this duplication, but unfortunately only by eliminating the argument list. The following would work as well:

 class foo{
 function foo(){
  $args = func_get_args();
  foreach(array('duration', 'height', 'width', 'quality') as $k => $name)
  {
    $this->{"_$name"} = $args[$k];
  }
}

Unfortunately this solution introduces other code smells. func_get_args() returns a numerically indexed array (in PHP the array type is also a dict, so "numerically indexed array" isn't actually a redundancy) of the arguments passed to the function. Our loop is a little cleaner with no $$ madness, but we've introduced some fatal flaws. One is a lack of clarity: using func_get_args() means you can't tell what arguments a function accepts by looking at its declaration. While in this case it's easy to see what's going on, doc generators won't be able to add any insight to this function and, more importantly, the runtime won't be able to do even its minimal function signature checking. Normally if I call a function and leave off a parameter, I get a warning. In this case, however, I cause an indexing error as I try to access a non-existent argument. These disadvantages are significant enough that I would stick with version two of the constructor, $$ badness and all. I might even go so far as to say that func_get_args() is itself a bad code smell. Further, if I needed features like default arguments I might go back to version one of the constructor.*

So you're getting the picture here, right? I'm evaluating the code I write: not on the basis of correctness (all three versions do the same thing if called correctly), not on the value of performance (the first version would presumably run the fastest since it doesn't do any string interpolation), but rather on the basis of maintainability.

Now maintainability is a subjective test with multiple axis. At least two of them, for me, are conciseness and legibility. Code should be as easy to understand and as short as possible. Conflicts between these two values must be resolved ... by the discretion of the programmer. Developing that discretion requires experimentation and study.

Let me give you one more example. This will be less contrived and less ambiguous - variable names have been changed to protect the innocent but the following code is in production in a real site.

$fp = fopen("data.txt", "r");
$values = array();
while($row = fgetcsv($fp, 1024))
{
    if($row[0] == '1015')#financial row
    {
        if(trim($row[1]) == 'TAX')
        {
            $values['tax'] = $row[3];
        }
        elseif(trim($row[1]) == 'NET SALES')
        {
            $values['sales'] = $row[3];
        }
    }
    elseif($row[0] == '1018') #inventory row
    {
        if(trim($row[1]) == 'ITEM1')
        {
            $values['item1'] = $row[2];
        }
        elseif(trim($row[1]) == 'ITEM2')
        {
            $values['item2'] = $row[2];
        }
    }
 }
 saveData($values);

This is also pretty familiar code. I'm importing data from a CSV file into a database. The CSV file looks like

    1015,'TAX          ',230,450,455
    1015,'NET SALES    ',1240,1480,9287
    1018,'ITEM1        ',424
    1018,'ITEM2        ',93
    ...

The code loops through the file, a line at a time, checks the first field to find out what kind of line we're on, then some other conditions to gather the appropriate data and store it in an array. When a single record has been read (by gathering all the relevant rows of a record in the file) the array of values is saved to the database.

Is there a better way to write this? Consider that I add an elseif condition and potentially another containing if/elseif clause every time I need to find a distinct value in the file. As it happens, I needed to read nearly 20 different types of rows, so given an average of 5 lines of code per row/type, I'd have an elseif statement at least 100 lines long. Large loops are a bad code smell themselves, and while this code is reasonably legible, it certainly isn't concise.

LAG! If you haven't figured out how to "loop" this sort of problem, it's because you think you can only loop over data: an array, a string, a file, a list... Code, however, is data. It's just that most languages obscure this reality with their syntax. Look at the following solution:

  class strategy{
    function strategy($key, $index1,$index2=Null)
    {
        $this->key = $key;
        $this->index1 = $index1;
        $this->index2 = $index2;
    }
    function apply($row)
    {
        if($row[0] == $this->index1)#now check description
        {
            if(!empty($this->index2))
            {
                if(trim($row[1]) == $this->index2)
                {
                    return $row[$this->key];
                }
            }
            else
                return $row[$this->key];
        }
        return Null;
    }
  }

  $strategies = array('waste'=> new strategy('3', '1010'),
                      'breakfast'=> new strategy('2', '1018', 'BKFST SALES'),
                      'lunch'=> new strategy(2, '1018', 'LUNCH SALES'),
                      'dinner'=> new strategy(2, '1018', 'DINNER SALES'),
                      'kidmeals'=> new strategy(2, '1018', 'TOTAL KID MEALS'),
                      'dtpct'=> new strategy(4, '1009', '03'),
                      'waste'=> new strategy(2, '1018', 'RAW WASTE'),
                      'wages'=> new strategy(2, '1018', 'LABOR DOLLARS'),
                      /* 20 more lines of definitiions here */
                      );
    $fp = fopen("data.txt", "r");
    $values = array();
    while($row = fgetcsv($fp, 1024))
    {
        $value = Null;
        foreach($strategies as $k => $v)
        {
            $value = $v->apply($row);
            if($value != Null)
            {
                $values[$k] = $value;
                break 1;
            }
        }
    }

First, notice that the big else/if section is gone! Replaced completely by a foreach loop. But what am I looping over? Code! My strategies array contains objects of type "strategy". As it turns out the repetition in my else/if statements basically consisted of checking the first field in a row for a value, optionally checking the second field for a secondary value, and then returning some field from the row. Given that description it's easy to write a function that does exactly that in generic fashion and a class that accepts as parameters the value to look for in column 1, the value to optionally look for in column 2, and the index of the column we're really after. And creating a list of objects corresponding to each kind of lookup I had to do before in an elseif statement allows me to condense the elseif list down to a foreach loop. The "break 1;" statement is purely to simulate the performance and logic benefits of an elseif over an if statement and could be left out.

What are the tradeoffs? Before I was copying and pasting (bad code smell!) an elseif clause for every additional row - now I simply add a row to my strategies array. This means my code size increases by a factor of 1x instead of say 5x lines for each additional row type I need to find. And of course eliminating duplication means printing out debug messages doesn't require adding a print statement to each clause of an elseif list, but adding a single statement in the apply method of my class. Conciseness has been improved tremendously and legibility hasn't suffered at all.

Those of you familiar with GOF design patterns will recognise the class name as tip-off to what's going on here. In dynamic languages like PHP, however, the recipe for implementing many design patterns is pretty basic: I don't have to worry about inheriting from a common base class, and really only used a class at all because PHP lacks closures. In a more advanced language I could have just created an array of functions and the whole "Strategy" design pattern pretty much boils down to "code as data" style coding. This is why I prefer the idea of principles over patterns: patterns end up being implementations that are somewhat language specific while principles are universal. DRY is a higher level of abstraction than Singleton.

But back to the rest of you - the maintainability win from this use of looping is pronounced enough that I might consider coining a new code smell: Elseif or switch statements considered harmful! What do you think?


*My contrived example with the constructor obviously has all sorts of tradeoffs and probably isn't the best place to look for conciseness. I have seen, however, literally hundreds of times, the pattern:


$row = $db->query('select * from foo');
$template->setVariable('fname',$row['fname']);
$template->setVariable('lname',$row['lname']);
$template->setVariable('zip',$row['zip']);
$template->setVariable('city',$row['city']);
$template->setVariable('state',$row['state']);
$template->setVariable('street',$row['street']);
$template->setVariable('county',$row['county']);
...

If I catch you writing this sort of code, I'm going to have to hurt you. First, use a decent template engine so you don't have to manually push values to template variables or that will simply discard values with no corresponding variable (so you can loop over the associative array you got back from the database). Even with your current stupid template library, however, you could at least write:


$row = $db->query('select * from foo');
foreach(array('fname','lname','zip','city','state','street','zip','county') as $field)
    $template->setVariable($field,$row[$field]);

Do it for me, will ya?

Update: www.nivas.hr isn't so sure about this. I may have to write my own definitive lesscode thesis to link to, but in the mean time I recommend Paul Graham's Succintness is Power. Graham is in my Geek Pantheon of Heroes and this essay is a good example why...

Posted on October 17th 2006, 11:55 PM