11th October 2008

Posted in

Blog :: Pain (or why I am not a Systems Administrator)

Aaaaaaaaargh!

Ok - I feel a little better.

I hate computers. Specifically I hate being a systems administrator. I don't really have the experience or aptitude to be a sysadmin - while I don't mind debugging my own software (whose inner workings I designed) tweaking software settings trying to get something to work right is just frustrating to me. In the interest of sparing my humble reader any of my pain...

Some of the sites I run are on a VPS. It's a basic Cpanel/CentOS setup which comes with the ancientvenerable Apache 1.3. I've got some sites running PHP on it and a few running Django. Now fast_cgi wasn't installed, upgrading to Apache 2.x wasn't an option (I'm guessing it would cause Cpanel breakage) and when I first set this up a year ago I wasn't aware of mod_wsgi. Plus I was in the mood to experiment. So.... I have lighttpd running solely in order to fastcgi my django instances and a mod_rewrite rule to proxy requests that should be dynamic (everything not starting with /media/ or /admin_media/) from apache to the lighttpd instances. It's more or less worked for me and let me play around with the django deployment side of things without messing with the stable and working Apache. An example of a sample lighty section and .htaccess file follows:

.htaccess:

RewriteCond % !-f
RewriteCond % !(django.fcgi)
RewriteRule ^(.*)$ http://127.0.0.1:9006/$1 [QSA,P]
and lighttpd.conf
$SERVER["socket"] == "127.0.0.1:9006" {
  server.document-root = "/home/simeon/public_html"
  fastcgi.server = (
    "/lighty.fcgi" => (
        "main" => (
            "socket" => "/home/simeon/mysite.sock",
            "check-local" => "disable",
        )
    ),
  )
  url.rewrite-once = (
    "^(/.*)$" => "/lighty.fcgi$1",
  )
}
Pretty simple stuff. And that's the way I like it (did I mention that I'm not a sys-admin?)

Ok - the first problem I just solved wasn't all that painful and was really my own fault to boot. I've been running my Django sites off of trunk but stopped tracking back when NFA merged. I figured I'd wait for 1.0 before doing any backwards compatibility breaking stuff. So the arrival of 1.0 was great for me - I'd played with NFA and set up a couple of Django sites on dedicated servers with mod_python. No surprises, no problems.

Until I went to deploy a 1.0 site on the VPS with Apache proxied to lighty. All of the sudden the lighty.fcgi started showing up as the root url for my django site. The {% url %} tag was prefixing paths ( /lighty.fcgi/foo/ instead of just plain /foo/). I tried modifying the lighttpd.conf file to use a blank string or just "/" as my root url, I tried adding uri-strip clauses to the config... Nothing worked.

Let's stop right there to see how the 45 minutes of pain are all my fault so far. Where do we go boys and girls when we upgrade Django and stuff that used to work no longer does? That's right - the official list of Backwards Incompatible Changes. Now in fairness this list is getting pretty long (which is what happens when you go so long between releases - things should be better going forwards) but halfway down the list is item 52 - Changed the way URL paths are determined which explains various servers break the SCRIPT_NAME and PATH_INFO variables in various ways - Django now does the right thing by paying attention to SCRIPT_NAME but has introduced a new setting FORCE_SCRIPT_NAME so that you can override this if your particular choice of server software is doing dumb things. The added line to my settings.py

FORCE_SCRIPT_NAME = "" #not "/" as they suggested, interestingly enough
put me back in business. No more phantom script names.

The next problem was harder and I still don't understand what's wrong. I have fixed it however... and I hope to spare some future tormented soul a bout of frenzied swearing. Here's the first manifestation of the problem I noticed: my Django sites are slow! This isn't a performance issue - memory usage is fine and the processor isn't loaded at all. It takes 5 seconds, however, for even a simple page to display. All the static media (served by Apache) comes across fast (100ms), but the main request that is proxied to lighttpd takes 5 seconds!

After some poking around I discover that running the apache benchmark tool directly on the lighty instance (ab -n10 http://127.0.0.1:9006/) from a shell session on the VPS shows millisecond response times (<100ms) but running it on the actual domain (ab -n10 http://simeonfranklin.com/) generates response times that are all greater than 5 seconds! The issue isn't django, it isn't lighty, it's the apache proxy => lighty interaction that somehow causing the slowdown.

How do you go about troubleshooting this? My error logs are clean so I don't have any immediate clues. Googling "apache proxy slow" yields a host of non-helpful complaints. I started thinking about things that might be causing a delay and spent some time turning off KeepAlives and checking every setting that involved keeping an HTTP connection open. No joy.

Eventually (I'm seriously embarrassed to admit how much time I've spent on this) I find a post to the zope mailing list with a possible answer - if the Apache is forced to do DNS resolution it may cause consistent time delays on proxy requests. Aha! I add the FQDN to my /etc/hosts file, verify my resolv.conf check my httpd.conf file to make sure there aren't any DNS problems... Still no joy! I replace hostnames with IP addresses and fiddle with Apache settings to keep it from doing DNS lookups. Still the same frustrating maddening delay!

Fine. I get it. I'm not going to be able to fix this and I start looking at compiling mod_wsgi for Apache 1.3. First I put the domain names back in my httpd.conf file and for some reason feel moved to try a domain in my RewriteRule instead of an IP address. Un-be-lievable.

When my .htaccess file looks like this:

RewriteRule ^(.*)$ http://127.0.0.1:9006/$1 [QSA,P]
requests take 5+ seconds to return. When I put in this:
RewriteRule ^(.*)$ http://localhost:9006/$1 [QSA,P]
I suddenly start getting sub 200ms response times.

Does that make sense to anybody? Me neither... Forcing DNS resolution (well - localhost presumably gets looked up in /etc/hosts) instead of using an IP directly results in an order of magnitude speedup? I hate computers. Oh and if anybody else finds themselves in the same situation and this is helpful - email me for my home address - I like the chocolates with creamy fillings.

Posted on October 11th 2008, 12:22 AM



blog comments powered by Disqus