Archive for November, 2005

Getting rid of errant HTTP requests

Posted on November 24, 2005, under apache, general, mirroring.

When you run a busy website, you’re bound to pick up a lot of wacky requests, and some downright broken clients. Annoying behaviour can range from anything from repeated nuisance requests to a full-scale Denial of Service attack. Competent mail server administrators will be very familiar with protocol-level techniques to try and hinder these requests, but you don’t hear much about it within HTTP.

This is partly because, luckily, abuse is rarer in HTTP and because not many people actually read their logs all that closely. Over at though, we get many many millions of requests per day and the bad boys all add up. We see broken browsers, broken mirror scripts, huge “wget -r” or lftp grabs of massive portions of our tree, paths on our server hardcoded into some applications – which consider us the most reliable place to fetch a XSL file or check for the latest version of a perl package.

And we’re not alone, Dave Malone had to deal with a ridiculous NTP client which was using his Webserver as a time source (yep, NTP via HTTP date headers!), and it wasn’t even being polite enough to use a HEAD request, it was actually using “GET / HTTP/1.0\r\n\r\n”. We had to patch Apache to get around that one.

Over time, we’ve developed a few tactics for helping us defeat these annoying requests and get them off of our server as quickly as possible. The first trick of course is to identify them in the first place. Any decent log analysis package or even just getting a regular “feel” for traffic through mod_status will quickly identify any odd requests. If you see 250,000 requests for an XSL file, you know something is up. Likewise if you observe that a particular host is constantly connected to you, it’s possible there’s something that needs looking at.

The next thing to look at is whether these requests are really a problem or not. In our case we can tolerate 250,000 requests for an XSL file, after all we’re not short of bandwidth which is the main resource being used. But it’s not something we would want to leave unchecked, we’re there to serve all sorts of content – not just XSL files. Huge “wget -r”s or clients which poll is far too long are a concern for us though, because we optimise the server for the average long downloads that make up most of our server. We don’t want to see lots and lots of small requests and all of the context switching that entails. They slow down the responsivity of the system.

Unfortunately, it’s pretty rare that these illegitimate requests come from single, fixed, IP addresses, and when they do more often than not that address is a proxy or a NAT box and serves many other legitimate requests, so just applying an ACL doesn’t always suit. Instead we use mod_setenvif and mod_rewrite to classify requests based on the more exact nature of the requests.

Once we’ve done that, how we deal with them falls into 4 different categories;

  1. Malformed output

    The first thing we tried was simply returning malformed data at the URL they expect.

    So if a client was persistently querying say /foo/bar/blah.xsl , we would return an XSL file that was crafted to be utterly broken and contain lots of comments explaining why (though of course only to this client, other users get the original file). This is the same tactic Dave Malone employed to combat the bold NTP clients. We patched Apache so that a date header set with mod_headers would work (ordinarily Apache doesn’t let anything else set a Date header) and returned Dec 31st 1999 to every such client.

  2. Teergrubbing

    For a lot of cases, malformed output works pretty well. But for others; typically automated processes long forgotten by their owners, it does little. For those we next tried a variation on “teergrubbing” used in some SMTP-level anti-spam defences. We just redirect to a cgi that does only a very little more than something like;

    echo "Content-type: text/plain"
    while true; do
        echo please contact
        sleep 10

    That worked pretty well, and caught a lot of brokenness including people who had hard-coded us as a dtd source. Still left us with some annoying stragglers, though.

  3. CPU exhaustion

    The next trick we try, and one which is really pretty dirty is to try an mount a Denial of Service attack on the HTTP client. Typically these clients arn’t well written, and if they even have any loop-prevention, it’s basic in the extreme. We exploit this by trying causing their client to loop and loop between sucessive HTTP requests. Now, anyone it’s relatively easy to detect a URI that redirects to itself even with one or two levels of indirection, so instead we do;

    echo "Location: /thiscgi?$RANDOM"

    Now that’s what I call mean. Now our system can easily take the load and bandwidth this causes, theirs cannot, and it can pretty quickly wear them out and soon enough we see the requests die.

  4. Memory exhaustion

    Something else I’ve been playing with lately is using the Content-Length: header, or the features of chunked-encoding to try and exhaust memory on these clients. Many of the dumb clients seem to allocate a single buffer for the entire response, especially when using chunked-encoding. By trying to make the client allocate several Gigabytes of memory, they can ocasionally be stopped dead in their tracks.

In reality, we actually implement the above with tiny Apache modules rather than CGI, for the sake of efficiency. Of course all of these are really only appropriate when you’re dealing with remote tasks someone is inappropriately running, or if they’re using woefully outdated harmful software, and if the requests are easily categorisable, and even then only if the CPU hit it causes the server results in a net drop in these requests over time.

Hallo Von München

Posted on November 20, 2005, under general.

I’m over in Munich, where it’s starting to get a bit cold. I haven’t had long here (though I’ll be back in 2 weeks) but it’s a very nice place. I’ve been staying out with Noirin, at the OlympiaZentrum, right beside the Olympic Stadium, the BMW building and lots more.

Colm in Munich

As usual, the integrated transport system is an awful lot better than Dublin’s mess. Especially as Munich is actually smaller than Dublin, both in terms of population and geographical size. It’s real pity that Dublin doesn’t have anything comparable, because a great public transport network has a great social impact too. Here, London, New York, Paris and everywhere I’ve been that has a real network there has always been examples of people from all walks of life, side by side on the same journeys.

In Dublin we have transport apartheid, with real class segregation. For example if you live in south east Dublin, the journey into town you take every day will go blissfully through leafy suburb after leafy suburb, classy georgian Dublin, and then you’ll probably park right behind BTs and make use of the Powerscourt Centre, and maybe Grafton St. if you’re feeling adventourous. If you don’t feel like a drive, sure you can get the Dart anyway. All the while missing out on the real Dublin. Ocasionally there might be a bus journey where you’ll have to tolerate two dozen teenagers with their 150 euro each to get plastered in the Wesley, but there won’t be much dealing with an upper deck full of scumbags smoking away. To me, that’s one of the biggest failings of a lack of transport integration. It helps pervade an attitude of real social ignorance, of comfortable isolation. It makes Ross O’Carroll-Kelly a possibility.

Unfortunately for us, Munich had the Olympics to solve its Transport problems and we’ve got Martin Cullen.