Getting rid of errant HTTP requests

Posted on November 24, 2005, under apache, general, mirroring.

When you run a busy website, you’re bound to pick up a lot of wacky requests, and some downright broken clients. Annoying behaviour can range from anything from repeated nuisance requests to a full-scale Denial of Service attack. Competent mail server administrators will be very familiar with protocol-level techniques to try and hinder these requests, but you don’t hear much about it within HTTP.

This is partly because, luckily, abuse is rarer in HTTP and because not many people actually read their logs all that closely. Over at though, we get many many millions of requests per day and the bad boys all add up. We see broken browsers, broken mirror scripts, huge “wget -r” or lftp grabs of massive portions of our tree, paths on our server hardcoded into some applications – which consider us the most reliable place to fetch a XSL file or check for the latest version of a perl package.

And we’re not alone, Dave Malone had to deal with a ridiculous NTP client which was using his Webserver as a time source (yep, NTP via HTTP date headers!), and it wasn’t even being polite enough to use a HEAD request, it was actually using “GET / HTTP/1.0\r\n\r\n”. We had to patch Apache to get around that one.

Over time, we’ve developed a few tactics for helping us defeat these annoying requests and get them off of our server as quickly as possible. The first trick of course is to identify them in the first place. Any decent log analysis package or even just getting a regular “feel” for traffic through mod_status will quickly identify any odd requests. If you see 250,000 requests for an XSL file, you know something is up. Likewise if you observe that a particular host is constantly connected to you, it’s possible there’s something that needs looking at.

The next thing to look at is whether these requests are really a problem or not. In our case we can tolerate 250,000 requests for an XSL file, after all we’re not short of bandwidth which is the main resource being used. But it’s not something we would want to leave unchecked, we’re there to serve all sorts of content – not just XSL files. Huge “wget -r”s or clients which poll is far too long are a concern for us though, because we optimise the server for the average long downloads that make up most of our server. We don’t want to see lots and lots of small requests and all of the context switching that entails. They slow down the responsivity of the system.

Unfortunately, it’s pretty rare that these illegitimate requests come from single, fixed, IP addresses, and when they do more often than not that address is a proxy or a NAT box and serves many other legitimate requests, so just applying an ACL doesn’t always suit. Instead we use mod_setenvif and mod_rewrite to classify requests based on the more exact nature of the requests.

Once we’ve done that, how we deal with them falls into 4 different categories;

  1. Malformed output

    The first thing we tried was simply returning malformed data at the URL they expect.

    So if a client was persistently querying say /foo/bar/blah.xsl , we would return an XSL file that was crafted to be utterly broken and contain lots of comments explaining why (though of course only to this client, other users get the original file). This is the same tactic Dave Malone employed to combat the bold NTP clients. We patched Apache so that a date header set with mod_headers would work (ordinarily Apache doesn’t let anything else set a Date header) and returned Dec 31st 1999 to every such client.

  2. Teergrubbing

    For a lot of cases, malformed output works pretty well. But for others; typically automated processes long forgotten by their owners, it does little. For those we next tried a variation on “teergrubbing” used in some SMTP-level anti-spam defences. We just redirect to a cgi that does only a very little more than something like;

    echo "Content-type: text/plain"
    while true; do
        echo please contact
        sleep 10

    That worked pretty well, and caught a lot of brokenness including people who had hard-coded us as a dtd source. Still left us with some annoying stragglers, though.

  3. CPU exhaustion

    The next trick we try, and one which is really pretty dirty is to try an mount a Denial of Service attack on the HTTP client. Typically these clients arn’t well written, and if they even have any loop-prevention, it’s basic in the extreme. We exploit this by trying causing their client to loop and loop between sucessive HTTP requests. Now, anyone it’s relatively easy to detect a URI that redirects to itself even with one or two levels of indirection, so instead we do;

    echo "Location: /thiscgi?$RANDOM"

    Now that’s what I call mean. Now our system can easily take the load and bandwidth this causes, theirs cannot, and it can pretty quickly wear them out and soon enough we see the requests die.

  4. Memory exhaustion

    Something else I’ve been playing with lately is using the Content-Length: header, or the features of chunked-encoding to try and exhaust memory on these clients. Many of the dumb clients seem to allocate a single buffer for the entire response, especially when using chunked-encoding. By trying to make the client allocate several Gigabytes of memory, they can ocasionally be stopped dead in their tracks.

In reality, we actually implement the above with tiny Apache modules rather than CGI, for the sake of efficiency. Of course all of these are really only appropriate when you’re dealing with remote tasks someone is inappropriately running, or if they’re using woefully outdated harmful software, and if the requests are easily categorisable, and even then only if the CPU hit it causes the server results in a net drop in these requests over time.

3 Replies to "Getting rid of errant HTTP requests"


jmason  on November 24, 2005

I tried similar with a case of referrer spamming, a while back.

It’s worth noting that CGIs aren’t a viable tactic for #2 on smaller sites, since keeping that CGI running takes up an apache thread (as far as I could tell) — too many of those, and your site stops responding since all its threads are busy running teergrubes.


colmmacc  on November 24, 2005

Using CGI will add one extra process to the mix, per teergrube but it’s not going to be too resource intensive. Unfortunately the thread/process getting tied up problem doesn’t go away with a module, and if the system has some problems with process exhaustion or if MaxClients is being hit, then the long-lived sessions will definitely cause problems.

However, the event mpm offers a good solution to this, as it can pool the long lived connections the way it handles keepalive connections, leaving the ordinary worker threads free for responsive handling of real requests.

Another solution would be to send a regular httpd a graceful-stop, and then start signal every now and then. The existing connections will be maintained and continue to be teergrubed, but a new httpd instance with all of MacClients available will be back. And the admin can renice the old httpd to slow it down even further.

We wern’t being harvested for email addresses, so at most we had to teergrub a few dozen connections at once.


Ian Holsman  on November 25, 2005

Hi Colm.
a better approach to the redirect would be to stick the random string inside the URL itself.
something like

so you catch things which strip out query args.

and you might even be able to do this via mod-rewrite, using something like request-time instead of RANDOM. instead of a CGI.

also..remember to put
Disallow: /thiscgi
into your robots.txt file, so you can claim you gave the bots fair warning ;-)

but from my experience, it doesn’t help. the stupid client bot’s still keep coming. I’ve one (from the same IP) come for over 2 years. yes.. i’ve been feed him crap for 2 years, and he still keeps on coming… the only solution I know is to wait until that machine that he has his cron job on gets reinstalled.

Leave a Comment