(Untitled)
lol fixed the scrapers scraping 4scrape problem (or, as much as I’m willing to).
The problem, as I see it, is that scrapers have a very abnormal usage pattern compared to normal visitors — they’re just pulling the big heavy images which cost a lot of bandwidth and CPU (in the form of I/O interrupts) to serve. My solution to the problem was to write a lighttpd module which basically throttles the number of “expensive” requests you can make. I don’t really give a shit that they’re scraping — what pisses me off is that they’re bringing the server to it’s knees by doing so.
Each IP address is associated with an integer representing the amount of resources it’s used. This value is reduced by a configurable amount each second, and increases whenever they access a marked resource. Once it gets over a maximum amount, the module starts throwing back 403 responses.
The way I’ve got it configured right now is to add 20 points for each full-size image fetched, then decay that value by 1 point/second. The maximum is 200 points before the server starts 403′ing, with a 600-point upper bound. Basically, this gives you a wallpaper every 20 seconds, allowing for bursts of activity.
I’ve also got code in place that stat‘s the requested files and bases the cost on the file size, but haven’t gotten around to testing it yet. Deploying a fucking lighttpd plugin was a goddamn nightmare (at least, compared to an Apache module) because the lighttpd developers are retards or something.
Anyway, we’ll see how this pans out. [source]
Comments are off for this post