So I’ve been wanking up a new upload system for Fakku — basically one that lets Jacob et all to just upload a zip file and attach metadata all through a webform (inb4 uploading large files over HTTP is silly) and it automatically unpacks the archive and uploads the data to the appropriate servers. It’s a pretty straightforward thing, except for fucking progress bars.
When I read through the project specs the progress bar raised a red flag. There simply isn’t any standardized way to query the current status of an upload. As far as I see, there’s only two ways to do it –
- Use flash to do the upload.
- Have the webserver provide a JSON interface to query.
The first is fucking stupid — I hate flash.
The second is much more reasonable. The idea is that the browser provides a random key in part of the POST data to identify the upload uniquely. The web server then provides a JSON endpoint to query the status of uploads by the client-specified key. Naturally, this is a pretty straightforward thing and there are implementations available for various servers. Unfortunately, Fakku is running lighttpd-1.4.x which has no such implementation. There’s a mod_uploadprogress for 1.5, but 1.5 is a worthless broken piece of shit.
Anyway, the reason 1.4 doesn’t have an upload progress module is apparently Lighttpd buffers requests before sending them to a backend. This seems like a stupid reason to me. Why the hell would that prevent the web server from providing transfer information to the client? The backend has absolutely nothing to do with the transfer of POST data.
Before writing a full-blown Lighttpd module to implement this, I took a look at how Lighttpd actually handles the buffering of uploaded data. I have to say, it’s pretty hacky. It stores the data in /var/tmp/lighttpd-upload-* (with names generated via something like tempnam). It splits large files into many 1MB buffers.
So I hacked together a PHP script which provides two JSON methods — one to acquire the name of the most recently modified (I wish POSIX let me query file creation time) Lighttpd upload buffer, and another to look up the current size of a buffer based on it’s name. This results in a shitty interface which allows you to basically track how much data you’ve uploaded, except the transfer isn’t linked to the client so it’s got a race condition by design. And you can’t determine the full size of the upload — you only get amount uploaded so far (so you couldn’t implement a progress bar or ETA).
But I’m willing to bet it wouldn’t be impossible to write a Lighttpd module which provided an interface to query this shit. Dunno if I’ll get around to hammering it out though; Lighttpd modules are a pain in the dick to code — there’s close to no documentation aside from other existing modules.
lol fixed the scrapers scraping 4scrape problem (or, as much as I’m willing to).
The problem, as I see it, is that scrapers have a very abnormal usage pattern compared to normal visitors — they’re just pulling the big heavy images which cost a lot of bandwidth and CPU (in the form of I/O interrupts) to serve. My solution to the problem was to write a lighttpd module which basically throttles the number of “expensive” requests you can make. I don’t really give a shit that they’re scraping — what pisses me off is that they’re bringing the server to it’s knees by doing so.
Each IP address is associated with an integer representing the amount of resources it’s used. This value is reduced by a configurable amount each second, and increases whenever they access a marked resource. Once it gets over a maximum amount, the module starts throwing back 403 responses.
The way I’ve got it configured right now is to add 20 points for each full-size image fetched, then decay that value by 1 point/second. The maximum is 200 points before the server starts 403′ing, with a 600-point upper bound. Basically, this gives you a wallpaper every 20 seconds, allowing for bursts of activity.
I’ve also got code in place that stat‘s the requested files and bases the cost on the file size, but haven’t gotten around to testing it yet. Deploying a fucking lighttpd plugin was a goddamn nightmare (at least, compared to an Apache module) because the lighttpd developers are retards or something.
Anyway, we’ll see how this pans out. [source]
I mentioned earlier that I was looking at lighttpd‘s features and was really impressed. I mean, how could I not be? Built-in load distribution! I actually went ahead and installed it on one of my machines after writing that post, then forgot about it.
So I came back to it tonight, and started going through the docs. Well. What docs there were…
suigintou# man lighttpd NAME lighttpd - a fast, secure and flexible webserver SYNOPSIS lighttpd -D -f
DESCRIPTION FILES /etc/lighttpd/lighttpd.conf CONFORMING TO HTTP/1.0 HTTP/1.0 HTTP-Authentification - Basic, Digest FastCGI CGI/1.1 SEE ALSO spawn-fcgi(1) AUTHOR email@example.com
Okay, so no real man page. Whatever.
Like Apache, the lighttpd.conf.sample is littered with lots of comments about the syntax, configurable options, etc. So the first thing I try to do is change the port it’s running on, because I don’t need to be broadcasting that I’m running a webserver. So I open up the config file and add in the appropriate line
## bind to port (default: 80) #server.port = 81 server.port = 7000
Then I go to start it.
suigintou# lighttpd 2007-12-14 22:20:08: (server.c.548) No configuration available. Try using -f option.
Well, okay. I guess it’s looking in /etc instead of the place ports placed it’s configuration (/usr/local/etc), but whatever.
suigintou# lighttpd -f /usr/local/etc/lighttpd.conf
And we’re running. The first thing I do is attempt to connect with my browser, and I get a “connection refused”. Well, okay. So I then try w3m from one of my other machines, and despite giving an error about how it can’t connect to the test page it displays the contents fine.
So I poke around and find this -
suigintou# netstat -an | grep LISTEN tcp4 0 0 *.80 *.* LISTEN tcp6 0 0 *.7000 *.* LISTEN
So apparently the server.port directive only applys to IPv6 connections. I can connect to it fine with my browser on port 80. Goddammit what the fuck.
EDIT: It was something with fucking lighttpd’s .conf.sample, which is just bloat bloat bloat. Starting a new lighttpd.conf from scratch eliminated the problem.
I made delicious ramen today. I finally managed to use the exact right amount of water to essentially steam the noodles so they both come out cooked and without the “soup” portion. I then seasoned them with some cumin, red pepper, Tabasco and salt (and msg) and it was delicious. Yum yum. Overcooked them a little bit, but overall pretty good and spicy.
Anyway, two points of interest for this post -
- I got Erlang running off two machines (one master, one node)! I kind of cheated and NFS mounted the master’s filesystem on the node by hand; I need to figure out how I’m going to export things like applications and stuff. I tried just booting the node from the root filesystem but that didn’t go over so well.
- lighttpd has built-in load-balancing. This, in addition to lighttpd already sounding fucking sweet, is just fucking awesome. I’m trying to install and configure it now.
I also went ahead and downloaded BProc just to see how many modifications were made to the Linux kernel to get it to work, and I lol’ed like a motherfucker. I doubt I’ll be able to get that shit working on FreeBSD for a long, long time, simply because I’m not at all familiar with the structure of FreeBSD’s kernel.
Ha ha ha oh wow.