So I came up with a brilliant solution to bandwidth/disk rape — write a small PHP script (Meimei) which caches images from 4scrape (Suigintou), give it out to people, and just randomly direct people to those mirrors with javascript. Validation issues aside (“oh hey let’s just serve lemonparty for every image”) — because those can be “fixed” by only allowing trusted mirrors — I don’t think it’s going to work.
I threw up a test version of the script on my DreamHost account and it completely shit itself. It worked for a while, then it started spewing 503 errors everywhere. I don’t know if this is an issue with DreamHost or my shitty code (or the fact that by putting it up to serve 100% of the requests, it was bombarded with almost 1000 hits in the few minutes that it was up), but it definitely needs a lot more tweaking before it’s usable. It’s one of those “ugh PHP” things and it’s just a bitch.
The version I have running right now is really stupid — it basically just acts as a caching HTTP proxy. To have the thing fully functional, it needs to do a lot more than just that (and the server stuff on Suigintou needs to be a bit more sophisticated too). Each Meimei instance needs to track how much bandwidth it’s used (it already tracks disk consumption); it needs a means to communicate back to Suigintou that it’s all filled up or out of bandwidth or whatever.
There also needs to be something which gives some notion of cache locality — ie, a Meimei might say “I only want images in the range 30000-40000 that are SFW and from /wg/”. That way it’ll consistently get handed those requests and hopefully hit the cache more often (though eventually, assuming unlimited disk, each Meimei instance will be a full mirror — this isn’t a sound assumption).
And then you hit mirror validation. I’d want to do in a distributed manner with JavaScript, ie, check the MD5 of a subset of the image data against a known value from the authoritative Suigintou, then report back to Suigintou if the mirror is giving bad data and not display the image. I don’t even know if that’s going to be possible in a realistic manner, or how evadable my checks would be. I think if there aren’t more than 5-7 Meimei instances they’ll either get overloaded with requests, not be clustered enough to achieve decent locality, or just be otherwise ineffective at staving off bandwidth consumption.
Would be nice to have it all working and shit but ugh it’s almost too much work. Bleh.
6 comments