I no longer have the time or energy to continue maintaining the 4scrape codebase. There’s currently a crapton of issues with it –
- Cache for searches (potentially just post searches) is broken.
- Threads need to be cached as a whole unit — assembling them from a 500,000-row table is too slow.
- General consistency errors — there’s a bunch of images missing (???)
- The scraper likes to shit itself to keep things lively.
- The backend would occasionally crash/spinlock (???)
I took the entire server offline for now, but I’m going to re-post the shit code here along with a complete dump of the entire database (over a year’s worth of posts, lawds). I don’t really know how to offload the image data (~150GB) — I’ll talk to Shuugo about setting up a torrent (or just turn the web server back on and let people hammer the extra 150mbps).
It was a fun run; I had a good time. Sorry for crapping out on you :(
EDIT: 4scrape source code [178KB]. I should have the full database dump uploaded some time tomorrow (it’s xbox hueg, a little over 500MB compressed comprising almost a million posts and metadata for over 1/3 million images). I’m not quite sure how I’m going to offload the image data yet but it’ll either be over bittorrent, or providing unthrottled HTTP access to the image data (but not the API/front-end, so you’ll need the database dump (or just 1.jpg 2.jpg 3.jpg etc — though you’ll miss *.png etc) to accidentially the whole thing).
EDIT2: And the SQL dump [505MB]. I was running PostgreSQL 8.3; you can probably massage the dump to fit into something else but it uses some Postgres-only features (ie, for the fulltext indexes) so it isn’t going to fit into anything else (ie, MySQL/MSSQL/whatever) as-is. Additionally, I turned back on the image access (for what’s still there) and turned off the access restrictions. Feel free to hammer the balls out of the server while I figure out a better way to get them off.
EDIT3: Go to 4walled.org instead.454 comments