Archive for April, 2008
SQLite is great, until you try to throw it into a threaded environment (ie, Apache+mod_python). It worked for awhile, then just started crapping out as multiple interpreter instances started hammering it. So I rewrote the scraper and viewer to use a MySQL backend instead, which was suprisingly easy since THE FORCED INDENTATION OF THE CODE lays out a standard database interface in PEP 249.
mysql> SELECT board_sname, COUNT(*) AS `img_count` -> FROM 4s_images i -> JOIN 4s_posts p ON p.img_id=i.img_id -> JOIN 4s_boards b ON b.board_id=p.board_id -> GROUP BY b.board_id; +-------------+-----------+ | board_sname | img_count | +-------------+-----------+ | w | 2751 | | e | 2435 | +-------------+-----------+ 2 rows in set (0.39 sec)
Those counts used to be over 9000, but I cbf’d converting the SQLite data over to MySQL because I restructured the database a bit etc. So I just deleted the 3 days worth of images scraped, not a big deal.
The web frontend uses Apache+mod_python, as mentioned. I took one look at all those heavy frameworks (Django, Pylons, etc) and lol’d. Way too much flashy wank; I ended up writing my own 70-line “web-framework” which gets the job done. I am using Cheetah for a template engine, because I cbf’d writing yet another template engine. I wrote one in PHP once, it was a couple of lines long and doesn’t afraid of anything.
As slow and bloated as Python may be, I’m fucking glad I don’t have to use PHP, and that I don’t have to use it through across a FCGI interface.
Also, I ordered an Atheros wifi PCI NIC, so the porn server which is now also scraping images will probably be serving as my primary router some time in the near future. Which means I’ll be able to configure the goddamn thing the way I want and have it externally accessible. And I’ll get to dick around with bpf hur hur hur.
hax my anus
Also, I’ve got the 4chan dump script set up as a cron job, running every 12 hours. It’s currently dumping both /w/ and /e/, I’m considering adding either /s/ or /hc/ into that mix as well. It’s running on a dedicated machine which has 1TB of free space (2x 500GB, no redundancy), so hopefully that’ll be enough for a long while. Or at least long enough for me to get a job and money and etc.
Script stores the image dimensions as metadata in the database now, which is cool. What I need now is a better viewing mechanism — using gthumb over a 100bT wireless NFS is, well, slower than I’d hoped.3 comments
I sat down tonight, started and finished a project. I wrote a tool which goes through and dumps imageboards; wrote it to basically archive /w/ since I like mah animu wallpapers. Gave me an opportunity to dick around with SQLite, something I’ve been meaning to do for awhile.
The dumper isn’t a dumb dumper like wget is — you give it a host, a board path and a database (and the number of pages of threads you’d like to grab) and it goes through a several-step process.
First, it goes through all the index pages and grabs the IDs of all the threads, in addition to the IDs of the latest posts in each thread. It has a table which keeps track of the last post it’s seen in each thread, so the threads which haven’t received posts since the last pass can be thrown out without even fetching them.
It then goes through each of the remaining threads and grabs the page. Since it knows the last post it’s seen in the thread, it can intelligently grab only the images and posts it hasn’t seen.
When it fetches an image, it hashes the image data and checks to see if there’s an existing entry for that hash. If there is, it links the post to that entry (such that multiple posts can link to the same file on disk), otherwise it saves it to disk and notes the hash.
Really, it isn’t that much, but I’m fairly happy with it. It’s trivial enough to implement that I’m not going to bother posting the source. I’ve got an extra TB of unpurposed disk sitting around, so I’ll probably throw that in my porn server and have that fucker run the script twice a day or something. On average, each image weighs about 1MB, thus the 1TB will accomidate around a million posts. Should be more than enough — /w/ has only accumulated 450,000 posts so far.
I’m kind of interested to see how SQLite scales to that number of entries. The database is only being used to track metadata (nothing stupid like putting the images in there as blobs), so I have high hopes for it.
My brain is pretty burnt out right now, but I need to go through and have the thing glean the image size data before it writes it to disk, then throw that in with the metadata. Then I guess I can write some kind of browsing/tagging system or some kind of organization instead of having thousands of images sitting around in a single file.
I also kind of want to point this fucker at /e/.
It’s kind of weird having an entire board available for one’s offline perusing. Oh well, I’ll get used to it. Lawds.
I got a phone call this morning from SICP guy, apparently I don’t really have that job — there’s quite a few hoops left to jump through. Mostly HR stuff, real pain in the ass.
Long story short, being a public institution and all, they have to put out the job offer publically, then wait for people to apply for it (including myself), then select from the applicants. This, as far as I can tell, has several implications –
- Someone more qualified than myself (not uncommon) might show up.
- Added temporal overhead might push me to financial desperation (lol rent?)
Even if I manage to get the job, there are some other gotchas, specifically, the position would be limited to 1500 hours total, after which it would be terminated, without any room for renegotiation. Which means after 6 months, I’d either be promoted into a permanent staff position or looking for another job. Really, this is something I anticipated happening anyway, so it’s not that big of a deal.
What is a big deal is that, if I don’t have a job within the next 10 days or so, I won’t be able to pay rent for June. Which means I will have a job in the next 10 days or so, whether it’s writing code or sucking dick I will have one.
Going to finish up my bullshit resume today and shotgun it off to a bunch of positions around here. I know no one in the Charlottesville area reads this, but if you have a job opening…