Before the shitstorm hits 4scrape is dead for awhile due to a HDD failure. Yes, there are offsite backups for all the stuff on there but meh probably going to be awhile before I’m arsed to actually restore it. Kind of want to rewrite the damn thing again. I cbf’d to put the images in a torrent or whatever because they’re all shit anyway. RIP.
And I csup‘d a bad kernel last night on my laptop or something because it panicked a couple minutes ago. Naturally this means spending the rest of the night dicking around with it to get a crash dump (it didn’t leave one FUUUUUUUUUUU) then dicking around in gdb to find the cause of the crash (or just posting the traces to the mailing list). <3
Pic potentially related to binge drinking for the past week.16 comments
I know I’ve been ranting a lot about Meimei, but hopefully I’ll have an early version ready to deploy in a couple days (want to have a couple instances deployed on some friends’ hosting first so I can make sure everything works). Once that’s taken care of, there are plenty more sub-projects to tackle -
User-Aided Autonomous Metadata Creation
A user requests an image (landing page). The server checks the cookies and sees that no other images are in the cookies. No data is modified server-side, but the client now has that image’s ID in its cookies. It requests another image. The server checks the cookies and sees that there is one image ID in the cookies. It modifies the “correlation files” for both the image requested and the one in the cookies to reflect the correlation (that is, it increments a count in each file for the other’s ID, creating an entry for the image if there wasn’t one). The client’s cookies now contain both image’s IDs. The client requests a third image. The server checks the cookies and sees two image IDs. It modifies the files for both of those images to reflect the correlation with the current one and the file for the current one to reflect the correlation with the other two. The client’s cookies now contain three image IDs.
And so on.
Meanwhile, any time anyone requests an image, the server checks the file for that image. The top ten or so images with the highest seen count are displayed.
Basically, the idea is that people tend to cluster into groups which have similar image tastes. By tracking what images are viewed in a single session, you can cluster images by preference group. Those clusterings can then be used as a smart image suggestion system.
A similar system could be implemented, in which the images clicked during a search are auto-tagged with the search terms used. This would be kind of useful, but ultimately wouldn’t really do anything with the current search system (since there are no relevancy rankings — everything’s sorted by date).
Image Analysis — Common Colors
Tag Scrape/MD5 Lookup Service
There’s a bunch of sites out there which index basically the same image data — just do an image search on IQDB and you’ll get the same image back from multiple sites. I think all the sites IQDB indexes (except 4scrape) are Danbooru instances, which means they have delicious tags which can be scraped by MD5.
It’s kind of what Anisearch already does — aggregate tags from a bunch of Danbooru instances and throw them into a search index. Taking it a step further though, the service could not only aggregate tags, but provide an API to query against the image MD5 (to get a list of tags) and to adjust the tag weights.
Once there was a service, it would be cool to integrate with other software like pImgDB (and have a Danbooru plugin to facilitate a push model rather than aggressive scraping) and have a massive image MD5 <-> tags thing.
Better Fulltext Indexing
Right now 4scrape uses PostgreSQL’s fulltext search engine. While it works and all, it’s kind of gross. At the very least, it needs a natural language wrapper which parses the queries and formats the search query properly (right now it just AND’s all terms together).
Still, it would be kind of cool to re-implement an existing fulltext index like Whoosh (a Python one) in Haskell. Some of the projects at work use Solr, which is a gross REST-based webservice (with support for some other stuff) built on top of Lucene. And it works really well except that it’s slow as fuck. There are a couple of other fulltext search solutions (Oracle, etc) but they’re ugh.
I’d be nice to have a fancy pants fulltext search index written in Haskell.
I don’t have nearly enough free time :(
PS: If anyone wants to implement any of this (since they can all mostly be implemented as services external to 4scrape), let me know and we can work something out :37 comments
I started watching I got trolled into watching Maria+Holic. Despite the fact that the genre doesn’t usually appeal to me I can’t stop fucking watching it. It’s kind of disturbing how enthralled I am by it. lol faggots.
Rimi hit the 100mbps bandwidth limit last night, which is quite the incentive to finish up work on Meimei. I’ve got the script and installer done (I think) — I just need to put the auto-update functionality in and write the Haskell-based load balancer and she should be all set to deploy. Hopefully that’ll take some of the strain off Rimi without being too much of a pain.
Finally, Drupal can suck my fucking balls. I’ve known for a long time that CMS’s are shitty, I just wasn’t able to comprehend how shitty they could possibly be. Now I know (and have to put up with it, unfortunately). Ugh.
lol fixed the scrapers scraping 4scrape problem (or, as much as I’m willing to).
The problem, as I see it, is that scrapers have a very abnormal usage pattern compared to normal visitors — they’re just pulling the big heavy images which cost a lot of bandwidth and CPU (in the form of I/O interrupts) to serve. My solution to the problem was to write a lighttpd module which basically throttles the number of “expensive” requests you can make. I don’t really give a shit that they’re scraping — what pisses me off is that they’re bringing the server to it’s knees by doing so.
Each IP address is associated with an integer representing the amount of resources it’s used. This value is reduced by a configurable amount each second, and increases whenever they access a marked resource. Once it gets over a maximum amount, the module starts throwing back 403 responses.
The way I’ve got it configured right now is to add 20 points for each full-size image fetched, then decay that value by 1 point/second. The maximum is 200 points before the server starts 403′ing, with a 600-point upper bound. Basically, this gives you a wallpaper every 20 seconds, allowing for bursts of activity.
I’ve also got code in place that stat‘s the requested files and bases the cost on the file size, but haven’t gotten around to testing it yet. Deploying a fucking lighttpd plugin was a goddamn nightmare (at least, compared to an Apache module) because the lighttpd developers are retards or something.
Anyway, we’ll see how this pans out. [source]