- Raid is surprisingly effective against wasps. One blast and they were all dead writhing corpses on my windowstill (their pupaesque young included).
- Those fucks still haven’t put my fucking toilet back.
- hack (on hackage) is a Haskell version of Rack which looks excellent.
- Of course, the creator of Hack has named his framework Loli (on hackage), which means he is a dirty child molesting pedophile /b/tard.
- I feel dirty for wanting to use the loli.
I like having lots of data, but I hate having it in a completely unusable mess. I have a couple of friends who have fairly massive repositories of images and video all strewn in a big pile. I’m kind of guilty of doing it myself. The problem is, these big piles are absolutely unusable because they don’t provide any means of attaching non-trivial metadata to the content.
One of my goals for the next month or so is to write a Danbooru-like system which is geared towards taxonomic categorization of a folder of shit. The idea is that you point the software (which is effectively a web application written in Haskell) at a directory. Any file in that directory can then be turned into a Danbooru-style post — a process which moves the file out of the watch directory and into the system’s internal repository and attaches some initial metadata.
Even if you just have a raw file, you can scrape quite a bit of metadata from it. I spent a bit of time tonight writing a small wrapper around mplayer which can be used to dump all the metadata from a video file and dump arbitrary screenshots from the video. Combine this with a system which has an idea of how fansubbers normally name their files and how to look up series names on AniDB/MAL and you’ve got a good amount of automatically-generated metadata right off the bat (which can be then supplemented by users).
You can do the same thing with images — there are tons of existing massive repositories for image metadata. Aside from looking at the on-file stuff, you can take the MD5 of the image and pass it to a fairly wide variety of services (4scrape, IQDB, Danbooru instances, etc) and pull back tags from any hits that you find. Sure, you won’t find stuff for every image, but you’d be surprised how many images are already posted and categorized elsewhere.
The reason I want to use a watch directory is because it makes it really easy to tie into an environment where people are uploading content via NFS/SMB. I’m most interested in good categorization of video data (since there are already good existing image repositories), and uploading 200MB+ files over anything but NFS/SMB is a massive pain in the ass.
Additionally, it makes it really easy to tie into an automated download system — hook an RSS agent up to download torrents into rtorrent‘s watch directory, then have a script to move the finished files into the repository’s watch directory and boom — your stuff automatically gets pulled and indexed. Hook it up to a twitter feed and have a monitor set up in the living room to display that feed, and you’ve got a pretty good media discovery system.
Anyway, we’ll see how it turns out. I’m hoping it’s going to be fucking awesome, but it’s going to take awhile to get everything up and running, I think. And so, art imitates life.3 comments
I ditched work a little early today. I’m migrating our website over from a pig disgusting Cold Fusion mess over to pig disgusting Drupal. It’s a fucking nightmare — how the hell did I turn into a fucking PHP developer? Times like these make me want to build up a non-webshit portfolio and go shopping for a new job, though the market around here is kind of shitty for that right now.
Started tinkering around with Haskell + OpenGL last night. Using some old C code of mine as a reference, I got textures and vertex arrays working fine, should hammer out VBO’s and I should have everything I need to render with. Don’t know what kind of game or program I’d write though, so I’ll probably just get bored with it. I put the shit code on github because as shitty as github is (slow as fuck Ruby web interface) it’s kind of nice to have a free remote, public VCS. And git itself isn’t too bad.
Right now I’m compiling CURRENT from source, over NFS, on a netbooted machine (because I forgot the goddamn install CD at work fucking AGAIN and I’m sick of having shit strewn around my desk). It’s pulling the source off NFS, but /usr/obj is on a disk. Figure I’ll buildkernel + buildworld then install them to the disk, then boot off that. Pulling the source over NFS takes surprisingly little bandwidth — I assumed the network link would be the bottleneck. I guess it’s pulling on-demand, so it just adds latency to each request. Should have unionfs’d the UFS mount over the NFS mount and pulled all the stuff over concurrent to the install (via find . | xargs touch or similar).
Took me fucking ages to sift through all the shit I post to find the nfsbooting notes from 2007 though. I guess I should start tagging posts or something.2 comments
I know I’ve been ranting a lot about Meimei, but hopefully I’ll have an early version ready to deploy in a couple days (want to have a couple instances deployed on some friends’ hosting first so I can make sure everything works). Once that’s taken care of, there are plenty more sub-projects to tackle -
User-Aided Autonomous Metadata Creation
A user requests an image (landing page). The server checks the cookies and sees that no other images are in the cookies. No data is modified server-side, but the client now has that image’s ID in its cookies. It requests another image. The server checks the cookies and sees that there is one image ID in the cookies. It modifies the “correlation files” for both the image requested and the one in the cookies to reflect the correlation (that is, it increments a count in each file for the other’s ID, creating an entry for the image if there wasn’t one). The client’s cookies now contain both image’s IDs. The client requests a third image. The server checks the cookies and sees two image IDs. It modifies the files for both of those images to reflect the correlation with the current one and the file for the current one to reflect the correlation with the other two. The client’s cookies now contain three image IDs.
And so on.
Meanwhile, any time anyone requests an image, the server checks the file for that image. The top ten or so images with the highest seen count are displayed.
Basically, the idea is that people tend to cluster into groups which have similar image tastes. By tracking what images are viewed in a single session, you can cluster images by preference group. Those clusterings can then be used as a smart image suggestion system.
A similar system could be implemented, in which the images clicked during a search are auto-tagged with the search terms used. This would be kind of useful, but ultimately wouldn’t really do anything with the current search system (since there are no relevancy rankings — everything’s sorted by date).
Image Analysis — Common Colors
Tag Scrape/MD5 Lookup Service
There’s a bunch of sites out there which index basically the same image data — just do an image search on IQDB and you’ll get the same image back from multiple sites. I think all the sites IQDB indexes (except 4scrape) are Danbooru instances, which means they have delicious tags which can be scraped by MD5.
It’s kind of what Anisearch already does — aggregate tags from a bunch of Danbooru instances and throw them into a search index. Taking it a step further though, the service could not only aggregate tags, but provide an API to query against the image MD5 (to get a list of tags) and to adjust the tag weights.
Once there was a service, it would be cool to integrate with other software like pImgDB (and have a Danbooru plugin to facilitate a push model rather than aggressive scraping) and have a massive image MD5 <-> tags thing.
Better Fulltext Indexing
Right now 4scrape uses PostgreSQL’s fulltext search engine. While it works and all, it’s kind of gross. At the very least, it needs a natural language wrapper which parses the queries and formats the search query properly (right now it just AND’s all terms together).
Still, it would be kind of cool to re-implement an existing fulltext index like Whoosh (a Python one) in Haskell. Some of the projects at work use Solr, which is a gross REST-based webservice (with support for some other stuff) built on top of Lucene. And it works really well except that it’s slow as fuck. There are a couple of other fulltext search solutions (Oracle, etc) but they’re ugh.
I’d be nice to have a fancy pants fulltext search index written in Haskell.
I don’t have nearly enough free time :(
PS: If anyone wants to implement any of this (since they can all mostly be implemented as services external to 4scrape), let me know and we can work something out :37 comments