I like having lots of data, but I hate having it in a completely unusable mess. I have a couple of friends who have fairly massive repositories of images and video all strewn in a big pile. I’m kind of guilty of doing it myself. The problem is, these big piles are absolutely unusable because they don’t provide any means of attaching non-trivial metadata to the content.
One of my goals for the next month or so is to write a Danbooru-like system which is geared towards taxonomic categorization of a folder of shit. The idea is that you point the software (which is effectively a web application written in Haskell) at a directory. Any file in that directory can then be turned into a Danbooru-style post — a process which moves the file out of the watch directory and into the system’s internal repository and attaches some initial metadata.
Even if you just have a raw file, you can scrape quite a bit of metadata from it. I spent a bit of time tonight writing a small wrapper around mplayer which can be used to dump all the metadata from a video file and dump arbitrary screenshots from the video. Combine this with a system which has an idea of how fansubbers normally name their files and how to look up series names on AniDB/MAL and you’ve got a good amount of automatically-generated metadata right off the bat (which can be then supplemented by users).
You can do the same thing with images — there are tons of existing massive repositories for image metadata. Aside from looking at the on-file stuff, you can take the MD5 of the image and pass it to a fairly wide variety of services (4scrape, IQDB, Danbooru instances, etc) and pull back tags from any hits that you find. Sure, you won’t find stuff for every image, but you’d be surprised how many images are already posted and categorized elsewhere.
The reason I want to use a watch directory is because it makes it really easy to tie into an environment where people are uploading content via NFS/SMB. I’m most interested in good categorization of video data (since there are already good existing image repositories), and uploading 200MB+ files over anything but NFS/SMB is a massive pain in the ass.
Additionally, it makes it really easy to tie into an automated download system — hook an RSS agent up to download torrents into rtorrent‘s watch directory, then have a script to move the finished files into the repository’s watch directory and boom — your stuff automatically gets pulled and indexed. Hook it up to a twitter feed and have a monitor set up in the living room to display that feed, and you’ve got a pretty good media discovery system.
Anyway, we’ll see how it turns out. I’m hoping it’s going to be fucking awesome, but it’s going to take awhile to get everything up and running, I think. And so, art imitates life.
3 comments