The random rantings of a concerned programmer.

4scrape — further ideas

March 10th, 2009 | Category: Random

I know I’ve been ranting a lot about Meimei, but hopefully I’ll have an early version ready to deploy in a couple days (want to have a couple instances deployed on some friends’ hosting first so I can make sure everything works). Once that’s taken care of, there are plenty more sub-projects to tackle -

User-Aided Autonomous Metadata Creation

A user requests an image (landing page). The server checks the cookies and sees that no other images are in the cookies. No data is modified server-side, but the client now has that image’s ID in its cookies. It requests another image. The server checks the cookies and sees that there is one image ID in the cookies. It modifies the “correlation files” for both the image requested and the one in the cookies to reflect the correlation (that is, it increments a count in each file for the other’s ID, creating an entry for the image if there wasn’t one). The client’s cookies now contain both image’s IDs. The client requests a third image. The server checks the cookies and sees two image IDs. It modifies the files for both of those images to reflect the correlation with the current one and the file for the current one to reflect the correlation with the other two. The client’s cookies now contain three image IDs.

And so on.

Meanwhile, any time anyone requests an image, the server checks the file for that image. The top ten or so images with the highest seen count are displayed.

Basically, the idea is that people tend to cluster into groups which have similar image tastes. By tracking what images are viewed in a single session, you can cluster images by preference group. Those clusterings can then be used as a smart image suggestion system.

A similar system could be implemented, in which the images clicked during a search are auto-tagged with the search terms used. This would be kind of useful, but ultimately wouldn’t really do anything with the current search system (since there are no relevancy rankings — everything’s sorted by date).

Image Analysis — Common Colors

Matt wrote a nice Python script like 6 months ago that (as I remember) would take an image, quantize it, then return the N most recurring colors. If you ran each image through this, a color index could be created, such that you select a color (or two) with a JavaScript color picker and the search would be able to limit images by that color.

Tag Scrape/MD5 Lookup Service

There’s a bunch of sites out there which index basically the same image data — just do an image search on IQDB and you’ll get the same image back from multiple sites. I think all the sites IQDB indexes (except 4scrape) are Danbooru instances, which means they have delicious tags which can be scraped by MD5.

It’s kind of what Anisearch already does — aggregate tags from a bunch of Danbooru instances and throw them into a search index. Taking it a step further though, the service could not only aggregate tags, but provide an API to query against the image MD5 (to get a list of tags) and to adjust the tag weights.

Once there was a service, it would be cool to integrate with other software like pImgDB (and have a Danbooru plugin to facilitate a push model rather than aggressive scraping) and have a massive image MD5 <-> tags thing.

Better Fulltext Indexing

Right now 4scrape uses PostgreSQL’s fulltext search engine. While it works and all, it’s kind of gross. At the very least, it needs a natural language wrapper which parses the queries and formats the search query properly (right now it just AND’s all terms together).

Still, it would be kind of cool to re-implement an existing fulltext index like Whoosh (a Python one) in Haskell. Some of the projects at work use Solr, which is a gross REST-based webservice (with support for some other stuff) built on top of Lucene. And it works really well except that it’s slow as fuck. There are a couple of other fulltext search solutions (Oracle, etc) but they’re ugh.

I’d be nice to have a fancy pants fulltext search index written in Haskell.

I don’t have nearly enough free time :(

PS: If anyone wants to implement any of this (since they can all mostly be implemented as services external to 4scrape), let me know and we can work something out :3

7 comments

(Untitled)

February 09th, 2009 | Category: Random

Apparently, the spicyness scale of a local Thai place goes up to 50 (they only list up to 5 on the menu). My roommate was ordering from the phone when he learned of this and ordered me a 20 (since I told him to just get me whatever was the spiciest, thinking it would be a 5). My god was that some hot fucking curry. I don’t even want to think about what 50 would be like, but if I ever order that (and successfully finish it) it had better be fucking free!

I guess I could go into a mini-rant about how I was stupid enough to mis-interpret how PostGRES chooses indexes on expressions — apparently if you define an index like

CREATE INDEX "idx_normal" ON s_posts AS gin (to_tsvector('english',
    ((post_subject || ' ' || post_comment || ' ' || post_origimg))
))

Then do a query like

... WHERE to_tsvector('english',
    post_subject || ' ' || post_comment || ' ' || post_origimg
) @@ plainto_tsquery('english', 'pantsu')

It doesn’t actually use the existing index. It recomputes the entire thing, which is kind of bollocks. I suspect it’s because I built the expression as the concatentation, rather than the generation of the ts_vector (though that’s a function, so it should work just as well). In any case, breaking it into three separate expressions OR’d together gave a megassa nyoro~n performance boost. I can take it a step further and throw the concatenated expression into a separate column and index over that, but I’ll wait until morning to dick with it.

2 comments