Before the shitstorm hits 4scrape is dead for awhile due to a HDD failure. Yes, there are offsite backups for all the stuff on there but meh probably going to be awhile before I’m arsed to actually restore it. Kind of want to rewrite the damn thing again. I cbf’d to put the images in a torrent or whatever because they’re all shit anyway. RIP.
And I csup‘d a bad kernel last night on my laptop or something because it panicked a couple minutes ago. Naturally this means spending the rest of the night dicking around with it to get a crash dump (it didn’t leave one FUUUUUUUUUUU) then dicking around in gdb to find the cause of the crash (or just posting the traces to the mailing list). <3
Pic potentially related to binge drinking for the past week.16 comments
Okay, I have to stumble through the Haddock for this shit every time I want to get something done with Haskell, so I’m going to type some of this up for the next time I have to deal with it. Hopefully this will help me remember the specifics better.
Fetching shit over HTTP
First, we need to fetch the raw data. Before we can do that, we need to parse the target URL into a Network.URI.URI –
ghci> import Network.URI ghci> let url = "http://feeds.jlist.com/SEARCH/evangelion/feed.xml" ghci> let Just uri = parseURI url
There’s a lot of ways to fetch shit over HTTP, but I prefer Network.HTTP.simpleHTTP because it’s simple. We just have to construct a Request object from our URI (and you can put headers and shit in there too — but for posting stuff there’s better interfaces I think) -
ghci> import Network.HTTP ghci> :t Request Request :: URI -> RequestMethod -> [Header] -> String -> Request ghci> let req = Request uri GET  ""
And then perform the simpleHTTP call, which returns an Either we have to deal with (I just let the shit error out) –
ghci> res <- simpleHTTP req Right HTTP/1.1 200 OK Date: Wed, 01 Apr 2009 22:47:42 GMT Server: Apache Connection: close Transfer-Encoding: chunked Content-Type: text/xml Content-Length: 65190 ghci> let Right rsp = res
There’s various accessor functions for grabbing all of the pieces of the response object, but the only thing I’m interested in is the body –
ghci> let xml = rspBody rsp
Parsing with Text.XML.Light
Now that we have the data, we need to parse it. There’s a bajillion Haskell libraries to parse and manipulate XML documents (HXT, HaXML, libxml bindings, etc), but for lightweight work I find Text.XML.Light to be simple enough. Arguably, this would be faster to deal with using some XPath but ugh that shit (and XSLT) make me kind of queasy (for no good reason). Fucking XML.
Anyway, have to parse the XML document –
ghci> import Text.XML.Light ghci> let Just doc = parseXMLDoc xml
In this case, I’m dealing with RSS, which is fairly straightforward — I just want to grab all the “item” elements, then from each of those, grab the title, link and description. The only muddy thing about traversal with XML.Light is using QName (qualified names?) which really is just an extra step in the process. Fucking XML.
ghci> let items = findElements (QName "item" Nothing Nothing) doc ghci> length items 58
And this is where I introduce totally unmaintainable code
ghci> let stuff = map (\e -> map (\n -> maybe "" strContent (findElement ( QName n Nothing Nothing) e)) ["title", "guid", "description"]) items ghci> stuff !! 0 !! 0 "Evangelion Head Interface ~ Rei Ayanami ver. "
Could probably clean that up a lot by sticking the lambdas and the ["title", ...] into where clauses but I am too lazy. Anyway, that’s all our stuff and we can just throw that into JSON with Text.JSON.encode and be on our way. Hooray.6 comments
Not only did I spend way too much fucking money at Katsucon, but the Korean mafia is after my head. Goddammit.11 comments