Okay, I have to stumble through the Haddock for this shit every time I want to get something done with Haskell, so I’m going to type some of this up for the next time I have to deal with it. Hopefully this will help me remember the specifics better.
I want to implement a “feature” on 4scrape where when someone does a search, the JavaScript also does a search on JList and injects some product suggestions into the results. Since JList only provides HTML/XML interfaces (and not JSONP), I have to write a proxy service, and if I’m going to do that, I might as well have it cache, parse and convert the JList product data into JSON so the JavaScript side is really easy to write.
Fetching shit over HTTP
First, we need to fetch the raw data. Before we can do that, we need to parse the target URL into a Network.URI.URI –
ghci> import Network.URI
ghci> let url = "http://feeds.jlist.com/SEARCH/evangelion/feed.xml"
ghci> let Just uri = parseURI url
There’s a lot of ways to fetch shit over HTTP, but I prefer Network.HTTP.simpleHTTP because it’s simple. We just have to construct a Request object from our URI (and you can put headers and shit in there too — but for posting stuff there’s better interfaces I think) -
ghci> import Network.HTTP
ghci> :t Request
Request :: URI -> RequestMethod -> [Header] -> String -> Request
ghci> let req = Request uri GET [] ""
And then perform the simpleHTTP call, which returns an Either we have to deal with (I just let the shit error out) –
ghci> res <- simpleHTTP req
Right HTTP/1.1 200 OK
Date: Wed, 01 Apr 2009 22:47:42 GMT
Server: Apache
Connection: close
Transfer-Encoding: chunked
Content-Type: text/xml
Content-Length: 65190
ghci> let Right rsp = res
There’s various accessor functions for grabbing all of the pieces of the response object, but the only thing I’m interested in is the body –
ghci> let xml = rspBody rsp
Parsing with Text.XML.Light
Now that we have the data, we need to parse it. There’s a bajillion Haskell libraries to parse and manipulate XML documents (HXT, HaXML, libxml bindings, etc), but for lightweight work I find Text.XML.Light to be simple enough. Arguably, this would be faster to deal with using some XPath but ugh that shit (and XSLT) make me kind of queasy (for no good reason). Fucking XML.
Anyway, have to parse the XML document –
ghci> import Text.XML.Light
ghci> let Just doc = parseXMLDoc xml
In this case, I’m dealing with RSS, which is fairly straightforward — I just want to grab all the “item” elements, then from each of those, grab the title, link and description. The only muddy thing about traversal with XML.Light is using QName (qualified names?) which really is just an extra step in the process. Fucking XML.
ghci> let items = findElements (QName "item" Nothing Nothing) doc
ghci> length items
58
And this is where I introduce totally unmaintainable code
ghci> let stuff = map (\e -> map (\n -> maybe "" strContent (findElement (
QName n Nothing Nothing) e)) ["title", "guid", "description"]) items
ghci> stuff !! 0 !! 0
"Evangelion Head Interface ~ Rei Ayanami ver. "
Could probably clean that up a lot by sticking the lambdas and the ["title", ...] into where clauses but I am too lazy. Anyway, that’s all our stuff and we can just throw that into JSON with Text.JSON.encode and be on our way. Hooray.
6 comments