Parsing XML in Clojure
Problem :
We need to parse a xml string and be able to query using xpath style tag list.
Ex :
<friends> <person> <name>Siva</name> </person> </friends>
I need a function that can do this,
(get-value xml :person :name)
returns “Siva”
Solution :
To parse and query xml we need to do these following three things in clojure.
1) Convert xml string (file) to Struct Map
Clojure core comes with a build in xml library (clojure.xml http://clojure.github.com/clojure/clojure.xml-api.html) that has a parse function which takes in InputStream and returns a struct map that represents xml.
(defn get-struct-map [xml]
(let [stream (ByteArrayInputStream. (.getBytes (.trim xml)))]
(xml/parse stream)))
user> (get-struct-map xml)
{:tag :friends, :attrs nil, :content [{:tag :person, :attrs nil, :content [{:tag :name, :attrs nil, :content ["Siva"]}]}]}
This struct map is cumbersome to query.
2) Convert Struct Map to Zipper Data Structure
“A zipper is a technique of representing an aggregate data structure so that it is convenient for writing programs that traverse the structure arbitrarily and update its contents, especially in purely functional programming languages.” http://en.wikipedia.org/wiki/Zipper_%28data_structure%29
To make it easy for us to traverse we will change struct map to zipper data structure. Clojure comes with zip library ( it is short form for zipper ) http://clojure.github.com/clojure/clojure.zip-api.html
We will this zip library to convert xml struct map to zipper data structure.
user> (clojure.zip/xml-zip xml-struct)
[{:tag :friends, :attrs nil, :content [{:tag :person, :attrs nil, :content [{:tag :name, :attrs nil, :content ["Siva"]}]}]} nil]
3) Use Zip-filter library and query zipper data structure.
Now that we have our xml in zipper data structure we could use zip-filter library that is present in clojure.contrib. http://clojure.github.com/clojure-contrib/zip-filter-api.html
user> (clojure.contrib.zip-filter.xml/xml-> zipper-struct :person :name)
("Siva")
Putting this altogether
(ns com.sivajag.utils.xml
(:import (java.io ByteArrayInputStream))
(:require [clojure.xml :as xml])
(:require [clojure.zip :as zip])
(:require [clojure.contrib.zip-filter.xml :as zf]))
(defn get-struct-map [xml]
(if-not (empty? xml)
(let [stream (ByteArrayInputStream. (.getBytes (.trim xml)))]
(xml/parse stream))))
(defn get-value [xml & tags]
(apply zf/xml1-> (zip/xml-zip (get-struct-map xml)) (conj (vec tags) zf/text)))
user> (get-value xml :person :name)
"Siva"
Happy Coding!!!

Can you explain what’s happening with apply zf/xml1-> in the get value function? It looks like (conj (vec tags) zf/text) is building a list of functions to invoke but I don’t see how this all comes together. Thanks.
Scott Hickey
August 26, 2010 at 10:58 am
Hi Scott:
zf/xml1 takes a location and set of predicates.
From xml1 doc,
“The loc is passed to the first predicate. If the predicate returns
a collection, each value of the collection is passed to the next
predicate. If it returns a location, the location is passed to the
next predicate. If it returns true, the input location is passed to
the next predicate. If it returns false or nil, the next predicate
is not called.
This process is repeated, passing the processed results of each
predicate to the next predicate. xml-> returns the final sequence.
The entire chain is evaluated lazily.
There are also special predicates: keywords are converted to tag=,
strings to text=, and vectors to sub-queries that return true if
they match.”
In get-value function, we are creating location by
(zip/xml-zip (get-struct-map xml)))
and a set of predicates
(conj (vec tags) zf/text))
When we do a conj we will have a vector of
[:person :name text-fn]
xml1-> call processes keywords as a special predicate by converting them to tag= fn. tag=
Usage: (tag= tagname)
Returns a query predicate that matches a node when its is a tag named tagname.
So when we pass the ROOT location, first (tag= :person) is called , which will return a PERSON location
Then PERSON location is called with (tag= :name) predicate, which will return NAME location.
To get the text from NAME location we use a build in predicate (text).
In get-value fn if you take out zf/text, you can see that it will return
([{:tag :name, :attrs nil, :content ["Siva"]} {:l [], :pnodes [{:tag :friends, :attrs nil, :content [{:tag :person, :attrs nil, :content [{:tag :name, :attrs nil, :content ["Siva"]}]}]} {:tag :person, :attrs nil, :content [{:tag :name, :attrs nil, :content ["Siva"]}]}], :ppath {:l [], :pnodes [{:tag :friends, :attrs nil, :content [{:tag :person, :attrs nil, :content [{:tag :name, :attrs nil, :content ["Siva"]}]}]}], :ppath nil, :r nil}, :r nil}])
I hope I answered your question. Let me know if you it is still unclear.
regards,
– Siva Jagadeesan
Siva Jagadeesan
August 26, 2010 at 3:01 pm
Hey Siva,
Nice post! You may like to replace in the URL the text “richhickey.github.com” with “clojure.github.com” as the former is not being maintained anymore. I know it’s easy to overlook – just a gentle reminder.
Regards,
Shantanu
Shantanu Kumar
March 6, 2011 at 11:07 am
Thanks Shantanu.
I have updated the urls.
Siva Jagadeesan
March 8, 2011 at 1:50 pm
Hey, Siva,
I have been playing with your example, making slight changes to understand what each function does. Could you explain me why does this code doesn’t work?
(get-value xml :friends :person :name)
Thank you
Otavio Macedo
January 2, 2012 at 4:50 am