Tech behind Tech

Raw information. No finesse :)

Parsing XML in Clojure

with 5 comments


Problem :

We need to parse a xml string and be able to query using xpath style tag list.

Ex :

<friends>
  <person>
   <name>Siva</name>
  </person>
</friends>

I need a function that can do this,

(get-value xml :person :name)

returns “Siva”

Solution :

To parse and query xml we need to do these following three things in clojure.

1) Convert xml string (file) to Struct Map

Clojure core comes with a build in xml library (clojure.xml http://clojure.github.com/clojure/clojure.xml-api.html) that has a parse function which takes in InputStream and returns a struct map that represents xml.

(defn get-struct-map [xml]
  (let [stream (ByteArrayInputStream. (.getBytes (.trim xml)))]
    (xml/parse stream)))
user> (get-struct-map xml)
{:tag :friends, :attrs nil, :content [{:tag :person, :attrs nil, :content [{:tag :name, :attrs nil, :content ["Siva"]}]}]}

This struct map is cumbersome to query.

2) Convert Struct Map to Zipper Data Structure

“A zipper is a technique of representing an aggregate data structure so that it is convenient for writing programs that traverse the structure arbitrarily and update its contents, especially in purely functional programming languages.” http://en.wikipedia.org/wiki/Zipper_%28data_structure%29

To make it easy for us to traverse we will change struct map to zipper data structure. Clojure comes with zip library ( it is short form for zipper ) http://clojure.github.com/clojure/clojure.zip-api.html

We will this zip library to convert xml struct map to zipper data structure.

user> (clojure.zip/xml-zip xml-struct)
[{:tag :friends, :attrs nil, :content [{:tag :person, :attrs nil, :content [{:tag :name, :attrs nil, :content ["Siva"]}]}]} nil]

3) Use Zip-filter library and query zipper data structure.

Now that we have our xml in zipper data structure we could use zip-filter library that is present in clojure.contrib. http://clojure.github.com/clojure-contrib/zip-filter-api.html

user> (clojure.contrib.zip-filter.xml/xml-> zipper-struct :person :name)
("Siva")

Putting this altogether

(ns com.sivajag.utils.xml
  (:import (java.io ByteArrayInputStream))
  (:require [clojure.xml :as xml])
  (:require [clojure.zip :as zip])
  (:require [clojure.contrib.zip-filter.xml :as zf]))

(defn get-struct-map [xml]
  (if-not (empty? xml)
    (let [stream (ByteArrayInputStream. (.getBytes (.trim xml)))]
      (xml/parse stream))))

(defn get-value [xml & tags]
  (apply zf/xml1-> (zip/xml-zip (get-struct-map xml)) (conj (vec tags) zf/text)))

user> (get-value xml :person :name)
"Siva"

Happy Coding!!!

Written by Siva Jagadeesan

June 25, 2010 at 1:57 pm

Posted in Clojure

Tagged with , , ,

5 Responses

Subscribe to comments with RSS.

  1. Can you explain what’s happening with apply zf/xml1-> in the get value function? It looks like (conj (vec tags) zf/text) is building a list of functions to invoke but I don’t see how this all comes together. Thanks.

    Scott Hickey

    August 26, 2010 at 10:58 am

    • Hi Scott:

      zf/xml1 takes a location and set of predicates.

      From xml1 doc,

      “The loc is passed to the first predicate. If the predicate returns
      a collection, each value of the collection is passed to the next
      predicate. If it returns a location, the location is passed to the
      next predicate. If it returns true, the input location is passed to
      the next predicate. If it returns false or nil, the next predicate
      is not called.

      This process is repeated, passing the processed results of each
      predicate to the next predicate. xml-> returns the final sequence.
      The entire chain is evaluated lazily.

      There are also special predicates: keywords are converted to tag=,
      strings to text=, and vectors to sub-queries that return true if
      they match.”

      In get-value function, we are creating location by
      (zip/xml-zip (get-struct-map xml)))

      and a set of predicates

      (conj (vec tags) zf/text))

      When we do a conj we will have a vector of

      [:person :name text-fn]

      xml1-> call processes keywords as a special predicate by converting them to tag= fn. tag=
      Usage: (tag= tagname)
      Returns a query predicate that matches a node when its is a tag named tagname.

      So when we pass the ROOT location, first (tag= :person) is called , which will return a PERSON location

      Then PERSON location is called with (tag= :name) predicate, which will return NAME location.

      To get the text from NAME location we use a build in predicate (text).

      In get-value fn if you take out zf/text, you can see that it will return

      ([{:tag :name, :attrs nil, :content [“Siva”]} {:l [], :pnodes [{:tag :friends, :attrs nil, :content [{:tag :person, :attrs nil, :content [{:tag :name, :attrs nil, :content [“Siva”]}]}]} {:tag :person, :attrs nil, :content [{:tag :name, :attrs nil, :content [“Siva”]}]}], :ppath {:l [], :pnodes [{:tag :friends, :attrs nil, :content [{:tag :person, :attrs nil, :content [{:tag :name, :attrs nil, :content [“Siva”]}]}]}], :ppath nil, :r nil}, :r nil}])

      I hope I answered your question. Let me know if you it is still unclear.

      regards,

      — Siva Jagadeesan

      Siva Jagadeesan

      August 26, 2010 at 3:01 pm

  2. Hey Siva,

    Nice post! You may like to replace in the URL the text “richhickey.github.com” with “clojure.github.com” as the former is not being maintained anymore. I know it’s easy to overlook – just a gentle reminder. :)

    Regards,
    Shantanu

    Shantanu Kumar

    March 6, 2011 at 11:07 am

  3. Hey, Siva,

    I have been playing with your example, making slight changes to understand what each function does. Could you explain me why does this code doesn’t work?

    (get-value xml :friends :person :name)

    Thank you

    Otavio Macedo

    January 2, 2012 at 4:50 am


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 146 other followers

%d bloggers like this: