Zen and the art of...

2010-01-19

A Recursive Walk on a String

Sometimes you don't pay attention to some details and this lead you right into the wrong direction. That's exactly what happened to me the other day. I had a specific problem I intended to solve for a while, nothing particularly difficult, but an interesting one. After hacking around for an hour, the situation was getting weird and I decided to ask for help on the Clojure Google Group. This post is an account of how that situation arose and what have been done to solve it.

Walking Recursively

The task is nearly as simple as explaining it. I needed a way to apply a function to each strings contained in a tree of nested data structures in Clojure. Not that easy to explain as you see, well you didn't really saw me rewrite the previous sentence a thousand times but you can probably picture it. My first attempt is really naive and leverages the clojure.walk library written by Stuart Sierra. It's a small library that provides functions to perform non-recursive walking of generic Clojure data structures. Lets first have a look at the walk function documentation.

user> (doc clojure.walk/walk)
-------------------------
clojure.walk/walk
([inner outer form])
  Traverses form, an arbitrary data structure.  inner and outer are
  functions.  Applies inner to each element of form, building up a
  data structure of the same type, then applies outer to the result.
  Recognizes all Clojure data structures except sorted-map-by.
  Consumes seqs as with doall. 
nil

It isn't hard to take this function and use it in a recursive one to do the job. Our function to transform strings could go in the inner argument, it's also where we'll add the recursive call. We don't need the outer one, so we will just use the identity function.

(use 'clojure.walk)

(defn recursive-string-walk [f form]
  (walk #(if (string? %) (f %) (recursive-string-walk f %))
    identity form))

It works well but there's an issue with this way of doing things, recursion will end up filling up the stack which will ultimately overflow. For the use I intend to make of this function, it isn't really a big problem, but I always like to take these kind of challenges.

A Buggy Test

The first thing that came to my mind to test the above issue was to create a nested list of arbitrary depth. We can take the iterate function to create an infinite sequence of nested function calls to list, then take and last to fetch the pit we want to test with.

user> (def pit (iterate list "bottom!"))
#'user/pit
user> (take 6 pit)
("bottom!" ("bottom!") (("bottom!")) ((("bottom!"))) (((("bottom!")))) ((((("bottom!"))))))
user> (last (take 6 pit))
((((("bottom!")))))
user> (recursive-string-walk identity (last (take 6 pit)))
((((("bottom!")))))
user> (recursive-string-walk #(str "reached " %) (last (take 6 pit)))
((((("reached bottom!")))))
user> (recursive-string-walk #(str "reached " %) (last (take 1000 pit)))
; Evaluation aborted.
user> (recursive-string-walk #(str "reached " %) (last (take 100 pit)))
((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((("reached bottom!")))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))

Work great! Now lets create an informal test function. This function should take a range and a recursive walk function then will test it using pits starting with the shallowest. This way we can find exactly at which point the overflow happen.

(defn test-walk [walker shallowest deepest]
  (doseq [depth (range shallowest deepest)]
    (println (walker #(str depth " reached " %)
               (last (take depth pit))))))

We'll use it to make sure it works and see which depth our recursive-string-walk can reach.

user> (test-walk recursive-string-walk 400 500)
...
((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((("486 reached bottom!")))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
(...(; Evaluation aborted.

On my computer, this function cannot go deeper than 486 (I wonder how it would fare on a 486?) which isn't that much. It shouldn't be all that difficult to make a lazy version of that function. It boils down to look for sequences and wrap their recursive calls with lazy-seq.

(defn lazy-recursive-string-walk [f form]
  (walk #(cond
           (string? %) (f %)
           (seq? %)    (lazy-seq (lazy-recursive-string-walk f %))
           :default    (lazy-recursive-string-walk f %))
    identity form))

From my understanding of lazy data structures and the walk library, this version should be able to handle data structures of infinite depth given infinite resources. However the test function didn't seems to be in agreement with me as it was crashing at 828. That's about the time I asked for help and stopped thinking about this problem for a while.

Lazy Zippers

A week-end passed by and in the meantime, some people from the group tried to help me, yet nobody found what was the real problem. That would have been miraculous given the actual source of the bug. Furthermore, I was now more convinced than before than there was no bug in the lazy version. Tom Hicks provided a lazy version of walk, which was behaving in the same way. Laurent Petit also suggested that I try to write a version using zippers.

It was an intriguing suggestion, I never used zippers but heard often about them. Additionally, that day, MarkCC from Good Math, Bad Math just posted an article about them. So I learned what they are and how to use them. I must say this is a very neat functional programming technique, really worth the time to learn. It's not that hard, Mark has done a great job explaining it. I won't go into the details, but this version is quite easy to understand when you know that the next function do a depth-first traversal automatically.

(require '[clojure.zip :as z])

(defn lazy-recursive-string-walk-with-zipper [f form]
  (loop [loc (z/seq-zip form)]
    (if (z/end? loc)
      (z/root loc)
      (recur (z/next
               (if (string? (z/node loc))
                 (z/replace loc (f (z/node loc)))
                 loc))))))

To my surprise, the test function still overflowed the stack using this version. And it could simply not throw the StackOverflowError as it executes in a loop. Maybe the error wasn't coming from the lazy walk function after all.

So Where's The Bug?

I've been negligent about one thing, I didn't look closely enough at the stack traces. The one thrown by recursive-string-walk happen to be completely different than the one thrown by lazy-recursive-string-walk. Here's both of them one after the other.

No message.
  [Thrown class java.lang.StackOverflowError]

Restarts:
 0: [ABORT] Return to SLIME's top level.

Backtrace:
  0: clojure.lang.RT.boundedLength(RT.java:1128)
  1: clojure.lang.RestFn.applyTo(RestFn.java:135)
  2: clojure.core$apply__4370.invoke(core.clj:436)
  3: clojure.walk$walk__7942.invoke(walk.clj:43)
  4: recursive_string_walk$recursive_string_walk__4137.invoke(NO_SOURCE_FILE:1)
  5: recursive_string_walk$recursive_string_walk__4137$fn__4139.invoke(NO_SOURCE_FILE:1)
---
No message.
  [Thrown class java.lang.StackOverflowError]

Restarts:
 0: [ABORT] Return to SLIME's top level.

Backtrace:
  0: clojure.lang.RT.assoc(RT.java:666)
  1: clojure.core$assoc__4259.invoke(core.clj:146)
  2: clojure.lang.AFn.applyToHelper(AFn.java:179)
  3: clojure.lang.RestFn.applyTo(RestFn.java:137)
  4: clojure.lang.Ref.alter(Ref.java:174)
  5: clojure.core$alter__4907.doInvoke(core.clj:1542)

The actual source of the exception is the print function that is recursive on nested sequences. This make sense, why would someone want to print such monsters. Because of that inattention I lost time searching for an imaginary bug. Only a small modification is needed to make our test function working.

(defn get-bottom [p]
  (loop [p (first p)]
    (if (seq? p) (recur (first p)) p)))

(defn test-walk [walker shallowest deepest]
  (doseq [depth (range shallowest deepest)]
    (println (get-bottom
               (walker #(str depth " reached " %)
                 (lazy-seq (last (take depth pit))))))))

We simply search for the bottom item and print it, this will also make the test output clearer. We can now see what our original walk function can handle.

user> (test-walk lazy-recursive-string-walk 100000 100001)
...
; Evaluation aborted.
---
Java heap space
  [Thrown class java.lang.OutOfMemoryError]

Well, the heap too has its limit! We would need to augment the JVM heap memory size to be able to run the previous example. Lets reduce our ambitions and do some benchmarking instead.

user> (time (test-walk lazy-recursive-string-walk 10000 10001))
10000 reached bottom!
"Elapsed time: 25.138697 msecs"
nil
user> (time (test-walk lazy-recursive-string-walk-with-zipper 10000 10001))
10000 reached bottom!
"Elapsed time: 122.691972 msecs"
nil

As you see, the version using clojure.walk is much faster. That's perfectly understandable considering the more complex concept of zippers which have multiple uses aside walking trees. As both versions are fully lazy, the first one is the clear winner here.

What Can We Use This For

After all these trials and tribulations, you may wonder what use I may have for such a function. It's to help create functions or macros that perform string replacement on all strings in a Clojure form. This can be useful with Compojure while deploying your application to a different root path than "/" for example. The idea come from my experience with ASP.NET in which URLs can contain a character (~) to denote the web site root path.

(use 'clojure.contrib.str-utils)

(defn expand* [expansions string]
  (reduce (fn [s [regex replacement]]
            (re-gsub regex replacement s))
    string (partition 2 expansions)))

(defn expand-form [url form]
  (recursive-string-walk (partial expand* [#"^~" url]) form))

Having the same functionality in a macro would be nice too.

(defmacro with-expansions [expansions & body]
  `(do ~@(recursive-string-walk (partial expand* expansions) body)))

(defmacro expand [url & body]
  `(with-expansions [#"^~" ~(str url)] ~@body))

To put this to use I've taken a function written by James Reeves a few months ago during a discussion on the Compojure group. A function with the same name (with-context) already exists in Compojure though, so we'll rename it to with-root-path. The way it work is by binding the given context to a var and then we can use a function called url to add the context so that internal links still work once the application is deployed elsewhere. Using the expand-form function on html's arguments can reduce repetition and make the code less cluttered if there's a lot of such URLs.

(use 'compojure)

(declare *context*)

(defn with-root-path
  [ctx & route-seq]
  (let [handler (apply routes route-seq)
         pattern (re-pattern (str "^" ctx "(/.*)?"))]
    (fn [request]
      (if-let [[_ uri] (re-matches pattern (:uri request))]
        (binding [*context* ctx]
          (handler (assoc request :uri uri)))))))

(defn url [path]
  (str *context* path))

(defn html-expand [& trees]
  (apply html (expand-form *context* trees)))

(defroutes test-routes
  (with-root-path "/foo"
    (GET "/"    (html-expand [:h1 "test"] [:p "~/"]))
    (GET "/bar" (html [:h1 "test"] [:p (url "/bar")]))
    (ANY "*" (page-not-found))))

(defn start-test-server []
  (run-server {:port 8000} "/*" (servlet test-routes)))

The first route use a special version of the html function, it's only a slight improvement but I really like that way of doing things. There's plenty of other possible use cases though, but that should be enough for this post.

P.S.: I know the title isn't correct, but it sounds way better than A Recursive Walk on Strings Contained in Arbitrary Nested Data Structures.

No comments:

Post a Comment

About Me

My photo
Quebec, Canada
Your humble servant.