Giter Club home page Giter Club logo

clj-tightly-packed-trie's Introduction

Clojure Tightly Packed Trie

https://img.shields.io/clojars/v/com.owoga/tightly-packed-trie.svg

What does this do?

Tries as hash-maps are common, but hash-maps take up a lot of memory (relatively speaking).

For example, creating a hash-map trie of 1, 2, and 3-grams of short story by Edgar Allen Poe results in a hash-map that consumes over 2 megabytes of memory. See this markov language model example.

If you’re dealing with much larger corpuses, the memory footprint could become an issue.

A tightly packed trie, on the other hand, is tiny. A tightly packed trie on the same corpus is only 37 kilobytes. That’s ~4% of the original trie’s size, even after the original trie’s keys/values have all been condensed to numbers!

How do you use library?

A trie is created similar to a hash-map by passing a variable number of “trie entries” to trie.

A “trie entry” is basically the same thing as a map entry. It’s just a key and a value.

But for a Trie, the key must be seqable and for a tightly-packed trie all keys must be comparable.

(require '[com.owoga.trie :as trie])

(def loosely-packed-trie (trie/make-trie "dog" :dog "dot" :dot "do" :do "day" :day))
loosely-packed-trie
;; => {[\d \a \y] :day, [\d \o \g] :dog, [\d \o \t] :dot, [\d \o] :do}

You’ll see from the output of that last line above that the default REPL representation of a Trie is a flat hash-map-looking-thing. It’s actually a sorted-hash-map-looking-thing, because if you seq over it, you’ll get the trie-entries in depth-first post-order traversal.

In some ways, a Trie behaves a lot like a map.

`get` returns the value at the key.

(get loosely-packed-trie "dog")
;; => :dog
(get loosely-packed-trie "do")
;; => :do
(get (assoc loosely-packed-trie "dove" {:value "dove" :count 10}) "dove")
;; => {:value "dove", :count 10}

But there’s a couple cool Trie-specific functions.

`lookup` returns the Trie at the key. This way, you have access to all of the node’s descendants.

(trie/lookup loosely-packed-trie "do")
;; => {[\g] :dog, [\t] :dot}
(seq (trie/lookup loosely-packed-trie "do"))

`children` returns the direct children of a node.

(trie/children (trie/lookup loosely-packed-trie "do"))
;; => ({} {})

That’s odd… there’s two things in there that look like empty maps.

(map #(get % []) (trie/children (trie/lookup loosely-packed-trie "do")))
;; => (:dog :dot)

The REPL representation of a Trie only shows children key/values. The “root” node (not necessarily the “true” root node if you’ve travsersed down with `lookup`) doesn’t print any data to REPL. So if you’re looking ata node with no children, you’ll see `{}` in the REPL. But you can get the value of that node with `(get node [])`

Tightly Packed Tries

The trie above is backed by regular old Clojure data structures: hash-maps and vectors.

It’s not very efficient. All of the strings, nested maps, pointers… it all adds up to a lot of wasted memory.

A tightly packed trie provides the same functionality at an impressively small fraction of the memory footprint.

One restriction though: all keys and values must be integers. To convert them from integer identifiers back into the values that your biological self can process, you’ll need to keep some type of database or in-memory map of ids to human-parseable things.

Here’s a similar example to that above, but with values that we can tightly pack.

(require '[com.owoga.tightly-packed-trie :as tpt]
         '[com.owoga.tightly-packed-trie.encoding :as encoding])

(defn encode-fn [v]
  (if (nil? v)
    (encoding/encode 0)
    (encoding/encode v)))

(defn decode-fn [byte-buffer]
  (let [v (encoding/decode byte-buffer)]
    v
    (if (zero? v) nil v)))

(def tight-ready-loosely-packed-trie
  (trie/make-trie '(1 2 3) 123 '(1 2 1) 121 '(1 2 2) 122 '(1 3 1) 131))

(def tightly-packed-trie
  (tpt/tightly-packed-trie
   tight-ready-loosely-packed-trie
   encode-fn
   decode-fn))

(get tightly-packed-trie [1 2 3])
;; => 123

(map #(get % []) (trie/children (trie/lookup tightly-packed-trie [1 2])))
;; => (121 122 123)

(seq tightly-packed-trie)
;; => ([[1 2 1] 121]
;;     [[1 2 2] 122]
;;     [[1 2 3] 123]
;;     [[1 2] nil]
;;     [[1 3 1] 131]
;;     [[1 3] nil]
;;     [[1] nil])

Instead of a map with all of its pointers, we are storing all of the information necessary for this trie in just 39 bytes!

(require '[cljol.dig9 :as d])

(.capacity (.byte-buffer tightly-packed-trie))
;; => 39

It’s backed by a byte-buffer so saving to disk is trivial, but there’s a helper for that.

Here’s the process of saving to and loading from disk. (Only works for tightly-packed tries.)

(tpt/save-tightly-packed-trie-to-file "/tmp/tpt.bin" tightly-packed-trie)

(def saved-and-loaded-tpt
  (tpt/load-tightly-packed-trie-from-file "/tmp/tpt.bin" decode-fn))

(get saved-and-loaded-tpt '(1 2 3))
;; => 123

Credits

Ulrich Germann, Eric Joanis, and Samuel Larkin of the National Research Institute of Canada for the paper Tightly Packed Tries: How to Fit Large Models into Memory,and Make them Load Fast, Too.

Lots of credit also goes to the Clojurians community.

Why would you want a trie data structure?

TODO: The below is closer to a CSCI lesson than library documentation. If it’s necessary, figure out where to put it, how to word it, etc… It might not be worth cluttering documentation with so much detail.

Autocomplete

A user types in the characters “D” “O” and you want to show all possible autocompletions.

Typical “List” data structure

  • Iterate through each word starting from the beginning.
  • When you get to the first word that starts with the letters “D” “O”, start keeping track of words
  • When you get to the next word that doesn’t start with “D” “O”, you have all the words you want to use for autocomplete.
(def dictionary ["Apple" "Banana" "Carrot" "Do" "Dog" "Dot" "Dude" "Egg"])

Problems with a list.

It’s slow if you have a big list. If you have a dictionary with hundreds of thousands of words and the user is typing in letters that don’t show up until the end of the list, then you’re searching through the first few hundred thousand items in the list before you get to what you need.

If you’re familiar with binary search over sorted lists, you’ll know this is a contrived example.

Typical “Trie” in Clojure

{"A" {:children {"P" {,,,} :value nil}}
 "D" {:children {"O"
                 :children {"G" {:children {} :value "DOG"}
                            "T" {:children {} :value "DOT"}}
                 :value "DO"}
      :value nil}}

How is a trie faster?

Development

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.