Giter Club home page Giter Club logo

binary's Introduction

binary-dsl

This library is a high performance binary parser combinator. It enables reading and writing arbitrary binary data from Java's io streams. The focus is on enabling parsing of externally defined binary structures. If you have a format specification for any binary structure, this library is for you!

It is inspired by Gloss but focuses on java's stream classes. The individual codecs do not require explicit knowledge about the length of data that needs to be read.

Build Status

Artifacts

Binary artifacts are released to Clojars. If you are using Maven, add the following repository definition to your pom.xml:

<repository>
  <id>clojars.org</id>
  <url>http://clojars.org/repo</url>
</repository>

The Most Recent Release

With Leiningen:

[smee/binary "0.5.5"]

With Maven:

<dependency>
  <groupId>smee</groupId>
  <artifactId>binary</artifactId>
  <version>0.5.5</version>
</dependency>

Note

All functions given in this document refer to the namespace org.clojars.smee.binary.core (needs to be required or use in your namespace declaration).

Examples / Demo

Please refer to the tests for now. There are several demos:

Codec

To read binary data we need two things: A codec that knows how to read and write it's binary representation and convert it to a clojure data structure and an instance of java.io.InputStream. The codec needs to satisfy the protocol BinaryIO (see here).

Codecs are composable, you may combine them as you like.

Each codec can have two attached functions:

  • pre-encode - to convert clojure data to something that can be written to binary
  • post-decode - to convert something read by a codec to a clojure/java data structure

Example: Let's represent an instance of java.util.Date as a unix epoch and write it as a little-endian long:

(compile-codec :long-le (fn [^java.util.Date date] (.getTime date)) (fn [^long number] (java.util.Date. number))

The compiler hints are not necessary. They are just a clarification in this example.

API

  • encode takes an instance of codec, a java.util.OutputStream and a value and writes the binary representation of this value into the stream.
  • decode takes a codec and a java.util.InputStream and returns a clojure/java value that was read from the stream. Individual read via decode are eager!

Features/Available codecs

Primitives

Encodes primitive data types, either big-endian or little-endian:

; signed
:byte
:short-le
:short-be
:int-le
:int-be
:uint-le
:uint-be
:long-le
:long-be
:float-le
:float-be
:double-le
:double-be
; unsigned
:ubyte
:ushort-le
:ushort-be
:uint-le
:uint-be
:ulong-le
:ulong-be

Please be aware that since Java doesn't support unsigned data types the codecs will consume/produce a bigger data type than for the unsigned case: Unsigned bytes are shorts, unsigned shorts are integers, unsigned integers are longs, unsigned longs are Bigints!

Sequences

If you want several codecs in a specific order, use a vector:

[:int-le :float-le :float-le]

Maps

To name elements in a binary data source maps are ideal. Unfortunately the order of the keys is unspecified. We need to use a map constructor that respects the order of the keys:

(require '[org.clojars.smee.binary.core :as b])
(b/ordered-map :foo :int-le :bar [:float-le :double-le])

As you can see arbitrary nesting of codecs is possible. You can define maps of maps of ... If you use clojure's map literals, the order of the binary values is unspecified (it is determined by the sequence of keys and values within the map's implementation).

Repeated

repeated uses another codec repeatedly until the stream is exhausted. To restrict, how often the codec should be used, you can explicitely give one of three parameters:

  • :length gives a fixed length. E.g. (repeated :int-le :length 5) will try to read/write exactly five little-endian 32bit integers from/to a stream
  • :prefix takes another codec that will get read/written first. This codec contains the length for the successive read/write of the repeated values. Example: (repeated :int-le :prefix :short-le) will first read a short and tries then to read as many integers as specified in this short value.
  • :separator will read values using the codec until the value read is the same as the given separator value. An example would be (repeated :byte :separator (byte 0) for null-tokenized c-strings. If the separator would be the last element in the stream, it is optional (think of comma-separated value where the last column may not have a trailing comma).

Caution: When writing the data there WILL be a final separator. This means, the written data may have more bytes than initially read!

  • No parameter means: read until exhaustion of the stream (EOF).

Blob

blob is essentially an optimized version of (repeated :byte ...) that produces and consumes Java byte arrays. It takes the same options as repeated, except for :separator.

String

Reads and writes bytes and converts them from/to strings with a specific string encoding. This codec uses repeated, that means it takes either :length or :prefix as parameter to determine the length of the string.

(string "ISO-8859-1" :length 3) ; read three bytes, interpret them as a string with encoding "ISO-8859-1"

C strings

Similar to string, but reads bytes until it finds a null byte:

(c-string "UTF8") ; 

Bits

If you have a byte where each bit has a specific meaning you can use a set of keywords as an input. For example, the following definition says, that the lowest bit in a byte gets the value :a, the next one :b, then :c. The bits 4-7 are ignored, the highest bit has the value :last:

(decode (bits [:a :b :c nil nil nil nil :last]) instream); let's assume the next byte in instream is 2r11011010
=> #{:b :last}

If you now read a byte with the value 2r11011001 using this codec you will get the clojure set #{:a :b :last} as a value.

Header

Decodes a header using header-codec. Passes this datastructure to header->body which returns the codec to use to parse the body. For writing this codec calls body->header with the data as parameter and expects a value to use for writing the header information.

Padding

Make sure there is always a minimum byte length when reading/writing values. Works by reading length bytes into a byte array, then reading from that array using inner-codec. Currently there are three options:

  • :length is the number of bytes that should be present after writing
  • :padding-byte is the numeric value of the byte used for padding (default is 0)
  • :truncate? is a boolean flag that determines the behaviour if inner-codec writes more bytes than padding can handle: false is the default, meaning throw an exception. True will lead to truncating the output of inner-codec.

Example:

(padding (repeated :int-le :length 100) :length 1024 :padding-byte (byte \x))
=> [...] ; sequence of 100 integers, the stream will have 1024 bytes read, though

(encode (padding (repeated (string "UTF8" :separator 0)) :length 11 :truncate? true) outstream ["abc" "def" "ghi"])
=> ; writes bytes [97 98 99 0 100 101 102 0 103 104 105]
   ; observe: the last separator byte was truncated!

Align

This codec is related to padding in that it makes sure that the number of bytes written/read to/from a stream always is aligned to a specified byte boundary. For example, if a format requires aligning all data to 8 byte boundaries this codec will pad the written data with padding-byte to make sure that the count of bytes written is divisable by 8.

Parameters:

  • modulo: byte boundary modulo, should be positive
  • :padding-byte is the numeric value of the byte used for padding (default is 0)

Example:

(encode (align (repeated :short-be :length 3) :modulo 9 :padding-byte 55) [1 2 3] output-stream)
;==> writes these bytes: [0 1 0 2 0 3 55 55 55]

Constant

If a binary format uses fixed elements (like the three bytes 'ID3' in mp3), you can use this codec. It needs a codec and a fixed value. If the value read using this codec does not match the given fixed value, an exception will be thrown.

(constant (string "ISO-8859-1" :length 3) "ID3")

Alternatively, you may treat strings and byte arrays as constant encoders.

Union

Union is a C-style union. A fixed number of bytes may represent different values depending on the interpretation of the bytes. The value returned by read-data is a map of all valid interpretations according to the specified unioned codecs. Parameter is the number of bytes needed for the longest codec in this union and a map of value names to codecs. This codec will read the specified number of bytes from the input streams and then successively try to read from this byte array using each individual codec.

Example: Four bytes may represent an integer, two shorts, four bytes, a list of bytes with prefix or a string.

(union 4 {:integer :int-be 
          :shorts (repeated :short-be :length 2)
          :bytes (repeated :byte :length 4)
          :prefixed (repeated :byte :prefix :byte)
          :str (string \"UTF8\" :prefix :byte)})

License

Copyright © 2014 Steffen Dienst

Distributed under the Eclipse Public License, the same as Clojure.

binary's People

Contributors

harrigan avatar ilyapomaskin avatar paulschulz avatar peteut avatar smee avatar whittlesjr avatar wjoel avatar zsau avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

binary's Issues

[PATCH] [fix] typo fixed

---
 src/org/clojars/smee/binary/core.clj | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/src/org/clojars/smee/binary/core.clj b/src/org/clojars/smee/binary/core.clj
index e85f6f7..2223ffe 100644
--- a/src/org/clojars/smee/binary/core.clj
+++ b/src/org/clojars/smee/binary/core.clj
@@ -52,7 +52,7 @@
    :short    (primitive-codec .readShort .writeShort short :be)
    :short-le (primitive-codec .readShort .writeShort short :le)
    :short-be (primitive-codec .readShort .writeShort short :be)
-   
+
    :ushort    (primitive-codec .readUnsignedShort .writeUnsignedShort int :be)
    :ushort-le (primitive-codec .readUnsignedShort .writeUnsignedShort int :le)
    :ushort-be (primitive-codec .readUnsignedShort .writeUnsignedShort int :be)
@@ -315,7 +315,7 @@ Flag names `null` are ignored. Bit count will be padded up to the next multiple
                    (fn [bytes] (set (map idx->flags (filter #(bit-set? bytes %) bit-indices)))))))

 (defn header
-  "Decodes a header using `header-codec`. Passes this datastructure to `header->body` which returns the codec to
+  "Decodes a header using `header-codec`. Passes this datastructure to `header->body-codec` which returns the codec to
 use to parse the body. For writing this codec calls `body->header` with the data as parameter and
 expects a value to use for writing the header information.
 If the optional flag `:keep-header` is set, read will return a vector of `[header body]`
@@ -327,11 +327,11 @@ else only the `body` will be returned."
         (let [header (read-data header-codec big-in little-in)
               body-codec (header->body-codec header)
               body (read-data body-codec big-in little-in)]
-          (if keep-header? 
-            {:header header 
+          (if keep-header?
+            {:header header
              :body body}
             body)))
-      (write-data [_ big-out little-out value] 
+      (write-data [_ big-out little-out value]
         (let [body (if keep-header? (:body value) value)
               header (if keep-header? (:header value) (body->header body))
               body-codec (header->body-codec header)]
@@ -354,9 +354,9 @@ Example:
     (encode (padding (repeated (string \"UTF8\" :separator 0)) :length 11 :truncate? true) outstream [\"abc\" \"def\" \"ghi\"])
     => ; writes bytes [97 98 99 0 100 101 102 0 103 104 105]
        ; observe: the last separator byte was truncated!"
-  [inner-codec & {:keys [length 
+  [inner-codec & {:keys [length
                          padding-byte
-                         truncate?] 
+                         truncate?]
                   :or {padding-byte 0
                        truncate? false}
                   :as opts}]
@@ -426,12 +426,12 @@ Example:
     Object (toString [_] (str "<BinaryIO aligned, options=" opts ">"))))


-(defn union 
+(defn union
   "Union is a C-style union. A fixed number of bytes may represent different values depending on the
 interpretation of the bytes. The value returned by `read-data` is a map of all valid interpretations according to
 the specified unioned codecs.
 Parameter is the number of bytes needed for the longest codec in this union and a map of value names to codecs.
-This codec will read the specified number of bytes from the input streams and then successively try to read 
+This codec will read the specified number of bytes from the input streams and then successively try to read
 from this byte array using each individual codec.

 Example: Four bytes may represent an integer, two shorts, four bytes, a list of bytes with prefix or a string.
@@ -442,7 +442,7 @@ Example: Four bytes may represent an integer, two shorts, four bytes, a list of
               :prefixed (repeated :byte :prefix :byte)
               :str (string \"UTF8\" :prefix :byte)})"
   [bytes-length codecs-map]
-  (padding 
+  (padding
     (reify BinaryIO
       (read-data  [_ big-in _]
         (let [arr (byte-array bytes-length)
@@ -480,14 +480,14 @@ Example: Four bytes may represent an integer, two shorts, four bytes, a list of
   "An enumerated value. `m` must be a 1-to-1 mapping of names (e.g. keywords) to their decoded values.
 Only names and values in `m` will be accepted when encoding or decoding."
   (let [pre-encode (strict-map m lenient?)
-        post-decode (strict-map (map-invert m) lenient?)] 
+        post-decode (strict-map (map-invert m) lenient?)]
     (compile-codec codec pre-encode post-decode)))

-#_(defn at-offsets 
+#_(defn at-offsets
   "Read from a stream at specific offsets. Problems are we are skipping data inbetween and we miss data earlier in the stream."
   [offset-name-codecs]
   {:pre [(every? #(= 3 (count %)) offset-name-codecs)]}
-  (let [m (reduce (fn [m [offset name codec]] (assoc m offset [name codec])) (sorted-map) offset-name-codecs)] 
+  (let [m (reduce (fn [m [offset name codec]] (assoc m offset [name codec])) (sorted-map) offset-name-codecs)]
     (reify BinaryIO
       (read-data [this big-in little-in]
         (loop [pos (.size big-in), pairs (seq m), res {}]
@@ -495,7 +495,7 @@ Only names and values in `m` will be accepted when encoding or decoding."
             res
             (let [[seek-pos [name codec]] (first pairs)
                   _ (.skipBytes big-in (- seek-pos pos))
-                  obj (read-data codec big-in little-in)]              
+                  obj (read-data codec big-in little-in)]
               (recur (.size big-in) (next pairs) (assoc res name obj))))))
       (write-data [this big-out little-out values]
         (throw :not-implemented)))))
@@ -513,7 +513,7 @@ Only names and values in `m` will be accepted when encoding or decoding."
       bytes))
   (write-data [this out _ _]
     (.write ^OutputStream out (.getBytes ^String this)))
-  
+
   java.lang.String
   (read-data [this big-in _]
     (let [^bytes bytes (read-bytes big-in (count this))
@@ -522,7 +522,7 @@ Only names and values in `m` will be accepted when encoding or decoding."
       res))
   (write-data [this out _ _]
     (.write ^OutputStream out (.getBytes ^String this)))
-  
+
   clojure.lang.ISeq
   (read-data [this big-in little-in]
     (map #(read-data % big-in little-in) this))
-- 
2.1.4

Trailing bytes with repeated :separator-using strings

I'm not sure if this is intended behavior or not:

(let [str-seq (repeated (string "UTF-8" :separator 0))
      in (java.io.ByteArrayInputStream. (.getBytes "abc\u0000def\u0000ghi" "UTF-8"))]
  [(decode str-seq in) (.read in)])
; [["abc" "def"] -1]

Should the parser be consuming the trailing bytes ("ghi") in this case? If so, is there a way for my code to access those bytes?

Q: header example

Hi,

I'm extending your Bitcoin protocol example (demo/bitcoin.clj) to
handle Bitcoin messages that are sent over the wire. The format is:

  • magic (4 bytes)
  • command (12 bytes)
  • length (4 bytes)
  • checksum (4 bytes)
  • payload (variable length)

The problem I have run into is the checksum field between the length
and the payload. I tried using something like:

(def payload (binary/blob :prefix length-and-checksum))

and having length-and-checksum reify BinaryIO so that it ignores the
checksum when reading and just returns the length. However, for
writing, I don't have access to the payload from here so I can't
compute the checksum. Also, I'd prefer to compute the checksum outside
of the codec.

Do you know of any way of doing this? Sorry if I missed something
obvious and thank you for creating smee/binary.

Regards,
@harrigan

Terminated strings

Some codecs use null-terminated strings whose length isn't known in advance, which is very awkward to parse at the moment. An optional :suffix or :terminator argument to string and/or repeated would be very useful.

Can't omit final separator when encoding

(Related to Issue #3)

I'm having trouble encoding a fixed-length string sequence while omitting the final separator:

(defn fixed-string-seq [size]
    (padding (repeated (string "UTF-8" :separator 0)) size))

(let [out (java.io.ByteArrayOutputStream.)]
    (encode (fixed-string-seq 11) out ["abc" "def" "ghi"]))
; IllegalArgumentException Data should be max. 11 bytes, but attempting to write 0 bytes more!  org.clojars.smee.binary.core/padding/reify--1466 (core.clj:302)

The content should be exactly 11 bytes without the trailing null, but it seems the encoder doesn't like this.

needs docs

This library worked great for me, but it needs docs. ;) Some use examples, e.g. reading from a byte array, would be nice.

Public BinaryIO protocol or codec for `nil`

Thanks for awesome library!

I need to serialize no value in some cases. For example:

(b/header :int-be
          (fn header->body-codec [length]
            (if (= -1 length)
              codec/null
              (b/blob :length length)))
          "not used")

So I add null codec. But it's ugly:

;; hack
(def null
  (b/compile-codec
   (byte-array 0)
   (constantly (byte-array 0))
   (constantly nil)))

Is it possible to make public BinaryIO protocol or add nil primitive codec?

Binary utilities

Hi! So I recently (finally) (sort of) finished my BACnet implementation. I ended up with a bunch of generic utility functions. I'd much prefer to contribute to an existing project, rather than spin off a new one, so I thought I'd ask if you'd want to merge any or all of the following as a separate util namespace?

https://gist.github.com/WhittlesJr/dd94e7e4d9e21460b4dd9cd31b9fcaa1

The "util.core" namespace has more generic functions. I'm thinking of making a separate library for my map-matching functions, but I included them in the gist so you could see what they are.

I included the npdu example so you can get a sense for my use case, but it's just a small part of the BACnet protocol.

Native arrays for repeated primitives

Would it be possible for repeated to output a native array when given a primitive codec? This would be especially useful for codecs that include binary blobs, since wrapping each byte in a java.lang.Byte is quite wasteful.

A separate codec like repeated-prim or even just bytes would also work.

Conditionals and complex bit handling

Is there currently a way to use "conditional" fields? My use case is for the BACnet protocol, which is somewhat complex. Many fields are included in the spec that only show up if a previous field matches a certain value (or some other more complicated condition is met). Or sometimes, based on an earlier condition, the parsing rules for further segments will change...

I'm not sure how to do that with this library... maybe I'm missing something? I'm investigating header further to see if it can do everything I need, and if I find out I'll close this issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.