bigmlcom / histogram Goto Github PK

Streaming Histograms for Clojure/Java

License: Other

Java 62.46% Clojure 37.54%

streaming histogram data-summary clojure

histogram's Introduction

Overview

This project is an implementation of the streaming, one-pass histograms described in Ben-Haim's Streaming Parallel Decision Trees. Inspired by Tyree's Parallel Boosted Regression Trees, the histograms are extended so that they may track multiple values.

The histograms act as an approximation of the underlying dataset. They can be used for learning, visualization, discretization, or analysis. The histograms may be built independently and merged, making them convenient for parallel and distributed algorithms.

While the core of this library is implemented in Java, it includes a full featured Clojure wrapper. This readme focuses on the Clojure interface, but Java developers can find documented methods in com.bigml.histogram.Histogram.

Installation

histogram is available as a Maven artifact from Clojars.

For Leiningen:

For Maven:

<repository>
  <id>clojars.org</id>
  <url>http://clojars.org/repo</url>
</repository>
<dependency>
  <groupId>bigml</groupId>
  <artifactId>histogram</artifactId>
  <version>4.1.2</version>
</dependency>

Basics

In the following examples we use Incanter to generate data and for charting.

The simplest way to use a histogram is to create one and then insert! points. In the example below, ex/normal-data refers to a sequence of 200K samples from a normal distribution (mean 0, variance 1).

user> (ns examples
        (:use [bigml.histogram.core])
        (:require (bigml.histogram.test [examples :as ex])))
examples> (def hist (reduce insert! (create) ex/normal-data))

You can use the sum fn to find the approximate number of points less than a given threshold:

examples> (sum hist 0)
99814.63248

The density fn gives us an estimate of the point density at the given location:

examples> (density hist 0)
80936.98291

The uniform fn returns a list of points that separate the distribution into equal population areas. Here's an example that produces quartiles:

examples> (uniform hist 4)
(-0.66904 0.00229 0.67605)

Arbritrary percentiles can be found using percentiles:

examples> (percentiles hist 0.5 0.95 0.99)
{0.5 0.00229, 0.95 1.63853, 0.99 2.31390}

We can plot the sums and density estimates as functions. The red line represents the sum, the blue line represents the density. If we normalized the values (dividing by 200K), these lines approximate the cumulative distribution function and the probability distribution function for the normal distribution.

examples> (ex/sum-density-chart hist) ;; also see (ex/cdf-pdf-chart hist)

The histogram approximates distributions using a constant number of bins. This bin limit is a parameter when creating a histogram (:bins, defaults to 64). A bin contains a :count of the points within the bin along with the :mean for the values in the bin. The edges of the bin aren't captured. Instead the histogram assumes that points of a bin are distributed with half the points less than the bin mean and half greater. This explains the fractional sum in the example below:

examples> (def hist (-> (create :bins 3)
                        (insert! 1)
                        (insert! 2)
                        (insert! 3)))
examples> (bins hist)
({:mean 1.0, :count 1} {:mean 2.0, :count 1} {:mean 3.0, :count 1})
examples> (sum hist 2)
1.5

As mentioned earlier, the bin limit constrains the number of unique bins a histogram can use to capture a distribution. The histogram above was created with a limit of just three bins. When we add a fourth unique value it will create a fourth bin and then merge the nearest two.

examples> (bins (insert! hist 0.5))
({:mean 0.75, :count 2} {:mean 2.0, :count 1} {:mean 3.0, :count 1})

A larger bin limit means a higher quality picture of the distribution, but it also means a larger memory footprint. In the chart below, the red line represents a histogram with 8 bins and the blue line represents 64 bins.

examples> (ex/multi-pdf-chart
           [(reduce insert! (create :bins 8) ex/mixed-normal-data)
            (reduce insert! (create :bins 64) ex/mixed-normal-data)])

Another option when creating a histogram is to use gap weighting. When :gap-weighted? is true, the histogram is encouraged to spend more of its bins capturing the densest areas of the distribution. For the normal distribution that means better resolution near the mean and less resolution near the tails. The chart below shows a histogram without gap weighting in blue and with gap weighting in red. Near the center of the distribution, red uses more bins and better captures the gaussian distribution's true curve.

examples> (ex/multi-pdf-chart
           [(reduce insert! (create :bins 8 :gap-weighted? true)
                    ex/normal-data)
            (reduce insert! (create :bins 8 :gap-weighted? false)
                    ex/normal-data)])

Merging

A strength of the histograms is their ability to merge with one another. Histograms can be built on separate data streams and then combined to give a better overall picture.

In this example, the blue line shows a density distribution from a histogram after merging 300 noisy histograms. The red shows one of the original histograms:

examples> (let [samples (partition 1000 ex/mixed-normal-data)
                hists (map #(reduce insert! (create) %) samples)
                merged (reduce merge! (create) (take 300 hists))]
            (ex/multi-pdf-chart [(first hists) merged]))

Targets

While a simple histogram is nice for capturing the distribution of a single variable, it's often important to capture the correlation between variables. To that end, the histograms can track a second variable called the target.

The target may be either numeric or categorical. The insert! fn is overloaded to accept either type of target. Each histogram bin will contain information summarizing the target. For numeric targets the sum and sum-of-squares are tracked. For categoricals, a map of counts is maintained.

examples> (-> (create)
              (insert! 1 9)
              (insert! 2 8)
              (insert! 3 7)
              (insert! 3 6)
              (bins))
({:target {:sum 9.0, :sum-squares 81.0, :missing-count 0.0},
  :mean 1.0,
  :count 1}
 {:target {:sum 8.0, :sum-squares 64.0, :missing-count 0.0},
  :mean 2.0,
  :count 1}
 {:target {:sum 13.0, :sum-squares 85.0, :missing-count 0.0},
  :mean 3.0,
  :count 2})
examples> (-> (create)
              (insert! 1 :a)
              (insert! 2 :b)
              (insert! 3 :c)
              (insert! 3 :d)
              (bins))
({:target {:counts {:a 1.0}, :missing-count 0.0},
  :mean 1.0,
  :count 1}
 {:target {:counts {:b 1.0}, :missing-count 0.0},
  :mean 2.0,
  :count 1}
 {:target {:counts {:d 1.0, :c 1.0}, :missing-count 0.0},
  :mean 3.0,
  :count 2})

Mixing target types isn't allowed:

examples> (-> (create)
              (insert! 1 :a)
              (insert! 2 999))
Can't mix insert types
  [Thrown class com.bigml.histogram.MixedInsertException]

insert-numeric! and insert-categorical! allow target types to be set explicitly:

examples> (-> (create)
              (insert-categorical! 1 1)
              (insert-categorical! 1 2)
              (bins))
({:target {:counts {2 1.0, 1 1.0}, :missing-count 0.0}, :mean 1.0, :count 2})

The extended-sum fn works similarly to sum, but returns a result that includes the target information:

examples> (-> (create)
              (insert! 1 :a)
              (insert! 2 :b)
              (insert! 3 :c)
              (extended-sum 2))
{:sum 1.5, :target {:counts {:c 0.0, :b 0.5, :a 1.0}, :missing-count 0.0}}

The average-target fn returns the average target value given a point. To illustrate, the following histogram captures a dataset where the input field is a sample from the normal distribution while the target value is the sine of the input. The density is in red and the average target value is in blue:

examples> (def make-y (fn [x] (Math/sin x)))
examples> (def hist (let [target-data (map (fn [x] [x (make-y x)])
                                           ex/normal-data)]
                      (reduce (fn [h [x y]] (insert! h x y))
                              (create)
                              target-data)))
examples> (ex/pdf-target-chart hist)

Continuing with the same histogram, we can see that average-target produces values close to original target:

examples> (def view-target (fn [x] {:actual (make-y x)
                                    :approx (:sum (average-target hist x))}))
examples> (view-target 0)
{:actual 0.0, :approx -0.00051}
examples>  (view-target (/ Math/PI 2))
{:actual 1.0, :approx 0.9968169965429206}
examples> (view-target Math/PI)
{:actual 0.0, :approx 0.00463}

Missing Values

Information about missing values is captured whenever the input field or the target is nil. The missing-bin fn retrieves information summarizing the instances with a missing input. For a basic histogram, that is simply the count:

examples> (-> (create)
              (insert! nil)
              (insert! 7)
              (insert! nil)
              (missing-bin))
{:count 2}

For a histogram with a target, the missing-bin includes target information:

examples> (-> (create)
              (insert! nil :a)
              (insert! 7 :b)
              (insert! nil :c)
              (missing-bin))
{:target {:counts {:a 1.0, :c 1.0}, :missing-count 0.0}, :count 2}

Targets can also be missing, in which case the target missing-count is incremented:

examples> (-> (create)
              (insert! nil :a)
              (insert! 7 :b)
              (insert! nil nil)
              (missing-bin))
{:target {:counts {:a 1.0}, :missing-count 1.0}, :count 2}

Array-backed Categorical Targets

By default a histogram with categorical targets stores the category counts as Java HashMaps. Building and merging HashMaps can be expensive. Alternatively the category counts can be backed by an array. This can give better performance but requires the set of possible categories to be declared when the histogram is created. To do this, set the :categories parameter:

examples> (def categories (map (partial str "c") (range 50)))
examples> (def data (vec (repeatedly 100000
                                     #(vector (rand) (str "c" (rand-int 50))))))
examples> (doseq [hist [(create) (create :categories categories)]]
            (time (reduce (fn [h [x y]] (insert! h x y))
                          hist
                          data)))
"Elapsed time: 1295.402 msecs"
"Elapsed time: 516.72 msecs"

Group Targets

Group targets allow the histogram to track multiple targets at the same time. Each bin contains a sequence of target information. Optionally, the target types in the group can be declared when creating the histogram. Declaring the types on creation allows the targets to be missing in the first insert:

examples> (-> (create :group-types [:categorical :numeric])
              (insert! 1 [:a nil])
              (insert! 2 [:b 8])
              (insert! 3 [:c 7])
              (insert! 1 [:d 6])
              (bins))
({:target
  ({:counts {:d 1.0, :a 1.0}, :missing-count 0.0}
   {:sum 6.0, :sum-squares 36.0, :missing-count 1.0}),
  :mean 1.0,
  :count 2}
 {:target
  ({:counts {:b 1.0}, :missing-count 0.0}
   {:sum 8.0, :sum-squares 64.0, :missing-count 0.0}),
  :mean 2.0,
  :count 1}
 {:target
  ({:counts {:c 1.0}, :missing-count 0.0}
   {:sum 7.0, :sum-squares 49.0, :missing-count 0.0}),
  :mean 3.0,
  :count 1})

Rendering

There are multiple ways to render the charts, see examples.clj. An example of rendering a single function, namely cumulative probability:

examples> (def hist (reduce hst/insert! (hst/create) [1 1 2 3 4 4 4 5]))
examples> (let [{:keys [min max]} (hst/bounds hist)]
            (core/view (charts/function-plot (hst/cdf hist) min max)))

(core and charts are Incanter namespaces.)

To render multiple functions on the same chart, you would use add-function with the result of function-plot:

examples> (core/view (-> (charts/function-plot (hst/cdf hist) min max :legend true)
                         (charts/add-function (hst/pdf hist) min max)))

Performance-related concerns

Freezing a Histogram

While the ability to adapt to non-stationary data streams is a strength of the histograms, it is also computationally expensive. If your data stream is stationary, you can increase the histogram's performance by setting the :freeze parameter. After the number of inserts into the histogram have exceeded the :freeze parameter, the histogram bins are locked into place. As the bin means no longer shift, inserts become computationally cheap. However the quality of the histogram can suffer if the :freeze parameter is too small.

examples> (time (reduce insert! (create) ex/normal-data))
"Elapsed time: 333.5 msecs"
examples> (time (reduce insert! (create :freeze 1024) ex/normal-data))
"Elapsed time: 166.9 msecs"

Performance

There are two implementations of bin reservoirs (which support the insert! and merge! functions). Either of the two implementations, :tree and :array, can be explicitly selected with the :reservoir parameter. The :tree option is useful for histograms with many bins as the insert time scales at O(log n) with respect to the # of bins. The :array option is good for small number of bins since inserts are O(n) but there's a smaller overhead. If :reservoir is left unspecified then :array is used for histograms with <= 256 bins and :tree is used for anything larger.

examples> (time (reduce insert! (create :bins 16 :reservoir :tree)
                        ex/normal-data))
"Elapsed time: 554.478 msecs"
examples> (time (reduce insert! (create :bins 16 :reservoir :array)
                        ex/normal-data))
"Elapsed time: 183.532 msecs"

Insert times using reservoir defaults:

![timing chart] (https://docs.google.com/spreadsheet/oimg?key=0Ah2oAcudnjP4dG1CLUluRS1rcHVqU05DQ2Z4UVZnbmc&oid=2&zx=mppmmoe214jm)

License

Distributed under the Apache License, Version 2.0.

histogram's People

Stargazers

Watchers

histogram's Issues

_sumToBinMap not updated after Histogram.merge()

clearCacheMap() is called after insertBin(.) but not after histogram.merge() - this causes NPE in later calls to findPointForSum()

Heatmap example

Hi,

First of all, congratulations for all the work! This is super interesting 😃

I saw you published here (https://blog.bigml.com/2012/06/18/bigmls-fancy-histograms/) several examples.
I am unable to find any code example on how to do a heatmap, would you be so kind as to give some pointers?

Much appreciated!

Gracefully handle merging two empty histograms

(.mergeHistogram (Histogram. 10) (Histogram. 10))

Leads to:

Can't mix insert types (two value bins and three value bins)
  [Thrown class com.bigml.histogram.MixedInsertException]

Restarts:
 0: [QUIT] Quit to the SLIME top level

Backtrace:
  0: com.bigml.histogram.Histogram.checkType(Histogram.java:331)
  1: com.bigml.histogram.Histogram.mergeHistogram(Histogram.java:274)
...

Unexpected uniform exception

histogram.core> (uniform (reduce insert! (create) [-0.0 0 1 2]) 5)

No message.
  [Thrown class java.lang.NullPointerException]

Restarts:
 0: [QUIT] Quit to the SLIME top level

Backtrace:
  0:       (Unknown Source) com.bigml.histogram.Histogram.findPointForSum
  1:       (Unknown Source) com.bigml.histogram.Histogram.uniform
  2:           core.clj:186 histogram.core/uniform
  3:       NO_SOURCE_FILE:1 histogram.core/eval48391
  4:     Compiler.java:6511 clojure.lang.Compiler.eval
  5:     Compiler.java:6477 clojure.lang.Compiler.eval
  6:          core.clj:2797 clojure.core/eval
  7:           basic.clj:47 swank.commands.basic/eval-region
  8:           basic.clj:37 swank.commands.basic/eval-region
  9:           basic.clj:71 swank.commands.basic/eval873[fn]
 10:           Var.java:415 clojure.lang.Var.invoke
 11:       (Unknown Source) histogram.core/eval48389
 12:     Compiler.java:6511 clojure.lang.Compiler.eval
 13:     Compiler.java:6477 clojure.lang.Compiler.eval
 14:          core.clj:2797 clojure.core/eval
 15:            core.clj:96 swank.core/eval-in-emacs-package
 16:           core.clj:250 swank.core/eval-for-emacs
 17:           Var.java:423 clojure.lang.Var.invoke
 18:           AFn.java:167 clojure.lang.AFn.applyToHelper
 19:           Var.java:532 clojure.lang.Var.applyTo
 20:           core.clj:601 clojure.core/apply
 21:           core.clj:103 swank.core/eval-from-control
 22:           core.clj:108 swank.core/eval-loop
 23:           core.clj:323 swank.core/spawn-repl-thread[fn]
 24:           AFn.java:159 clojure.lang.AFn.applyToHelper
 25:           AFn.java:151 clojure.lang.AFn.applyTo
 26:           core.clj:601 clojure.core/apply
 27:           core.clj:320 swank.core/spawn-repl-thread[fn]
 28:        RestFn.java:397 clojure.lang.RestFn.invoke
 29:            AFn.java:24 clojure.lang.AFn.run
 30:        Thread.java:680 java.lang.Thread.run

Bin count is double as opposed to long (or integer)

I understand why mean has to be double but I don't understand why count is double? I could cast it to int in my code but just want to understand if it ever could be double?

P.S.: Are you guys on IRC? I just think raising issues for questions like these is an overkill :)

Histogram is not Serializable

When trying to serialize a Histogram I get a NotSerializableException. This seems like a pretty simple object, serialization should be a trivial matter.

Since these Histograms serve to summarize pretty huge datasets, it's an important feature to serialize them for re-use after a service restart.

Thank you

Round-tripping fails when freezeThreshold is set

I ran into this problem while writing tests for my own serialization code. Luckily, I don't think I actually need freezeThreshold right now.

So far, I just have a failing test.

Seems like there are two different errors:

IndexOutOfBounds, I'm guessing when the freeze threshold is exceeded on the first bin. Don't understand this one fully yet.
Bins mismatch after round-trip. If we want to round-trip with insert-bin!, seems like we need some flag for "insert, but ignore freezeThreshold for this insert". Otherwise lots of bins at the end get squashed into one bin as soon as the freezeThreshold is exceeded.

Also, note that the existing round trip test makes no assertions :).

bigml.histogram.core/pdf fails for input 0

Hi,

a histogram with a bucket at position 0 always returns NaN when calling the pdf function returned by bigml.histogram.core/pdf.

For example:

(let [hist (h/create)
  _ (h/insert! hist 0)]
  ((h/pdf hist) 0))

This should return some value > 0 imho. The problem seems to be located in Histogram.java, line 394. If p == 0, then Double.longBitsToDouble(Double.doubleToLongBits(0) - 1) == Double.NaN.

Are histogram memory sizes bounded practically?

By keeping the number of bins constant we ensure that the size of the histograms in memory is bounded, theoretically.

I did some investigation by inserting 100k values into a few histograms and got their average size in memory using sizeof.

I've included the results below, which show a small positive trend as we continue adding values,
after all the bins (using 100 here) have been populated.

Is this expected behavior? Is there a way to impose a hard limit on the size of the histograms in memory?

Double.valueOf failes in Locales where comma is the decimal separator

Why is this idiom used here? Double.valueOf expects a decimal point as the separator, german locale uses a comma instead. This leads to a NumberFormatException in line 30.

IMHO the correct way to parse the number is reusing the NumberFormat instance here, although I don't grok why the double gets formatted and reparsed in the same line.

Histogram.sum(double) not thread safe

When calling Histogram's sum(double) a thousand times inside a loop, sometimes I'll get the NullPointerException below. This is fixed by placing the "synchronized" at the extendedSum(double p) method.

java.lang.NullPointerException
at java.util.TreeMap.rotateLeft(TreeMap.java:2060)
at java.util.TreeMap.fixAfterInsertion(TreeMap.java:2127)
at java.util.TreeMap.put(TreeMap.java:574)
at com.bigml.histogram.Histogram.refreshCacheMaps(Histogram.java:689)
at com.bigml.histogram.Histogram.getPointToSumMap(Histogram.java:707)
at com.bigml.histogram.Histogram.extendedSum(Histogram.java:349)
at com.bigml.histogram.Histogram.sum(Histogram.java:296)
....
at scala.collection.parallel.AugmentedIterableIterator$class.map2combiner(RemainsIterator.scala:120)
at scala.collection.parallel.immutable.ParVector$ParVectorIterator.map2combiner(ParVector.scala:67)
at scala.collection.parallel.ParIterableLike$Map.leaf(ParIterableLike.scala:1057)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
at scala.collection.parallel.ParIterableLike$Map.tryLeaf(ParIterableLike.scala:1054)
at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:172)
at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514)
at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162)
at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Support for min and max of targets

I'm interested in being able to track the min and max values of targets for each bin. It seems like it would be trivial to implement this in NumericTarget. Does that seem right to you, or am I missing something? Are you interested in having changes contributed back? Should we make is possible to configure which target metrics you're interested in tracking, so we don't waste space or computation for metrics you don't care about? Any design input on how that should work?

Gap weighting appears to be broken

(def foo
  [213000 4000 4000 4000 4000 4000 5000 4000 5000 5000 6000 5000
   6000 5000 4000 6000 5000 6000 5000 7000 6000 4000 6000 6000 6000
   5000 6000 6000 6000 6000 6000 4000 6000 5000 6000 6000 5000 6000
   6000 6000])

(hst/total-count (reduce hst/insert!
                         (hst/create :bins 4 :gap-weighted? true)
                         foo))

Notion of targets is confusing

I am trying to use the library for implementing histograms in https://github.com/rackerlabs/blueflood

Thanks for writing this wonderful library and making it available for community. I appreciate it.

My use case:
I have a use case where I compute histogram from raw samples and I have to persist them to disk in a compact format. I also have a use case of generating coarser histograms from finer ones. So when I write histograms to disk, I would get the bins and for each bin persist the mean value and also the count (which is surprisingly double). When I read the bins from disk and I want to create a histogram object and use those objects to merge them into one giant histogram.

My problem:
I am just not clear which of the insert methods I should be using to create a histogram from its constituent bins. I looked at insertBin(Bin). The problem with T is it extends Target which is either numeric or categorical. That's when I got confused as to what target actually means. So when I read the bins from disk which would be list of pairs of mean value and count, how would I reconstruct the histogram? Should I just use numeric target?

I tried to look at the tests and they don't help me much.

Other thoughts:
I looked at the README and I just don't comprehend why histograms have to track something. As the README notes, histogram has to give the distribution of a numeric variable. I think there is a mixup of concerns there. Can you provide a concrete example of how someone would use targets?

Also, it would be nice to have a histogram constructed from a JSON representation. I like the toJSONArray method. I'd like a static method createFromJSONArray(). That way we have both serialization and deserialization. I can code this up but wanted to know if I am thinking about the problem correctly.

Please clarify uniform's "equal population"

Could you be so kind and provide little more detail about what uniform does? I believed that it returns such points that the number of elements in the histogram's series between the points n and n+1 is always the same, i.e. that each of this intervals will contain approximately (length of series / number bins) values but it doesn't seem to be true.

In my case, I had 1347 values between 0 and 52,000 and uniform produced these 6 bins (rounded) (4 37 68 131 279 681) with the following populations:
({4 300} {37 173} {68 152} {131 151} {279 183} {681 196} {:rest 192}) (300 values between 0 and 4, 173 values between 4 and 37 etc.)

(6 was the max number of bins that uniform produced, no matter what larger number max-bins was set to)

I suppose I just do not understand correctly what "bins with equal population" in uniform's docstring mean. I would be happy if the documentation could be improved to make it clearer to us who are less savvy in the domain of statistics.

Thank you! And thanks for the great library!

Maven Dependency issue

I've been using your library for a while, but I can no longer pull your library from Maven. I have not built this project for about a year.
I need to build a new version but I cannot build it anymore because it cannot locate your library anymore. I tried 4.1.3 and 4.1.4 and they have the same problem

In my .pom file I have
<dependency>
<groupId>bigml</groupId>
<artifactId>histogram</artifactId>
<version>4.1.4</version>
</dependency>
...
<repositories>
<repository>
<id>Clojars.repository</id>
<url>http://clojars.org/repo/</url>
</repository>
</repositories>

In Jenkins I can see it getting the histogram-4.1.4.pom from the clojars repo, but then it gets the same file from repo.maven.apache.org which does not exist, then I get an error about a different repo. ( See output below)

Jenkins OUTPUT:
Downloading: http://clojars.org/repo/bigml/histogram/4.1.4/histogram-4.1.4.pom
Downloading: https://repo.maven.apache.org/maven2/bigml/histogram/4.1.4/histogram-4.1.4.pom
...
Failed to read artifact descriptor for bigml:histogram:jar:4.1.4: Could not transfer artifact bigml:histogram:pom:4.1.4 from/to jboss (https://repository.jboss.org/nexus/content/groups/public-jboss): Received fatal alert: internal_error -> [Help 1]

This is one of many libraries I'm using and all the others are pulling fine. I think there is a problem with your maven setup. Can you please advise me on what to do to fix this?

Ways to freeze bins and re-process data?

Hi,

First of all, great job on this lib. I totally love it! I have a question though on how this can be used to properly capture the distribution of multiple samples while preserving the bins location.

At the moment, all I see possible is to merge all the histograms from each sample into one, then freeze the bins, and somehow clear the histogram while keeping the frozen bins, and then re-add each sample one by one to have a fixed bin placement across all samples.

Is there any better way to do this? Is there another approach I should be aware of for this particular case?

Thank you in advance!

Strange behaviour of bigml.histogram.core/pdf

Hi,

the bigml.histogram.core/pdf function should return the probability density function, right? That means the integral of this function is 1 and there is no x where ((pdf hist) x) is outside of the interval [0,1].

Now, if I run the snippet below (a simple histogram of a bipolar distribution), the pdf returns values>50!

(require '[incanter 
           [core :refer [view]] 
           [charts :refer [xy-plot]]])
(require '[bigml.histogram.core :as h])

(let [hist-in-clj {:group-types [], 
                   :max-bins 10, 
                   :minimum 0.9905790666510712, 
                   :gap-weighted? false, 
                   :bins [{:mean 0.9905790666510712, :count 1} 
                          {:mean 0.9945287243737819, :count 71} 
                          {:mean 0.9984828454567461, :count 341} 
                          {:mean 1.0023886309412584, :count 325} 
                          {:mean 1.0087166176960565, :count 12} 
                          {:mean 1.103900907800562, :count 22} 
                          {:mean 1.1077538698109557, :count 192} 
                          {:mean 1.1114131757311698, :count 336} 
                          {:mean 1.1153047415614752, :count 195} 
                          {:mean 1.1200568358242649, :count 5}], 
                   :maximum 1.1207679176043404}
      xs (range 0.95 1.15 0.01)
      hist (h/clj-to-hist hist-in-clj)] 
  (view (xy-plot xs (map (h/pdf hist) xs))))

I tested the last two versions, 3.2.0 and 3.2.1, both show the same behaviour.

Report serialization status

As I talked to Jao by mail. It would be nice to have a way for knowing in what point of the serialization we are at.

This would allow me to draw a nice progress bar to the user, with a message saying "We are now serializing". If this happens to mean anything to our average user.

Cheers,
Miguel