Giter Club home page Giter Club logo

isarn-sketches's Introduction

isarn-sketches

Sketching data structures

API documentation

Compatibility

isarn-sketches can operate with Algebird via the isarn-sketches-algebird-api

isarn-sketches can also operate with Apache Spark via the isarn-sketches-spark library

How to use in your project

// isarn-sketches
libraryDependencies += "org.isarnproject" %% "isarn-sketches" % "0.3.0"

// isarn-sketches-java
libraryDependencies += "org.isarnproject" % "isarn-sketches-java" % "0.3.0"

t-digest

scala> import org.isarnproject.sketches.TDigest
import org.isarnproject.sketches.TDigest

scala> val data = Vector.fill(10000) { scala.util.Random.nextGaussian() }
data: scala.collection.immutable.Vector[Double] = Vector(1.6046163970051968, 0.44151418924289004, ...

scala> val sketch = TDigest.sketch(data)
sketch: org.isarnproject.sketches.TDigest = TDigest(0.5,0,74,TDigestMap(-3.819069044174932 -> (1.0, 1.0), ...

scala> sketch.cdf(0)
res0: Double = 0.4984362744530557

scala> sketch.cdfInverse(0.5)
res1: Double = 0.0038481195948969205

t-digest resources

isarn-sketches's People

Contributors

erikerlandson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

isarn-sketches's Issues

Is the monoidal sum over multiple t-digests stable?

If I take the monoidal sum over some t-digests, it gives me some distribution behavior that looks a bit skewed:

reproducer:

scala> import org.isarnproject.sketchesAlgebirdAPI.implicits._
import org.isarnproject.sketchesAlgebirdAPI.implicits._

scala> val data = Vector.fill(100) { Vector.fill(10000) { scala.util.Random.nextGaussian() } }
data: scala.collection.immutable.Vector[scala.collection.immutable.Vector[Double]] = Vector(Vector(0.1778102040514962, ...

scala> val tdvec = data.map(TDigest.sketch(_))
tdvec: scala.collection.immutable.Vector[org.isarnproject.sketches.TDigest] = Vector(TDigest(0.5,75,TDigestMap(-4.215406387806561 -> (1.0, 1.0), ...

scala> val tdsum = com.twitter.algebird.Monoid.sum(tdvec)
tdsum: org.isarnproject.sketches.TDigest = TDigest(0.5,113,TDigestMap(-4.776867999224537 -> (1.0, 1.0), ...

scala> tdsum.cdf(0)
res5: Double = 0.46485982756854377    // should be close to 0.5

scala> tdsum.cdfInverse(0.5)
res6: Double = 0.19202097508054178    // should be close to 0

A theory (unverified) is that as the sum grows on the left-hand-side, the cluster mass disparities between LHS and RHS make the current ++ logic unstable.

C (compression) and M (maxDiscrete) parameters return the same results

Greetings, Mr. Erlandson (@erikerlandson),

The team I represent recently started using this solution and we've been trying to tune accuracy and performance via
different C (compression) and M (maxDiscrete) parameters.

In our case, we had a dataframe called "originDf" which has 3 columns ("id": String, "fruit_type": String, "count": Integer)

+------+--------------------+-----------+
|id |fruit_type |count |
+------+--------------------+-----------+
|001|Apples |2 |
|002|Apricots |79 |
|001|Avocados |4 |
|003|Watermelon |13 |
|007|Blueberries |5 |
|007|Cherries |6 |
|007|Clementine |41 |
|007|Cucumbers |5 |
|007|Elderberry |3 |
|007|Eggfruit |1 |
|008|Eggfruit |19 |
|012|Clementine |61 |
|013|Blueberries |21 |
|014|Blueberries |4 |
|...|Lime |4 |
|...|Rambutan |3 |
|...|Strawberries |6 |
|...|Watermelon |5 |
|...|Tangerine |3 |
|...|Tangerine |6 |
+------+--------------------+-----------+
This dataframe has roughly 0.2 billion rows in total.

We then did the following:

val t_digest_udf1= TDigestAggregator.udf[Double](compression = C, maxDiscrete = M)
val groupDf1 = originDf.groupBy(col("fruit_type")).agg(t_digest_udf1(col("count")) as "t_digests",)
val udf1 = groupDf1.first()
val t1 = udf1.getAsTDigest

where (C, M) is respectively (100, 100), (0.01, 100), (100, 10000), and the resulting single TDigest from each set as t1, t2, and t3

t1: org.isarnproject.sketches.java.TDigest = TDigest(1.0 -> (22240.0, 22240.0), 2.0 -> (6509.0, 28749.0), 3.0 -> (2936.0, 31685.0), 4.0 -> (1594.0, 33279.0), 5.0 -> (1096.0, 34375.0), 6.0 -> (767.0, 35142.0), 7.0 -> (523.0, 35665.0), 8.0 -> (404.0, 36069.0), 9.0 -> (358.0, 36427.0), 10.0 -> (284.0, 36711.0), 11.0 -> (201.0, 36912.0), 12.0 -> (189.0, 37101.0), 13.0 -> (162.0, 37263.0), 14.0 -> (120.0, 37383.0), 15.0 -> (98.0, 37481.0), 16.0 -> (86.0, 37567.0), 17.0 -> (68.0, 37635.0), 18.0 -> (69.0, 37704.0), 19.0 -> (63.0, 37767.0), 20.0 -> (50.0, 37817.0), 21.0 -> (50.0, 37867.0), 22.0 -> (44.0, 37911.0), 23.0 -> (37.0, 37948.0), 24.0 -> (31.0, 37979.0), 25.0 -> (29.0, 38008.0), 26.0 -> (24.0, 38032.0) ...)

t2: org.isarnproject.sketches.java.TDigest = TDigest(1.0 -> (22240.0, 22240.0), 2.0 -> (6509.0, 28749.0), 3.0 -> (2936.0, 31685.0), 4.0 -> (1594.0, 33279.0), 5.0 -> (1096.0, 34375.0), 6.0 -> (767.0, 35142.0), 7.0 -> (523.0, 35665.0), 8.0 -> (404.0, 36069.0), 9.0 -> (358.0, 36427.0), 10.0 -> (284.0, 36711.0), 11.0 -> (201.0, 36912.0), 12.0 -> (189.0, 37101.0), 13.0 -> (162.0, 37263.0), 14.0 -> (120.0, 37383.0), 15.0 -> (98.0, 37481.0), 16.0 -> (86.0, 37567.0), 17.0 -> (68.0, 37635.0), 18.0 -> (69.0, 37704.0), 19.0 -> (63.0, 37767.0), 20.0 -> (50.0, 37817.0), 21.0 -> (50.0, 37867.0), 22.0 -> (44.0, 37911.0), 23.0 -> (37.0, 37948.0), 24.0 -> (31.0, 37979.0), 25.0 -> (29.0, 38008.0), 26.0 -> (24.0, 38032.0) ...)

t3: org.isarnproject.sketches.java.TDigest = TDigest(1.0 -> (22240.0, 22240.0), 2.0 -> (6509.0, 28749.0), 3.0 -> (2936.0, 31685.0), 4.0 -> (1594.0, 33279.0), 5.0 -> (1096.0, 34375.0), 6.0 -> (767.0, 35142.0), 7.0 -> (523.0, 35665.0), 8.0 -> (404.0, 36069.0), 9.0 -> (358.0, 36427.0), 10.0 -> (284.0, 36711.0), 11.0 -> (201.0, 36912.0), 12.0 -> (189.0, 37101.0), 13.0 -> (162.0, 37263.0), 14.0 -> (120.0, 37383.0), 15.0 -> (98.0, 37481.0), 16.0 -> (86.0, 37567.0), 17.0 -> (68.0, 37635.0), 18.0 -> (69.0, 37704.0), 19.0 -> (63.0, 37767.0), 20.0 -> (50.0, 37817.0), 21.0 -> (50.0, 37867.0), 22.0 -> (44.0, 37911.0), 23.0 -> (37.0, 37948.0), 24.0 -> (31.0, 37979.0), 25.0 -> (29.0, 38008.0), 26.0 -> (24.0, 38032.0) ...)

It turns out t1, t2 & t3 (all of the org.isarnproject.sketches.java.TDigest class) look exactly the same value-wise.
Though Boolean checks such as t1.equal(t2) all return false, which indicates these are different TDigest entities somehow.

Do you see anything off in this usage? Did we use the TDigest correctly, and more importantly, is our understanding of the algorithm correct?

Thank you and we look forward to your responses.

Save and load TDigest?

Hi Erik,

I'm using isarn-sketches-spark package to compute T-digest on Spark and would like to persist TDigestSQL or TDigest data into a PostgreSQL database. Potentially I also would like to be able to load the TDigest data from the database.

What would be the best way of doing that, if ever possible? Thanks very much!

Richard

more mathematically grounded heuristic for recluster triggering

the main triggering scenario for re-clustering is if the distribution of the incoming data begins to move. In the "null-hypothesis" case of "the distribution is stationary" then the expected number of new clusters (as a function of N, the number of samples seen so far) is 2log(N) - if the actual number of clusters is growing substantially faster than that, then it is most likely because new extrema are appearing faster than they "should be" due to new extrema being encountered.

The idea would be to work out the details of how to turn this basic idea into a more statistically grounded test for when to trigger a re-clustering.

Related: if the sketch needs a re-clustering due to the above scenario, then it implies that only the tails of the distribution need to be re-clustered. That is, a bunch of singleton-clusters that are accumulating at one or both ends. Figuring out a good way to re-cluster only these tails would be faster and probably more numerically stable. See for example this. Possibly there is a way to re-insert these tail clusters while using some kind of discounted measure of the quantile, so the insert logic is persuaded to make larger clusters closer in to the distribution body.

Upgrading to sbt 1.5 leads to failure to resolve dependency

Upgrading to sbt 1.5 leads to failure to resolve genjavadoc-plugin transitive dependency

Not an issue with the project functionality but attempting to upgrade and feels wrong to not be able to resolve dependencies.

Upgrading to sbt 1.5.5 produces this

error] insecure HTTP request is unsupported 'Patterns(ivyPatterns=Vector(http://dl.bintray.com/content/sbt/sbt-plugin-releases/[organisation]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)([branc
h]/)[revision]/[type]s/[artifact](-[classifier]).[ext]), artifactPatterns=Vector(http://dl.bintray.com/content/sbt/sbt-plugin-releases/[organisation]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)
([branch]/)[revision]/[type]s/[artifact](-[classifier]).[ext]), isMavenCompatible=false, descriptorOptional=false, skipConsistencyCheck=false)'; switch to HTTPS or opt-in as Resolver.url("bintray-sbt-p
lugin-releases", url(...)).withAllowInsecureProtocol(true), or by using allowInsecureProtocol in repositories file                                                                                       
[error] java.lang.RuntimeException: insecure protocol is unsupported                                                                                                                                     
[error]         at scala.sys.package$.error(package.scala:30)                                       
[error]         at sbt.Classpaths$.errorInsecureProtocol(Defaults.scala:3312)                       
[error]         at sbt.Classpaths$.$anonfun$mkIvyConfiguration$1(Defaults.scala:3906)               
[error]         at scala.Function1.$anonfun$compose$1(Function1.scala:49)                           
[error]         at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:62)       
[error]         at sbt.std.Transform$$anon$4.work(Transform.scala:68)                               
[error]         at sbt.Execute.$anonfun$submit$2(Execute.scala:282)                                 
[error]         at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:23)             
[error]         at sbt.Execute.work(Execute.scala:291)                                              
[error]         at sbt.Execute.$anonfun$submit$1(Execute.scala:282)                                 
[error]         at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:265)
[error]         at sbt.CompletionService$$anon$2.call(CompletionService.scala:64)                   
[error]         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)               
[error]         at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
[error]         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)               
[error]         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[error]         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[error]         at java.base/java.lang.Thread.run(Thread.java:829)

Fixing the resolver urls to use https produces this

sbt:isarn-sketches> compile
[info] Updating 
[info] Resolved  dependencies
[warn] 
[warn]  Note: Unresolved dependencies path:
[error] stack trace is suppressed; run last isarn_sketches_java / update for the full output
[error] (isarn_sketches_java / update) sbt.librarymanagement.ResolveException: Error downloading com.typesafe.genjavadoc:genjavadoc-plugin_2.12.14:0.15
[error]   Not found
[error]   Not found
[error]   not found: /development/projects/10_misc/isarn-sketches/isarn-sketches/project/.ivy/localcom.typesafe.genjavadoc/genjavadoc-plugin_2.12.14/0.15/ivys/ivy.xml
[error]   not found: https://repo1.maven.org/maven2/com/typesafe/genjavadoc/genjavadoc-plugin_2.12.14/0.15/genjavadoc-plugin_2.12.14-0.15.pom

Have tried a number of things locating this plugin but am stuck.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.