twitter / algebird Goto Github PK

View Code? Open in Web Editor NEW

2.3K 235.0 343.0 9.34 MB

Abstract Algebra for Scala

Home Page: https://twitter.github.io/algebird

License: Apache License 2.0

Scala 97.15% Java 1.79% Ruby 1.06%

algebird's Issues

Add a way to signal + commutative, * is commutative.

Here is an idea:

// sigil trait
trait CommutativePlus

// sigil trait
trait CommutativeTimes

Then we can add with CommutativePlus to most of the Monoids/Semigroups/etc... and we can use isInstanceOf to test at runtime, and type constraints to check at compile-time.

Count-Min-Sketch Monoid yields inaccurate frequency.

scala> import com.twitter.algebird._
import com.twitter.algebird._

scala> implicit val monoid = CMS.monoid(0.01, 0.01, 0, 0.01)
monoid: com.twitter.algebird.CountMinSketchMonoid = com.twitter.algebird.CountMinSketchMonoid@56fb2ac4

scala> val two = monoid.create(Seq(1L, 1L))
two: com.twitter.algebird.CMS = CMSInstance(CMSCountsTable(Vector(Vector(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
scala> val three = monoid.create(Seq(1L, 1L, 1L))
three: com.twitter.algebird.CMS = CMSInstance(CMSCountsTable(Vector(Vector(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
scala> two.frequency(1L)
res0: com.twitter.algebird.Approximate[Long] = Approximate(2,2,2,0.99)

scala> three.frequency(1L)
res1: com.twitter.algebird.Approximate[Long] = Approximate(3,3,3,0.99)

scala> monoid.plus(two, three)
res2: com.twitter.algebird.CMS = CMSInstance(CMSCountsTable(Vector(Vector(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
scala> res2.frequency(1L)
res3: com.twitter.algebird.Approximate[Long] = Approximate(2,2,2,0.99)

scala> res2.asInstanceOf[CMSInstance].hhs
res4: com.twitter.algebird.HeavyHitters = HeavyHitters(TreeSet(HeavyHitter(1,2), HeavyHitter(1,3)))

Summing two and three should give me '5' as the frequency, but it's giving me two. Might be related to the duplicated Heavy Hitters issue.

Add monoids for other streaming algorithms.

Particularly Count-min sketch:

https://github.com/clearspring/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java

Approximate collections API.

There is something going on with some of the data-structures we have here: BloomFilter is a kind of approximate Set. HyperLogLog is an approximate set with Cardinality. Count-min-sketch is an approximate Map (so Count-min-sketch has an approximate set on the keys).

I think we could define a few traits to represent these concepts.

Something like:

trait ApproximateSet[V] {
  def mayContain(v: V): Boolean
  def containsFalsePosProb: Double
  def containsFalseNegProb: Double
}

In fact, most of these have false neg prob == 0.0, so maybe we should encode that in the type (for cases where you want to be sure you don't miss anything).

trait ApproxAcceptingSet[V] extends ApproximateSet[V] {
  final def containsFalseNegProb = 0.0
}

Maybe we want to punt and avoid the returning the false negatives, and if we want to support set union (which these as monoids would have):

trait ApproximateSet[V,S<:ApproximateSet[V,S]] {
  // if we return false, the item is definitely NOT in this set.
  def mayContain(v: V): Boolean
  // f-bounded polymorphism to preserve the types:
  def union(that: S): S
}

For size, maybe something like:

trait ApproximateCardinality {
  def approxSize: Long
}

For Count-min-sketch, it looks like we need something like:

trait ApproximateMap[K,V] {
  def approxValue(k: K): Option[V]
}

In this view, I think CMS extends ApproximateMap, BloomFilter extends ApproximateSet, and HyperLogLog extends ApproximateSet with ApproximateCardinality

Count-min-sketch on any Monoid

You can think of Count-min-sketch as an approximate: Map[Long,Long].

The algorithm seems to generalize to Map[Long,V : Ordering : Monoid] because we use the count (monoid) and the min (ordering).

If we include a 64 bit hash from K => Long, then we get

SketchMap[K, V](implicit keyhash: Hashable[K,Long], monoidv: Monoid[V], ord: Ordering[V])

And generalize the CMS code in the obvious way.

Should we remove CMS and move to the above? I don't know.

SummingQueue fails w/ capacity of 0

scala> SummingQueue[Int](0)
java.lang.IllegalArgumentException
    at java.util.concurrent.ArrayBlockingQueue.<init>(ArrayBlockingQueue.java:169)
    at SummingQueue.<init>(<console>:27)
    at SummingQueue$.apply(<console>:21)
    at .<init>(<console>:22)
    at .<clinit>(<console>)
    at .<init>(<console>:11)
    at .<clinit>(<console>)
    at $print(<console>)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:704)
    at scala.tools.nsc.interpreter.IMain$Request$$anonfun$14.apply(IMain.scala:920)
    at scala.tools.nsc.interpreter.Line$$anonfun$1.apply$mcV$sp(Line.scala:43)
    at scala.tools.nsc.io.package$$anon$2.run(package.scala:25)
    at java.lang.Thread.run(Thread.java:680)

Update readme

Algebird is now much more than just the Monoid trait. Anyone want to take a stab at updating the readme with all the algebird goodness?

Algebird should have a "testing" submodule.

This would allow libraries that use Algebird to write scalacheck tests using the laws.

Range queries for CountMinSketch

See eg http://www.cse.cuhk.edu.hk/~taoyf/course/5705f10/lec8.pdf for how this works.

I have a basic working implementation of the DyadicRange logic (for positive integer keys, which you can map whatever else to), but it needs to be hooked up to levels instances of CMS. We also want to implement binary search using this to get quantile queries.

case class DyadicRange(maxValue : Long = Long.MaxValue) {
  val levels = math.ceil(math.log(maxValue) / math.log(2)).toInt

  def indicesForPoint(v : Long) = (1 to levels).map{level => (level, indexForPoint(v, level))}
  def indicesForRange(start : Long, end : Long)  : List[(Int,Long)] = indicesForRange(start, end, levels)

  def indexForPoint(v : Long, level : Int) = v >> (level - 1)
  def rangeForIndex(i : Long, level : Int) = (i << (level-1), ((i+1) << (level-1)) - 1)

  def indicesForRange(start : Long, end : Long, maxLevel : Int) : List[(Int,Long)] = {
    if(start > end) {
      Nil
    } else if(start == end) {
      List((1, start))
    } else {
      val startIndex = indexForPoint(start, maxLevel)
      val (a,b) = rangeForIndex(startIndex, maxLevel)
      if(a >= start) {
        if(b <= end) {
          List((maxLevel, startIndex)) ++ indicesForRange(start, a - 1, maxLevel) ++ indicesForRange(b + 1, end, maxLevel) 
        } else {
          indicesForRange(start, end, maxLevel - 1)
        }
      } else {
        val (a2, b2) = rangeForIndex(startIndex + 1, maxLevel)
        if(b2 <= end && a2 <= end) {
          List((maxLevel, startIndex + 1)) ++ indicesForRange(start, a2 - 1, maxLevel - 1) ++ indicesForRange(b2 + 1, end, maxLevel)
        } else {
          indicesForRange(start, end, maxLevel - 1)
        }
      }
    }
  }

}

Add BigInt/BigInteger Ring

Useful in cases where we might need very large exact counts.

Burst Detection Monoid

I'd like to see a monoid for detecting bursts in streams of data:

http://www.cs.cornell.edu/home/kleinber/bhs.pdf

/cc @johnynek

Q-Digest

We should implement Q-Digest for efficient quantile estimations - this is probably better than using CMS as in #58. For reference, there's a Java implementation here:
https://github.com/clearspring/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/QDigest.java

algebird-serialization

This module should contain thrift mappings for all Algebird case classes. Out of this we'll publish scrooge and java jars of these structs.

Add Equatable[T] typeclass and use it where needed

This will be of use in tests and elsewhere.

trait Equatable[T] extends Function2[T,T,Boolean] {
  def apply(left: T, right: T): Boolean
}

object Equatable {
  def apply[T](left: T, right:T)(implicit eq: Equatable[T]): Boolean = eq(left, right)
  def from[T](fn :(T,T) => Boolean) = new Equatable[T] { def apply(l: T, r:T) = fn(l,r) }

  //Make usual ones for numbers and all the containers we use in groups/monoids/etc...

  // Might be useful to have something like:
  def withEps[T:Metric](eps: Double) = from[T] { Metric(_,_) <= eps } 
}

InvertibleAffineFn

Once we pull in a Bijection dependency,

case class InvertibleAffineFunction[R](slope: R, intercept: R)(implicit field: Field[R])
  extends Bijection[R,R] {
  lazy val toFn: Function1[R,R] = {x => this.apply(x) }
  override def apply(x: R): R = field.plus(field.times(slope, x), intercept)
  override def invert(x: R): R = field.minus(field.div(x, slope), field.div(intercept, slope))
  def andThen(other: InvertibleAffineFunction[R]): InvertibleAffineFunction[R] =
    InvertibleAffineFunction[R](slope * other.slope, intercept + (slope * other.intercept))
  def compose(other: InvertibleAffineFunction[R]): InvertibleAffineFunction[R] = andThen(other)
}

Add implicitNotFound to type classes

http://suereth.blogspot.com/2011/03/annotate-your-type-classes.html

We should put something like that on Semigroup/Monoid/Ring/Field/Metric/VectorSpace.

Ring.mapRing => diverging chain of implicits.

Starting here: https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Ring.scala#L104

This seems to preclude some stuff using the scalding matrix API, for example templated functions which operate on generic matrices and do multiplication. I guess the reason is that the compiler gives up trying to figure out all possible kinds of values that can be in the matrix and still lead to a valid function.

Use scala.math.Equiv[T] where we need generic equivalence.

Linear bloom filters

See http://comonad.com/reader/2008/linear-bloom-filters/:

So to recap, we took a normal (or counted or spectral) Bloom filter, crossbred it with Litwin's linear hash table and found that the mutant offspring is an approximation of a set that is better suited to sharing over the network than either structure alone, with a memory usage profile similar to that of a linear hash table. Interestingly as a side effect you can go one step further and allow for transmission of a requested subset of the exact hashes present in which case we've really only used the Blooms to provide partial information about the underlying linear hash table, which can aid in the subsequent join process.

And yes, they are probably better named something like Bloomed linear hash tables, but that doesn't roll off the tongue.

Chernoff test

We should write some reusable code in algebird-test to have a scalacheck property that is true with probability at least P. Then, using the chernoff bound, we should verify that this claim is true with a very high probability (something like p=10^{-9}), so that the tests will almost always pass.

FlowerFilter monoid

http://eng.42go.com/a-simple-time-decaying-approximate-membership-filter/

This stores each item T in each of it's hash bucket indices (so it is considerably larger than a normal bloom filter), but it considers something present if t: T exists in one of the hashed locations.

This is interesting because it can only remember recent items, which is often useful for some streaming join, or some decayed approximation algorithm.

Some code duplication in tests

Currently a lot of the tests contain code for doing approximate equality of maps, lists, etc. It would be nice to move these to a common trait.

Add Bounded[T]

// string/list are instance here, 
trait LowerBounded[T] {
  def ordering: Ordering[T]
  def min: T
}


trait UpperBounded[T] {
  def ordering: Ordering[T]
  def min: T
}

// primitive types are instances here
trait Bounded[T] {
  def ordering: Ordering[T]
  def min: T
  def max: T
}

// Make an implicit Lower/Upper if you have a Bounded (don't subclass)

trait Approximate

As Oscar pointed out in #44, we're starting to get a lot of approximate collections. We tend to implement them right now as returning, say, a Long, and then having various other methods for describing the degree of error. Maybe instead they should be returning something like

trait Approximate[T:Numeric] {
   def estimate : T
   def max : T
   def min : T
   def probWithinBounds : Double
}

More Hashable discussion

It seemed like there was some progress here: #35

Did anything get decided? It would be nice to be able to use other types in bloomfilter (and implement a bloom join).

I'd be happy to help out with it but I don't want to step on anything in progress.

Use Spire's typeclasses.

Not trolling. The definitions of Semigroup, Monoid, VectorSpace etc in https://github.com/non/spire are not obviously incompatible with us, and it might be nicer for the broader Scala community if we could unify.

ApproximateBoolean Confidence Incorrectness

Consider:

ApproximateBoolean(false, 0.6) && ApproximateBoolean(false, 0.4)

This yields ApproximateBoolean(false, 0.6), which is actually inaccurate. The false is correct, but the probability is low. The actual probability should be as follows: (0.6 + 0.4) - (0.6 * 0.4) = 0.76. Which is to say, the probability that the left is false, or the right is false (non-exclusively).

There is a similar problem with the || operation.

The implementations are nearly correct, they just require a bit of refinement. Here is a corrected && operator (bonus: also much faster!):

  def &&(that: ApproximateBoolean): ApproximateBoolean = {
    if(isTrue && that.isTrue) {
      //We need both to be correct:
      ApproximateBoolean(true, withProb * that.withProb)
    } else if (this.isTrue || that.isTrue) {
      ApproximateBoolean(false, if (!this.isTrue) withProb else that.withProb)
    } else {
      ApproximateBoolean(false, this.withProb + that.withProb - (this.withProb * that.withProb))
    }
  }

The || operator should be analogous.

Count-Min-Sketch Heavy Hitters duplication.

When adding two CountMinSketch objects, is it intentional that the heavy hitters are duplicated when they're summed using the monoid? The heavyHitters method on CMS is correct because it filters out. Here's an example where I add two CMS objects, and inspect the heavy hitters internal representation.

scala> import com.twitter.algebird._
import com.twitter.algebird._

scala> val monoid = CMS.monoid(0.01, 0.01, 0, 0.01)
monoid: com.twitter.algebird.CountMinSketchMonoid = com.twitter.algebird.CountMinSketchMonoid@3219ab8d

scala> monoid.create(Seq(1L, 1L, 1L))
res0: com.twitter.algebird.CMS = CMSInstance(CMSCountsTable(Vector(Vector(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
scala> monoid.create(Seq(1L, 1L, 1L, 1L))
res1: com.twitter.algebird.CMS = CMSInstance(CMSCountsTable(Vector(Vector(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
scala> monoid.plus(res0, res1)
res2: com.twitter.algebird.CMS = CMSInstance(CMSCountsTable(Vector(Vector(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
scala> res2.asInstanceOf[CMSInstance].heavyHitters
res3: Set[Long] = TreeSet(1)

scala> res2.asInstanceOf[CMSInstance].hhs
res4: com.twitter.algebird.HeavyHitters = HeavyHitters(TreeSet(HeavyHitter(1,3), HeavyHitter(1,4)))

I would expect that the internal representation of hhs would remove the duplicate value. Since I serialize the data, it becomes a major waste of space storing a bunch of old values. For now, I just remove duplicate 'item' HeavyHitter keys when serialized, but I was wondering if this was intentional, since I can see that being the case for faster summing.

Deal with Abelian/Commutative objects better.

There are often optimizations the can be done if a semigroup/monoid/group is commutative.

How can we best communicate that?

Two obvious solutions: a property of the instance (.isCommutative). This rules out the ability to have some methods that only apply to commutative objects (because that check is at runtime).

A second type-class approach would be something like:

// use this to check at runtime:
sealed trait CommutativeStatus[O] {
  def commutes: Boolean
}
trait IsNotCommutative[O] extends CommutativeStatus[O] { val commutes = false }
// use this when you want to require commutativity
trait IsCommutative[O] extends CommutativeStatus[O] { val commutes = true }

trait LowP {
  implicit def defaultIsNot[O]: CommutativeStatus new IsNotCommutative[O] { }
}

object CommutativeStatus extends LowP {
  implicit def commutes[O](implicit isc: IsCommutative[O]): CommutativeStatus = isc
}

We can add instances of IsCommutative to the Monoid/Group/Semigroup objects.

Evaluate function for lifting into MapAlgebra

  protected def mergeLookup[T,U]
  (keys: Iterable[T], lookup: (T) => Option[V], present: (T) => U): Map[U,V] = {
    val kvPairs = keys map { key =>
      val v = lookup(key) getOrElse monoid.zero
      (present(key), v): (U,V)
    }
    MapAlgebra.sum(kvPairs)
  }

Add a dense matrix type with semigroup/monoid/group/ring

use this in CMS.

Add DecayedVector class

Use the new VectorSpace to make an analog of DecayedValue except on a VectorSpace[Double,C] for all container spaces C. Would be of use for IndexedSeq and Map's.

Flaky test in HyperLogLog

I got the following failure, unrelated to any code change that I did:

[error] x HyperLogLog should
[info]   + count with 5-bits
[error]   x count with 6-bits
[error]     0.3370433439011089 is not less than 0.325 (HyperLogLogTest.scala:56)

algebird-util depends on algebird-test

Is this intentional?
As it is now, anything that depends on algebird somewhere pulls in algebird-test, and transitively, its dependencies: specs2 and scalacheck, onto the runtime classpath.

add murmur hash to HyperLogLog

Add java specific docs

https://github.com/typesafehub/genjavadoc

PoissonDecayedValue

For counts-based data, this may be a better approach than the current exponential decay. See http://word.bitly.com/post/41284219720/forget-table

Implement FenwickTreeGroup

http://en.wikipedia.org/wiki/Fenwick_tree

Use SortedMap[K,V: Group] and make a group on SortedMap[K,V: Group] such that at K you have the sum(k <= K) V_k

Then, add a method on the Group object to get the sum over a range by getting the sum at the lower bound and subtracting that from the value at the upper bound of the range.

Generalize Sparse/Dense vector

The recent refactoring of HLL does almost exactly what I am contemplating doing with MinHash, except that HLL is Map[Int,Max[Byte]] and I would want Map[Int,Min[Int]]. I'm wondering if we want to generalize to a OptionallySparseVector[V], with an OptionallySparseVectorMonoid[V:Monoid]. (Better name would be appreciated).

Add join method to Aggregator object

We need to add something like:

object Aggregator {
  def join2(ag1: Aggregator[A1,B1,C1], ag2: Aggregator[A2,B2,C2]):
    Aggregator[(A1,A2),(B1,B2),(C1,C2)] = new Aggregator[(A1,A2),(B1,B2),(C1,C2)] {
    def prepare(a: (A1,A2)) = (ag1.prepare(a._1), ag2.prepare(a._2))
    // etc..., to bad this kind of lifting is not easier to do automatically. I don't see how...
  }
  // and so on for join3, join4, up to join22.  Ideally these are all code-gen in scala.
}

PairMonoid/Semigroup to combine Set/Hyperloglog

Here is the idea, two functions:

sealed abstract class EitherAlgebra[+L,+R]
case class LeftAlg[+L](get: L) extends EitherAlgebra[L, Nothing]
case class RightAlg[+R](get: R) extends EitherAlgebra[Nothing,R]

//Then we have a monoid/semigroup on L,R and:

def monoid(shouldConvert: (R) => Boolean, convert: (R) => L): Monoid[EitherAlgebra[L,R]]

Then, shouldConvert(s: Set[T]) = s.size > threshold, and convert(s: Set[T]): HLLInstance = iterate over all the items and insert them.

In this way, we can make a monoid that is exact when size is small, and converts to approximate when we cross some size boundary (say about the size of the serialized hyperloglog).

We can do the same with count-min-sketch with a Map[K,Long]. The pattern is: use the exact data structure when the size is less than the approximate, and then switch on demand.

Count-min-sketch on any T + Serialization

CountMinSketch should work on any T with a Hashable or at least %> Array[Byte](to use an injection/bijection).

Also, it should be easy to serialize the count-min-sketch (this complicates always including the heavy-hitters.

publishing 0.1.12

Any last minute (binary compatible) fixes should be submitted in the next 24 hours.

Goal is to publish on Thursday.

Add adjoin trick to add identity to pseduo-rings

http://en.wikipedia.org/wiki/Pseudo-ring#Adjoining_an_identity_element

What we want is something like:

// in Ring object:
  def adjoinOne[T](implicit pseudoring: Ring[T]): Ring[(BigInt, T)] = new Ring[(BigInt,T)] {
    lazy val one = (BigInt(1), pseudoring.zero)
    lazy val zero = (BigInt(0), pseudoring.zero)
   // and follow the rules above.
  }

This would allow us to use a Map as a true ring, not a pseudo-ring, which could be useful in Matrix.scala (but honestly, this is mostly a correctness issue, I don't think matrix.scala often assumes ring.one exists).

trait Countable[T] {
  def next(t: T): Option[T]
  def prev(t: T): Option[T]
}

an infinite T might never return None, but Int, Long, Double etc instances would.

sortedSum implementation

Could this be useful inside of Monoid?

import scala.util.Sorting.stableSort

def sortedSum[T: Ordering: Manifest](vs: Seq[T]): T = Monoid.sum(stableSort(vs))

Use sbt-boilerplate instead of custom codegen

https://github.com/sbt/sbt-boilerplate

That might be cleaner to generate the tuple/product algebras.

twitter / algebird Goto Github PK

algebird's Issues

Recommend Projects

Recommend Topics

Recommend Org