twitter / algebird Goto Github PK
View Code? Open in Web Editor NEWAbstract Algebra for Scala
Home Page: https://twitter.github.io/algebird
License: Apache License 2.0
Abstract Algebra for Scala
Home Page: https://twitter.github.io/algebird
License: Apache License 2.0
Here is an idea:
// sigil trait
trait CommutativePlus
// sigil trait
trait CommutativeTimes
Then we can add with CommutativePlus
to most of the Monoids/Semigroups/etc... and we can use isInstanceOf to test at runtime, and type constraints to check at compile-time.
scala> import com.twitter.algebird._
import com.twitter.algebird._
scala> implicit val monoid = CMS.monoid(0.01, 0.01, 0, 0.01)
monoid: com.twitter.algebird.CountMinSketchMonoid = com.twitter.algebird.CountMinSketchMonoid@56fb2ac4
scala> val two = monoid.create(Seq(1L, 1L))
two: com.twitter.algebird.CMS = CMSInstance(CMSCountsTable(Vector(Vector(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
scala> val three = monoid.create(Seq(1L, 1L, 1L))
three: com.twitter.algebird.CMS = CMSInstance(CMSCountsTable(Vector(Vector(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
scala> two.frequency(1L)
res0: com.twitter.algebird.Approximate[Long] = Approximate(2,2,2,0.99)
scala> three.frequency(1L)
res1: com.twitter.algebird.Approximate[Long] = Approximate(3,3,3,0.99)
scala> monoid.plus(two, three)
res2: com.twitter.algebird.CMS = CMSInstance(CMSCountsTable(Vector(Vector(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
scala> res2.frequency(1L)
res3: com.twitter.algebird.Approximate[Long] = Approximate(2,2,2,0.99)
scala> res2.asInstanceOf[CMSInstance].hhs
res4: com.twitter.algebird.HeavyHitters = HeavyHitters(TreeSet(HeavyHitter(1,2), HeavyHitter(1,3)))
Summing two
and three
should give me '5' as the frequency, but it's giving me two. Might be related to the duplicated Heavy Hitters issue.
There is something going on with some of the data-structures we have here: BloomFilter is a kind of approximate Set. HyperLogLog is an approximate set with Cardinality. Count-min-sketch is an approximate Map (so Count-min-sketch has an approximate set on the keys).
I think we could define a few traits to represent these concepts.
Something like:
trait ApproximateSet[V] {
def mayContain(v: V): Boolean
def containsFalsePosProb: Double
def containsFalseNegProb: Double
}
In fact, most of these have false neg prob == 0.0, so maybe we should encode that in the type (for cases where you want to be sure you don't miss anything).
trait ApproxAcceptingSet[V] extends ApproximateSet[V] {
final def containsFalseNegProb = 0.0
}
Maybe we want to punt and avoid the returning the false negatives, and if we want to support set union (which these as monoids would have):
trait ApproximateSet[V,S<:ApproximateSet[V,S]] {
// if we return false, the item is definitely NOT in this set.
def mayContain(v: V): Boolean
// f-bounded polymorphism to preserve the types:
def union(that: S): S
}
For size, maybe something like:
trait ApproximateCardinality {
def approxSize: Long
}
For Count-min-sketch, it looks like we need something like:
trait ApproximateMap[K,V] {
def approxValue(k: K): Option[V]
}
In this view, I think CMS extends ApproximateMap, BloomFilter extends ApproximateSet, and HyperLogLog extends ApproximateSet with ApproximateCardinality
You can think of Count-min-sketch as an approximate: Map[Long,Long].
The algorithm seems to generalize to Map[Long,V : Ordering : Monoid] because we use the count (monoid) and the min (ordering).
If we include a 64 bit hash from K => Long, then we get
SketchMap[K, V](implicit keyhash: Hashable[K,Long], monoidv: Monoid[V], ord: Ordering[V])
And generalize the CMS code in the obvious way.
Should we remove CMS and move to the above? I don't know.
scala> SummingQueue[Int](0)
java.lang.IllegalArgumentException
at java.util.concurrent.ArrayBlockingQueue.<init>(ArrayBlockingQueue.java:169)
at SummingQueue.<init>(<console>:27)
at SummingQueue$.apply(<console>:21)
at .<init>(<console>:22)
at .<clinit>(<console>)
at .<init>(<console>:11)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:704)
at scala.tools.nsc.interpreter.IMain$Request$$anonfun$14.apply(IMain.scala:920)
at scala.tools.nsc.interpreter.Line$$anonfun$1.apply$mcV$sp(Line.scala:43)
at scala.tools.nsc.io.package$$anon$2.run(package.scala:25)
at java.lang.Thread.run(Thread.java:680)
Algebird is now much more than just the Monoid trait. Anyone want to take a stab at updating the readme with all the algebird goodness?
This would allow libraries that use Algebird to write scalacheck tests using the laws.
See eg http://www.cse.cuhk.edu.hk/~taoyf/course/5705f10/lec8.pdf for how this works.
I have a basic working implementation of the DyadicRange logic (for positive integer keys, which you can map whatever else to), but it needs to be hooked up to levels
instances of CMS. We also want to implement binary search using this to get quantile queries.
case class DyadicRange(maxValue : Long = Long.MaxValue) {
val levels = math.ceil(math.log(maxValue) / math.log(2)).toInt
def indicesForPoint(v : Long) = (1 to levels).map{level => (level, indexForPoint(v, level))}
def indicesForRange(start : Long, end : Long) : List[(Int,Long)] = indicesForRange(start, end, levels)
def indexForPoint(v : Long, level : Int) = v >> (level - 1)
def rangeForIndex(i : Long, level : Int) = (i << (level-1), ((i+1) << (level-1)) - 1)
def indicesForRange(start : Long, end : Long, maxLevel : Int) : List[(Int,Long)] = {
if(start > end) {
Nil
} else if(start == end) {
List((1, start))
} else {
val startIndex = indexForPoint(start, maxLevel)
val (a,b) = rangeForIndex(startIndex, maxLevel)
if(a >= start) {
if(b <= end) {
List((maxLevel, startIndex)) ++ indicesForRange(start, a - 1, maxLevel) ++ indicesForRange(b + 1, end, maxLevel)
} else {
indicesForRange(start, end, maxLevel - 1)
}
} else {
val (a2, b2) = rangeForIndex(startIndex + 1, maxLevel)
if(b2 <= end && a2 <= end) {
List((maxLevel, startIndex + 1)) ++ indicesForRange(start, a2 - 1, maxLevel - 1) ++ indicesForRange(b2 + 1, end, maxLevel)
} else {
indicesForRange(start, end, maxLevel - 1)
}
}
}
}
}
Useful in cases where we might need very large exact counts.
I'd like to see a monoid for detecting bursts in streams of data:
http://www.cs.cornell.edu/home/kleinber/bhs.pdf
/cc @johnynek
We should implement Q-Digest for efficient quantile estimations - this is probably better than using CMS as in #58. For reference, there's a Java implementation here:
https://github.com/clearspring/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/QDigest.java
This module should contain thrift mappings for all Algebird case classes. Out of this we'll publish scrooge and java jars of these structs.
This will be of use in tests and elsewhere.
trait Equatable[T] extends Function2[T,T,Boolean] {
def apply(left: T, right: T): Boolean
}
object Equatable {
def apply[T](left: T, right:T)(implicit eq: Equatable[T]): Boolean = eq(left, right)
def from[T](fn :(T,T) => Boolean) = new Equatable[T] { def apply(l: T, r:T) = fn(l,r) }
//Make usual ones for numbers and all the containers we use in groups/monoids/etc...
// Might be useful to have something like:
def withEps[T:Metric](eps: Double) = from[T] { Metric(_,_) <= eps }
}
Once we pull in a Bijection dependency,
case class InvertibleAffineFunction[R](slope: R, intercept: R)(implicit field: Field[R])
extends Bijection[R,R] {
lazy val toFn: Function1[R,R] = {x => this.apply(x) }
override def apply(x: R): R = field.plus(field.times(slope, x), intercept)
override def invert(x: R): R = field.minus(field.div(x, slope), field.div(intercept, slope))
def andThen(other: InvertibleAffineFunction[R]): InvertibleAffineFunction[R] =
InvertibleAffineFunction[R](slope * other.slope, intercept + (slope * other.intercept))
def compose(other: InvertibleAffineFunction[R]): InvertibleAffineFunction[R] = andThen(other)
}
http://suereth.blogspot.com/2011/03/annotate-your-type-classes.html
We should put something like that on Semigroup/Monoid/Ring/Field/Metric/VectorSpace.
Starting here: https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Ring.scala#L104
This seems to preclude some stuff using the scalding matrix API, for example templated functions which operate on generic matrices and do multiplication. I guess the reason is that the compiler gives up trying to figure out all possible kinds of values that can be in the matrix and still lead to a valid function.
See http://comonad.com/reader/2008/linear-bloom-filters/:
So to recap, we took a normal (or counted or spectral) Bloom filter, crossbred it with Litwin's linear hash table and found that the mutant offspring is an approximation of a set that is better suited to sharing over the network than either structure alone, with a memory usage profile similar to that of a linear hash table. Interestingly as a side effect you can go one step further and allow for transmission of a requested subset of the exact hashes present in which case we've really only used the Blooms to provide partial information about the underlying linear hash table, which can aid in the subsequent join process.
And yes, they are probably better named something like Bloomed linear hash tables, but that doesn't roll off the tongue.
We should write some reusable code in algebird-test
to have a scalacheck property that is true with probability at least P
. Then, using the chernoff bound, we should verify that this claim is true with a very high probability (something like p=10^{-9}
), so that the tests will almost always pass.
http://eng.42go.com/a-simple-time-decaying-approximate-membership-filter/
This stores each item T in each of it's hash bucket indices (so it is considerably larger than a normal bloom filter), but it considers something present if t: T exists in one of the hashed locations.
This is interesting because it can only remember recent items, which is often useful for some streaming join, or some decayed approximation algorithm.
Currently a lot of the tests contain code for doing approximate equality of maps, lists, etc. It would be nice to move these to a common trait.
// string/list are instance here,
trait LowerBounded[T] {
def ordering: Ordering[T]
def min: T
}
trait UpperBounded[T] {
def ordering: Ordering[T]
def min: T
}
// primitive types are instances here
trait Bounded[T] {
def ordering: Ordering[T]
def min: T
def max: T
}
// Make an implicit Lower/Upper if you have a Bounded (don't subclass)
As Oscar pointed out in #44, we're starting to get a lot of approximate collections. We tend to implement them right now as returning, say, a Long, and then having various other methods for describing the degree of error. Maybe instead they should be returning something like
trait Approximate[T:Numeric] {
def estimate : T
def max : T
def min : T
def probWithinBounds : Double
}
It seemed like there was some progress here: #35
Did anything get decided? It would be nice to be able to use other types in bloomfilter (and implement a bloom join).
I'd be happy to help out with it but I don't want to step on anything in progress.
Not trolling. The definitions of Semigroup, Monoid, VectorSpace etc in https://github.com/non/spire are not obviously incompatible with us, and it might be nicer for the broader Scala community if we could unify.
Consider:
ApproximateBoolean(false, 0.6) && ApproximateBoolean(false, 0.4)
This yields ApproximateBoolean(false, 0.6)
, which is actually inaccurate. The false
is correct, but the probability is low. The actual probability should be as follows: (0.6 + 0.4) - (0.6 * 0.4) = 0.76
. Which is to say, the probability that the left is false, or the right is false (non-exclusively).
There is a similar problem with the ||
operation.
The implementations are nearly correct, they just require a bit of refinement. Here is a corrected &&
operator (bonus: also much faster!):
def &&(that: ApproximateBoolean): ApproximateBoolean = {
if(isTrue && that.isTrue) {
//We need both to be correct:
ApproximateBoolean(true, withProb * that.withProb)
} else if (this.isTrue || that.isTrue) {
ApproximateBoolean(false, if (!this.isTrue) withProb else that.withProb)
} else {
ApproximateBoolean(false, this.withProb + that.withProb - (this.withProb * that.withProb))
}
}
The ||
operator should be analogous.
When adding two CountMinSketch objects, is it intentional that the heavy hitters are duplicated when they're summed using the monoid? The heavyHitters
method on CMS is correct because it filters out. Here's an example where I add two CMS objects, and inspect the heavy hitters internal representation.
scala> import com.twitter.algebird._
import com.twitter.algebird._
scala> val monoid = CMS.monoid(0.01, 0.01, 0, 0.01)
monoid: com.twitter.algebird.CountMinSketchMonoid = com.twitter.algebird.CountMinSketchMonoid@3219ab8d
scala> monoid.create(Seq(1L, 1L, 1L))
res0: com.twitter.algebird.CMS = CMSInstance(CMSCountsTable(Vector(Vector(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
scala> monoid.create(Seq(1L, 1L, 1L, 1L))
res1: com.twitter.algebird.CMS = CMSInstance(CMSCountsTable(Vector(Vector(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
scala> monoid.plus(res0, res1)
res2: com.twitter.algebird.CMS = CMSInstance(CMSCountsTable(Vector(Vector(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
scala> res2.asInstanceOf[CMSInstance].heavyHitters
res3: Set[Long] = TreeSet(1)
scala> res2.asInstanceOf[CMSInstance].hhs
res4: com.twitter.algebird.HeavyHitters = HeavyHitters(TreeSet(HeavyHitter(1,3), HeavyHitter(1,4)))
I would expect that the internal representation of hhs
would remove the duplicate value. Since I serialize the data, it becomes a major waste of space storing a bunch of old values. For now, I just remove duplicate 'item' HeavyHitter keys when serialized, but I was wondering if this was intentional, since I can see that being the case for faster summing.
There are often optimizations the can be done if a semigroup/monoid/group is commutative.
How can we best communicate that?
Two obvious solutions: a property of the instance (.isCommutative
). This rules out the ability to have some methods that only apply to commutative objects (because that check is at runtime).
A second type-class approach would be something like:
// use this to check at runtime:
sealed trait CommutativeStatus[O] {
def commutes: Boolean
}
trait IsNotCommutative[O] extends CommutativeStatus[O] { val commutes = false }
// use this when you want to require commutativity
trait IsCommutative[O] extends CommutativeStatus[O] { val commutes = true }
trait LowP {
implicit def defaultIsNot[O]: CommutativeStatus new IsNotCommutative[O] { }
}
object CommutativeStatus extends LowP {
implicit def commutes[O](implicit isc: IsCommutative[O]): CommutativeStatus = isc
}
We can add instances of IsCommutative to the Monoid/Group/Semigroup objects.
protected def mergeLookup[T,U]
(keys: Iterable[T], lookup: (T) => Option[V], present: (T) => U): Map[U,V] = {
val kvPairs = keys map { key =>
val v = lookup(key) getOrElse monoid.zero
(present(key), v): (U,V)
}
MapAlgebra.sum(kvPairs)
}
use this in CMS.
Use the new VectorSpace to make an analog of DecayedValue except on a VectorSpace[Double,C] for all container spaces C. Would be of use for IndexedSeq and Map's.
I got the following failure, unrelated to any code change that I did:
[error] x HyperLogLog should
[info] + count with 5-bits
[error] x count with 6-bits
[error] 0.3370433439011089 is not less than 0.325 (HyperLogLogTest.scala:56)
Is this intentional?
As it is now, anything that depends on algebird somewhere pulls in algebird-test, and transitively, its dependencies: specs2 and scalacheck, onto the runtime classpath.
For counts-based data, this may be a better approach than the current exponential decay. See http://word.bitly.com/post/41284219720/forget-table
http://en.wikipedia.org/wiki/Fenwick_tree
Use SortedMap[K,V: Group]
and make a group on SortedMap[K,V: Group]
such that at K
you have the sum(k <= K) V_k
Then, add a method on the Group
object to get the sum over a range by getting the sum at the lower bound and subtracting that from the value at the upper bound of the range.
The recent refactoring of HLL does almost exactly what I am contemplating doing with MinHash, except that HLL is Map[Int,Max[Byte]] and I would want Map[Int,Min[Int]]. I'm wondering if we want to generalize to a OptionallySparseVector[V], with an OptionallySparseVectorMonoid[V:Monoid]. (Better name would be appreciated).
We need to add something like:
object Aggregator {
def join2(ag1: Aggregator[A1,B1,C1], ag2: Aggregator[A2,B2,C2]):
Aggregator[(A1,A2),(B1,B2),(C1,C2)] = new Aggregator[(A1,A2),(B1,B2),(C1,C2)] {
def prepare(a: (A1,A2)) = (ag1.prepare(a._1), ag2.prepare(a._2))
// etc..., to bad this kind of lifting is not easier to do automatically. I don't see how...
}
// and so on for join3, join4, up to join22. Ideally these are all code-gen in scala.
}
Here is the idea, two functions:
sealed abstract class EitherAlgebra[+L,+R]
case class LeftAlg[+L](get: L) extends EitherAlgebra[L, Nothing]
case class RightAlg[+R](get: R) extends EitherAlgebra[Nothing,R]
//Then we have a monoid/semigroup on L,R and:
def monoid(shouldConvert: (R) => Boolean, convert: (R) => L): Monoid[EitherAlgebra[L,R]]
Then, shouldConvert(s: Set[T]) = s.size > threshold, and convert(s: Set[T]): HLLInstance = iterate over all the items and insert them.
In this way, we can make a monoid that is exact when size is small, and converts to approximate when we cross some size boundary (say about the size of the serialized hyperloglog).
We can do the same with count-min-sketch with a Map[K,Long]. The pattern is: use the exact data structure when the size is less than the approximate, and then switch on demand.
CountMinSketch should work on any T with a Hashable or at least %> Array[Byte](to use an injection/bijection).
Also, it should be easy to serialize the count-min-sketch (this complicates always including the heavy-hitters.
Any last minute (binary compatible) fixes should be submitted in the next 24 hours.
Goal is to publish on Thursday.
http://en.wikipedia.org/wiki/Pseudo-ring#Adjoining_an_identity_element
What we want is something like:
// in Ring object:
def adjoinOne[T](implicit pseudoring: Ring[T]): Ring[(BigInt, T)] = new Ring[(BigInt,T)] {
lazy val one = (BigInt(1), pseudoring.zero)
lazy val zero = (BigInt(0), pseudoring.zero)
// and follow the rules above.
}
This would allow us to use a Map as a true ring, not a pseudo-ring, which could be useful in Matrix.scala (but honestly, this is mostly a correctness issue, I don't think matrix.scala often assumes ring.one exists).
Algebird is much more than just the Monoid trait now. Isn't it time we added the Cokleisli star operators?
Max, Min, First and Last should be specialized on the usual primitives (which are the salient use cases) in order to minimize (un-)boxing overhead
Cryptographic hashes are often slower than non-cryptographic. And MD5 suffers from known collision problems.
I've been using MurmurHash in my hyperloglog impl with good results. Scala is not my language of choice, but I'd provide a patch if you guys wanted.
Not totally sure about this:
Think it is:
trait Countable[T] {
def next(t: T): Option[T]
def prev(t: T): Option[T]
}
an infinite T might never return None, but Int, Long, Double etc instances would.
Could this be useful inside of Monoid?
import scala.util.Sorting.stableSort
def sortedSum[T: Ordering: Manifest](vs: Seq[T]): T = Monoid.sum(stableSort(vs))
https://github.com/sbt/sbt-boilerplate
That might be cleaner to generate the tuple/product algebras.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.