joshday / onlinestats.jl Goto Github PK

⚡ Single-pass algorithms for statistics

Home Page: https://joshday.github.io/OnlineStats.jl/latest/

License: MIT License

Julia 100.00%

julia online-algorithms statistics big-data onlinestats stochastic-approximation streaming-data julia-language julialang

onlinestats.jl's Introduction

Online Algorithms for Statistics, Models, and Big Data Viz

⚡ High-performance single-pass algorithms for statistics and data viz.
➕ Updated one observation at a time.
✅ Algorithms use O(1) memory.
📈 Perfect for streaming and big data.

Docs	Build	Test	Citation	Dependents

🚀 Quickstart

import Pkg

Pkg.add("OnlineStats")

using OnlineStats

# Create several statistics
o = Series(Mean(), Variance(), Extrema())

# Update with single data point
fit!(o, 1.0)

# Iterate through and update with lots of data
fit!(o, randn(10^6))

# Get the values of the statistics
value(o)  # (value(mean), value(variance), value(extrema))

📖 Documentation

✨ Contributing

Pull requests are very welcome!
For major changes, you'll probably want to first discuss the changes via issue/email/slack with @joshday.

✏️ Authors

Primary Author: Josh Day (@joshday)
Significant early contributions from Tom Breloff (@tbreloff)
Many algorithms developed under mentorship of Hua Zhou (@Hua-Zhou)

See also the list of contributors to OnlineStats.

onlinestats.jl's People

Contributors

Stargazers

Watchers

onlinestats.jl's Issues

Start using Logging.jl?

@tbreloff Any arguments for or against switching to use https://github.com/kmsquire/Logging.jl? Logging in OnlineStats is undocumented/untested. It'd be nice to shift that responsibility to a maintained package.

Inf in Variance value

I don't like the behavior when trying to get a running vector of the variance of a series:

using OnlineStats
v = Variance(ExponentialWeight(10))
x = sin(1:100) .* rand(100)
xvar = Float64[(fit!(v, xi); std(v)) for xi in x]

gives the error:

LoadError: DomainError:
while loading In[4], in expression starting on line 4

 in std at /home/tom/.julia/v0.4/OnlineStats/src/summary.jl:69
 in anonymous at no file

The value of v is -Inf, but that doesn't feel like what we want. Even if it's "technically correct", it makes working with a Variance really tricky.

v
Out[5]:
▌ Variance 
  ▶     value: -Inf
  ▶      nobs: 1

TMLE For confidence intervals etc

Would be really cool to have online (or even batch TMLE) either in the package or interfacing with it.

Intro: http://blog.revolutionanalytics.com/2015/03/
http://www.bu.edu/cphs/files/2013/01/Mark-van-der-Laan-Targeted-Learning-with-Big-Data.pdf
Online targeted learning: http://biostats.bepress.com/ucbbiostat/paper330/
Julia implementation of batch TMLE: https://lendle.github.io/TargetedLearning.jl/

IPCA

Update PCA loading matrix incrementally

Better Documentation

I started working on generating documentation from the OnlineStats doc strings.

This:
https://github.com/joshday/OnlineStats.jl/blob/dev/doc/knit.jl

generates this:
https://github.com/joshday/OnlineStats.jl/blob/dev/doc/api.md

There's clearly a lot of stuff that needs to be documented better.

ambiguity warnings

@joshday I assume you're getting these as well? I'm on master.

INFO: Recompiling stale cache file /home/tom/.julia/lib/v0.4/OnlineStats.ji for module OnlineStats.
WARNING: New definition 
    fit!(OnlineStats.OnlineStat, Union{AbstractArray{T<:Any, 2}, AbstractArray{T<:Any, 1}}, Integer) at /home/tom/.julia/v0.4/OnlineStats/src/OnlineStats.jl:159
is ambiguous with: 
    fit!(OnlineStats.StatLearn, AbstractArray{#T<:Real, 1}, Real) at /home/tom/.julia/v0.4/OnlineStats/src/modeling/statlearn.jl:200.
To fix, define 
    fit!(OnlineStats.StatLearn, AbstractArray{#T<:Real, 1}, Integer)
before the new definition.
WARNING: New definition 
    fit!(OnlineStats.OnlineStat, Union{AbstractArray{T<:Any, 2}, AbstractArray{T<:Any, 1}}, Integer) at /home/tom/.julia/v0.4/OnlineStats/src/OnlineStats.jl:159
is ambiguous with: 
    fit!(OnlineStats.StatLearnSparse, AbstractArray{T<:Any, 1}, Real) at /home/tom/.julia/v0.4/OnlineStats/src/modeling/statlearnextensions.jl:39.
To fix, define 
    fit!(OnlineStats.StatLearnSparse, AbstractArray{T<:Any, 1}, Integer)
before the new definition.
WARNING: New definition 
    fit!(OnlineStats.OnlineStat, Union{AbstractArray{T<:Any, 2}, AbstractArray{T<:Any, 1}}, Integer) at /home/tom/.julia/v0.4/OnlineStats/src/OnlineStats.jl:159
is ambiguous with: 
    fit!(OnlineStats.StatLearnCV, AbstractArray{T<:Any, 1}, Real) at /home/tom/.julia/v0.4/OnlineStats/src/modeling/statlearnextensions.jl:80.
To fix, define 
    fit!(OnlineStats.StatLearnCV, AbstractArray{T<:Any, 1}, Integer)
before the new definition.
WARNING: New definition 
    fit!(OnlineStats.OnlineStat, Union{AbstractArray{T<:Any, 2}, AbstractArray{T<:Any, 1}}, Integer) at /home/tom/.julia/v0.4/OnlineStats/src/OnlineStats.jl:159
is ambiguous with: 
    fit!(OnlineStats.LinReg, AbstractArray{T<:Any, 1}, Real) at /home/tom/.julia/v0.4/OnlineStats/src/modeling/linreg.jl:16.
To fix, define 
    fit!(OnlineStats.LinReg, AbstractArray{T<:Any, 1}, Integer)
before the new definition.
WARNING: New definition 
    fit!(OnlineStats.OnlineStat, Union{AbstractArray{T<:Any, 2}, AbstractArray{T<:Any, 1}}, Integer) at /home/tom/.julia/v0.4/OnlineStats/src/OnlineStats.jl:159
is ambiguous with: 
    fit!(OnlineStats.QuantReg, AbstractArray{#T<:Real, 1}, Real) at /home/tom/.julia/v0.4/OnlineStats/src/modeling/quantreg.jl:27.
To fix, define 
    fit!(OnlineStats.QuantReg, AbstractArray{#T<:Real, 1}, Integer)
before the new definition.

Use Weighting for parametric density estimation

Distribution estimation should have optional Weighting.

User-specified weighting

It would be a nice feature to allow users to specify their own weights along with data. This would likely only be a method available for MyStatistic{EqualWeighting} types.

update!(o, y, w) where w is a weight vector.

edit: adding a checklist by directory:

Code cleanup

There's a bunch of places where there's unnecessary parametric types/signatures and wasteful work . For example...
Base.push!{T <: ScalarStat}(df::DataFrame, obj::T) = append!(df, DataFrame(obj))
might be better as:
Base.push!(df::DataFrame, obj::ScalarStat) = push!(df, value(obj))

Note that:

No need for a parametric method here... T is never used. It's no slower to give the type as obj::ScalarStat.. it's still compiled to the same method (i.e. Base.push!(df::DataFrame, obj::Var) = ...) You only want the parametric if you need T in the RHS of the assignment or if you need types to match in the signature (i.e. add{T<:Integer}(x::T, y::T) = ... will make sure x and y are the same type of integer) or if you want an abstract as the parameter of another type (i.e. Base.push!{T<:Integer}(df::DataFrame, obj::Vector{T}) = ... is appropriate)
It's inefficient to create a new one-row DataFrame just to then merge it with an existing DataFrame. You can push!() any iterable of row values to a DataFrame and it will just push!()` each of those values to the columns.

I'll make changes like these as I see them... If you have any questions on why I made a certain change, don't hesitate to talk it through here.

updatebatch!() for statistics that benefit from batch updates

Some statistics will be much faster in batch updates instead of singletons. I think we should be explicit in how the update is performed (also because updating by singletons is not equivalent to updating by batch unless it uses EqualWeighting).

What is preferred here for batch y?

updatebatch!(o, y)
update!(o, y, batch::Bool = false)

Generalized regularization?

I'd like to be able to arbitrarily define a regularization function beyond the standard L1/L2. For example it would be great to incorporate the change in beta: {lambda * (beta_t+1 - beta_t)} in the minimization, and not simply rely on the fixed speed of variation such as defined in online flexible least squares. Also, I'd like this without needing to implement a brand new specialized OnlineStat. Ideally lots of these concepts could be a little more plug and play (match user-defined regularization function with user defined link function, for example)

I have hopes that generalized Adagrad as a Weighting may provide this sort of flexibility, so for now I'm leaving this here as a "what if", and we can revisit later.

Reorganization/rewrite

I started toying around with changes to OnlineStats in a separate package, https://github.com/joshday/OnlineStatistics.jl (It only exists as a separate package to run comparisons between the old and the new. It is not, nor will it be, a replacement). What was meant to be just a few performance tests turned into a major rewrite (details on changes below). I've managed some performance improvements (some marginal, some orders of magnitude):

julia> include("test/performance.jl")
WARNING: replacing module Performance

  =======================================
  Performance on 10 million observations
  =======================================

                Mean new :  0.040254 seconds (5 allocations: 192 bytes)
                Mean old :  0.042746 seconds (5 allocations: 192 bytes)

        Mean (batch) new :  0.005795 seconds (6 allocations: 224 bytes)
        Mean (batch) old :  0.005977 seconds (6 allocations: 224 bytes)

            Variance new :  0.042452 seconds (5 allocations: 208 bytes)
            Variance old :  0.060829 seconds (5 allocations: 192 bytes)

             Extrema new :  0.049575 seconds (4 allocations: 160 bytes)
             Extrema old :  0.051199 seconds (4 allocations: 160 bytes)

         QuantileSGD new :  0.618786 seconds (8 allocations: 448 bytes)
         QuantileSGD old :  1.510063 seconds (94 allocations: 76.310 MB, 0.25% gc time)

          QuantileMM new :  0.772390 seconds (10 allocations: 656 bytes)
          QuantileMM old :  1.720187 seconds (96 allocations: 76.310 MB, 1.33% gc time)

             Moments new :  0.099977 seconds (6 allocations: 288 bytes)
             Moments old :  0.071233 seconds (5 allocations: 208 bytes)


  ============================================
  Performance on .2 million × 500 observations
  ============================================

               Means new :  0.010688 seconds (17 allocations: 2.469 KB)
        Means old (VERY SLOW) :

       Means (batch) new :  0.009797 seconds (16 allocations: 1.578 KB)
       Means (batch) old :  0.009701 seconds (17 allocations: 1.609 KB)

           Variances new :  0.041235 seconds (40 allocations: 7.766 KB)
    Variances old (VERY SLOW) :

   Variances (batch) new :  0.042492 seconds (37 allocations: 5.109 KB)
   Variances (batch) old :  0.037979 seconds (38 allocations: 5.156 KB)

           CovMatrix new :  2.819448 seconds (200.00 k allocations: 9.155 MB)
    CovMatrix old (VERY SLOW) :

   CovMatrix (batch) new :  0.072263 seconds (23 allocations: 237.063 KB)
   CovMatrix (batch) old :  0.065741 seconds (19 allocations: 80.625 KB)



  ===========================================
  Performance on 1 million × 5 design matrix
  ===========================================

              LinReg new :  0.033935 seconds (35 allocations: 45.779 MB, 9.72% gc time)
              LinReg old :  0.053115 seconds (37 allocations: 45.779 MB, 43.14% gc time)
           SparseReg old :  0.092843 seconds (33 allocations: 45.779 MB, 71.34% gc time)

Changes:

remove state(o) and statenames(o), replace with value(o)
- value(o) returns only the statistic, nonessential information (ex: Vector of quantiles) should show up in Base.show methods
Change Weighting to Weight, EqualWeighting to EqualWeight, etc.
- add the field nup (number of updates) to each OnlineStat, which is useful for LearningRate or similar stochastic weighting mechanisms to be added in the future
- weight!(o, n2) updates the n and nup fields and returns the weight. Take a look at https://github.com/joshday/OnlineStatistics.jl/blob/master/src/summary.jl to see how this changes update! methods
change update! to StatsBase.fit!
faster and easier to understand sweep! operator with additional method with a placeholder vector to avoid gc. sweep! has also been changed to store the upper triangular matrix, rather than lower.
Rather than let each Distribution have its own type, fitting distributions can all be done through FitDistribution and FitMvDistribution types
Remove SparseReg, it's functionality (coefficients from penalized likelihood) is now handled by LinReg
cleanup
- I had too many files floating around. For example, everything in summary/ is now in one file, summary.jl.
rename StochasticModel to StatLearn. I think hinting at statistical learning is better than "algorithms based on a stochastic subgradient". If anyone has a better name for something that incorporates SVMs and linear, logistic, poisson, huber, l1 loss, and quantile regression, I'm all ears.
probably other things I forgot

I wanted to get this out in the open before I start moving things over to OnlineStats. The biggest impact change is how weightings are handled.

Ensure consistent interface

I've noticed in one or two places the update! function is defined differently... sometimes it's update!(o, x, y) and sometimes update!(o, y, x). There might be similar things for other methods... need to check. This all may be my fault, but we should pick one and make sure everything conforms.

Allow OnlineStats to work on memory mapped arrays

As the title says.

Lasso with sparse solution

L1Penalty() is currently functional only for SGD. It also typically sets coefficients "close" to zero and doesn't create a sparse solution.

Given the usefulness of the Lasso, this needs to be improved.

switch testing to FactCheck.jl?

Pros:

Nicer looking tests/reports
Can have groupings to easily see sections that are failing
Testing will continue beyond a failed test (for example, I was changing some tests from @test to @test_approx_eq one at a time, because any tests after the failed one were not getting run)

Cons:

Might want to convert existing tests (annoying coding, but we don't have to switch everything if we don't want)

You can see how to use it on the github page, but as examples:
@test x == y converts to @fact x => y
@test_approx_eq x y converts to @fact x => roughly(y)

I'll use FactCheck for new stuff that I'm working on, and you can make the judgement call on whether you want to switch to it.

High level model builder

Sometimes it can be a daunting task to select the appropriate model for a given dataset. It would be great to provide a helper framework (possibly a separate package or business) that could help choose/setup the best model given high-level information about the data set:

Dimensionality?
Time-varying/Ordered Data?
Expected multicolinearity? (can check this ourselves of course)
Classification? 2 classes or more than 2?
Regression? Nonlinear?
etc

I've seen many graphics which essentially create a decision tree given lots of high-level information about a data problem, and point you at the right solution type (linear regression vs logistic regression vs dimensionality reduction vs SVM vs random forests vs ???). I think this is something that could be implemented alongside an ensemble framework which could choose lots of candidate models for you and drop/average/vote on the best predictions. In the online setting, ensembles could be relatively cheap, even for large datasets (especially if the online algorithm allows for parallel fitting)

(It's conceivable that this could be a SaaS business in its own right... high level online data science platform built on top of OnlineStats and OnlineAI)

Reactive framework

It would be cool to have some helper functions (or macros?) to set up a Reactive pipeline for OnlineStats. Here's a toy example of what it may do under the hood. I set up 2 OnlineStats: a Diff and a Mean, with the goal of reproducing the equivalent of mean(diff(x)) when x is the full data vector. liftdiff is of type Reactive.Lift{OnlineStats.Diff{Float64}}. Everytime you push! to input, reactive will take care of calling the function you pass it and place the return value inside the Lift object. Then you can chain the first Lift to another one that updates an OnlineStats.Mean object. One call to push! on the Reactive.Input will trigger the necessary updates and store the return values.

julia> using OnlineStats, Reactive

julia> input = Input(0.0);

julia> d = Diff(); m = Mean();

julia> liftdiff = lift(x -> (update!(d, x); d), input);

julia> liftdiff
Online OnlineStats.Diff{Float64}
 *     diff:  0.000000

 *     last:  0.000000

 *     nobs:  1

julia> typeof(liftdiff)
Reactive.Lift{OnlineStats.Diff{Float64}}

julia> liftmean = lift(d -> (if nobs(d) > 1; update!(m, diff(d)); end; mean(m)), liftdiff);

julia> empty!(d); empty!(m);

julia> println(d); println(m)
OnlineStats.Diff{Float64}{diff=0.0 last=0.0 nobs=0}
OnlineStats.Mean{OnlineStats.EqualWeighting}{μ=0.0 nobs=0}

julia> push!(input, 5.0)   # this passes our first data point through the reactive pipeline we set up

julia> println(d); println(m)
OnlineStats.Diff{Float64}{diff=0.0 last=5.0 nobs=1}
OnlineStats.Mean{OnlineStats.EqualWeighting}{μ=0.0 nobs=0}

julia> push!(input, 8.0)

julia> println(d); println(m)
OnlineStats.Diff{Float64}{diff=3.0 last=8.0 nobs=2}
OnlineStats.Mean{OnlineStats.EqualWeighting}{μ=3.0 nobs=1}

julia> push!(input, 3.0)

julia> println(d); println(m)
OnlineStats.Diff{Float64}{diff=-5.0 last=3.0 nobs=3}
OnlineStats.Mean{OnlineStats.EqualWeighting}{μ=-1.0 nobs=2}

julia> @assert value(liftmean) == mean(diff([5.,8.,3.]))  # compare to vector version

Logging

Did log.jl break on 0.3.7? Maybe we can debug together? It's nice to use... you can do:

x = "world"
LOG("Hello", x)  # prints "INFO: Hello world [filename:linenumber]"
log_severity(ERROR)
LOG("Hello", x)  # doesn't print... severity not high enough
LOG(ERROR, "Hi") # prints "ERROR: Hi [filename:linenumber]"

log_severity(INFO)
y = 5
@LOG x y   # prints "INFO: x: world y: 5 [filename:linenumber]"

I'll probably add @DEBUG and @ERROR macros too.

This is cool so you can add verbose logging to your stuff and turn it on/off for different runs (using log_severity()) without reloading.

fix 0.4 deprecation warnings

There's a bunch, here are some examples:

WARNING: [a,b,...] concatenation is deprecated; use [a;b;...] instead
 in depwarn at ./deprecated.jl:62
 in oldstyle_vcat_warning at ./abstractarray.jl:29
 in anonymous at /home/tom/.julia/v0.4/OnlineStats.jl/test/analyticalpca_test.jl:24
 in facts at /home/tom/.julia/v0.4/FactCheck/src/FactCheck.jl:315
 in include at ./boot.jl:254
 in include_from_node1 at ./loading.jl:133
 in include at ./boot.jl:254
 in include_from_node1 at loading.jl:133
 in process_options at ./client.jl:306
 in _start at ./client.jl:406
while loading /home/tom/.julia/v0.4/OnlineStats.jl/test/analyticalpca_test.jl, in expression starting on line 5

WARNING: float64(x) is deprecated, use Float64(x) instead.
 in depwarn at ./deprecated.jl:62
 in float64 at deprecated.jl:49
 in _kmeans! at /home/tom/.julia/v0.4/Clustering/src/kmeans.jl:156
 in kmeans! at /home/tom/.julia/v0.4/Clustering/src/kmeans.jl:37
 in kmeans at /home/tom/.julia/v0.4/Clustering/src/kmeans.jl:53
 in emstart at /home/tom/.julia/v0.4/OnlineStats.jl/src/distributions/offlinenormalmix.jl:19
 in anonymous at /home/tom/.julia/v0.4/OnlineStats.jl/test/normalmix_test.jl:13
 in context at /home/tom/.julia/v0.4/FactCheck/src/FactCheck.jl:341
 in anonymous at /home/tom/.julia/v0.4/OnlineStats.jl/test/normalmix_test.jl:8
 in facts at /home/tom/.julia/v0.4/FactCheck/src/FactCheck.jl:315
 in include at ./boot.jl:254
 in include_from_node1 at ./loading.jl:133
 in include at ./boot.jl:254
 in include_from_node1 at loading.jl:133
 in process_options at ./client.jl:306
 in _start at ./client.jl:406
while loading /home/tom/.julia/v0.4/OnlineStats.jl/test/normalmix_test.jl, in expression starting on line 7

WARNING: int(x::AbstractArray) is deprecated, use round(Int64,x) instead.
 in depwarn at ./deprecated.jl:62
 in int at deprecated.jl:49
 in anonymous at /home/tom/.julia/v0.4/OnlineStats.jl/test/logreg_test.jl:17
 in context at /home/tom/.julia/v0.4/FactCheck/src/FactCheck.jl:341
 in anonymous at /home/tom/.julia/v0.4/OnlineStats.jl/test/logreg_test.jl:9
 in facts at /home/tom/.julia/v0.4/FactCheck/src/FactCheck.jl:315
 in include at ./boot.jl:254
 in include_from_node1 at ./loading.jl:133
 in include at ./boot.jl:254
 in include_from_node1 at loading.jl:133
 in process_options at ./client.jl:306
 in _start at ./client.jl:406
while loading /home/tom/.julia/v0.4/OnlineStats.jl/test/logreg_test.jl, in expression starting on line 8

WARNING: [a] concatenation is deprecated; use collect(a) instead
 in depwarn at ./deprecated.jl:62
 in oldstyle_vcat_warning at ./abstractarray.jl:29
 in anonymous at /home/tom/.julia/v0.4/OnlineStats.jl/test/logreg_test.jl:15
 in context at /home/tom/.julia/v0.4/FactCheck/src/FactCheck.jl:341
 in anonymous at /home/tom/.julia/v0.4/OnlineStats.jl/test/logreg_test.jl:9
 in facts at /home/tom/.julia/v0.4/FactCheck/src/FactCheck.jl:315
 in include at ./boot.jl:254
 in include_from_node1 at ./loading.jl:133
 in include at ./boot.jl:254
 in include_from_node1 at loading.jl:133
 in process_options at ./client.jl:306
 in _start at ./client.jl:406
while loading /home/tom/.julia/v0.4/OnlineStats.jl/test/logreg_test.jl, in expression starting on line 8

Plots dependency

cc: @johnmyleswhite @petercolberg

Continuing the comments from METADATA... I agree that I also prefer to keep dependencies small. A core part of the design of Plots is to be lightweight with minimal dependencies so that it could be a requirement for packages like this (Colors.jl is the only additional dependency). You don't need to install Gtk or Cairo, etc to be able to load the package.

That said... I'd like to hear other ideas for how to structure the organization of statistical plots. Should there be a StatPlots.jl package with plotting recipes for common needs? Of course then you may run into the reverse problem... all I wanted was to trace plot a linear regression, and now I have dependencies on lots of statistics packages.

What's the cleanest separation?

Regression should accept vectors

I was surprised that this wasn't allowed:


julia> using OnlineStats

julia> LinReg(randn(100),randn(100))
ERROR: MethodError: `convert` has no method matching convert(::Type{OnlineStats.LinReg{W<:OnlineStats.Weighting}}, ::Array{Float64,1}, ::Array{Float64,1})
This may have arisen from a call to the constructor OnlineStats.LinReg{W<:OnlineStats.Weighting}(...),
since type constructors fall back to convert methods.
Closest candidates are:
  OnlineStats.LinReg(::AbstractArray{Float64,2}, ::AbstractArray{Float64,1})
  OnlineStats.LinReg(::AbstractArray{Float64,2}, ::AbstractArray{Float64,1}, ::OnlineStats.Weighting)
  call{T}(::Type{T}, ::Any)
  ...
 in call at essentials.jl:57

julia> LinReg(randn(100,1),randn(100))
OnlineStat: OnlineStats.LinReg{OnlineStats.EqualWeighting}
 *        β:  [0.06140403868184703]
 *     nobs:  100

We should make sure both vectors and matrices are handled as expected.

Need help understanding LinReg

I expect my FLS (flexible least squares) implementation will be similar to your LinReg. Do you have a paper or documentation explaining the method you're using? I don't quite follow what the sweep method is doing, and how that fits in.

Normalization is breaking FLS algo

Better update!

update! should be able to do:

singleton and minibatch updates
support callbacks

onlinefit! will be removed. updatebatch! will no longer be exported. Callbacks aren't incorporated yet. Here's what I got so far:

# b is batch size
function update!(o::OnlineStat, y::Union{AVec, AMat}, b::Integer = 1)
    b = Int(b)
    n = size(y, 1)
    @assert 0 < b <= n "batch size must be positive and less than data size"
    if b == 1
        for i in 1:n
            update!(o, row(y, i))
        end
    else
        i = 1
        while i <= n
            rng = i:min(i + b - 1, n)
            updatebatch!(o, rows(y, rng))
            i += b
        end
    end
end

# If an OnlineStat doesn't have an updatebatch method, just use update
updatebatch!(o::OnlineStat, data...) = update!(o, data...)

Generic types for SGD and variants (averaging, momentum, stochastic proximal gradient)

Implementing these things should be easy and also useful, just follow the work done with Adagrad.

It took about a minute to add quantile regression to @tbreloff 's Adagrad framework, so I'm thinking this is the best approach. These generic types would also unify a few things (LogRegSGD, QuantRegSGD, etc.).

SGD
Averaging (somewhat taken care of with work going on in react.jl)
Momentum
Stochastic Proximal Gradient Descent

Merge StreamStats, add to JuliaStats

See StreamStats.jl issue #24.

To prepare for adding OnlineStats into JuliaStats, we should add the following functionality from StreamStats:

Abstract Type Flexibility

Starting an issue to discuss improving the abstract type structure.

Here are the problems I see with the current structure:

Ambiguity when using Distributions.VariateForm. For estimating a parametric density, say a normal distribution, we have a parameter vector [:μ, :σ]. With the current naming convention, OnlineStats.FitNormal (or similar) would be a subtype of ContinuousMultivariateOnlineStat (we are estimating a vector), which is confusing since we're looking at a ContinuousUnivariate distribution.
What is the type of a gaussian mixture model? Each component has a mean vector and covariance matrix, which is beyond Matrixvariate. Cubevariate?
I don't see the usefulness of labeling estimates as continuous/discrete.
I foresee two main use cases. One where a user needs to call update! very often (streaming data). The other is a user wants to call update! the fewest number of times possible to get a good estimate (data doesn't fit in RAM). For any given statistical method, there could be two types available. In the case of linear regression, there would be the "slow" (LinReg) analytical version and the "fast" (FastLinReg) SGD version. I think it would be beneficial to at least define the class of algorithm that is being used.

Here are my suggested types:

# OnlineAlgorithm
abstract OnlineAlgorithm
abstract Approximate <: OnlineAlgorithm
abstract Analytical <: OnlineAlgorithm   # Analytical update
abstract SGD <: Approximate              # Stochastic gradient descent
abstract ASGD <: Approximate             # Averaged SGD
abstract SGD2 <: Approximate             # 2nd order SGD
abstract OEM <: Approximate              # Online EM       
abstract OMM <: Approximate              # Online MM

# ParameterForm
abstract ParameterForm
abstract ScalarParameter <: ParameterForm
abstract VectorParameter <: ParameterForm
abstract MatrixParameter <: ParameterForm
abstract OtherParameter  <: ParameterForm

# OnlineStat
abstract OnlineStat{F <: ParameterForm, A <: OnlineAlgorithm}

Expected behavior for state() and statenames() with matrix parameters?

What should state() and statenames() return?

I'm thinking state() should return Vector{Any} so that state(o)[someinteger] coincides with the value of statenames(o)[someinteger]

An example is for fitting a multivariate normal distribution where mean(o.c) is a Vector and cov(o.c) is a Matrix.

statenames(o::FitMvNormal) = [:μ, :Σ, :nobs]

state(o::FitMvNormal) = Any[mean(o.c), cov(o.c), o.n]

OnlineFLS

modeled on http://www2.econ.iastate.edu/tesfatsi/FLSTemporalDataMining.GMontana2009.pdf

Write draft
Tests

Adagrad issues

I thought Adagrad needed it's own issue. Also, I apparently decided to write a novel.

Here's a bunch of stuff. Sorry for the length:

I think the default η = 0.1 is too small and it's converging slowly. The second paper talks about using η = 1.0.
I should've noticed this earlier:
- y is being generated incorrectly and that's probably messing with things.
- Each y_i should be 1 with probability 1 / (1 + exp(-x'b))

It's less pretty, but it should look something like

probvec = [1 / (1 + exp(-u)) for u in x*β]
y = [Float64(rand(Bernoulli(p))) for p in probvec]

which makes things look nicer. It still seems a bit sensitive to the choice of η:

# β = collect(1.:p)
julia> o = Adagrad(x, y; link=LogisticLink(), loss=LogisticLoss(), η = 0.1)
Online Adagrad
 *        β:  [0.615126,1.2406,1.83746,2.50282,3.07797,3.68487,4.34718,4.93414,5.57917,6.16753]

 *     nobs:  1000000

julia> o = Adagrad(x, y; link=LogisticLink(), loss=LogisticLoss(), η = 1.0)
Online Adagrad
 *        β:  [1.01088,1.97945,2.99252,4.03143,4.97021,5.86467,6.88853,7.89626,8.86106,9.84362]

 *     nobs:  1000000

julia> o = Adagrad(x, y; link=LogisticLink(), loss=LogisticLoss(), η = 5.0)
Online Adagrad
 *        β:  [1.26286,1.89395,3.08134,4.37615,5.30177,6.22554,7.15833,8.34897,9.29574,10.2929]

 *     nobs:  1000000

I decided to try out Adagrad vs. LinReg and this was a big surprise. LinReg is a lot faster and uses a lot less allocated memory. After profiling, it looks like gᵢ = ∇f(o.loss, ε, x[i]) + ∇Ψ(o.reg, o.β, i) is the bottleneck. You've probably figured all this out by the time I finish writing this.

julia> x = randn(n, p);

julia> β = collect(1.:p);

julia> y = x * β + randn(n)*10;

julia> @time Adagrad(x, y)
elapsed time: 2.668726562 seconds (784000544 bytes allocated, 51.37% gc time)
Online Adagrad
 *        β:  [0.9744,1.94384,3.00235,4.01059,4.98608,6.07463,7.03342,8.03884,8.97063,9.99778]

 *     nobs:  1000000



julia> @time LinReg(x, y)
elapsed time: 0.176378306 seconds (176004712 bytes allocated, 46.22% gc time)
Online LinReg{EqualWeighting}
 *        β:  [0.99402,1.9986,3.00373,4.013,5.01179,6.00631,6.99868,8.00633,8.98498,9.99851]

 *     nobs:  1000000

Implement IPLS

Partial Least Squares (PLS) is a technique which effectively combines PCA and regression in the same algo. The goal is to find reduced dimension linear combinations of X (i.e. T = XW) that have high covariance with Y (i.e. regressing Y = beta * T + error)

Reviewing Zeng et al (2014): "Incremental partial least squares analysis of big streaming data" as a possible algo reference.

Weighting for stochastic methods

Stochastic types use weights of the form nbatches ^ -const where const > .5 && const <= 1, which doesn't fit into the current weighting scheme. I need to think a bit more about how to handle this.

Generic types/methods for normalizing input data

Error during sweep test

Here's the error:

ERROR: range must be non-empty
 in RandIntGen at random.jl:180
 in anonymous at /home/tom/.julia/v0.3/OnlineStats/test/sweep_test.jl:9
 in facts at /home/tom/.julia/v0.3/FactCheck/src/FactCheck.jl:355
 in include at ./boot.jl:245
 in include_from_node1 at ./loading.jl:128
 in include at ./boot.jl:245
 in include_from_node1 at loading.jl:128
 in process_options at ./client.jl:285
 in _start at ./client.jl:354
while loading /home/tom/.julia/v0.3/OnlineStats/test/sweep_test.jl, in expression starting on line 6
while loading /home/tom/.julia/v0.3/OnlineStats/test/runtests.jl, in expression starting on line 20

and the offending code in sweep_test.jl:

n = rand(1:10_000, 1)[1]
p = rand(1:min(n-1, 1000), 1)[1]
A = rand(n, p)

Is there a good reason why you're doing the n = rand(...)[1] here? Seems unnecessary and also unstable...

state() returns the distribution object instead of the params?

Would be a slight cleanup in places, plus then you automatically have the correct distribution objects with df[:dist]. This isn't important so you can close the issue immediately if you disagree.

For example you now have:

type FitBernoulli{W <: Weighting} <: OnlineStat
    d::Bernoulli
    p::Float64  # success probability
    n::Int64
    weighting::W
end

function onlinefit{T <: Integer}(::Type{Bernoulli},
                                 y::Vector{T},
                                 wgt::Weighting = default(Weighting))
    o = FitBernoulli(wgt)
    update!(o, y)
    o
end

FitBernoulli{T <: Integer}(y::Vector{T}, wgt::Weighting = default(Weighting)) =
    onlinefit(Bernoulli, y, wgt)

FitBernoulli(wgt::Weighting = default(Weighting)) =
    FitBernoulli(Bernoulli(0), 0., 0, wgt)


#-----------------------------------------------------------------------# state
statenames(o::FitBernoulli) = [:p, :nobs]

state(o::FitBernoulli) = [o.d.p, o.n]


#---------------------------------------------------------------------# update!
function update!(obj::FitBernoulli, y::Integer)
    λ = weight(obj)
    obj.p = smooth(obj.p, @compat(Float64(y)), λ)
    obj.d = Bernoulli(obj.p)
    obj.n += 1
    return
end

But could probably change it to:

type FitBernoulli{W <: Weighting} <: OnlineStat
    d::Bernoulli
    n::Int64
    weighting::W
end

function onlinefit{T <: Integer}(::Type{Bernoulli},
                                 y::Vector{T},
                                 wgt::Weighting = default(Weighting))
    o = FitBernoulli(wgt)
    update!(o, y)
    o
end

FitBernoulli{T <: Integer}(y::Vector{T}, wgt::Weighting = default(Weighting)) =
    onlinefit(Bernoulli, y, wgt)

FitBernoulli(wgt::Weighting = default(Weighting)) =
    FitBernoulli(Bernoulli(0), 0, wgt)


#-----------------------------------------------------------------------# state
statenames(o::FitBernoulli) = [:dist, :nobs]

state(o::FitBernoulli) = [o.d, o.n]


#---------------------------------------------------------------------# update!
function update!(obj::FitBernoulli, y::Integer)
    λ = weight(obj)
    obj.d = Bernoulli(smooth(obj.d.p, @compat(Float64(y)), λ))
    obj.n += 1
    return
end

Allow matrices row or column based

Need to think on this topic a little more... I'd like to be able to support passing in an X matrix either NxT or TxN, but without annoyance for the user, and without re-coding everything twice.

DataStreams

Implement interface of: https://github.com/JuliaDB/DataStreams.jl

Overhaul of StatLearn

I plan on making a few changes to StatLearn in the next few weeks. Here's a rough outline of what I'd like to do:

The Model, Penalty, and Algorithm types will be imported from https://github.com/joshday/SparseRegression.jl. SparseRegression.jl is essentially the offline version of StatLearn.
Accept a grid of regularization parameters. Currently each StatLearn object only works for a single parameter.
General code cleanup

Another change to SGD/Momentum/Adagrad

I touched a lot of files here, so I wanted to make an issue of it.

I added two fields for all three types: β0::Float64 (intercept/bias term) and intercept::Bool (whether β0 needs to be updated). intercept = true by default.

This solves two things:

Adds an intercept without making a copy. BiasVector/Matrix is awesome, but there's no rowvec_view() method, so Adagrad(BiasMatrix(x),y) throws an error. There may be a way to fix this.
The intercept term will no longer be affected by the penalty (as it shouldn't).

Moving window online stats

It would be nice if these online statistics can return the statistic over the last X samples rather than the start. For example , moving average.

Documentation/Wiki

What's your final intention for documentation? Should we try to use the doc"""...""" notation that they're moving towards in Base? Lets pool those thoughts here...

Move away from ArrayViews

Can replace with slice which produces a SubArray. ArrayViews is apparently obsolete at this point.

use AbstractArray

Consider changing VecF/MatF to:

typealias VecF AbstractVector{Float64}
typealias MatF AbstractMatrix{Float64}

and use views freely to avoid temporary copies. If we need a copy, require explicit call to copy

edit: adding a checklist by directory:

Window functions?

This departs only slightly from the spirit of OnlineStats, but sometimes it's useful to have statistics on a rolling window. (mean/variance/high/low over the last 30 periods, for example) We could require QuickStructs.jl (one of my packages) to use the CircularBuffer to store the required history, or there's probably another implementation elsewhere if you don't want to add a dependency. This may be too big a change... need to think on it more.

Separately, I also have a RollingWindow type in QuickStructs that will hold a window of data given pre-specified lags, and could make various time-series modelling a little nicer (ARMA, VARMA, VECM, etc)

Can fit return something useful?

If I want to create a trace vector of fitted values, I have to do something like:

Float64[(fit!(o, yi); value(o)) for yi in y]

But if fit! returned the OnlineStat object then I can do something like:

Float64[fit!(o,yi).value for yi in y]
# or
map(yi -> value(fit!(o,yi)), y)

I don't think it should return the value itself, because many times you'll want to do something else with the object.

There is precedence for this sort of thing... for example push! will return the object you are pushing to.

Remove dependency on DataFrames?

I'm concerned that the methods that depend on DataFrames (tracedata stuff) aren't particularly useful to people that aren't me ( I'm using it to generate plots for my dissertation).

I think tracedata can/should be replaced with a function that generates Vector{T <: OnlineStat} instead of a DataFrame.

Any objections?

Function Names

While changes are being made, let's nail down some function names.

nobs(obj): return the number of observations used
state(obj): create a DataFrame in the form of DataFrames.melt
addstate!(df, obj): append state(obj) to new row(s) in df
param(obj): return Vector{Symbol} of parameter names
value(obj): return simplest form of current estimate
tracedata(Stat, y, b): return DataFrame for making traceplot for statistic Stat on data y using batch size b
onlinefit(Distribution, y): create object for online estimation of the parametric density Distribution using first batch y

Plotting methods

Continued from JuliaPlots/Plots.jl#30

Since Plots.jl exists, we can now implement plotting methods with less worry about a user's preferred plotting package.

In the josh branch, I currently have a new traceplot!(o, b, data...; f) function which plots the value x = nobs(o) y = f(o) for every b observations. f defaults to v -> state(v)[1], but any function that works on the OnlineStat and returns a Real scalar or Vector should work.

Edit for example:

x = randn(10_000, 10)
beta = collect(1:10)
y = x*beta + randn(10_000)

o = SGModel(10)
traceplot!(o, 100, x, y)

Iterative GAMs?

http://onlinelibrary.wiley.com/doi/10.1111/rssc.12068/abstract