joshday / onlinestats.jl Goto Github PK
View Code? Open in Web Editor NEW⚡ Single-pass algorithms for statistics
Home Page: https://joshday.github.io/OnlineStats.jl/latest/
License: MIT License
⚡ Single-pass algorithms for statistics
Home Page: https://joshday.github.io/OnlineStats.jl/latest/
License: MIT License
To prepare for adding OnlineStats into JuliaStats, we should add the following functionality from StreamStats:
Implementing these things should be easy and also useful, just follow the work done with Adagrad.
It took about a minute to add quantile regression to @tbreloff 's Adagrad framework, so I'm thinking this is the best approach. These generic types would also unify a few things (LogRegSGD
, QuantRegSGD
, etc.).
Partial Least Squares (PLS) is a technique which effectively combines PCA and regression in the same algo. The goal is to find reduced dimension linear combinations of X (i.e. T = XW) that have high covariance with Y (i.e. regressing Y = beta * T + error)
Reviewing Zeng et al (2014): "Incremental partial least squares analysis of big streaming data" as a possible algo reference.
Need to think on this topic a little more... I'd like to be able to support passing in an X matrix either NxT or TxN, but without annoyance for the user, and without re-coding everything twice.
This departs only slightly from the spirit of OnlineStats, but sometimes it's useful to have statistics on a rolling window. (mean/variance/high/low over the last 30 periods, for example) We could require QuickStructs.jl (one of my packages) to use the CircularBuffer
to store the required history, or there's probably another implementation elsewhere if you don't want to add a dependency. This may be too big a change... need to think on it more.
Separately, I also have a RollingWindow
type in QuickStructs that will hold a window of data given pre-specified lags, and could make various time-series modelling a little nicer (ARMA, VARMA, VECM, etc)
It would be a nice feature to allow users to specify their own weights along with data. This would likely only be a method available for MyStatistic{EqualWeighting}
types.
update!(o, y, w)
where w
is a weight vector.
edit: adding a checklist by directory:
I'm concerned that the methods that depend on DataFrames (tracedata
stuff) aren't particularly useful to people that aren't me ( I'm using it to generate plots for my dissertation).
I think tracedata
can/should be replaced with a function that generates Vector{T <: OnlineStat}
instead of a DataFrame.
Any objections?
I've noticed in one or two places the update! function is defined differently... sometimes it's update!(o, x, y)
and sometimes update!(o, y, x)
. There might be similar things for other methods... need to check. This all may be my fault, but we should pick one and make sure everything conforms.
Here's the error:
ERROR: range must be non-empty
in RandIntGen at random.jl:180
in anonymous at /home/tom/.julia/v0.3/OnlineStats/test/sweep_test.jl:9
in facts at /home/tom/.julia/v0.3/FactCheck/src/FactCheck.jl:355
in include at ./boot.jl:245
in include_from_node1 at ./loading.jl:128
in include at ./boot.jl:245
in include_from_node1 at loading.jl:128
in process_options at ./client.jl:285
in _start at ./client.jl:354
while loading /home/tom/.julia/v0.3/OnlineStats/test/sweep_test.jl, in expression starting on line 6
while loading /home/tom/.julia/v0.3/OnlineStats/test/runtests.jl, in expression starting on line 20
and the offending code in sweep_test.jl:
n = rand(1:10_000, 1)[1]
p = rand(1:min(n-1, 1000), 1)[1]
A = rand(n, p)
Is there a good reason why you're doing the n = rand(...)[1]
here? Seems unnecessary and also unstable...
I started working on generating documentation from the OnlineStats doc strings.
This:
https://github.com/joshday/OnlineStats.jl/blob/dev/doc/knit.jl
generates this:
https://github.com/joshday/OnlineStats.jl/blob/dev/doc/api.md
There's clearly a lot of stuff that needs to be documented better.
If I want to create a trace vector of fitted values, I have to do something like:
Float64[(fit!(o, yi); value(o)) for yi in y]
But if fit!
returned the OnlineStat
object then I can do something like:
Float64[fit!(o,yi).value for yi in y]
# or
map(yi -> value(fit!(o,yi)), y)
I don't think it should return the value itself, because many times you'll want to do something else with the object.
There is precedence for this sort of thing... for example push!
will return the object you are pushing to.
I touched a lot of files here, so I wanted to make an issue of it.
I added two fields for all three types: β0::Float64
(intercept/bias term) and intercept::Bool
(whether β0
needs to be updated). intercept = true
by default.
This solves two things:
BiasVector/Matrix
is awesome, but there's no rowvec_view()
method, so Adagrad(BiasMatrix(x),y)
throws an error. There may be a way to fix this.Stochastic types use weights of the form nbatches ^ -const
where const > .5 && const <= 1
, which doesn't fit into the current weighting scheme. I need to think a bit more about how to handle this.
Consider changing VecF/MatF to:
typealias VecF AbstractVector{Float64}
typealias MatF AbstractMatrix{Float64}
and use view
s freely to avoid temporary copies. If we need a copy, require explicit call to copy
edit: adding a checklist by directory:
Implement interface of: https://github.com/JuliaDB/DataStreams.jl
Continued from JuliaPlots/Plots.jl#30
Since Plots.jl exists, we can now implement plotting methods with less worry about a user's preferred plotting package.
In the josh branch, I currently have a new traceplot!(o, b, data...; f)
function which plots the value x = nobs(o)
y = f(o)
for every b
observations. f
defaults to v -> state(v)[1]
, but any function that works on the OnlineStat and returns a Real scalar or Vector should work.
Edit for example:
x = randn(10_000, 10)
beta = collect(1:10)
y = x*beta + randn(10_000)
o = SGModel(10)
traceplot!(o, 100, x, y)
Would be a slight cleanup in places, plus then you automatically have the correct distribution objects with df[:dist]
. This isn't important so you can close the issue immediately if you disagree.
For example you now have:
type FitBernoulli{W <: Weighting} <: OnlineStat
d::Bernoulli
p::Float64 # success probability
n::Int64
weighting::W
end
function onlinefit{T <: Integer}(::Type{Bernoulli},
y::Vector{T},
wgt::Weighting = default(Weighting))
o = FitBernoulli(wgt)
update!(o, y)
o
end
FitBernoulli{T <: Integer}(y::Vector{T}, wgt::Weighting = default(Weighting)) =
onlinefit(Bernoulli, y, wgt)
FitBernoulli(wgt::Weighting = default(Weighting)) =
FitBernoulli(Bernoulli(0), 0., 0, wgt)
#-----------------------------------------------------------------------# state
statenames(o::FitBernoulli) = [:p, :nobs]
state(o::FitBernoulli) = [o.d.p, o.n]
#---------------------------------------------------------------------# update!
function update!(obj::FitBernoulli, y::Integer)
λ = weight(obj)
obj.p = smooth(obj.p, @compat(Float64(y)), λ)
obj.d = Bernoulli(obj.p)
obj.n += 1
return
end
But could probably change it to:
type FitBernoulli{W <: Weighting} <: OnlineStat
d::Bernoulli
n::Int64
weighting::W
end
function onlinefit{T <: Integer}(::Type{Bernoulli},
y::Vector{T},
wgt::Weighting = default(Weighting))
o = FitBernoulli(wgt)
update!(o, y)
o
end
FitBernoulli{T <: Integer}(y::Vector{T}, wgt::Weighting = default(Weighting)) =
onlinefit(Bernoulli, y, wgt)
FitBernoulli(wgt::Weighting = default(Weighting)) =
FitBernoulli(Bernoulli(0), 0, wgt)
#-----------------------------------------------------------------------# state
statenames(o::FitBernoulli) = [:dist, :nobs]
state(o::FitBernoulli) = [o.d, o.n]
#---------------------------------------------------------------------# update!
function update!(obj::FitBernoulli, y::Integer)
λ = weight(obj)
obj.d = Bernoulli(smooth(obj.d.p, @compat(Float64(y)), λ))
obj.n += 1
return
end
Distribution estimation should have optional Weighting.
Would be really cool to have online (or even batch TMLE) either in the package or interfacing with it.
Intro: http://blog.revolutionanalytics.com/2015/03/
http://www.bu.edu/cphs/files/2013/01/Mark-van-der-Laan-Targeted-Learning-with-Big-Data.pdf
Online targeted learning: http://biostats.bepress.com/ucbbiostat/paper330/
Julia implementation of batch TMLE: https://lendle.github.io/TargetedLearning.jl/
I don't like the behavior when trying to get a running vector of the variance of a series:
using OnlineStats
v = Variance(ExponentialWeight(10))
x = sin(1:100) .* rand(100)
xvar = Float64[(fit!(v, xi); std(v)) for xi in x]
gives the error:
LoadError: DomainError:
while loading In[4], in expression starting on line 4
in std at /home/tom/.julia/v0.4/OnlineStats/src/summary.jl:69
in anonymous at no file
The value of v is -Inf, but that doesn't feel like what we want. Even if it's "technically correct", it makes working with a Variance
really tricky.
v
Out[5]:
▌ Variance
▶ value: -Inf
▶ nobs: 1
Update PCA loading matrix incrementally
cc: @johnmyleswhite @petercolberg
Continuing the comments from METADATA... I agree that I also prefer to keep dependencies small. A core part of the design of Plots is to be lightweight with minimal dependencies so that it could be a requirement for packages like this (Colors.jl is the only additional dependency). You don't need to install Gtk or Cairo, etc to be able to load the package.
That said... I'd like to hear other ideas for how to structure the organization of statistical plots. Should there be a StatPlots.jl package with plotting recipes for common needs? Of course then you may run into the reverse problem... all I wanted was to trace plot a linear regression, and now I have dependencies on lots of statistics packages.
What's the cleanest separation?
As the title says.
Can replace with slice
which produces a SubArray
. ArrayViews is apparently obsolete at this point.
L1Penalty() is currently functional only for SGD. It also typically sets coefficients "close" to zero and doesn't create a sparse solution.
Given the usefulness of the Lasso, this needs to be improved.
@joshday I assume you're getting these as well? I'm on master
.
INFO: Recompiling stale cache file /home/tom/.julia/lib/v0.4/OnlineStats.ji for module OnlineStats.
WARNING: New definition
fit!(OnlineStats.OnlineStat, Union{AbstractArray{T<:Any, 2}, AbstractArray{T<:Any, 1}}, Integer) at /home/tom/.julia/v0.4/OnlineStats/src/OnlineStats.jl:159
is ambiguous with:
fit!(OnlineStats.StatLearn, AbstractArray{#T<:Real, 1}, Real) at /home/tom/.julia/v0.4/OnlineStats/src/modeling/statlearn.jl:200.
To fix, define
fit!(OnlineStats.StatLearn, AbstractArray{#T<:Real, 1}, Integer)
before the new definition.
WARNING: New definition
fit!(OnlineStats.OnlineStat, Union{AbstractArray{T<:Any, 2}, AbstractArray{T<:Any, 1}}, Integer) at /home/tom/.julia/v0.4/OnlineStats/src/OnlineStats.jl:159
is ambiguous with:
fit!(OnlineStats.StatLearnSparse, AbstractArray{T<:Any, 1}, Real) at /home/tom/.julia/v0.4/OnlineStats/src/modeling/statlearnextensions.jl:39.
To fix, define
fit!(OnlineStats.StatLearnSparse, AbstractArray{T<:Any, 1}, Integer)
before the new definition.
WARNING: New definition
fit!(OnlineStats.OnlineStat, Union{AbstractArray{T<:Any, 2}, AbstractArray{T<:Any, 1}}, Integer) at /home/tom/.julia/v0.4/OnlineStats/src/OnlineStats.jl:159
is ambiguous with:
fit!(OnlineStats.StatLearnCV, AbstractArray{T<:Any, 1}, Real) at /home/tom/.julia/v0.4/OnlineStats/src/modeling/statlearnextensions.jl:80.
To fix, define
fit!(OnlineStats.StatLearnCV, AbstractArray{T<:Any, 1}, Integer)
before the new definition.
WARNING: New definition
fit!(OnlineStats.OnlineStat, Union{AbstractArray{T<:Any, 2}, AbstractArray{T<:Any, 1}}, Integer) at /home/tom/.julia/v0.4/OnlineStats/src/OnlineStats.jl:159
is ambiguous with:
fit!(OnlineStats.LinReg, AbstractArray{T<:Any, 1}, Real) at /home/tom/.julia/v0.4/OnlineStats/src/modeling/linreg.jl:16.
To fix, define
fit!(OnlineStats.LinReg, AbstractArray{T<:Any, 1}, Integer)
before the new definition.
WARNING: New definition
fit!(OnlineStats.OnlineStat, Union{AbstractArray{T<:Any, 2}, AbstractArray{T<:Any, 1}}, Integer) at /home/tom/.julia/v0.4/OnlineStats/src/OnlineStats.jl:159
is ambiguous with:
fit!(OnlineStats.QuantReg, AbstractArray{#T<:Real, 1}, Real) at /home/tom/.julia/v0.4/OnlineStats/src/modeling/quantreg.jl:27.
To fix, define
fit!(OnlineStats.QuantReg, AbstractArray{#T<:Real, 1}, Integer)
before the new definition.
I expect my FLS (flexible least squares) implementation will be similar to your LinReg. Do you have a paper or documentation explaining the method you're using? I don't quite follow what the sweep method is doing, and how that fits in.
There's a bunch, here are some examples:
WARNING: [a,b,...] concatenation is deprecated; use [a;b;...] instead
in depwarn at ./deprecated.jl:62
in oldstyle_vcat_warning at ./abstractarray.jl:29
in anonymous at /home/tom/.julia/v0.4/OnlineStats.jl/test/analyticalpca_test.jl:24
in facts at /home/tom/.julia/v0.4/FactCheck/src/FactCheck.jl:315
in include at ./boot.jl:254
in include_from_node1 at ./loading.jl:133
in include at ./boot.jl:254
in include_from_node1 at loading.jl:133
in process_options at ./client.jl:306
in _start at ./client.jl:406
while loading /home/tom/.julia/v0.4/OnlineStats.jl/test/analyticalpca_test.jl, in expression starting on line 5
WARNING: float64(x) is deprecated, use Float64(x) instead.
in depwarn at ./deprecated.jl:62
in float64 at deprecated.jl:49
in _kmeans! at /home/tom/.julia/v0.4/Clustering/src/kmeans.jl:156
in kmeans! at /home/tom/.julia/v0.4/Clustering/src/kmeans.jl:37
in kmeans at /home/tom/.julia/v0.4/Clustering/src/kmeans.jl:53
in emstart at /home/tom/.julia/v0.4/OnlineStats.jl/src/distributions/offlinenormalmix.jl:19
in anonymous at /home/tom/.julia/v0.4/OnlineStats.jl/test/normalmix_test.jl:13
in context at /home/tom/.julia/v0.4/FactCheck/src/FactCheck.jl:341
in anonymous at /home/tom/.julia/v0.4/OnlineStats.jl/test/normalmix_test.jl:8
in facts at /home/tom/.julia/v0.4/FactCheck/src/FactCheck.jl:315
in include at ./boot.jl:254
in include_from_node1 at ./loading.jl:133
in include at ./boot.jl:254
in include_from_node1 at loading.jl:133
in process_options at ./client.jl:306
in _start at ./client.jl:406
while loading /home/tom/.julia/v0.4/OnlineStats.jl/test/normalmix_test.jl, in expression starting on line 7
WARNING: int(x::AbstractArray) is deprecated, use round(Int64,x) instead.
in depwarn at ./deprecated.jl:62
in int at deprecated.jl:49
in anonymous at /home/tom/.julia/v0.4/OnlineStats.jl/test/logreg_test.jl:17
in context at /home/tom/.julia/v0.4/FactCheck/src/FactCheck.jl:341
in anonymous at /home/tom/.julia/v0.4/OnlineStats.jl/test/logreg_test.jl:9
in facts at /home/tom/.julia/v0.4/FactCheck/src/FactCheck.jl:315
in include at ./boot.jl:254
in include_from_node1 at ./loading.jl:133
in include at ./boot.jl:254
in include_from_node1 at loading.jl:133
in process_options at ./client.jl:306
in _start at ./client.jl:406
while loading /home/tom/.julia/v0.4/OnlineStats.jl/test/logreg_test.jl, in expression starting on line 8
WARNING: [a] concatenation is deprecated; use collect(a) instead
in depwarn at ./deprecated.jl:62
in oldstyle_vcat_warning at ./abstractarray.jl:29
in anonymous at /home/tom/.julia/v0.4/OnlineStats.jl/test/logreg_test.jl:15
in context at /home/tom/.julia/v0.4/FactCheck/src/FactCheck.jl:341
in anonymous at /home/tom/.julia/v0.4/OnlineStats.jl/test/logreg_test.jl:9
in facts at /home/tom/.julia/v0.4/FactCheck/src/FactCheck.jl:315
in include at ./boot.jl:254
in include_from_node1 at ./loading.jl:133
in include at ./boot.jl:254
in include_from_node1 at loading.jl:133
in process_options at ./client.jl:306
in _start at ./client.jl:406
while loading /home/tom/.julia/v0.4/OnlineStats.jl/test/logreg_test.jl, in expression starting on line 8
While changes are being made, let's nail down some function names.
nobs(obj)
: return the number of observations usedstate(obj)
: create a DataFrame in the form of DataFrames.melt
addstate!(df, obj)
: append state(obj)
to new row(s) in df
param(obj)
: return Vector{Symbol}
of parameter namesvalue(obj)
: return simplest form of current estimatetracedata(Stat, y, b)
: return DataFrame for making traceplot for statistic Stat
on data y
using batch size b
onlinefit(Distribution, y)
: create object for online estimation of the parametric density Distribution
using first batch y
Sometimes it can be a daunting task to select the appropriate model for a given dataset. It would be great to provide a helper framework (possibly a separate package or business) that could help choose/setup the best model given high-level information about the data set:
I've seen many graphics which essentially create a decision tree given lots of high-level information about a data problem, and point you at the right solution type (linear regression vs logistic regression vs dimensionality reduction vs SVM vs random forests vs ???). I think this is something that could be implemented alongside an ensemble framework which could choose lots of candidate models for you and drop/average/vote on the best predictions. In the online setting, ensembles could be relatively cheap, even for large datasets (especially if the online algorithm allows for parallel fitting)
(It's conceivable that this could be a SaaS business in its own right... high level online data science platform built on top of OnlineStats and OnlineAI)
I was surprised that this wasn't allowed:
julia> using OnlineStats
julia> LinReg(randn(100),randn(100))
ERROR: MethodError: `convert` has no method matching convert(::Type{OnlineStats.LinReg{W<:OnlineStats.Weighting}}, ::Array{Float64,1}, ::Array{Float64,1})
This may have arisen from a call to the constructor OnlineStats.LinReg{W<:OnlineStats.Weighting}(...),
since type constructors fall back to convert methods.
Closest candidates are:
OnlineStats.LinReg(::AbstractArray{Float64,2}, ::AbstractArray{Float64,1})
OnlineStats.LinReg(::AbstractArray{Float64,2}, ::AbstractArray{Float64,1}, ::OnlineStats.Weighting)
call{T}(::Type{T}, ::Any)
...
in call at essentials.jl:57
julia> LinReg(randn(100,1),randn(100))
OnlineStat: OnlineStats.LinReg{OnlineStats.EqualWeighting}
* β: [0.06140403868184703]
* nobs: 100
We should make sure both vectors and matrices are handled as expected.
Pros:
@test
to @test_approx_eq
one at a time, because any tests after the failed one were not getting run)Cons:
You can see how to use it on the github page, but as examples:
@test x == y
converts to @fact x => y
@test_approx_eq x y
converts to @fact x => roughly(y)
I'll use FactCheck for new stuff that I'm working on, and you can make the judgement call on whether you want to switch to it.
There's a bunch of places where there's unnecessary parametric types/signatures and wasteful work . For example...
Base.push!{T <: ScalarStat}(df::DataFrame, obj::T) = append!(df, DataFrame(obj))
might be better as:
Base.push!(df::DataFrame, obj::ScalarStat) = push!(df, value(obj))
Note that:
obj::ScalarStat
.. it's still compiled to the same method (i.e. Base.push!(df::DataFrame, obj::Var) = ...
) You only want the parametric if you need T in the RHS of the assignment or if you need types to match in the signature (i.e. add{T<:Integer}(x::T, y::T) = ...
will make sure x and y are the same type of integer) or if you want an abstract as the parameter of another type (i.e. Base.push!{T<:Integer}(df::DataFrame, obj::Vector{T}) = ...
is appropriate)push!()
any iterable of row values to a DataFrame and it will just push!()` each of those values to the columns.I'll make changes like these as I see them... If you have any questions on why I made a certain change, don't hesitate to talk it through here.
Some statistics will be much faster in batch updates instead of singletons. I think we should be explicit in how the update is performed (also because updating by singletons is not equivalent to updating by batch unless it uses EqualWeighting
).
What is preferred here for batch y
?
updatebatch!(o, y)
update!(o, y, batch::Bool = false)
@tbreloff Any arguments for or against switching to use https://github.com/kmsquire/Logging.jl? Logging in OnlineStats is undocumented/untested. It'd be nice to shift that responsibility to a maintained package.
I'd like to be able to arbitrarily define a regularization function beyond the standard L1/L2. For example it would be great to incorporate the change in beta: {lambda * (beta_t+1 - beta_t)} in the minimization, and not simply rely on the fixed speed of variation such as defined in online flexible least squares. Also, I'd like this without needing to implement a brand new specialized OnlineStat. Ideally lots of these concepts could be a little more plug and play (match user-defined regularization function with user defined link function, for example)
I have hopes that generalized Adagrad as a Weighting may provide this sort of flexibility, so for now I'm leaving this here as a "what if", and we can revisit later.
modeled on http://www2.econ.iastate.edu/tesfatsi/FLSTemporalDataMining.GMontana2009.pdf
I plan on making a few changes to StatLearn in the next few weeks. Here's a rough outline of what I'd like to do:
Model
, Penalty
, and Algorithm
types will be imported from https://github.com/joshday/SparseRegression.jl. SparseRegression.jl is essentially the offline version of StatLearn.It would be cool to have some helper functions (or macros?) to set up a Reactive pipeline for OnlineStat
s. Here's a toy example of what it may do under the hood. I set up 2 OnlineStat
s: a Diff
and a Mean
, with the goal of reproducing the equivalent of mean(diff(x))
when x
is the full data vector. liftdiff
is of type Reactive.Lift{OnlineStats.Diff{Float64}}
. Everytime you push!
to input
, reactive will take care of calling the function you pass it and place the return value inside the Lift
object. Then you can chain the first Lift
to another one that updates an OnlineStats.Mean
object. One call to push!
on the Reactive.Input
will trigger the necessary updates and store the return values.
julia> using OnlineStats, Reactive
julia> input = Input(0.0);
julia> d = Diff(); m = Mean();
julia> liftdiff = lift(x -> (update!(d, x); d), input);
julia> liftdiff
Online OnlineStats.Diff{Float64}
* diff: 0.000000
* last: 0.000000
* nobs: 1
julia> typeof(liftdiff)
Reactive.Lift{OnlineStats.Diff{Float64}}
julia> liftmean = lift(d -> (if nobs(d) > 1; update!(m, diff(d)); end; mean(m)), liftdiff);
julia> empty!(d); empty!(m);
julia> println(d); println(m)
OnlineStats.Diff{Float64}{diff=0.0 last=0.0 nobs=0}
OnlineStats.Mean{OnlineStats.EqualWeighting}{μ=0.0 nobs=0}
julia> push!(input, 5.0) # this passes our first data point through the reactive pipeline we set up
julia> println(d); println(m)
OnlineStats.Diff{Float64}{diff=0.0 last=5.0 nobs=1}
OnlineStats.Mean{OnlineStats.EqualWeighting}{μ=0.0 nobs=0}
julia> push!(input, 8.0)
julia> println(d); println(m)
OnlineStats.Diff{Float64}{diff=3.0 last=8.0 nobs=2}
OnlineStats.Mean{OnlineStats.EqualWeighting}{μ=3.0 nobs=1}
julia> push!(input, 3.0)
julia> println(d); println(m)
OnlineStats.Diff{Float64}{diff=-5.0 last=3.0 nobs=3}
OnlineStats.Mean{OnlineStats.EqualWeighting}{μ=-1.0 nobs=2}
julia> @assert value(liftmean) == mean(diff([5.,8.,3.])) # compare to vector version
It would be nice if these online statistics can return the statistic over the last X samples rather than the start. For example , moving average.
update!
should be able to do:
onlinefit!
will be removed. updatebatch!
will no longer be exported. Callbacks aren't incorporated yet. Here's what I got so far:
# b is batch size
function update!(o::OnlineStat, y::Union{AVec, AMat}, b::Integer = 1)
b = Int(b)
n = size(y, 1)
@assert 0 < b <= n "batch size must be positive and less than data size"
if b == 1
for i in 1:n
update!(o, row(y, i))
end
else
i = 1
while i <= n
rng = i:min(i + b - 1, n)
updatebatch!(o, rows(y, rng))
i += b
end
end
end
# If an OnlineStat doesn't have an updatebatch method, just use update
updatebatch!(o::OnlineStat, data...) = update!(o, data...)
I started toying around with changes to OnlineStats in a separate package, https://github.com/joshday/OnlineStatistics.jl (It only exists as a separate package to run comparisons between the old and the new. It is not, nor will it be, a replacement). What was meant to be just a few performance tests turned into a major rewrite (details on changes below). I've managed some performance improvements (some marginal, some orders of magnitude):
julia> include("test/performance.jl")
WARNING: replacing module Performance
=======================================
Performance on 10 million observations
=======================================
Mean new : 0.040254 seconds (5 allocations: 192 bytes)
Mean old : 0.042746 seconds (5 allocations: 192 bytes)
Mean (batch) new : 0.005795 seconds (6 allocations: 224 bytes)
Mean (batch) old : 0.005977 seconds (6 allocations: 224 bytes)
Variance new : 0.042452 seconds (5 allocations: 208 bytes)
Variance old : 0.060829 seconds (5 allocations: 192 bytes)
Extrema new : 0.049575 seconds (4 allocations: 160 bytes)
Extrema old : 0.051199 seconds (4 allocations: 160 bytes)
QuantileSGD new : 0.618786 seconds (8 allocations: 448 bytes)
QuantileSGD old : 1.510063 seconds (94 allocations: 76.310 MB, 0.25% gc time)
QuantileMM new : 0.772390 seconds (10 allocations: 656 bytes)
QuantileMM old : 1.720187 seconds (96 allocations: 76.310 MB, 1.33% gc time)
Moments new : 0.099977 seconds (6 allocations: 288 bytes)
Moments old : 0.071233 seconds (5 allocations: 208 bytes)
============================================
Performance on .2 million × 500 observations
============================================
Means new : 0.010688 seconds (17 allocations: 2.469 KB)
Means old (VERY SLOW) :
Means (batch) new : 0.009797 seconds (16 allocations: 1.578 KB)
Means (batch) old : 0.009701 seconds (17 allocations: 1.609 KB)
Variances new : 0.041235 seconds (40 allocations: 7.766 KB)
Variances old (VERY SLOW) :
Variances (batch) new : 0.042492 seconds (37 allocations: 5.109 KB)
Variances (batch) old : 0.037979 seconds (38 allocations: 5.156 KB)
CovMatrix new : 2.819448 seconds (200.00 k allocations: 9.155 MB)
CovMatrix old (VERY SLOW) :
CovMatrix (batch) new : 0.072263 seconds (23 allocations: 237.063 KB)
CovMatrix (batch) old : 0.065741 seconds (19 allocations: 80.625 KB)
===========================================
Performance on 1 million × 5 design matrix
===========================================
LinReg new : 0.033935 seconds (35 allocations: 45.779 MB, 9.72% gc time)
LinReg old : 0.053115 seconds (37 allocations: 45.779 MB, 43.14% gc time)
SparseReg old : 0.092843 seconds (33 allocations: 45.779 MB, 71.34% gc time)
state(o)
and statenames(o)
, replace with value(o)
value(o)
returns only the statistic, nonessential information (ex: Vector of quantiles) should show up in Base.show
methodsWeighting
to Weight
, EqualWeighting
to EqualWeight
, etc.
nup
(number of updates) to each OnlineStat, which is useful for LearningRate
or similar stochastic weighting mechanisms to be added in the futureweight!(o, n2)
updates the n
and nup
fields and returns the weight. Take a look at https://github.com/joshday/OnlineStatistics.jl/blob/master/src/summary.jl to see how this changes update!
methodsupdate!
to StatsBase.fit!
sweep!
operator with additional method with a placeholder vector to avoid gc. sweep!
has also been changed to store the upper triangular matrix, rather than lower.FitDistribution
and FitMvDistribution
typesSparseReg
, it's functionality (coefficients from penalized likelihood) is now handled by LinReg
summary/
is now in one file, summary.jl
.StochasticModel
to StatLearn
. I think hinting at statistical learning is better than "algorithms based on a stochastic subgradient". If anyone has a better name for something that incorporates SVMs and linear, logistic, poisson, huber, l1 loss, and quantile regression, I'm all ears.I wanted to get this out in the open before I start moving things over to OnlineStats. The biggest impact change is how weightings are handled.
Did log.jl break on 0.3.7? Maybe we can debug together? It's nice to use... you can do:
x = "world"
LOG("Hello", x) # prints "INFO: Hello world [filename:linenumber]"
log_severity(ERROR)
LOG("Hello", x) # doesn't print... severity not high enough
LOG(ERROR, "Hi") # prints "ERROR: Hi [filename:linenumber]"
log_severity(INFO)
y = 5
@LOG x y # prints "INFO: x: world y: 5 [filename:linenumber]"
I'll probably add @DEBUG
and @ERROR
macros too.
This is cool so you can add verbose logging to your stuff and turn it on/off for different runs (using log_severity()
) without reloading.
What's your final intention for documentation? Should we try to use the doc"""..."""
notation that they're moving towards in Base? Lets pool those thoughts here...
Starting an issue to discuss improving the abstract type structure.
Here are the problems I see with the current structure:
Distributions.VariateForm
. For estimating a parametric density, say a normal distribution, we have a parameter vector [:μ, :σ]
. With the current naming convention, OnlineStats.FitNormal
(or similar) would be a subtype of ContinuousMultivariateOnlineStat
(we are estimating a vector), which is confusing since we're looking at a ContinuousUnivariate
distribution.Matrixvariate
. Cubevariate
?update!
very often (streaming data). The other is a user wants to call update!
the fewest number of times possible to get a good estimate (data doesn't fit in RAM). For any given statistical method, there could be two types available. In the case of linear regression, there would be the "slow" (LinReg
) analytical version and the "fast" (FastLinReg
) SGD version. I think it would be beneficial to at least define the class of algorithm that is being used.Here are my suggested types:
# OnlineAlgorithm
abstract OnlineAlgorithm
abstract Approximate <: OnlineAlgorithm
abstract Analytical <: OnlineAlgorithm # Analytical update
abstract SGD <: Approximate # Stochastic gradient descent
abstract ASGD <: Approximate # Averaged SGD
abstract SGD2 <: Approximate # 2nd order SGD
abstract OEM <: Approximate # Online EM
abstract OMM <: Approximate # Online MM
# ParameterForm
abstract ParameterForm
abstract ScalarParameter <: ParameterForm
abstract VectorParameter <: ParameterForm
abstract MatrixParameter <: ParameterForm
abstract OtherParameter <: ParameterForm
# OnlineStat
abstract OnlineStat{F <: ParameterForm, A <: OnlineAlgorithm}
I thought Adagrad needed it's own issue. Also, I apparently decided to write a novel.
Here's a bunch of stuff. Sorry for the length:
y
is being generated incorrectly and that's probably messing with things.1 / (1 + exp(-x'b))
It's less pretty, but it should look something like
probvec = [1 / (1 + exp(-u)) for u in x*β]
y = [Float64(rand(Bernoulli(p))) for p in probvec]
which makes things look nicer. It still seems a bit sensitive to the choice of η:
# β = collect(1.:p)
julia> o = Adagrad(x, y; link=LogisticLink(), loss=LogisticLoss(), η = 0.1)
Online Adagrad
* β: [0.615126,1.2406,1.83746,2.50282,3.07797,3.68487,4.34718,4.93414,5.57917,6.16753]
* nobs: 1000000
julia> o = Adagrad(x, y; link=LogisticLink(), loss=LogisticLoss(), η = 1.0)
Online Adagrad
* β: [1.01088,1.97945,2.99252,4.03143,4.97021,5.86467,6.88853,7.89626,8.86106,9.84362]
* nobs: 1000000
julia> o = Adagrad(x, y; link=LogisticLink(), loss=LogisticLoss(), η = 5.0)
Online Adagrad
* β: [1.26286,1.89395,3.08134,4.37615,5.30177,6.22554,7.15833,8.34897,9.29574,10.2929]
* nobs: 1000000
gᵢ = ∇f(o.loss, ε, x[i]) + ∇Ψ(o.reg, o.β, i)
is the bottleneck. You've probably figured all this out by the time I finish writing this.julia> x = randn(n, p);
julia> β = collect(1.:p);
julia> y = x * β + randn(n)*10;
julia> @time Adagrad(x, y)
elapsed time: 2.668726562 seconds (784000544 bytes allocated, 51.37% gc time)
Online Adagrad
* β: [0.9744,1.94384,3.00235,4.01059,4.98608,6.07463,7.03342,8.03884,8.97063,9.99778]
* nobs: 1000000
julia> @time LinReg(x, y)
elapsed time: 0.176378306 seconds (176004712 bytes allocated, 46.22% gc time)
Online LinReg{EqualWeighting}
* β: [0.99402,1.9986,3.00373,4.013,5.01179,6.00631,6.99868,8.00633,8.98498,9.99851]
* nobs: 1000000
state()
and statenames()
return?I'm thinking state()
should return Vector{Any}
so that state(o)[someinteger]
coincides with the value of statenames(o)[someinteger]
An example is for fitting a multivariate normal distribution where mean(o.c)
is a Vector and cov(o.c)
is a Matrix.
statenames(o::FitMvNormal) = [:μ, :Σ, :nobs]
state(o::FitMvNormal) = Any[mean(o.c), cov(o.c), o.n]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.