Giter Club home page Giter Club logo

onlinestatsbase.jl's Introduction

Build status codecov


OnlineStatsBase

This package defines the basic types and interface for OnlineStats.



Interface

Required

  • _fit!(stat, y): Update the "sufficient statistics" of the estimator from a single observation y.

Required (with Defaults)

  • value(stat, args...; kw...) = <first field of struct>: Calculate the value of the estimator from the "sufficient statistics".
  • nobs(stat) = stat.n: Return the number of observations.

Optional

  • _merge!(stat1, stat2): Merge stat2 into stat1 (an error by default in OnlineStatsBase versions >= 1.5).
  • Base.empty!(stat): Return the stat to its initial state (an error by default).



Example

  • Make a subtype of OnlineStat and give it a _fit!(::OnlineStat{T}, y::T) method.
  • T is the type of a single observation. Make sure it's adequately wide.
using OnlineStatsBase

mutable struct MyMean <: OnlineStat{Number}
    value::Float64
    n::Int
    MyMean() = new(0.0, 0)
end
function OnlineStatsBase._fit!(o::MyMean, y)
    o.n += 1
    o.value += (1 / o.n) * (y - o.value)
end



That's all there is to it!

y = randn(1000)

o = fit!(MyMean(), y)
# MyMean: n=1_000 | value=0.0530535

onlinestatsbase.jl's People

Contributors

brucala avatar felixcremer avatar github-actions[bot] avatar grahamas avatar joshday avatar juliatagbot avatar krynju avatar tkf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

onlinestatsbase.jl's Issues

Extrema() only available for Date time types

extrema_init(T::Type{Date}) = typemax(Date), typemin(Date), Union{Date, Dates.AbstractDateTime}

As I understand from the code above, it looks like Extrema() only works for Date time types. Is there a reason for not extending to a generic TimeType?
For instance replacing the code to something like:

extrema_init(T::Type{<:TimeType}) = typemax(T), typemin(T), TimeType

Dispatch issue in 1.9.

julia> ymat = randn(10, 5) ;

1.9:

julia> @which fit!(CovMatrix(), eachrow(ymat))
fit!(o::OnlineStat{T}, yi::T) where T
     @ OnlineStatsBase /OnlineStatsBase.jl/src/OnlineStatsBase.jl:111

1.8:

julia> @which fit!(CovMatrix(), eachrow(ymat))
fit!(o::OnlineStat{I}, y::T) where {I, T}
    in OnlineStatsBase at OnlineStatsBase.jl/src/OnlineStatsBase.jl:133

This causes the tests to fail.

I think the difference is that the result of eachrow is an AbstractVector on 1.9 but not on 1.8.

Improve doc ? value of type different from Float64 and from type of single observation value

Hello,

I'm working on https://github.com/femtotrader/IncTA.jl/tree/OnlineStatsBase and I'm looking for a way to pass a candlestick as a single observation value.
My single observation value looks like

struct OHLCV{Ttime,Tprice,Tvol}
    open::Tprice
    high::Tprice
    low::Tprice
    close::Tprice
    volume::Tvol
    time::Ttime

    function OHLCV(
        open::Tprice,
        high::Tprice,
        low::Tprice,
        close::Tprice;
        volume::Tvol = missing,
        time::Ttime = missing,
    ) where {Ttime,Tprice,Tvol}
        new{Ttime,Tprice,Tvol}(open, high, low, close, volume, time)
    end

end

and I'm tring to calculate AccuDist and have "sufficient statistics" as a value of type Tprice (in this case it's Float64 but in general it can be different)

I did

"""
    AccuDist{T}()

The AccuDist type implements an Accumulation and Distribution indicator.
"""
mutable struct AccuDist{T} <: OnlineStat{T}
    value::Union{Missing,Float64}
    n::Int

    function AccuDist{T}() where {T}
        new{T}(missing, 0)
    end
end

and would like to do something like

mutable struct AccuDist{T, S} <: OnlineStat{T, S}
    value::Union{Missing, S}
    n::Int

    function AccuDist{T, S}() where {T, S}
        new{T, S}(missing, 0)
    end
end
=#

function OnlineStatsBase._fit!(ind::AccuDist, candle::OHLCV)
    ind.n += 1
    if candle.high != candle.low
        # Calculate MFI and MFV
        mfi =
            ((candle.close - candle.low) - (candle.high - candle.close)) /
            (candle.high - candle.low)
        mfv = mfi * candle.volume
    else
        # In case high and low are equal (division by zero), return previous value if exists, otherwise return missing
        ind.value = value(ind)
        return
    end

    if !has_output_value(ind)
        ind.value = mfv
    else
        ind.value = value(ind) + mfv
    end
end

Unfortunatly I have to use

value::Union{Missing,Float64}

I'd like to be able to do

value::Union{Missing,Tprice}

How can I inherit OnlineStatsBase with OHLCV{Missing, Float64, Float64} as a single observation value and type Tprice as a sufficient statistics.

Maybe doc can slightly be improved on this side (and I could provide PR after understanding how to tackle that.

Especially in README example

using OnlineStatsBase

mutable struct MyMean <: OnlineStat{Number}
    value::Float64
    n::Int
    MyMean() = new(0.0, 0)
end
function OnlineStatsBase._fit!(o::MyMean, y)
    o.n += 1
    o.value += (1 / o.n) * (y - o.value)
end

value is type of Float64 but it could be more general I think (type of y, or an other type)

Some help from Julia guys experienced with types and inheritance could greatly help

Kind regards

Fails to precompile on Julia 1.0

In the file stats.jl you have used a new keyword syntax that is not compatible with Julia 1.0:

CircBuff(b::Int, T = Float64; rev=false) = CircBuff(T, b; rev)

This leads to an error when precompiling the package.
If you change the line to

CircBuff(b::Int, T = Float64; rev=false) = CircBuff(T, b; rev = rev) 

the precompilation works again.

Inefficient Mean()?

First of all, thanks for the fantastic OnlineStats package!

I've noticed that Mean() is quite slow when compared to (e.g.) Variance(), which I would expect to be slower. Some timing examples below (only giving time from second runs):

julia> @time reduce(Mean(), table, select=:C14)
  4.406728 seconds (40.45 M allocations: 618.137 MiB, 1.26% gc time)
Mean: n=40428967 | value=18841.8

julia> @time reduce(Variance(), table, select=:C14)
  0.152765 seconds (24.03 k allocations: 1.174 MiB)
Variance: n=40428967 | value=2.45962e7

A more than 10x difference is observed in favor of Variance(). Note: I chose to give the examples based on JuliaDB instead of purely OnlineStats because the times are more striking, but the effect is also noticeable with pure OnlineStats calls.

Inspecting the code for the fit function of Mean():

o.μ = (o.n += 1) == 1 ? x : smooth(o.μ, x, T(o.weight(o.n)))

I believe the inefficiency comes from the condition in the ternary operator, which I believe is unnecessary. Indeed if I modify the code to:

    o.μ = smooth(o.μ, x, T(o.weight(o.n += 1)))

I obtain the following timing measurements:

julia> @time reduce(Mean(), table, select=:C14)
  0.135243 seconds (24.03 k allocations: 1.174 MiB)
Mean: n=40428967 | value=18841.8

Does this make sense? Can it be fixed?

I don't know if the right approach here is for me to open a pull request. Please, let me know if that's the case.

Enable precompilation

Enabling precompilation for OnlineStatsBase.jl (and OnlineStats.jl, eventually) would make it possible to use them as building blocks for larger packages that need precompilation themselves.

error regarding NamedTuples for Julia 0.7

I get the error below on Julia 0.7

I am not entirely sure how to fix this though.

maybe line 4 of the main source file needs to become this
using NamedTuples.NamedTuples


julia> using OnlineStatsBase
[ Info: Precompiling module OnlineStatsBase
WARNING: both NamedTuples and Core export "NamedTuple"; uses of it in module OnlineStatsBase must be qualified
ERROR: LoadError: UndefVarError: NamedTuple not defined
Stacktrace:
 [1] top-level scope
 [2] include(::Module, ::String) at .\boot.jl:305
 [3] include_relative(::Module, ::String) at .\loading.jl:1072
 [4] include(::Module, ::String) at .\sysimg.jl:29
 [5] top-level scope
 [6] eval at .\boot.jl:308 [inlined]
 [7] top-level scope at .\<missing>:3
in expression starting at C:\Users\bernhard.konig\.julia\v0.7\OnlineStatsBase\src\OnlineStatsBase.jl:8
ERROR: Failed to precompile OnlineStatsBase to C:\Users\bernhard.konig\.julia\compiled\v0.7\OnlineStatsBase.ji.
Stacktrace:
 [1] error at .\error.jl:33 [inlined]
 [2] compilecache(::Base.PkgId) at .\loading.jl:1206
 [3] _require(::Base.PkgId) at .\loading.jl:1008
 [4] require(::Base.PkgId) at .\loading.jl:879
 [5] require(::Module, ::Symbol) at .\loading.jl:874

julia> versioninfo()
Julia Version 0.7.0-DEV.4474
Commit e542b28ac2* (2018-03-06 17:45 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)
Environment:
  JULIA_BINDIR = C:\Julia-0.7.X\bin\
  JULIA_HOME = C:\Julia-0.7.X\bin\

julia>

GroupBy Group

I was hoping to be able to do a collection of stats on each group with one pass. For example, something like o = fit!(GroupBy(eltype(group_key), Group(Mean(), Variance(), Extrema())), zip(group_key, obs)).

However, this example errors out with what appears to be a recursive iteration that arrives at the group keys, I can't see how else OnlineStatsBase would get to evaluate Char which only exists in the group keys of my example.

I'm not certain whether this is a bug report, feature request, or misuse on my part. Any assistance would be much appreciated. Also, this issue seems to be similar to joshday/OnlineStats.jl#145.

Below is a complete mockup.

julia> using OnlineStats

# Mockup data.
julia> vals = [1 2 3 4; 5 6 7 8; 9 10 11 12] #Note: would come from an iterator.
3×4 Array{Int64,2}:
 1   2   3   4
 5   6   7   8
 9  10  11  12

julia> (nrows, ncols) = size(vals)
(3, 4)

julia> attr1 = repeat(["a", "s", "a"], outer=ncols) #Note: would come from an iterator.
12-element Array{String,1}:
 "a"
 "s"
 "a"
 "a"
 "s"
 "a"
 "a"
 "s"
 "a"
 "a"
 "s"
 "a"

julia> attr2 = repeat(["q", "w", "e", "r"], inner=nrows) #Note: would come from an iterator.
12-element Array{String,1}:
 "q"
 "q"
 "q"
 "w"
 "w"
 "w"
 "e"
 "e"
 "e"
 "r"
 "r"
 "r"

# Define group keys.
julia> group_key = zip(attr1, attr2)
Base.Iterators.Zip{Tuple{Array{String,1},Array{String,1}}}((["a", "s", "a", "a", "s", "a", "a", "s", "a", "a", "s", "a"], ["q", "q", "q", "w", "w", "w", "e", "e", "e", "r", "r", "r"]))

julia> eltype(group_key)
Tuple{String,String}

# Mockup iterator.
julia> iter = zip(group_key, vals)
Base.Iterators.Zip{Tuple{Base.Iterators.Zip{Tuple{Array{String,1},Array{String,1}}},Array{Int64,2}}}((Base.Iterators.Zip{Tuple{Array{String,1},Array{String,1}}}((["a", "s", "a", "a", "s", "a", "a", "s", "a", "a", "s", "a"], ["q", "q", "q", "w", "w", "w", "e", "e", "e", "r", "r", "r"])), [1 2 3 4; 5 6 7 8; 9 10 11 12]))

julia> eltype(iter)
Tuple{Tuple{String,String},Int64}

julia> (i1, state) = iterate(iter)
((("a", "q"), 1), ((2, 2), 2))

julia> (i2, state) = iterate(iter, state)
((("s", "q"), 5), ((3, 3), 3))

julia> (i3, state) = iterate(iter, state)
((("a", "q"), 9), ((4, 4), 4))

# Of the 12 observations, there are 8 unique groups.
julia> first.(collect(iter)) |> unique
8-element Array{Tuple{String,String},1}:
 ("a", "q")
 ("s", "q")
 ("a", "w")
 ("s", "w")
 ("a", "e")
 ("s", "e")
 ("a", "r")
 ("s", "r")

# Setup stat.
julia> o = fit!(GroupBy(eltype(group_key), Mean()), iter)
GroupBy: Tuple{String,String} => Mean{Float64,EqualWeight}
  ├── ("a", "q"): Mean: n=2 | value=5.0
  ├── ("s", "q"): Mean: n=1 | value=5.0
  ├── ("a", "w"): Mean: n=2 | value=6.0
  ├── ("s", "w"): Mean: n=1 | value=6.0
  ├── ("a", "e"): Mean: n=2 | value=7.0
  ├── ("s", "e"): Mean: n=1 | value=7.0
  ├── ("a", "r"): Mean: n=2 | value=8.0
  └── ("s", "r"): Mean: n=1 | value=8.0

julia> o_desired = fit!(GroupBy(eltype(group_key), Group(Mean(), Variance(), Extrema())), iter)
ERROR: The input for GroupBy is a Union{Pair{Tuple{String,String},Union{Tuple{Number,Number,Number}, AbstractArray{#s28,1} where #s28<:Number, NamedTuple{names,R} where R<:Tuple{Number,Number,Number}} where names}, NamedTuple{names,Tuple{Tuple{String,String},Union{Tuple{Number,Number,Number}, AbstractArray{#s28,1} where #s28<:Number, NamedTuple{names,R} where R<:Tuple{Number,Number,Number}} where names}}, Tuple{Tuple{String,String},Union{Tuple{Number,Number,Number}, AbstractArray{#s28,1} where #s28<:Number, NamedTuple{names,R} where R<:Tuple{Number,Number,Number}} where names}} where names.  Found Char.
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] fit!(::GroupBy{Tuple{String,String},Union{Tuple{Number,Number,Number}, AbstractArray{#s28,1} where #s28<:Number, NamedTuple{names,R} where R<:Tuple{Number,Number,Number}} where names,Group{Tuple{Mean{Float64,EqualWeight},Variance{Float64,EqualWeight},Extrema{Float64,Number}},Union{Tuple{Number,Number,Number}, AbstractArray{#s28,1} where #s28<:Number, NamedTuple{names,R} where R<:Tuple{Number,Number,Number}} where names}}, ::Char) at /Users/comara/.julia/packages/OnlineStatsBase/L6i9N/src/OnlineStatsBase.jl:108
 [3] fit! at /Users/comara/.julia/packages/OnlineStatsBase/L6i9N/src/OnlineStatsBase.jl:110 [inlined] (repeats 4 times)
 [4] top-level scope at REPL[29]:1

Warning from Julia 1.1 on "eachcol"

Once I import OnlineStatsBase, the follow warning occur from Julia terminal

WARNING: both OnlineStatsBase and Base export "eachcol"; uses of it in module OnlineStats must be qualified

is there a way to remove this? Is it because the Julia Base also has a method called "eachcol"?

Bug (or maybe 2) in v1.5.0

I think a bug (or potentially 2) was introduced in version 1.5.0.

Bug 1:
After upgrading to version 1.5.0, I started to get a dispatching error saying there was no method matching the signature for the replace() method. Looking at the diff between 1.4.9 and 1.5.0 in src/OnlineStatsBase.jl, the call to replace() in function name() was updated to a one-line call (line 82). If I open a Julia repl, create a dummy string and those 2 bool vars, and then run that call to replace, I'm getting the same error.

The full error is:

ERROR: MethodError: no method matching replace(::String, ::Pair{Regex, String}, ::Pair{String, String})
Closest candidates are:
  replace(::AbstractString, ::Pair, ::Pair) at set.jl:613
  replace(::Any, ::Pair...; count) at set.jl:555
  replace(::Union{Function, Type}, ::Pair, ::Pair; count) at set.jl:612
  ...

Potential bug 2:
As part of that same change, it looks like the logic was flipped regarding which regex is used based on the values of those 2 bools. E.g. the regex for replacing periods used to be used when withmodule was false, but now it's used when withmodule is true. And same goes for withparams. I'm not sure if this was correct before and wrong now or vice-versa, hence me not being sure if this is a bug.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Possible type instability with `Mean`, `Moments`, `Sum`, `Variance`

Why are some statistics types subtyped with OnlineStat{Number}? For example:

mutable struct Mean{T,W} <: OnlineStat{Number}
    μ::T
    weight::W
    n::Int
end
Mean(T::Type{<:Number} = Float64; weight = EqualWeight()) = Mean(zero(T), weight, 0)

Is there a reason we can't have mutable struct Mean{T,W} <: OnlineStat{T} instead? This means that when input() is called on statistics like Mean() it will always return Number instead of the actual input type (eg: Float32). The same is true for Mean, Moments, Sum, and Variance.


I noticed this while playing around with a Mean/Stdev filter. My original code is as follows (and feel free to offer any suggestions on better/more efficient ways to do this, I'm new to this package).

using BenchmarkTools
using OnlineStatsBase

mutable struct MeanStdFilter{T}
    nu::Int
    tracker::OnlineStat
end

function MeanStdFilter(nu::Int; T::DataType=Float32)
    s = [Series(Mean(T), Variance(T)) for _ in 1:nu]
    return MeanStdFilter{T}(nu, Group(s...))
end

function _get_mean_var(m::MeanStdFilter{T}) where T
    vals = value.(value(m.tracker))
    return reinterpret(reshape, T, collect(vals))
end

function (m::MeanStdFilter)(x::AbstractVector)
    fit!(m.tracker, x)
    μσ2 = _get_mean_var(m)
    return (x .- μσ2[1,:]) ./ sqrt.(μσ2[2,:])
end

# Test runtime
nu = 4
T = Float32
m = MeanStdFilter(nu; T)

# @btime m(randn(T,nu));
@btime _get_mean_var(m);

Running with T = Float32 I get:

1.014 μs (18 allocations: 608 bytes)

and with T = Float64 it increases to:

549.342 ns (6 allocations: 480 bytes)

I suspect this is to do with having to convert Float64 to Float32 at some point in the pipeline because of the issue raised above.

Thanks in advance for any help!

Handle transform errors in `FTSeries`

Motivation

A possible use of transformations in FTSeries is converting a stream of strings into a more appropriate type, e.g.:

> fit!(FTSeries(String, Mean(); transform=x->parse(Int, x)), string.(1:5))
FTSeries
└─ Mean: n=5 | value=3.0

but in real life it isn't uncommon that these streams can have some elements with a wrong or unexpected format, e.g.:

> fit!(FTSeries(String, Mean(); transform=x->parse(Int, x)), [string.(1:5); "6.0"])
ERROR: ArgumentError: invalid base 10 digit '.' in "6.0"

Within the vision of FTSeries, would it make sense to protect against failures of the transform?

Possible solution

Below a possible solution with a try-catch approach, and keeping track of the number of failed transformations:

mutable struct FTSeries{IN, OS, F, T} <: StatCollection{IN}
    stats::OS
    filter::F
    transform::T
    nfiltered::Int
    nfailed::Int    <--- added to keep track of transform failures 
end
function FTSeries(stats::OnlineStat...; filter=x->true, transform=identity)
    IN, OS = Union{map(input, stats)...}, typeof(stats)
    FTSeries{IN, OS, typeof(filter), typeof(transform)}(stats, filter, transform, 0, 0)
end
function FTSeries(T::Type, stats::OnlineStat...; filter=x->true, transform=identity)
    FTSeries{T, typeof(stats), typeof(filter), typeof(transform)}(stats, filter, transform, 0, 0)
end
nobs(o::FTSeries) = nobs(o.stats[1]) + o.nfailed   <--- counting failed transformations as observations, although I'm not convinced this is the best approach
@generated function _fit!(o::FTSeries{N, OS}, y) where {N, OS}
    n = length(fieldnames(OS))
    quote
        if o.filter(y)
            try
                yt = o.transform(y)
                Base.Cartesian.@nexprs $n i -> @inbounds begin
                    _fit!(o.stats[i], yt)
                end
            catch e
                o.nfailed += 1
            end
        else
            o.nfiltered += 1
        end
    end
end
function _merge!(o::FTSeries, o2::FTSeries)
    o.nfiltered += o2.nfiltered
    o.nfailed += o2.nfailed
    _merge!.(o.stats, o2.stats)
end

I can make a PR with this solution If you think it makes sense.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.