juliastats / statsbase.jl Goto Github PK

Basic statistics for Julia

License: Other

Julia 100.00%

julia statistics summarization statistical-models

statsbase.jl's Introduction

StatsBase.jl

StatsBase.jl is a Julia package that provides basic support for statistics. Particularly, it implements a variety of statistics-related functions, such as scalar statistics, high-order moment computation, counting, ranking, covariances, sampling, and empirical density estimation.

Build & Testing Status:
Documentation:

statsbase.jl's People

Contributors

Stargazers

Watchers

Forkers

simonster rened aviks jonasrauber curiousleo bjarkehs superxroot guria65 johansigfrids nalimilan bigcrunsh koreanfoodcomics goretkin vunguyene wildart skumagai jpata bdeonovic mjlong bicycle1885 keno tkelman bjarthur islandsofsleep karenroma yuyichao rybern hayd ken-b greenbuttonlab abhijithch getzdan panlanfeng ragiaaboulfadl blimasouza halmoni100 davidavdav oschulz dcarrera c123w conning grero mkborregaard vinayprakashsingh wookay nkottary evanfields invenia rofinn jeffreysarnoff lewishein claireh93 jeffwong alexmorley edulemasson iwabuchikentaro formulas-and-numbers valdart benjaminborn jayvn joshday lnsongxf shashi zgornel djsegal fredrikekre tbeason spencerx floswald samuelwiqvist juliatakingfittingapisseriously martinholters jeffbezanson thofma jamesonquinn bellamkondaprakash benluteijn jw3126 abhiyad teresy asbisen mschauer diegozea oxinabox hhc2tech jgoldfar mateuszbaran qingxufish eulerkochy smldis mutaz94 jacobxk yuehhua jbahire sumegh-git nayyarv findmyway lfelipegomez bayesthm bkamins

statsbase.jl's Issues

Sampling without replacement produces repeated elements

The following (using Distributions)

for i=1:100 print(length(union(sample(1:100000,200,replace=false))),"\n") end

should always display 100 values of 200. It doesn't. Sometimes some values are 199 or even 198, meaning that not all values in the sample are unique
.
Strangely if 200 is replaced by 1000, the sampling function seems to always return 1000 unique numbers as expected.

Originally raised by @qfn

[PkgEval] StatsBase may have a testing issue on Julia 0.4 (2014-10-28)

PackageEvaluator.jl is a script that runs nightly. It attempts to load all Julia packages and run their tests (if available) on both the stable version of Julia (0.3) and the nightly build of the unstable version (0.4). The results of this script are used to generate a package listing enhanced with testing results.

On Julia 0.4

On 2014-10-27 the testing status was Package doesn't load.
On 2014-10-28 the testing status changed to Tests fail, but package loads.

Tests fail, but package loads. means that PackageEvaluator found the tests for your package, executed them, and they didn't pass. However, trying to load your package with using worked.

Package doesn't load. means that PackageEvaluator did not find tests for your package. Additionally, trying to load your package with using failed.

This issue was filed because your testing status became worse. No additional issues will be filed if your package remains in this state, and no issue will be filed if it improves. If you'd like to opt-out of these status-change messages, reply to this message saying you'd like to and @IainNZ will add an exception. If you'd like to discuss PackageEvaluator.jl please file an issue at the repository. For example, your package may be untestable on the test machine due to a dependency - an exception can be added.

Test log:

>>> 'Pkg.add("StatsBase")' log
INFO: No packages to install, update or remove
INFO: Package database updated
INFO: METADATA is out-of-date a you may not have the latest version of StatsBase
INFO: Use `Pkg.update()` to get the latest versions of your packages

>>> 'using StatsBase' log
Warning: could not import Base.evaluate into StatsBase
Julia Version 0.4.0-dev+1330
Commit 7fdc860 (2014-10-28 03:56 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

>>> test log
Warning: could not import Base.evaluate into StatsBase
Running tests:
 * mathfuns.jl ...
 * weights.jl ...

ERROR: evaluate not defined
 in _wsum_general! at /home/idunning/pkgtest/.julia/v0.4/StatsBase/src/weights.jl:144
 in _wsumN! at /home/idunning/pkgtest/.julia/v0.4/StatsBase/src/weights.jl:128
 in wsum at /home/idunning/pkgtest/.julia/v0.4/StatsBase/src/weights.jl:216
 in include at ./boot.jl:242
 in include_from_node1 at ./loading.jl:128
 in anonymous at no file:24
 in include at ./boot.jl:242
 in include_from_node1 at loading.jl:128
 in process_options at ./client.jl:293
 in _start at ./client.jl:362
 in _start_3B_3769 at /home/idunning/julia04/usr/bin/../lib/julia/sys.so
while loading /home/idunning/pkgtest/.julia/v0.4/StatsBase/test/weights.jl, in expression starting on line 59
while loading /home/idunning/pkgtest/.julia/v0.4/StatsBase/test/runtests.jl, in expression starting on line 21
INFO: Testing StatsBase
==============================[ ERROR: StatsBase ]==============================

failed process: Process(`/home/idunning/julia04/usr/bin/julia /home/idunning/pkgtest/.julia/v0.4/StatsBase/test/runtests.jl`, ProcessExited(1)) [1]

================================================================================
INFO: No packages to install, update or remove
ERROR: StatsBase had test errors
 in error at error.jl:21
 in test at pkg/entry.jl:719
 in anonymous at pkg/dir.jl:28
 in cd at ./file.jl:20
 in cd at pkg/dir.jl:28
 in test at pkg.jl:68
 in process_options at ./client.jl:221
 in _start at ./client.jl:362
 in _start_3B_3769 at /home/idunning/julia04/usr/bin/../lib/julia/sys.so


>>> end of log

renaming complete?

From a new build of julia (commit ad65af1) with wiped ~/.julia packages directory, the package Distributions has difficulty:

julia> Pkg.add("Distributions")
INFO: Initializing package repository /Users/alpert/.julia
INFO: Cloning METADATA from git://github.com/JuliaLang/METADATA.jl
INFO: Cloning cache of Distributions from git://github.com/JuliaStats/Distributions.jl.git
INFO: Cloning cache of NumericExtensions from git://github.com/lindahua/NumericExtensions.jl.git
INFO: Cloning cache of StatsBase from git://github.com/JuliaStats/StatsBase.jl.git
INFO: Installing Distributions v0.2.13
INFO: Installing NumericExtensions v0.3.5
INFO: Installing StatsBase v0.3.5
INFO: Package database updated

julia> using Distributions
ERROR: Stats not found
 in require at loading.jl:39
 in reload_path at loading.jl:146
 in _require at loading.jl:59
 in require at loading.jl:43
while loading /Users/alpert/.julia/Distributions/src/Distributions.jl, in expression starting on line 4

Checking further, I see this:

$ ls ~/.julia
Distributions     NumericExtensions StatsBase
METADATA          REQUIRE
$ more ~/.julia/Distributions/REQUIRE 
NumericExtensions
Stats

Changing "Stats" to "StatsBase" in the REQUIRE file doesn't do the trick. Do you see what's needed?

nan* family of methods

I have passing tests for a nan* family of methods in my TimeSeries timestamp branch. These methods remove NaN from an array and then do some basic statistical computation on the resulting array. I can do a pull request if it belongs here. The code is:

for(nam, func) = ((:nanmax, :max), (:nanmin, :min), (:nansum, :sum),
                   (:nanmean, :mean), (:nanmedian, :median), (:nanvar, :var),
                   (:nanstd, :std), (:nanskewness, :skewness), (:nankurtosis, :kurtosis))
  @eval begin
    function ($nam)(x::Array)
      newa = typeof(x[1])[]
      for i in 1:length(x)
      if ~isnan(x[i])
        push!(newa, x[i])
      end
    end
    ($func)(newa)
    end
  end
end

There is a slow down from generating a new array.

julia> qux = [rand(1000), NaN];

julia> var(qux)
NaN

julia> nanvar(qux)    
0.08380883170731807

julia> timetrial(nanvar, qux, 1000)
7.9002385e-5

julia> pop!(qux);

julia> var(qux)
0.08380883170731807

julia> timetrial(var, qux, 1000)
7.718842000000008e-6

Histograms

Should we try to move the hist functionality from Base? If there is general support here for this, I'll open an issue in the main julia repo.
It would also probably make sense to create a Histogram type, mirroring the KDE type.

2D KDE relies on Distributions and therefore is broken

The 2D KDE calls pdf and Normal from Distributions.jl, but of course Distributions.jl is not imported. Therefore, 2D KDE is currently not working.

Should we move kde back to Distributions ?

Version dependency

Is everything here 0.2 capable? If so, it'd be nice to specify that.

range and minmax

Currently, we have

midrange returns (max - min) / 2, while range returns (min, max). This is a little bit confusing.

What about the following semantics?
minmax returns (min, max).
range returns max - min
midrange returns half of the range (or call this halfrange?)

norepeat function in StatsBase/src/misc.jl

I don't see that this function is used anywhere, so it could probably be deleted. In case it is used it should be rewritten as

norepeat(a::AbstractArray) = length(Set(a)) == length(a)

there is no need to sort the array to determine whether or not there are repeats.

Generic statistics methods

I am writing some code for regression based time series models where I define e.g. logLik and residuals methods. I think it would make sense to define these functions in this package and maybe introduce abstract statistical models. I know that some of them are in GLM but I think that this place could be more suitable. Any thoughts?

cc: @dmbates

pacf() gives strange values

When I use pacf() with method=:regression I get that it's almost zero for time lag 0. This should always be one, as far as I know.This does on the other hand not happen with method=:yulewalker. Am I the only one experiencing that?

If I compare the output of pacf(method=:yulewalker) with aryule() in MATLAB i get the same values but negative. Is there a good reason for that, or is it a bug?

Weighted variance, standard deviation, covariance & correlation.

I am going to add these functionalities to the package soon.

One question needs to be decided: should we correct the scale like we do in unweighted cases? like:

m = mean(x, w)
# shall we do:
var(x, w) = sum(abs2(x - m)) / (sum(w) - 1)
# or do :
var(x, w) = sum(abs2(x - m)) / sum(w)

[PkgEval] StatsBase may have a testing issue on Julia 0.4 (2014-10-08)

On Julia 0.4

On 2014-10-05 the testing status was Tests pass.
On 2014-10-08 the testing status changed to Tests fail, but package loads.

Tests pass. means that PackageEvaluator found the tests for your package, executed them, and they all passed.

Tests fail, but package loads. means that PackageEvaluator found the tests for your package, executed them, and they didn't pass. However, trying to load your package with using worked.

Special message from @IainNZ: This change may be due to breaking changes to Dict in JuliaLang/julia#8521, or the removal of deprecated syntax in JuliaLang/julia#8607.

Test log:

>>> 'Pkg.add("StatsBase")' log
INFO: Installing ArrayViews v0.4.6
INFO: Installing StatsBase v0.6.6
INFO: Package database updated
INFO: METADATA is out-of-date a you may not have the latest version of StatsBase
INFO: Use `Pkg.update()` to get the latest versions of your packages

>>> 'using StatsBase' log

WARNING: deprecated syntax "(T=>Int)[]" at /home/idunning/pkgtest/.julia/v0.4/StatsBase/src/scalarstats.jl:98.
Use "Dict{T,Int}()" instead.

WARNING: deprecated syntax "(T=>Int)[]" at /home/idunning/pkgtest/.julia/v0.4/StatsBase/src/scalarstats.jl:122.
Use "Dict{T,Int}()" instead.

WARNING: deprecated syntax "(T=>Float64)[]" at /home/idunning/pkgtest/.julia/v0.4/StatsBase/src/counts.jl:162.
Use "Dict{T,Float64}()" instead.

WARNING: deprecated syntax "(T=>Int)[]" at /home/idunning/pkgtest/.julia/v0.4/StatsBase/src/counts.jl:192.
Use "Dict{T,Int}()" instead.

WARNING: deprecated syntax "(T=>W)[]" at /home/idunning/pkgtest/.julia/v0.4/StatsBase/src/counts.jl:193.
Use "Dict{T,W}()" instead.

WARNING: deprecated syntax "(T=>Int)[]" at /home/idunning/pkgtest/.julia/v0.4/StatsBase/src/misc.jl:66.
Use "Dict{T,Int}()" instead.

WARNING: deprecated syntax "(T=>Int)[]" at /home/idunning/pkgtest/.julia/v0.4/StatsBase/src/misc.jl:77.
Use "Dict{T,Int}()" instead.
Julia Version 0.4.0-dev+998
Commit e24fac0 (2014-10-07 22:02 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

>>> test log

WARNING: deprecated syntax "(T=>Int)[]" at /home/idunning/pkgtest/.julia/v0.4/StatsBase/src/scalarstats.jl:98.
Use "Dict{T,Int}()" instead.

WARNING: deprecated syntax "(T=>Int)[]" at /home/idunning/pkgtest/.julia/v0.4/StatsBase/src/scalarstats.jl:122.
Use "Dict{T,Int}()" instead.

WARNING: deprecated syntax "(T=>Float64)[]" at /home/idunning/pkgtest/.julia/v0.4/StatsBase/src/counts.jl:162.
Use "Dict{T,Float64}()" instead.

WARNING: deprecated syntax "(T=>Int)[]" at /home/idunning/pkgtest/.julia/v0.4/StatsBase/src/counts.jl:192.
Use "Dict{T,Int}()" instead.

WARNING: deprecated syntax "(T=>W)[]" at /home/idunning/pkgtest/.julia/v0.4/StatsBase/src/counts.jl:193.
Use "Dict{T,W}()" instead.

WARNING: deprecated syntax "(T=>Int)[]" at /home/idunning/pkgtest/.julia/v0.4/StatsBase/src/misc.jl:66.
Use "Dict{T,Int}()" instead.

WARNING: deprecated syntax "(T=>Int)[]" at /home/idunning/pkgtest/.julia/v0.4/StatsBase/src/misc.jl:77.
Use "Dict{T,Int}()" instead.
Running tests:
 * mathfuns.jl ...
 * weights.jl ...
 * moments.jl ...
 * scalarstats.jl ...

ERROR: function median! does not accept keyword arguments
 in mad! at /home/idunning/pkgtest/.julia/v0.4/StatsBase/src/scalarstats.jl:182
 in mad at /home/idunning/pkgtest/.julia/v0.4/StatsBase/src/scalarstats.jl:187
 in include at ./boot.jl:245
 in include_from_node1 at ./loading.jl:128
 in anonymous at no file:24
 in include at ./boot.jl:245
 in include_from_node1 at loading.jl:128
 in process_options at ./client.jl:293
 in _start at ./client.jl:362
 in _start_3B_3789 at /home/idunning/julia04/usr/bin/../lib/julia/sys.so
while loading /home/idunning/pkgtest/.julia/v0.4/StatsBase/test/scalarstats.jl, in expression starting on line 77
while loading /home/idunning/pkgtest/.julia/v0.4/StatsBase/test/runtests.jl, in expression starting on line 21
INFO: Testing StatsBase
==============================[ ERROR: StatsBase ]==============================

failed process: Process(`/home/idunning/julia04/usr/bin/julia /home/idunning/pkgtest/.julia/v0.4/StatsBase/test/runtests.jl`, ProcessExited(1)) [1]

================================================================================
INFO: No packages to install, update or remove
ERROR: StatsBase had test errors
 in error at error.jl:21
 in test at pkg/entry.jl:719
 in anonymous at pkg/dir.jl:28
 in cd at ./file.jl:20
 in cd at pkg/dir.jl:28
 in test at pkg.jl:68
 in process_options at ./client.jl:221
 in _start at ./client.jl:362
 in _start_3B_3789 at /home/idunning/julia04/usr/bin/../lib/julia/sys.so


>>> end of log

Design of the frequency/contingency tables

Efficient implementation of such functions on generic data types (e.g. strings) can be done via pooled data arrays. Therefore, I feel that it makes sense to implement these in the DataArrays.jl, and thus take advantage of the pooled arrays stuff.

For the Stats.jl package, we will still maintain the counts function that can be used to compute such tables based on integer variables, like

counts([1, 1, 1, 2, 2, 2, 3, 3, 3, 3], 1:3)  # ==> [3, 3, 4]

We may also continue to the keep the countmap function (or whatever we finally decide to call it).

What I suggest is to implement two-way or multi-way contingency tables in DataArrays.jl

Please look below to see my detailed consideration about the interface design.

Cholesky-like covariance decomposition

Is there a function anywhere in Julia stats (or linear algebra) similar to MATLAB's cholcov function? Here is the relevant MATLAB link in their help stats documentation:

http://www.mathworks.co.uk/help/stats/cholcov.html

I need this for some work I am doing, so any help is appreciated.

Decide on an opensource license

It would be nice to get this released under an opensource license. :)

Interface for histograms (AbstractHistogram)

This issue is meant to discuss the possibility of adding an AbstractHistogram interface which would allow methods of Histogram such as push! and append! be re-used. This would allow additional *Histogram types to be defined in a coherent way.

For a concrete case:
If one bins data with unequal weights and assumes the bin heights are sampled from independent Poisson distributions, the sum of weights squared for each bin is useful for keeping track of the estimated variance of the bin. This in turn can be used to tell whether two histograms are likely to arise from the same underlying distribution [1]. This would be a simple add-on for Histograms, see an example in https://github.com/jpata/HEP.jl/blob/master/src/hist.jl. This method is used heavily in high-energy physics experiments.

[1] http://arxiv.org/pdf/0712.4250.pdf

Basic computation routines

In the process of removing NumericExtensions from packages, I found that I keep rewriting some simple routines everywhere (i.e. many stats-related packages).

I feel that some of the commonly used routines should be placed here.

I am considering to write a small set of functions to support basic statistical computing, so that people won't have to write them again and again.

Here is a tentative list of functions to be added:

inplace arithmetics (add!, subtract!, multiply!, etc)
add/subtract a vector (or its scaled version) to each column/row inlace
inplace of several widely used math functions (abs!, abs2!, sqrt!, exp!, and log!)
weighted sum
sum/mean of abs/square (both weighted & non-weighted)
maximum absolute value

This wouldn't make the package much larger, but it will provide a lot of convenience to statistical computing.

If there's no objection, I will take the lead to implement these. Opinions?

Feature request: weighted quantiles

something along the lines of R function Hmisc::wtd.quantile from the Hmisc package http://cran.r-project.org/web/packages/Hmisc/Hmisc.pdf would be useful.

quantile(x::Vector,w::WeightVec)

Fused computation of statistics

I have been considering a uniform interface for computing multiple statistics all at once, while allowing they share part of the computation.

Consider the following example. We want to compute sum, mean, var, and std from x:

s, m, v, sd = sum(x), mean(x), var(x), std(x)

This clearly would waste a lot of computation (e.g it actually computes sum four times, mean three times, and variance twice).

A more efficient way would be

s = sum(x)
m = s / length(x)
v = varm(x, m)
sd = sqrt(v)

This is more efficient, but not as concise and convenient.

I am considering the following way:

s, m, v, sd = stats(x, (sum_, mean_, var_, std_))

Internally, it should find an efficient routine that computes them altogether. Here, sum_ and mean_ are typed indicators defined as

type Sum_ end
type Mean_ end
type Var_ end
type Std_ end

const sum_ = Sum_()
const mean_ = Mean_()
const var_ = Var_()
const std_ = Std_()

Different combinations of statistics are different tuple types, and therefore we can leverage Julia's multiple dispatch mechanism to choose the optimal computation paths.

This is not urgent, but would be really nice to have. I am not going to implement this in near future. Just open this thread to collect ideas, suggestions, and opinions.

Warning: using StatsBase.histrange in module Main conflicts with an existing identifier.

julia> using Data<tab><tab>
DataArrays     DataFrames      DataStructures
julia> using DataFrames    # I typed F and hit tab to complete
Warning: using StatsBase.histrange in module Main conflicts with an existing identifier.
Warning: using StatsBase.midpoints in module Main conflicts with an existing identifier.

julia> VERSION
v"0.3.3"

Now, what's really interesting is this:

julia> using DataFrames   # I typed the entire thing in after restarting Julia

julia>

That is, the warnings only occur when I use tabs to see completion options. 100% reproducible. This occurs with DataArrays and DataStructures as well. (See JuliaData/DataFrames.jl#740 for the original issue.)

Add function to standardize variable

Standardizing a variable is something that comes up in a lot of different statistical applications. It might a good idea to provide a optimized function for doing this in StatsBase since the naive implementation (x - mean(x)) / std(x) calculates the mean twice.

ERROR: Func not defined

I'm getting this error on Ubuntu 14.04 (using julia-nightlies, version 0.3.0-prerelease)

julia> using StatsBase
Warning: could not import Base.Func into StatsBase
Warning: could not import Base.evaluate into StatsBase
Warning: could not import Base.IdFun into StatsBase
Warning: could not import Base.Abs2Fun into StatsBase
ERROR: Func not defined
 in anonymous at no file (repeats 2 times)
 in include at boot.jl:244
while loading /home/dzea/.julia/v0.3/StatsBase/src/weights.jl, in expression starting on line 134
while loading /home/dzea/.julia/v0.3/StatsBase/src/StatsBase.jl, in expression starting on line 140

Define method fit! akin to predict!

Method fit is defined but no fit!: https://github.com/JuliaStats/StatsBase.jl/blob/master/src/statmodels.jl#L13

Is it possible to have this defined alike to predict and predict!?

cor_spearman deals with NaNs incorrectly

from JuliaLang/julia#1142

julia> using Stats

julia> cor_spearman([NaN,NaN,3.0,5.0,6.0,7.0],[NaN,NaN,3.0,5.0,6.0,7.0])
1.0

julia> cor_spearman([3.0,5.0,6.0,7.0],[3.0,NaN,6.0,NaN])
0.7999999999999999

According to JuliaLang/julia#1142, both should be NaN.

the indicators methods

The methods for creating sparse indicator matrices are less efficient than they should be. In general it is a bad idea to use spzeros to initialize a sparse matrix then set values in the matrix to a non-zero value. Worst case the entire matrix is rewritten for every nonzero value that is set.

If you use a PooledDataVector from the DataFrames, soon to be DataArrays package the indicators function is a one-liner (well, one longish line)

indicators(x::PooledDataVector; sparse::Bool=false) = (nl = length(x.pool); (sparse ? speye(nl) : eye(nl))[:,x.refs])

Generally I would recommend going through the pool constructor for a PooledDataArray as the logic for obtaining the indices into the pool of values is already there. However, that constructor is in the DataFrames package at present and we probably don't want to have the Stats package depend on DataFrames. Perhaps when DataArrays is more widely available these should be rewritten.

Broken on 0.2

StatsBase is broken on 0.2:

julia> using StatsBase
ERROR: StatsBase not found
 in require at loading.jl:39

The issue is that there is no src/StatsBase.jl. How can we fix this quickly?

@IainNZ

autocov/autocor tests don't work on 32-bit platform

$ ./julia -E 'versioninfo();println(Pkg.installed("StatsBase"));Pkg.test("StatsBase")'
Julia Version 0.4.0-dev+119
Commit 3a252ee (2014-08-14 03:46 UTC)
Platform Info:
  System: Linux (i686-redhat-linux)
  CPU: Genuine Intel(R) CPU           T2250  @ 1.73GHz
  WORD_SIZE: 32
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Banias)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
0.6.3
INFO: Testing StatsBase
Running tests:
 * mathfuns.jl ...
 * weights.jl ...
 * moments.jl ...
 * scalarstats.jl ...
 * deviation.jl ...
 * cov.jl ...
 * counts.jl ...
 * ranking.jl ...
 * empirical.jl ...
 * hist.jl ...
 * rankcorr.jl ...
 * signalcorr.jl ...
ERROR: `fptype` has no method matching fptype(::Type{Int32})
 in autocov at /home/rick/.julia/v0.4/StatsBase/src/signalcorr.jl:81
 in include at ./boot.jl:245
 in include_from_node1 at ./loading.jl:128
 in anonymous at no file:24
 in include at ./boot.jl:245
 in include_from_node1 at loading.jl:128
 in process_options at ./client.jl:285
 in _start at ./client.jl:354
 in _start_3B_1680 at /usr/local/src/julia/julia/usr/bin/../lib/julia/sys.so
while loading /home/rick/.julia/v0.4/StatsBase/test/signalcorr.jl, in expression starting on line 29
while loading /home/rick/.julia/v0.4/StatsBase/test/runtests.jl, in expression starting on line 21

==============================[ ERROR: StatsBase ]==============================

failed process: Process(`/usr/local/src/julia/julia/usr/bin/julia /home/rick/.julia/v0.4/StatsBase/test/runtests.jl`, ProcessExited(1)) [1]

================================================================================
INFO: No packages to install, update or remove
ERROR: StatsBase had test errors
 in error at error.jl:21
 in test at pkg/entry.jl:711
 in anonymous at pkg/dir.jl:28
 in cd at ./file.jl:20
 in cd at pkg/dir.jl:28
 in test at pkg.jl:67
 in process_options at ./client.jl:219
 in _start at ./client.jl:354
 in _start_3B_1680 at /usr/local/src/julia/julia/usr/bin/../lib/julia/sys.so

Implement diagnostic tests

TimeModels needs a few diagnostic tests (mentioned here: JuliaStats/TimeModels.jl#23). Do these belong in StatsBase instead? Here is the list:

Augmented Dickey-Fuller (unit root)
Phillips-Perron (unit root)
KPSS (unit root)
Shapiro-Wilks (normality)
Jaques-Berra (3rd and 4th moments for normality)
Ljung-Box (auto correlation)
Durbin-Watson (auto correlation)

Performance of ordered sampling with replacement

I wrote a script to benchmark sampling algorithms.

Specifically, perf/sampling1.jl is to compare the performance of direct sampling and ordered sampling under different settings of n and k. (Note: we are to draw k samples from a population with n elements).

Results are shown here:
https://docs.google.com/spreadsheets/d/1UTfdhp60SzFuAfqOKphd-6K-9Hpon7WLou3BUsXdYZA/edit?pli=1#gid=328392192

Obviously, ordered_sample performs very well when k is much greater than n. However, it is very slow when the case is the opposite (i.e. n >> k).

From the results, it looks like that we should choose to use ordered_sample when k > 10n, otherwise, it would be better to just sorting the results obtained by direct sampling.

cc: @one-more-minute

Compute weighted mean from a dataframe?

i can only compute a weighted mean from a dataframe object after converting the column x to a float array. I don't understand why, since DataArray <: AbstractArray is true?

julia> df = DataFrame(x = rand(3),w=rand(3))
3x2 DataFrame
| Row | x        | w         |
|-----|----------|-----------|
| 1   | 0.226874 | 0.266516  |
| 2   | 0.346735 | 0.57437   |
| 3   | 0.232796 | 0.0512731 |

julia> mean(df[:x],WeightVec(df[:w]))
ERROR: `start` has no method matching start(::WeightVec{Float64,DataArray{Float64,1}})
 in reduced_dims at reducedim.jl:17
 in mean at /Users/florianoswald/.julia/v0.3/DataArrays/src/reducedim.jl:335

julia> mean(array(df[:x]),WeightVec(df[:w]))
0.30438070685782537

julia> methods(mean)
...
mean(v::AbstractArray{T,N},w::WeightVec{W<:Real,Vec<:AbstractArray{T<:Real,1}}) at /Users/florianoswald/.julia/v0.3/StatsBase/src/weights.jl:234
...

julia> typeof(df[:x])
DataArray{Float64,1} (constructor with 1 method)

julia> DataArray <: AbstractArray
true

Is sample! intended to be exported?

The documentation for sampling describes both sample and sample! but the latter is not exported.

[PkgEval] StatsBase may have a testing issue on Julia 0.4 (2014-09-30)

On Julia 0.4

On 2014-09-29 the testing status was Tests pass.
On 2014-09-30 the testing status changed to Tests fail, but package loads.

Tests pass. means that PackageEvaluator found the tests for your package, executed them, and they all passed.

Tests fail, but package loads. means that PackageEvaluator found the tests for your package, executed them, and they didn't pass. However, trying to load your package with using worked.

This error on Julia 0.4 is possibly due to recently merged pull request JuliaLang/julia#8493.
This issue was filed because your testing status became worse. No additional issues will be filed if your package remains in this state, and no issue will be filed if it improves. If you'd like to opt-out of these status-change messages, reply to this message saying you'd like to and @IainNZ will add an exception. If you'd like to discuss PackageEvaluator.jl please file an issue at the repository. For example, your package may be untestable on the test machine due to a dependency - an exception can be added.

Test log:

>>> 'Pkg.add("StatsBase")' log
INFO: No packages to install, update or remove
INFO: Package database updated

>>> 'using StatsBase' log
Julia Version 0.4.0-dev+856
Commit 46c7bbf (2014-09-30 04:44 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

>>> test log
ERROR: test error in expression: fit(Histogram,[]).weights == []
`histrange` has no method matching histrange(::Array{Any,1}, ::Int64, ::Symbol)
 in anonymous at test.jl:83
 in do_test at test.jl:47
 in include at ./boot.jl:245
 in include_from_node1 at ./loading.jl:128
 in anonymous at no file:24
 in include at ./boot.jl:245
 in include_from_node1 at loading.jl:128
 in process_options at ./client.jl:285
 in _start at ./client.jl:354
 in _start_3B_3624 at /home/idunning/julia04/usr/bin/../lib/julia/sys.so
while loading /home/idunning/pkgtest/.julia/v0.4/StatsBase/test/hist.jl, in expression starting on line 3
while loading /home/idunning/pkgtest/.julia/v0.4/StatsBase/test/runtests.jl, in expression starting on line 21
Running tests:
 * mathfuns.jl ...
 * weights.jl ...
 * moments.jl ...
 * scalarstats.jl ...
 * deviation.jl ...
 * cov.jl ...
 * counts.jl ...
 * ranking.jl ...
 * empirical.jl ...
 * hist.jl ...

INFO: Testing StatsBase
==============================[ ERROR: StatsBase ]==============================

failed process: Process(`/home/idunning/julia04/usr/bin/julia /home/idunning/pkgtest/.julia/v0.4/StatsBase/test/runtests.jl`, ProcessExited(1)) [1]

================================================================================
INFO: No packages to install, update or remove
ERROR: StatsBase had test errors
 in error at error.jl:21
 in test at pkg/entry.jl:719
 in anonymous at pkg/dir.jl:28
 in cd at ./file.jl:20
 in cd at pkg/dir.jl:28
 in test at pkg.jl:68
 in process_options at ./client.jl:213
 in _start at ./client.jl:354
 in _start_3B_3624 at /home/idunning/julia04/usr/bin/../lib/julia/sys.so


>>> end of log

Deprecate kernel density estimator

I would like to remove the kde functionality in favour of the new KernelDensity.jl package (see discussion in #58).

@dcjones I know you use this in Gadfly.jl: how can we do this so as to avoid breaking anything?

RunningMoments code: does it belong here?

I have implemented some Julia code to compute the first four moments of an array (right now just Array{T, 1}) with just one pass. This code is pretty much Julia code based on the C++ version here.

This code also allows users to compute "running" versions of these moments. This means that if users need to track moments for data that is becoming available in smaller chunks, no computation will be repeated as new data is added to the sample. I did some testing against the code here and these are my results:

julia> moms(x) = [mean(x), std(x), var(x), skewness(x), kurtosis(x)]
moms (generic function with 1 method)

julia> x = randn(5000000);

julia> @time run_m = RunningMoments(); push!(run_m, x); moms(run_m);
elapsed time: 3.666e-6 seconds (128 bytes allocated)

julia> @time moms(x);
elapsed time: 0.032538688 seconds (67904 bytes allocated)

julia> moms(x) - moms(run_m)
5-element Array{Float64,1}:
-1.1601e-17
-2.85327e-14
-5.70655e-14
-2.08167e-17
 3.75255e-13

We can also see that adding data little by little results in the correct moments (this runs without any issues):

srand(42)
N = 70
data = rand(N)
run_m = RunningMoments()

push!(run_m, data[1])

for i=2:N
    push!(run_m, data[i])
    @test_approx_eq moms(data[1:i]) moms(run_m)
end

My question is if this code would belong in src/moments.jl in this package? Right now I have it sitting in a repo where I keep useful stuff that doesn't belong anywhere else. Some tests can be found here.

Let me know if anyone things this belongs here. If it does, we will probably need to ask John Cook for permission before including it, as I am not sure what license is attached to the content of his blog (link from above).

More tests are needed

For such a fundamental package, we only have a small test script 01.jl which has about 50 lines.

Despite being widely used, most of the functions have not been properly tested. We have to work together to add test cases for them.

Weighted Sampling without replacement

We now support non-weighted sampling (with & without replacement) + weighted sampling with replacement.

Weighted sampling without replacement is not supported yet. This is not as easy to implement.

A naive implementation might look like

n = length(a)     # the pool size
k = length(dst)   # the number of samples to draw

for j = 1 : k
    i = wsample(1:n, w)  # sample a value in 1:n according to weights w
    x[j] = i
    w[i] = 0     # preclude i from being sampled in future
end

Unfortunately, this is mathematically incorrect.

Suppose π be the weights that sum to 1, and you want to draw x1 and x2 such that x1 != x2. Then the marginal distribution of x1 is not π.

This can be easily illustrated by the following simple example. Suppose π = [0.4, 0.6], and you want to draw x1 and x2 that are different. Clearly, (x1 = 1, x2 = 2) and (x1 = 2, x1 = 1) should have the same joint probability, which means that the marginal probability of p(x1 = 1) equals 0.5 (instead of 0.4).

A correct way to do this is rejection sampling, which may look like:

x = wsample(1:n, w, k)   # draw a sample sequence (with replacements)
while there-are-repeated-sample(x)
    x = wsample(1:n, w, k)   # sample again
end
# until we get a sample sequence without repeated elements

This is very inefficient, as when k increases, it would become very unlikely to draw a sequence of which all elements are distinct.

Any suggestions?

Long standing travis failure

This package and Distributions.jl and any downstream packages have failed Travis tests for weeks. The issue is mainly due to the renaming of libRmath to libRmath-julia in Julia Base (see messages below)

The command "julia -e 'Pkg.init(); run(`ln -s $(pwd()) $(Pkg.dir("StatsBase"))`); Pkg.pin("StatsBase"); Pkg.resolve()'" exited with 0.
$ julia ./runtests.jl
Running tests:
 * test/means.jl ...
 * test/scalarstats.jl ...
 * test/counts.jl ...
 * test/ranking.jl ...
 * test/corr.jl ...
 * test/sampling.jl ...
ERROR: error compiling __sample!#38__: error compiling ordered_sample!: could not load module libRmath-julia: libRmath-julia: cannot open shared object file: No such file or directory

We need to address this and restore the green travis logo!

cc: @ViralBShah @staticfloat

Faster Categorical sampler

The method described by the following paper claims to be 10x faster than prior ones:

Fast Generation of Discrete Random Variables. George Marsaglia, Wai Wan Tsang, Jingbo Wang. Journal of Stats Software. July 2004, Volume 11, Issue 3.

Anyone interested in taking a shot?

Not sure whether it outperforms Alias Table that we are currently using. It is not urgent, but it would be interesting to do a comparison.

mad computes estimation of normal std

I find it strange that the mad! function does not compute the MAD but rather 1.4826*MAD (estimate of the std of a normal distribution). Is there a reason for this? The doc does not even mention this.

ar() method for fitting an autoregressive time series model

Fitting an autoregressive time series model to data (by default selecting the complexity using AIC) is needed in many different occasions beyond sole time series analysis. For instance it is used for computing MCMC convergence diagnostics and in other cases. R considers the function so important that it ships with its "base" code rather than being part of a package. My view would be that we need this function. Before embarking on coding it, I wanted to discuss it and get your views - do you also find it worth having it somewhere in JuliaStats? If yes, would you want it in Stats or in a separate JuliaStats package called TimeSeries?

Rename range

range is now a function in Base. The signatures don't conflict, but the functionality is very different, so we should probably rename range in StatsBase.

License?

StatsBase.kde deprecated

Still getting this deprecated warning. Will the code be corrected to use KernelDensity.kde?

WARNING: StatsBase.kde(...) is deprecated, use KernelDensity.kde(...) instead.
 in kde at /home/benjamin/.julia/v0.3/StatsBase/src/deprecates.jl:60

in tiedrank: amap not defined

julia> cor_spearman(data["rmsd"],data["contactos"],true)
ERROR: in tiedrank: amap not defined
 in tiemapslicesdrank at /home/dzea/.julia/Stats/src/Stats.jl:112
 in cor_spearman at /home/dzea/.julia/Stats/src/Stats.jl:152

julia> versioninfo()
Julia Version 0.2.0-714.r1840
Commit 184054dcde 2013-03-24 16:23:04
Platform Info:
  OS_NAME: Linux
Using: (64-bit interface)
  Blas: libopenblas
  Lapack: libopenblas
  Libm: libopenlibm

possible name changes

How about renaming range to spread. I would expect range(x) to give (minimum(x), maximum(x)) but it gives maximum(x)-minimum(x) instead.
Since minmax(x) doesn't compute (min(x),max(x)) the naming is a little unfortunate now, perhaps range(x) would be a better name for that.
The name prctile strikes me as unnecessarily abbreviated. Why not just spell it out as percentile?
Should the *rank functions maybe start with rank instead so that tab completion gives the options easily? I actually think that rankordinal, rankcompete, rankdense and ranktied sound pretty good – maybe even better than the current order, as well as being more tab-completion friendly.
Maybe change cor_spearman to spearmancor and cor_kendall to kendallcor? Or keeping with the tab-completion friendly order corspearman and corkendall?
Rename inverse_rle to irle?
partial_autocor is a bit of a mouthful and I'm not super excited about he underscore. How about autocorpar? or parautocor or at least partialautocor and then at least there's no underscore.

Data standardization (z-scores)

I am considering adding the following functionality:

z = (x - μ) / σ

This is widely used in statistics.

There are several questions before I implement this:

Naming.

Different people seem to call it different ways: z-score, z-value, standard score, normal score, etc. (See wikipedia).
Both MATLAB and Scipy uses zscore. We might follow this convention?
API design.

There are basically two kinds of usage: (1) you just have the data, so you compute both mean and standard deviation and then use them to compute the z-scores. Then you may want to return z-scores, as well as, mean and std.dev (for future use); (2) you know mean and std.dev, and simply apply them to compute the z-scores.

I have following proposals. Not sure which one is the best (or there are better ways):

API-A: The outputs depend on the inputs (but still type stable for each method)
```
z, μ, σ = zscore(x)     
z, μ, σ = zscore(x, dim)
z = zscore(x, μ, σ)
z = zscore(x, μ, σ, dim)
```
API-B: Use different function names for these two purposes:
```
μ, σ = mean_and_std(x)
μ, σ = mean_and_std(x, dim)
zscore(x, μ, σ) = ... compute z using μ and σ
zscore(x, μ, σ, dim)
zscore(x) = (μ, σ = mean_and_std(x); zscore(x, μ, σ))
zscore(x, dim) = (μ, σ = mean_and_std(x, dim); zscore(x, μ, σ, dim))
```
API-C: always return the triple z, μ, σ, one can use keyword arguments to supply what he knows. The output might seem a little bit redundant when my purpose is to apply known mean and standard deviation to compute the z scores.
```
z, μ, σ = zscore(x; mean=..., std=...)
z, μ, σ = zscore(x, dim; mean=..., std=...)
```

Any thoughts?

StatsBase vs DataFrames; tagging of versions messed up?

using DataFrames gives the error

julia> using DataFrames
ERROR: Stats not found
 in require at loading.jl:39
 in reload_path at loading.jl:146
 in _require at loading.jl:59
 in require at loading.jl:43
while loading /home/theodore/.julia/DataFrames/src/DataFrames.jl, in expression starting on line 5

On the other hand, using StatsBase seems to be working. I ran Pkg.update() and haven't used the Pkg.checkout() or Pkg.free() functions, so this does not seem to be a problem specific to my package management. Do we need to bump DataFrames or something else is going wrong?

Compute statistics along dimensions

It is useful to compute certain statistics along specific dimensions. I open this thread to derive the development along this line.

Some other functions like mode and quantile require more sophisticated data structure to compute, and thus not included in this list. We may look at those in future.

Add documentation

This package comprises a collection of useful functions. It is good to document them, so that people know they are there instead of reinventing the wheels as they need them.

Obviously, the documentation is lacking.

@johnmyleswhite Can you document some of the functions that you write in the readme? I will add the documentation of my part.

juliastats / statsbase.jl Goto Github PK

statsbase.jl's Introduction

StatsBase.jl

statsbase.jl's People

Contributors

Stargazers

Watchers

Forkers

statsbase.jl's Issues

On Julia 0.4

On Julia 0.4

On Julia 0.4

Recommend Projects

Recommend Topics

Recommend Org