juliastats / distributions.jl Goto Github PK

View Code? Open in Web Editor NEW

1.1K 38.0 404.0 10.18 MB

A Julia package for probability distributions and associated functions.

License: Other

Julia 100.00%

julia statistics data-science probability-distributions

distributions.jl's Introduction

Distributions.jl

A Julia package for probability distributions and associated functions. Particularly, Distributions implements:

Moments (e.g mean, variance, skewness, and kurtosis), entropy, and other properties
Probability density/mass functions (pdf) and their logarithm (logpdf)
Moment generating functions and characteristic functions
Sampling from a population or from a distribution
Maximum likelihood estimation

Note: The functionalities related to conjugate priors have been moved to the ConjugatePriors package.

Resources

Documentation: https://JuliaStats.github.io/Distributions.jl/stable/
Support: We use GitHub for the development of the Julia package Distributions itself. For support and questions, please use the Julia Discourse forum. Also, for casual conversation and quick questions, there are the channels #helpdesk and #statistics in the official Julia chat (https://julialang.slack.com). To get an invitation, please visit https://julialang.org/slack/.

Contributing

Reporting issues

If you need help or an explanation of how to use Distributions ask in the forum (https://discourse.julialang.org) or, for informal questions, visit the chat (https://julialang.slack.com).

If you have a bug linked with Distributions, check that it has not been reported yet on the issues of the repository. If not, you can file a new issue, add your version of the package which you can get with this command in the Julia REPL:

julia> ]status Distributions

Be exhaustive in your report, summarize the bug, and provide: a Minimal Working Example (MWE), what happens, and what you expected to happen.

Workflow with Git and GitHub

To contribute to the package, fork the repository on GitHub, clone it and make modifications on a new branch, do not commit modifications on master. Once your changes are made, push them on your fork and create the Pull Request on the main repository.

Requirements

Distributions is a central package which many rely on, the following are required for contributions to be accepted:

Docstrings must be added to all interface and non-trivial functions.
Tests validating the modified behavior in the test folder. If new test files are added, do not forget to add them in test/runtests.jl. Cover possible edge cases. Run the tests locally before submitting the PR.
At the end of the tests, Test.detect_ambiguities(Distributions) is run to check method ambiguities. Verify that your modified code did not yield method ambiguities.
Make corresponding modifications to the docs folder, build the documentation locally and verify that your modifications display correctly and did not yield warnings. To build the documentation locally, you first need to instantiate the docs/ project:
```
julia --project=docs/
pkg> instantiate
pkg> dev .
```
Then use julia --project=docs/ docs/make.jl to build the documentation.

Citing

See CITATION.bib, or use the DOI badge above.

distributions.jl's People

Stargazers

Watchers

Forkers

kmsquire nfoti sbos vtjnash simonbyrne gusl jiahao philipp-neubauer aviks mewo2 malmaud ingmarschuster jamesjohndrow eriktaubeneck doobwa binarybana simonster alanedelman catchmeifyoutry gaz3ll3 alyst mmaechler michielcottaar cbecker micklat davidanthoff chiunghua aerotuck bigcrunsh dirkschumacher eauvero vunguyene bicycle1885 anj1 setzler prgao carloslesmes goretkin sglyon hua-zhou getzdan simondanisch jariou afniedermayer fairbrot panlanfeng chrodan wildart kshramt nagyist zenna awpxii daubyp eford brian-j-smith ranjanan travisbarrydick slundberg limwei islandsofsleep augustinyiptong jeffku77 rawls238 yuyichao woest bobernhardsson darf-ferrara evanmason etotheipluspi jdtuck mlubin gajomi amgad-naiem leclere maximsch2 danielarndt laohuyanggp abhijithch jmxpearson krisdm lmescheder elcritch ph0non hhzzhz ericproffitt alexndrejoly chipkent jkbest2 copperheadcrotch halmoni100 davidavdav tkelman stepan-a robertdj sammy-suyama hedgefair anhecon mbeltagy ajdunlap anriseth

distributions.jl's Issues

MAP estimation

Maximum a Posteriori estimation is widely used in Bayesian analysis.

It would be very useful to implement MAP estimation for some common distributions in this package.

I am starting this issue to invite suggestions about the interface for such methods.

Version for 0.1.2?

Could you update METADATA so that it is possible to use Distributions with 0.1.2?

RFC: Functions mapping distributions to distributions (i.e. combinator language)

@mewo2 had the really good idea of exposing a set of primitive for transforming and combining distributions. This issue is meant to expand on that suggestion and call for suggestions. Here are some basic functions on distributions that return distributions:

truncate: Truncate a distribution at specified lower and upper bounds
shift: Alter the location parameter of a distribution
scale: Alter the scale parameter of a distribution
update: Perform a conjugate Bayesian update of a distribution in response to data
mix: Create a mixture model from any set of distributions

For many popular distributions, multiple dispatch can be used to make these methods very efficient.

Separate distributions by univariate/multivariate distinction?

As I've started to add multivariate distributions like multinomials and multivariate normals, it's clear that the distinction between univariate and multivariate distributions needs to go directly into the type system for the Distributions package. Methods like rand() need to work differently for multivariate distributions, while methods like quantile() need to throw an error for multivariate distributions.

I made an initial pass at this in my fork of Julia in the past: johnmyleswhite/julia@0b18d8d

A lot more thought needs to go into getting this distinction totally clean. Also, because of single inheritance, we need to decide whether the primary distinction is between discrete and continuous or between univariate and multivariate.

pventropy not defined

When I run the tests, I get ERROR: pventropy not defined. It is called by the entropy function for Multinomial and Categorical, but I can't find this function in Base or in Stats.

MixtureModel fixes: pdf does not return density, and assumes Univariate distribution

I tried to create a Mixture of (Multivariate) Gaussians, and noticed that
a) the method pdf(d::MixtureModel, x::Real) does not return density p, and
b) assumes that the mixture components and univariate distributions (i.e. x::Real)

Adding return p and changing x::Real to just x resolves the problem for me.

Inconsistent Behavior of quantile

quantile sometimes generate out-of-support values, and the behavior is inconsistent across distributions:

julia> quantile(Bernoulli(), 1)
NaN

julia> quantile(DiscreteUniform(0, 2), 1)
3

julia> quantile(Normal(), 1)
Inf

We should agree on a consistent behavior in implementing this ..

modes no longer available in Stats

Import of the Distributions package yields the following warning:

Warning: could not import Stats.modes into Distributions

Univariate distributions that fail tests

I re-organized the tests for univariate distributions as of 696abb4.

The distributions that current fail the tests (for various reasons) have been temporarily commented out.

Below is a list of distributions that we need to fix:

Please tick the items that have been fixed and pass all tests.

Characteristic function

Add a function which gives the characteristic function for each distribution. This would make it easier to implement the fft-based kde (#29 and #31) for different kernels.

I don't know what the best name would be: characteristic, charfun, cf?

Possible typo of insupport() methods in Distributions.jl

Distributions.jl:961 and Distributions.jl:999 define isupport() methods where it appears that insupport() was intended.

Appear to need "using Base" or something like that in Distributions

This is with the Distributions.jl from JuliaLang but I think the behavior will be the same with this version. The integer_valued predicate is defined in the REPL but apparently not accessible within the Distributions module.

julia> require("Distributions")

julia> using Distributions

julia> insupport(Poisson(), 3)
no method integer_valued(Int64,)
 in method_missing at base.jl:81
 in insupport at /home/bates/.julia/Distributions/src/Distributions.jl:695

julia> insupport(Poisson(), 3.)
no method integer_valued(Float64,)
 in method_missing at base.jl:81
 in insupport at /home/bates/.julia/Distributions/src/Distributions.jl:695

What needs to be added to the Distributions module?

Thorough testing of the package

This package is of fundamental significance to statistics. It is important to ensure correctness through more thorough testing.

I caught a bug (see #19) when developing another package. This bug was supposed to be caught by the tests associated with this package.

I looked at the test file. It seems to be far from complete in current shape.

For example, there is a test that tests the pdf function for MultivariateNormal, but unfortunately the covariance function is selected to be the simplest one -- an identify matrix -- so that z and cov \ z and cov * z are just the same. This led to the failure of detecting the bug in #19.

Bizarre Multinomial sampling speeds

Our multinomial sampler is always slower than R's rmultinom function. There are two types of sampling we need to speed up:

Repeated samples with a size of 1
A single sample with a large size

In the first case, we are 3x slower than R:

s <- rmultinom(1000000, 1, rep(0.1, 10))

This takes 0.4 seconds on JMW's laptop

rand(Multinomial(10), 1_000_000)

This takes 1.2 seconds on JMW's laptop

In the second case, we are 7000x slower:

s <- rmultinom(1, 1000000, rep(0.1, 10))

This takes 0.0001 seconds on JMW's laptop

rand(Multinomial(1_000_000, 10))

This takes 0.7 seconds on JMW's laptop

Default to `pdf()` for both Discrete and Continuous Distributions

In my fork of the original Distributions code inside of the main Julia repo (johnmyleswhite/julia@0b18d8d) I started to make several changes to the basic ontology of distributions.

One change was to take a counting measure view of discrete distributions and use pdf() for them as well as for continuous distributions. This makes some code much simpler, such as the MixtureModel type I introduced. We could then add pmf(dd::DiscreteDistribution) = pdf(dd).

Make Distributions into immutable types

All of the univariate distributions (and perhaps all of the multivariate ones as well) should probably become immutable types. We should see how much doing this improves benchmarks, which we should start to set up more rigorously.

Problem with beta distribution for parameters less than one

I have l'Ecuyer's TestU01 up and running in the package RNGTest. It is possible to test the uniformity of variates generated within Julia and the good news is most of the distributions I have tested do not fail when transformed by their cdf. However, for parameters less than one the beta distribution fails the SmallCrush test consistently.

using RNGTest

julia> testBeta()=cdf(Beta(0.1,0.2), rand(Beta(0.1,0.2)))
# method added to generic function testBeta

julia> RNGTest.smallcrush(testBeta)

========= Summary results of SmallCrush =========

 Version:          TestU01 1.2.3
 Generator:        
 Number of statistics:  15
 Total CPU time:   00:04:43.10
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):

       Test                          p-value
 ----------------------------------------------
  3  Gap                            3.6e-12
  6  MaxOft                           eps  
  6  MaxOft AD                      1 - eps1
 10  RandomWalk1 H                    eps  
 ----------------------------------------------
 All other tests were passed

It can be either the random beta variates or the beta cdf, but the Gamma distribution works just fine and hence my suspect is the beta cdf. We use the version from RMath which I think is the best available algorithm so I don't know how to solve the problem.

Generalised truncation operator

Looking over the code I've written for the truncated normal distribution (#63), it seems that a large part of it would be generalisable to truncated distributions generally (or at least truncated continuous univariate distributions). Is it worth adding a truncate method, which returns either a specialised type (e.g. truncate(d::Normal) returns a TruncatedNormal, truncate(d::Uniform) returns a Uniform), or some sort of parametric type which wraps the original distribution? It should be relatively straightforward to write slow/inaccurate but usable generic methods for such a type (e.g. rejection sampler for rand), and replace these with specialised versions as they seem necessary.

show() methods

As we add more complex distributions like the multivariate normal it might be helpful to make formal show() methods for different distributions. I would propose the following defaults:

Show states the type of the distribution
Show shows the parameters of the distribution line-by-line
Show provides the mean and variance of the distribution as the last two lines

The last pieces are arguably superfluous, but I think they would help when working with things like the Beta and Gamma distributions for which the mean and variance are never obvious to me without doing an explicit calculation.

Bandwidth for KDE's

I've merged my KDE branch so that we now have Gaussian KDE's via the FFT. But the bandwidth suggestion from the original paper on Algorithm AS 176 tends to oversmooth multimodal data. It would be great to have a reference for a gentler bandwidth heuristic.

Gaussian distribution with different kinds of covariance

It is not uncommon that covariance matrices of special structures (e.g. diagonal covariance, or those in the form of s * eye) are used in practice.

NumericExtensons.jl package provides several covariance types that implement efficient computation on different kinds of covariance matrices, while maintaining uniform interface.

I have been working on a new Gaussian distribution: https://github.com/lindahua/BayesModels.jl/blob/master/src/gauss.jl

This new type allows covariance of different structures to be used, and leverages the facilities in NumericExtensions for efficient computation.

I think the Distributions.jl package would be the best location to host this. I am considering moving this new class here, and reimplement MultivariateNormal as follows:

immutable MultivariateNormal{Cov<:AbstractPDMat}
    mu::Vector{Float64}
    cov::Cov
    ldcov::Float64    # log determinant of cov
end

I can open a new branch to implement this if no objection, and merge these efforts after review.

Incorrect calculation of logpdf for multivariate normal distribution

I implemented a multivariate gaussian distribution in my Bayesian package, and compared the results from my implementation with those yielded by Distributions.jl. I found that the logpdf method for MultivariateNormal produces incorrect results.

Here is a simple example:

mu = zeros(3)
c = [4. -2. -1.; -2. 5. -1.; -1. -1. 6.]
g = MultivariateNormal(mu, c)
x = [3., 4., 5.]
logpdf(g, x)

This produces -17.348506846053837.

The correct result should be -15.75539253001834. -- You may confirm this with MATLAB or other scientific software.

Also, cond(c) = 3.43, so it is in pretty good condition. Hence, such a big difference should not be due to numerical errors. There should be a bug somewhere in this package.

I think this package needs a thorough test suite to ensure the correctness of implementation -- the role of this package is fundamental for statistics.

Typing issues and vectorization

First, thanks for putting in all the work to make this package. My research is mostly Bayesian statistics. I recently discovered Julia and it's a great environment for coding up MCMC. Obviously I'd be nowhere without random number generation. That said, a couple of minor suggestions.

The univariate distribution functions throw an error if any of the parameters are 1-by-1 arrays instead of scalars. It seems this could be easily fixed by checking to see if either parameter is an array and converting it to a scalar. In MCMC one very commonly calculate parameters by a series of matrix multiplications and it gets a bit tedious to have to write a=a[1,1] all the time.
It would be very useful if these functions could return arrays of random numbers of the same dimension of the inputs. For example if A and B are m-by-n arrays of positive reals, then I'd like to call rand(Gamma(A,B)) and get back an m-by-n array of gamma random numbers with the parameters specified by the corresponding entries of A and B. Obviously I can get arrays of randoms all with the same parameters with the current functionality, but in most cases I need arrays with heterogenous parameters.

I'd be happy to help on this kind of thing once I'm a little more comfortable with the language.

rand!(d::MultivariateNormal, X::Matrix) has wrong sample mean

Hi, I'm new to Julia, and not very experienced with Github, but I think I found a bug that hasn't been reported yet (I'm using version 105ec97 ).

mu = [6., 1.]
d = MultivariateNormal(mu, eye(2))
mean(rand(d, 1000), 1) % should be approx. mu

% julia> mean(rand(d, 1000), 1)
% 1x2 Float64 Array:
%   0.00947563  0.0472824

The problem appears to be in rand!(d::MultivariateNormal, X::Matrix) which doesn't add the mean value. Here is a fix that just changes the return statement (but maybe there is a more efficient way) at Distributions.jl:1112

  return bsxfun(+, X, mean(d)') % instead of return X

See also Issue #20 on better test coverage.

Batch and inplace logpdf/pdf

I think the following is useful.

r = logpdf(d, x) # x is a set of samples
logpdf!(r, d, x)  # r is the output array

Most of the important distributions (except for Uniform distribution) are exponential family. It means that the core part in computing logpdf is to evaluate dot-product between parameters and the sufficient statistics. When evaluating logpdf for a set of samples, BLAS functions can be used to speed up the computation (often drastically).

Currently, batch evaluation is implemented for many univariate distributions, but it is still lacking for some multivariate distributions.

Inplace evaluation is also important. In a lot of inference/estimation algorithms (e.g. EM), one has to repeatedly evaluate logpdf at each iteration (on the same set of samples). It would be much more efficient to put the results to a pre-allocated array, and creating a new array every time.

Generally, I think we can do it in this way. Implementing a specialized method logpdf! for each distribution type. And, write a logpdf on abstract distributions in the following way

function logpdf{T<:Real}(d::UnvariateContinuousDistribution, x::Array{T})
    r = Array(T, size(x))
    logpdf!(r, d, x)
    r
end

function logpdf{T<:Real}(d::MultivariateContinuousDistribution, x::Matrix{T})
    r = Array(T, size(x, 2))
    logpdf!(r, d, x)
    r
end

Similar things can be done for discrete distributions, and we should do the same for pdf.

`fit` does not work

julia> require("Distributions/test/fit.jl")
ERROR: in fit: _jl_libRmath not defined
in fit at /Users/aviks/.julia/Distributions/src/fit.jl:65
in include_from_node1 at loading.jl:92
in reload_path at loading.jl:112
in require at loading.jl:48
at /Users/aviks/.julia/Distributions/test/fit.jl:15

Sampling functions

Currently, there are several sampling functions:

The functions for sampling based on a probability vector/weight vector are in Distributions.jl
The functions for sampling without replacement are in Stats.jl.

These functions should be in either of the two packages. I am fine with either choice. But, we need to decide on this.

Direct access to the random sampling functions

I think that for certain applications it would be useful to be able to generate random variables (other than uniform and Gaussian) without having to instantiate the distribution objects. For instance, when writing MCMC samplers the parameters of the distributions will change from one iteration to the next and the overhead of having to create those distribution objects will add up and

I ran a simple example to compare julia and python. Generating 1000 samples of a 500-dimensional Dirichlet random vector with rand(Dirichlet(gam), 500) (where gam = rand(Gamma(1), 500)) takes about 150 milliseconds. Doing the same thing with python takes about 86 milliseconds.

I wrote a simple function in julia to generate a Dirichlet random vector that is comparable to the python code for generating 1000 samples of the 500-dimensional vectors. The function is:

function rdirich(alph)
  nt = length(alph)
  t = zeros(nt)
  for i in 1:nt
    t[i] = ccall(("rgamma", Rmath), Float64, (Float64,Float64), alph[i])
  end
  t = t / sum(t)
end

where Rmath = :libRmath as in Distributions.jl

I think that the current interface is useful for certain tasks, but having functions like above that can generate a sample (or more than 1 like in R) without having to create the object would be really useful if the distribution object will be changing a lot.

I am happy to implement this functionality if you think it belongs in the package. Otherwise, I could start a discussion on julia-dev to see where this functionality should go.

Thanks.

Multinomial distribution issue

The following code generates a BoundsError():

rand(Multinomial(1000, [1/3, 1/3, 1/3]), 1000)

The error seems to be stochastic, and related to how the individual counts are generated. The following both result in the same error:

for i in 1:1000
    rand(Multinomial(1000, [1/3, 1/3, 1/3]))
end

for i in 1:(1000^2)
    rand(Multinomial(1, [1/3, 1/3, 1/3]))
end

The same error occurs when using the Categorical distribution for the same purpose.

rand(Categorical([1/3, 1/3, 1/3], 1000^2)

Move glmtools.jl to the GLM package

At one point I thought it would be a good idea to keep the definitions and methods for link functions in the Distributions package but now I think they more properly belong in the GLM package. I want to incorporate the NumericExtensions package by @lindahua into GLM and use methods from NumericExtensions on the link and inverse link functions. Eventually we probably want to use NumericExtensions in Distributions as well but for the time being I think it is best to compartmentalize.

Draw samples from non-standard discrete distributions using the alias method

See http://www.keithschwarz.com/darts-dice-coins/

methods for insupport where first argument is a DataType

For many distributions the support doesn't depend on the parameters. In those case I would like to define methods like

insupport(::Type{Bernoulli}, x::Number) = x == 0 || x== 1

and a fallback method of

function insupport(t::DataType, X::Array)
    for x in X; insupport(t,x) || return false; end
    true
end

The only advantage this will provide is the ability to check

insupport(Bernoulli, v)

instead of

insupport(Bernoulli(), v)

but I have found that saving the user a few keystrokes, especially keystroke requiring the Shift key, keeps them happier.

Any objections?

Improve Categorical performance

On my system, the following diagnostic suggests that our alias table is much slower than a naive sampler for a small number of categories.

using Distributions

d = fit(Categorical, [1, 1, 2, 3])

for n in 10.^[4:7]
    @printf "Naive sampler: n = %d, time = %f\n" n @elapsed rand(d, n)
end

s = sampler(d)

for n in 10.^[4:7]
    @printf "Efficient sampler: n = %d, time = %f\n" n @elapsed rand(s, n)
end

This gives:

Naive sampler: n = 10000, time = 0.007320
Naive sampler: n = 100000, time = 0.001980
Naive sampler: n = 1000000, time = 0.017820
Naive sampler: n = 10000000, time = 0.181685
Efficient sampler: n = 10000, time = 0.039919
Efficient sampler: n = 100000, time = 0.070547
Efficient sampler: n = 1000000, time = 0.642467
Efficient sampler: n = 10000000, time = 6.907287

We might need a polyalgorithm like we started using for Multinomial, but the non-linear growth for the alias table suggests that something deeper is going wrong.

Distributions are too heavy

Some distributions such as Categorical and MultivariateNormal builds facilities for sampling upon construction. For example, an instance of AliasTable is constructed when a Categorical distribution is created. This is often not necessary.

If I just want to compute some statistics or evaluate logpdf, the time spent on constructing the sampler, which may be more complicated than the distribution itself, would be wasteful.

What about the following strategy:

Use light weight distribution. Each instance only maintains necessary parameters.
rand(d) and rand(d, n) uses an algorithm that only relies on the parameters maintained in the distribution.
If fast sampling is important, we can do:

s = sampler(d)
x = rand(s, n)

Include fit() methods?

I have code to fit several simple distributions to observed data at https://github.com/johnmyleswhite/fit_distributions.jl

Should we move this code into the Distributions module?

Consistent semantics of rand!

When I looked at Distributions.jl, I saw the following implementation of rand! for MultivariateNormal (in lines 1044 - 1055)

function rand!(d::MultivariateNormal, X::Matrix)
  k = length(mean(d))
  m, n = size(X)
  if m == k 
    X = d.covchol.LR'randn(m, n)
  elseif n == k
    X = randn(m, n) * d.covchol.LR
  else
    error("Wrong dimensions")
  end
  return X
end

Clearly, the semantics of this function depends on which dimension happens to match the distribution's dimension. What if X is a square matrix of size k-by-k? From the code, I know that it treats each column as a sample in such case.

In the rand! function for Dirichlet, it just uniformly considers each row as a sample.

Honestly, I think such inconsistent behavior is asking for subtle bugs that are difficult to trace down.

I think a much more consistent approach is to make a choice (treating either each column or each row as a sample) and enforce it throughout the package. Since Julia is column-major, I think considering each column as a sample would be a better choice in terms of performance.

Or, we may do it this way:

rand!(d, X, dim)

Let the user to explicitly specify whether he considers columns or rows as samples.

Make this the official Distributions module?

@StefanKarpinski, @johnmyleswhite Should we go ahead and submit a pull request on METADATA.jl to make this the official Distributions module?

Depend on Stats?

Should Distributions depends on the Stats package?

rename fit?

It would better to have a name to convey the message that it is doing Maximum Likelihood Estimation.

There are different ways to estimate a distribution from data. MLE is just one approach. The name fit sounds overly generic.

What about mle, mlestimate, fit_mle, or mle_fit?

Changing this will also leave room for implementing other kinds of estimation algorithm in future.

error: Cholesky not defined

I get the following error message from the Julia command prompt (I used Pkg.update() first):

julia> using Distributions
ERROR: Cholesky not defined
in include_from_node1 at loading.jl:92
in reload_path at loading.jl:112
in require at loading.jl:48
at /home/theodore/.julia/Distributions/src/Distributions.jl:1094

These are the relevant lines from ~/.julia/Distributions.jl:

sed -n 1094,1104p ~/.julia/Distributions/src/Distributions.jl 
immutable MultivariateNormal <: ContinuousMultivariateDistribution
  mean::Vector{Float64}
  covchol::Cholesky{Float64}
  function MultivariateNormal(m, c)
    if length(m) == size(c, 1) == size(c, 2)
      new(m, c)
    else
      error("Dimensions of mean vector and covariance matrix do not match")
    end
  end
end

MultivariateNormal error(?)

Am I specifying the input to the mutlivariate Normal wrongly or there is a bug?

julia> MultivariateNormal(zeros(2), 100*eye(2))
MultivariateNormal distribution
Mean:
[0.0, 0.0]
Covchol: CholeskyDense{Float64}(2x2 Float64 Array:
10.0 0.0
0.0 10.0,'U')
Mean:
[0.0, 0.0]
Error showing value of type MultivariateNormal:
ERROR: type CholeskyDense has no field LR
in show at /home/theodore/.julia/Distributions/src/show.jl:23
in repl_show at repl.jl:12

Syntactic sugar for sprand()?

Currently the only way to generate sparse random matrices with nonzero elements following a particular Distribution is via a command like

    using Distributions
    M=sprand(10,20,0.5,n->rand(Chisq(5),n))

which is rather clunky; it would appear that something like

sprand(m, n, density, Distribution(parameters))

would be much cleaner.

I know there's been some talk of overhauling the sparse matrix routines in Julia, but having something like

sprand(m, n, density, X::Distribution) = sprand(m, n, density, n->rand(X, n))

would be nice.

Use types as tags for insupport()

We should allow both insupport(Normal, 0.0) and insupport(Normal(0.0, 1.0), 0.0).

Code cleanup

The Distributions codebase is getting a little unwieldy. I'm currently cleaning it up and splitting things into separate files. This issue is just a heads up to everyone else that large changes will occur soon.

Should the Link type be part of Distributions?

Link functions for generalized linear models are, to some extent, connected with distributions. In particular, distributions in the exponential family have canonical link functions. In R a object in the "family" class is generated from a distribution and, optionally, the name of a link.

> class(binomial())
[1] "family"
> binomial()

Family: binomial 
Link function: logit 

> binomial("probit")

Family: binomial 
Link function: probit 

> poisson()

Family: poisson 
Link function: log 

> gaussian()

Family: gaussian 
Link function: identity 

> Gamma()

Family: Gamma 
Link function: inverse

Currently I have the Link types in the Glm package (not yet a package but it will be) but that leaves methods for canonicallink(d::Distribution) being defined in Glm.jl whereas these really are properties of the distribution. Would it be too confusing to incorporate the Link abstract type and the concrete Link types in the Distributions module?

I have the feeling that we may end up with too many interlinking parts if we create fine-grained packages and we may be better off using fewer but more coarse-grained packages, which is why I would like to move many things that may be associated with probability distributions into this package.

Categorical's show broken

julia> yo=Categorical([0.5, 0.5])
Categorical distribution
Prob:
[0.5, 0.5]
Error showing value of type Categorical:
ERROR: no method start(Categorical,)
in mean at statistics.jl:2
in show at /home/iain/.julia/Distributions/src/show.jl:17
in repl_show at repl.jl:12

I'm at latest Distributions in METADATA, but my Julia is not at latest (but is recent)

InexactError in Categorical constructor if called with int vector

Int vector throw an InexactError

Sampling a multinomial fails with probs = [1., 0.]

rand(Multinomial(1, [1., 0.]))

which should be valid input, gives,

ERROR: BoundsError()
in getindex at array.jl:252

Use erfinv for cdf of Normal?

base/math.jl contains a definition of erfinv using that beautiful horner macro. Should that be used for calculating the cdf of the Normal distribution? It gets us away from one more dependence on the Rmath library.

Entropy of Binomial distribution is incorrect

Currently, we have

julia> entropy(Binomial(1, 0.5))
0.7257913526447274    # wrong

julia> entropy(Bernoulli())
0.6931471805599453    # correct

These two should be both equal to log(2).