juliastats / glm.jl Goto Github PK
View Code? Open in Web Editor NEWGeneralized linear models in Julia
License: Other
Generalized linear models in Julia
License: Other
One thing I'd really like is for Julia to tell the user when the data is linearly separable under a logistic model. This could be done by making a call to glm
for logistic models terminate with a call to predict
to see if there are no mispredicted responses. In that case, it would be nice to output a message noting this.
On OS X, no package can contain two files with the same name up to case. Right now, there are src/GLM.jl
and src/glm.jl
files in this package. One has to be renamed.
The variable y
in the final line of this function is not defined anywhere.
PackageEvaluator.jl is a script that runs nightly. It attempts to load all Julia packages and run their tests (if available) on both the stable version of Julia (0.2) and the nightly build of the unstable version (0.3). The results of this script are used to generate a package listing enhanced with testing results.
Tests pass.
.Tests fail, but package loads.
.Tests pass.
means that PackageEvaluator found the tests for your package, executed them, and they all passed.
Tests fail, but package loads.
means that PackageEvaluator found the tests for your package, executed them, and they didn't pass. However, trying to load your package with using
worked.
This issue was filed because your testing status became worse. No additional issues will be filed if your package remains in this state, and no issue will be filed if it improves. If you'd like to opt-out of these status-change messages, reply to this message saying you'd like to and @IainNZ will add an exception. If you'd like to discuss PackageEvaluator.jl please file an issue at the repository. For example, your package may be untestable on the test machine due to a dependency - an exception can be added.
Version 0.2.0
Commit 855a9220ec 2013-04-08 14:51:03
julia> using GLM
ERROR: QRDense not defined
in include_from_node1 at loading.jl:92
in reload_path at loading.jl:112
in require at loading.jl:48
at /home/iain/.julia/GLM/src/GLM.jl:173
This:
using GLM, RDatasets
form = dataset("datasets","Formaldehyde")
lm1 = lm(OptDen ~ Carb, form)
results in:
code for 3rd and higher order interactions not yet written
while loading In[1], in expression starting on line 3
in expandcols at C:\Users\admin\.julia\v0.3\DataFrames\src\formula\formula.jl:216
in ModelMatrix at C:\Users\admin\.julia\v0.3\DataFrames\src\formula\formula.jl:238
This is on Windows 8.1 x64.
versioninfo()
Julia Version 0.3.0-prerelease+1570
Commit a6f6461* (2014-02-14 21:07 UTC)
Platform Info:
System: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
LAPACK: libopenblas
LIBM: libopenlibm
Pkg.status()
8 required packages:
- Benchmark 0.0.3
- GLM 0.3.0
- Gadfly 0.2.5
- IJulia 0.1.4
- ProfileView 0.0.1
- RDatasets 0.1.1
- Resampling 0.0.0
- Winston 0.9.0
38 additional packages:
- ArrayViews 0.4.1
- BinDeps 0.2.12
- Blocks 0.0.2
- Cairo 0.2.12
- Cartesian 0.1.4
- Codecs 0.1.0
- Color 0.2.8
- Compose 0.1.26
- DataArrays 0.1.3
- DataFrames 0.5.3
- Datetime 0.1.2
- Distance 0.3.1
- Distributions 0.4.0
- GZip 0.2.12
- HTTPClient 0.1.0
- Hexagons 0.0.1
- ImageView 0.0.14
- Images 0.2.30
- IniFile 0.2.2
- Iterators 0.1.2
- JSON 0.3.3
- LibCURL 0.1.0
- LibExpat 0.0.3
- Loess 0.0.2
- Nettle 0.1.3
- NumericExtensions 0.5.4
- PDMats 0.1.0
- REPLCompletions 0.0.0
- SIUnits 0.0.1
- SortingAlgorithms 0.0.1
- StatsBase 0.3.7
- TexExtensions 0.0.1
- Tk 0.2.11
- URIParser 0.0.1
- URLParse 0.0.0
- WinRPM 0.0.12
- ZMQ 0.1.8
- Zlib 0.1.5
I'm trying to run the first example in the readme (using the formaldehyde data set), and I'm getting an error (ERROR: no method fit(Type{LmMod}, Formula, DataFrame)
) when trying to run lm1 = fit(LmMod, OptDen ~ Carb, form)
. Is this an outdated example? I've pasted some diagnostic info below:
Julia version: 0.3.0-prerelease+2265
Results of Pkg.status():
9 required packages:
28 additional packages:
Is there an equivalent to R's use of "." to select all columns for the model?
e.g. glm(:(outcome ~ .), df, Bernoulli());
julia> glm(:(d1 ~ dc), my_data_frame, Bernoulli())
ERROR: no method cols(Array{Float64,1},)
in ModelMatrix at /home/laurence/.julia/DataFrames/src/formula.jl:229
in glm at /home/laurence/.julia/GLM/src/glmfit.jl:62
in glm at /home/laurence/.julia/GLM/src/glmfit.jl:70
I'm pretty sure I'm using the latest versions of everything.
Taking a look at the ModelMatrix code, it seems that cols is defined for DataVector, but not Vector.
I should know how to do this but I don't. Could someone else flip the switch on this, please?
ftest
(#182)aic
, bic
loglikelihood
julia> using GLM
WARNING: deprecated syntax "x[i:]" at /home/diego/.julia/DataFrames/src/formula.jl:58.
Use "x[i:end]" instead.
WARNING: deprecated syntax "x[i:]" at /home/diego/.julia/DataFrames/src/formula.jl:71.
Use "x[i:end]" instead.
WARNING: deprecated syntax "x[i:]" at /home/diego/.julia/DataFrames/src/formula.jl:77.
Use "x[i:end]" instead.
WARNING: deprecated syntax "x[i:]" at /home/diego/.julia/DataFrames/src/formula.jl:92.
Use "x[i:end]" instead.
WARNING: deprecated syntax "x[i:]" at /home/diego/.julia/DataFrames/src/formula.jl:106.
Use "x[i:end]" instead.
WARNING: deprecated syntax "x[i:]" at /home/diego/.julia/DataFrames/src/formula.jl:111.
Use "x[i:end]" instead.
WARNING: deprecated syntax "x[i:]" at /home/diego/.julia/DataFrames/src/formula.jl:122.
Use "x[i:end]" instead.
ERROR: TernaryFunctor not defined
in include at boot.jl:240
while loading /home/diego/.julia/GLM/src/lm.jl, in expression starting on line 22
while loading /home/diego/.julia/GLM/src/GLM.jl, in expression starting on line 76
- Benchmark 0.0.3
- BioSeq 0.2.1
- Cairo 0.2.12
- DataFrames 0.5.1
- Distributions 0.3.0
- FastaIO 0.1.2
- GLM 0.2.2
- Gadfly 0.1.31
- IJulia 0.1.1
- NumericExtensions 0.3.6
- PyCall 0.4.1
- Stats 0.1.0
- StatsBase 0.3.5
- Winston 0.8.0
with newest Julia and GLM:
using DataFrames,GLM
df = DataFrame(A = [0.1,0.12,0.31,0.2,0.1], B =[0.21, 0.3, 0.1, 0.2,0.2])
ab=lm((A~B),df)
stderr(ab)
no method copytri!(Float64, Char)
while loading In[22], in expression starting on line 2
in symmetrize! at deprecated.jl:21
in vcov at /home/gaofeng/.julia/GLM/src/linpred.jl:65
in stderr at /home/gaofeng/.julia/GLM/src/linpred.jl:67
there is no error with "ab=lmc((A~B),df);stderr(ab)"
I just tried to test GLM following the README.MD but it simply does not work whatever I try (different packages using in different order ), lm() simply does not work
+++++++++++++++++++
julia> using DataFrames, Distributions, GLM
Warning: New definition refModResp not defined
at C:\Users\omorjan\AppData\Roaming\julia\packages\GLM\src\GLM.jl:18
julia>
(DataArray{T,N},Union(BitArray{1},Array{Bool,1})) is ambiguous with ref(DataArray{T<:Number,N},Union(BitArray{1},Ranges{T},Array{T,1})) at C:\Users\omorjan\AppData\Roaming\julia\packages\DataFrames\src\dataarray.jl:339.
Make sure ref(DataArray{T<:Number,N},Union(BitArray{1},Array{Bool,1})) is defined first.
julia> using RDatasets, GLM
invalid redefinition of constant GlmResp
at C:\Users\omorjan\AppData\Roaming\julia\packages\GLM\src\GLM.jl:18
julia> form = data("datasets", "Formaldehyde")
6x3 DataFrame:
carb optden
[1,] 1 0.1 0.086
[2,] 2 0.3 0.269
[3,] 3 0.5 0.446
[4,] 4 0.6 0.538
[5,] 5 0.7 0.626
[6,] 6 0.9 0.782
julia> fm1 = lm(:(optden ~ carb), form)
lm not defined
++++++++++++++++++++++
When I try to load GLM, I got an error, and cannot load GLM.
> using Distributions
> using GLM
ERROR: invalid redefinition of constant GlmResp
in include_from_node1 at loading.jl:76
in reload_path at loading.jl:96
in require at loading.jl:48
at /Users/kiwamu/.julia/GLM/src/GLM.jl:18
How can I solve this problem? I run Julia (Version 0.1.2+113667293.r7252) on Mac OS X 10.8, and GLM and Distributions were installed by Pkg.add("GLM") without any errors or warnings. Thanks.
I need to do generalized least squares and would like to use this package.
Is there an existing way to specify the variance of the residual, so as to do weighted least squares?
I can't seem to find any functions for evaluating the fit for a linear model. Does GLM have ways to get things like R^2, adjusted R^2, sigma, AIC, BIC?
"Deviance" in the context of a GLM is typically defined in terms of difference between the log likelihood of the model and the log likelihood of a full model with a parameter for each observation (see Wikipedia), but the value GLM.jl reports is just -2*log likelihood of model. This makes no difference if the likelihood of the full model is 1 but isn't always the case. This explains the difference between the deviance of the Poisson example from the README in R and GLM.jl.
Running this same data set through R does give me a reasonable result...
The entire set is available upon request.
julia> trndf = readtable("trn.ds.fix.11")
409x12 DataFrame:
outcome c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11
[1,] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
[2,] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0
[3,] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 2.0
[4,] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 4.0
[5,] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7.0 0.0 0.0 7.0
[6,] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.0 0.0 0.0 8.0
[7,] 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
[8,] 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
[9,] 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0
[10,] 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0
[11,] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 11.0 0.0 0.0 11.0
[12,] 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0
[13,] 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 1.0 0.0
[14,] 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 2.0 0.0
[15,] 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 1.0 0.0 1.0 1.0
[16,] 0.0 0.0 0.0 0.0 0.0 0.0 2.0 2.0 0.0 0.0 0.0 0.0
[17,] 0.0 0.0 0.0 0.0 0.0 0.0 2.0 2.0 0.0 0.0 2.0 0.0
[18,] 0.0 0.0 0.0 0.0 0.0 0.0 2.0 2.0 3.0 0.0 2.0 3.0
[19,] 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0
[20,] 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 2.0 0.0
:
[390,] 1.0 9.0 15.0 7.0 1.0 0.0 157.0 12.0 7.0 0.0 70.0 7.0
[391,] 0.0 0.0 3.0 0.0 0.0 17.0 113.0 53.0 25.0 0.0 39.0 22.0
[392,] 0.0 6.0 10.0 17.0 10.0 1.0 74.0 45.0 1.0 0.0 26.0 1.0
[393,] 0.0 7.0 15.0 3.0 3.0 10.0 27.0 21.0 23.0 0.0 8.0 23.0
[394,] 1.0 0.0 0.0 0.0 0.0 9.0 108.0 30.0 19.0 0.0 17.0 18.0
[395,] 1.0 0.0 15.0 31.0 14.0 7.0 32.0 32.0 0.0 0.0 12.0 0.0
[396,] 1.0 0.0 2.0 6.0 1.0 0.0 105.0 29.0 43.0 0.0 76.0 41.0
[397,] 1.0 12.0 12.0 38.0 6.0 0.0 24.0 0.0 33.0 0.0 2.0 21.0
[398,] 1.0 0.0 0.0 0.0 0.0 27.0 104.0 66.0 85.0 0.0 57.0 84.0
[399,] 1.0 16.0 22.0 31.0 21.0 4.0 32.0 20.0 0.0 0.0 14.0 0.0
[400,] 1.0 0.0 0.0 0.0 0.0 11.0 147.0 14.0 13.0 0.0 108.0 13.0
[401,] 1.0 0.0 0.0 2.0 0.0 33.0 201.0 50.0 50.0 0.0 155.0 34.0
[402,] 0.0 10.0 17.0 15.0 11.0 51.0 100.0 88.0 6.0 0.0 98.0 6.0
[403,] 1.0 0.0 0.0 18.0 0.0 103.0 360.0 107.0 9.0 0.0 338.0 7.0
[404,] 1.0 0.0 3.0 11.0 3.0 19.0 123.0 63.0 13.0 0.0 117.0 11.0
[405,] 1.0 1.0 18.0 0.0 0.0 49.0 255.0 70.0 56.0 0.0 127.0 53.0
[406,] 1.0 6.0 18.0 37.0 18.0 21.0 45.0 39.0 23.0 0.0 44.0 21.0
[407,] 1.0 27.0 28.0 12.0 12.0 34.0 162.0 40.0 53.0 0.0 61.0 46.0
[408,] 1.0 21.0 43.0 32.0 32.0 22.0 68.0 39.0 124.0 0.0 49.0 101.0
[409,] 0.0 49.0 52.0 31.0 31.0 64.0 198.0 96.0 164.0 0.0 137.0 149.0
julia> lm(:(outcome ~ c1 + c2 + c3 + c4 + c5 + c6 + c7 + c8 + c9 + c10 + c11), trndf)
ERROR: SingularException(10)
in \ at linalg/triangular.jl:46
in \ at linalg/factorization.jl:310
in delbeta! at /Users/daljohnson/.julia/GLM/src/linpred.jl:27
in lm at /Users/daljohnson/.julia/GLM/src/lm.jl:39
in lm at /Users/daljohnson/.julia/GLM/src/lm.jl:42
Issue #19 relates to the use of the glmfit function. At present the constructor for GlmMod always calls glmfit to fit the model. In the MixedModels package I took a different approach in which the type representing the model does not call the model-fitting function - it just creates the representation of the model which includes a boolean indicator of whether the fit has been performed. Any of the extractor functions, coef
, deviance
, etc. check first if the fit has been performed and call fit if needed.
Separating fitting the model from constructing the representation has the advantage that the representation can be reused, say in a simulation, without going through its construction again. The fit methods can take other parameters such as one for verbose output.
Does this seem reasonable? Are more details needed?
Currently the glmfit command has a println in it - is it possible to make this off by default and add a verbose option?
I'm getting this error when trying to run a simple lm
model on two Float64 values.
In [138]: mod = lm(:(offtask ~ testing), bystudent)
Out [138]: no method symmetrize!(Float64)
when I do Pkg.add("GLM") on my mac my .julia/GLM/src folder only contains is missing GLM.jl. I discovered that the reason is that my mac does not distinguish case in the file name so it replaces GLM.jl with glm.jl during the install. The only way to fix it is to change the name of one of the two files. I change glm.jl to glm2.jl and in GLM.jl:199 to include("glm2.jl").
using DataFrames,GLM
df = DataFrame(A = 0.1*(1:4), B =[0.1, 0.3, 0.1, 0.2])
lm(A~B,df)
ERROR: no method convert(Type{QR{Float64}}, QRCompactWY{Float64})
in convert at base.jl:11
in DensePredQR at /home/gaofeng/.julia/GLM/src/linpred.jl:22
in lm at /home/gaofeng/.julia/GLM/src/lm.jl:38
lm(:(A~B),df)
ERROR: no method Formula(Expr)
in lm at /home/gaofeng/.julia/GLM/src/lm.jl:42
Hello, i'm getting some errors running code from the documentation... i'm running:
using GLM, RDatasets
form = dataset("datasets","Formaldehyde")
lm1 = fit(LmMod, OptDen ~ Carb, form)
i'm getting the error:
Error OptDen not defined
i'm running the stable release of julia.
Any help would be appreciated
Jason
I've set up things so that the user can do:
lm("y ~ x", df)
I feel like this is less intimidating to new users. Of course, it does so by separating users from the work being done under the hood. I'd like to know whether others think this is a good thing. Any thoughts @dmbates, @andreasnoackjensen, @HarlanH, @doobwa or @tshort?
This is more a feature request or policy question than a bug report.
I'm wondering whether you would like to add an argument allowing to easily compute sandwich (heteroskedasticity-robust), bootstrap, jackknife and possibly other types of variance-covariance matrix and standard errors, instead of the asymptotic ones. Or whether you would like to provide a function to compute these after the model has been fitted.
R makes it relatively cumbersome to get these common standard errors (see e.g. [1] or [2]), which either means that researchers with limited programming skills are incited to keep using Stata, or to keep relying only on asymptotic standard errors. I see no good reason why such an option shouldn't be provided in the standard modeling package.
1: http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-bootstrapping.pdf
2: http://cran.r-project.org/web/packages/sandwich/vignettes/sandwich-OOP.pdf
Shouldn't it? This is useful in signal processing. Right now I am just solving my complex regression problems with pinv(X'X)*X'*Y
, but it doesn't produce the nice output of lm
and it is not scalable to problems with tons of predictors.
Shall we keep the coeftable methods and use them in the show methods? The reason that I wrote them is because there are often requests on R mailing lists dealing with extracting standard errors or p-values that are displayed by print.lm. The usual recommendation in R is to use
coef(summary(fm))
We should consider whether we want to go through the intermediate stage of summary in the GLM package. If so, we can fold coeftable into the summary methods.
On a related issue, there is an R function printCoefmat
for printing the coefficient table. This isolates the code for printing small p-values. There is a reticence to displaying small p-values as 0.000 and I feel it is justified.
I'm looking for something like R's glm.fit
function, which is called by glm
after the design matrix and response vector are constructed and can be called directly. This would be useful if you already have a design matrix and response, and you don't need extra features that are provided by DataFrames. This would also avoid the overhead of having a DataFrame in memory in addition to your design matrix.
It looks like it would not be too hard to provide this. In glmfit.jl, fit
takes a GlmMod. Other than the ModelFrame, the GlmMod object can be constructed from a response vector and design matrix instead of a Formula and DataFrame, and it looks like the ModelFrame is only used for column names in the coef table. (I'm exactly sure what the assign field is for in a ModelMatrix, important here?)
I am playing with Julia 3.0 prerelease.
using GLM, RDatasets
duncan = dataset("car","Duncan")
lm1=lm(Prestige~Income, duncan)
ERROR: no method LmResp{V<:StoredArray{T<:FloatingPoint,1}}(Array{Int32,1})
This issue is being filed by a script, but if you reply, I will see it.
PackageEvaluator.jl is a script that runs nightly. It attempts to load all Julia packages and run their test (if available) on both the stable version of Julia (0.2) and the nightly build of the unstable version (0.3).
The results of this script are used to generate a package listing enhanced with testing results.
The status of this package, GLM, on...
'No tests, but package loads.' can be due to their being no tests (you should write some if you can!) but can also be due to PackageEvaluator not being able to find your tests. Consider adding a test/runtests.jl
file.
'Package doesn't load.' is the worst-case scenario. Sometimes this arises because your package doesn't have BinDeps support, or needs something that can't be installed with BinDeps. If this is the case for your package, please file an issue and an exception can be made so your package will not be tested.
This automatically filed issue is a one-off message. Starting soon, issues will only be filed when the testing status of your package changes in a negative direction (gets worse). If you'd like to opt-out of these status-change messages, reply to this message.
See discussion in #67. We should at least be able to make the examples work.
Doing the steps in the README.md works, up to
julia> fm1 = lm(:(optden ~ carb), form)
Formula: optden ~ carb
Coefficients:
Evaluation succeeded, but an error occurred while showing value of type LmMod:
ERROR: no method display(LmMod)
in display at multimedia.jl:158
julia> confint(fm1)
ERROR: no method symmetrize!(Float64)
in vcov at /Users/adr/.julia/GLM/src/linpred.jl:65
in coeftable at /Users/adr/.julia/GLM/src/lm.jl:64
in confint at /Users/adr/.julia/GLM/src/lm.jl:75
in confint at /Users/adr/.julia/GLM/src/lm.jl:79
Not sure if this was the cause, but I started using Julia-0.2.0-rc2. On attempting using GLM
, I got the error:
ERROR: Logistic not defined
in include at boot.jl:238
in include_from_node1 at loading.jl:119
in include at boot.jl:238
in include_from_node1 at loading.jl:119
in reload_path at loading.jl:145
in _require at loading.jl:63
in require at loading.jl:48
at /Users/seanmackesey/.julia/GLM/src/glmtools.jl:3
at /Users/seanmackesey/.julia/GLM/src/GLM.jl:192
I tried changing line 3 of glmtools.jl to using NumericExtensions.LogisticFun
from using NumericExtensions.Logistic
, and this seemed to fix the problem. I guess there was a name change in NumericExtensions
?
There has been some changes in the API for NumericExtensions that breaks GLM.
julia> using GLM
ERROR: TernaryFunctor not defined
in reload_path at loading.jl:140
in _require at loading.jl:58
in require at loading.jl:43
while loading /Users/andreasnoackjensen/.julia/GLM/src/lm.jl, in expression starting on line 22
while loading /Users/andreasnoackjensen/.julia/GLM/src/GLM.jl, in expression starting on line 76
julia> using GLM
ERROR: Distribution not defined
in include_from_node1 at loading.jl:90
in reload_path at loading.jl:113
in require at loading.jl:48
at /Users/Adam/.julia/GLM/src/GLM.jl:18
Not sure this is the right place nor whether this may be of any use to you, but anyway...
I feel it's great that you write a GLM package from scratch while knowing very well the existing R implementation so that you take the best of it and improve it where needed. With that idea in mind, I'd like to share two points that I consider limits of the current handling of contrasts in R.
So IMHO it would be nice to be able to save the reference level separately from the order of levels.
Thus, it would be nice if the models could somewhat remember these contrast levels, and allow retrieving them via a function similar to coef(). I would even hold that they should be printed by the default show() method.
I realize that for sum contrasts things are a little more complex since the last contrast level is not equal to 0, but to the sum of all other coefficients. So this needs a more generic handling, which could likely be part of a contrast object.
Right now, a Formula specification looks like:
:(y ~ x1 + x2 + x3)
This is essentially just an enumeration of the columns of the LHS and the RHS. We would like to support more complex Formula's for use with both DataFrame's and DataStream's. The streaming case poses some challenges, but should be soluble with a little bit more work.
There are several issues that need to be addressed. At a minimum, we need:
Interactions
Harlan has noted before that specifying rules for interactions is non-trivial because :
has excessively high precedence in Julia. We can't support things like:
:(y ~ x1*x3 + x2:x3)
because they are not parsed the way that R would parse them.
At present, I believe the solution is that R's :
is written as *
and R's *
is written as &
. As such, the previous formula becomes:
:(y ~ x1&x3 + x2*x3)
This is not a bad solution, although I have mixed feelings about &
. It provides the right intuitions for interactions of dummy variables in which bitwise operations do actually capture the computations being done, but it doesn't provide the right intuition for interactions of continuous variables. Of course, :
doesn't provide much intuition either, so this may be a moot point.
Functions of Columns
I believe that we support things like :(y ~ log(x1))
, but am not sure. In general, it would be nice to support things like:
:(y ~ clamp(x1, x2, x3))
This should be possible using the with
function when we create a ModelMatrix
.
Categorical Factors => Numbers
The representation of factors is Julia's weak point right now. We have a PooledDataArray object that provides a means for storing factors with a small number of levels that is memory-efficient.
But this backend is not equivalent to the concept of a factor. We should be able to treat DataArray's as factors as well. For a categorical factor f
, what is essential is the ability to replace f
with a k-column matrix that indicates whether f[i] == k
for each of the k
unique values that f
takes on. We could do this by supporting either levels
or unique
for anything that will be treated as a categorical factor.
A major difficulty arises when working with categorical factors in DataStream's:
unique
on a minibatch of 25 rows pulled from a 100,000 row data set, many of the k
levels will not be present.To deal with this case, I would propose extending the functions we use for converting factors to allow an optional list of unique levels for each factor. If this list is blank, then we call levels
or factors
.
Ordered Factors => Numbers
We currently have no mechanism for working with ordered factors. In part, this is because our PooledDataArray is not like R's factor: a PooledDataArray dispatches all comparison operators to the contents so that PooledDataArray's of strings use string ordering for comparisons.
To provide something like R's ordered factor, we need to provide:
A lot of thought needs to go into this still.
Because the GLM package so strongly depends upon the DataFrames and Distribution packages, it is not really usable without importing the exports from both DataFrames and Distributions.
It would be nice if the following example worked out of the box:
using GLM
dobson = DataFrame({[18.,17,15,20,10,20,25,13,12], gl(3,1,9), gl(3,3)},
["counts","outcome","treatment"])
fm3 = glm(:(counts ~ outcome + treatment), dobson, Poisson())
To do this, we would have to force execution of using DataFrames, Distributions
when someone executes using GLM
. The RDatasets package is doing this sort of thing already, but we need to establish some community rules about what's acceptable and what's bad form. Input from @StefanKarpinski, @ViralBShah and @JeffBezanson would be helpful here.
Trying to apply lm
to Int
values gives me this error:
ERROR: no method LmResp(Array{Int64,1},)
i am getting an error when running a lm:
julia> using GLM, RDatasets
julia> form = dataset("datasets","Formaldehyde");
julia> lm1 = lm(OptDen ~ Carb, form)
Formula: OptDen ~ Carb
Coefficients:
Evaluation succeeded, but an error occurred while showing value of type LmMod:
ERROR: .+= not defined
in show at .julia\v0.3\StatsBase\src\statmodels.jl:68
julia>
i am running Julia 0.3.0 1897 and i've updated my packages.
Might be a new user question, but I have the following output upon entering 'using GLM':
Warning: could not import Base.foldl into NumericExtensions
Warning: could not import Base.foldr into NumericExtensions
Warning: could not import Base.sum! into NumericExtensions
Warning: could not import Base.maximum! into NumericExtensions
Warning: could not import Base.minimum! into NumericExtensions
ERROR: TernaryFunctor not defined
in include at boot.jl:238
in include_from_node1 at loading.jl:114
in include at boot.jl:238
in include_from_node1 at loading.jl:114
in reload_path at loading.jl:140
in _require at loading.jl:58
in require at loading.jl:43
at /Users/dhaivat/.julia/GLM/src/lm.jl:22
at /Users/dhaivat/.julia/GLM/src/GLM.jl:76
Any ideas how to resolve?
Running on Mac OS X, w/ Julia 0.2.0.
For logistic regression in R I always use family="binomial" but in GLM.jl its Bernoulli() - a slight distinction, but is there any reason to change it?
julia> gm1 = glm(:(counts ~ outcome + treatment), dobson, Poisson()) ERROR: no method Formula(Expr) in glm at /Users/gustavolacerda/.julia/v0.3/GLM/src/glmfit.jl:155
Hi,
Not major, but just wanted to report it.
Regards, Rob
_
_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" to list help topics
| | | | | | |/ ` | |
| | || | | | (| | | Version 0.3.0-prerelease+3241 (2014-05-20 14:04 UTC)
/ |_'|||__'| | master/6980728* (fork: 755 commits, 46 days)
|__/ | x86_64-apple-darwin13.2.0
julia> using Distributions
julia> using GLM
Warning: Method definition convert(Type{BigFloat},MathConst{:sqrt2}) in module Distributions at constants.jl:38 overwritten in module GLM at constants.jl:38.
Warning: Method definition convert(Type{Float64},MathConst{:sqrt2}) in module Distributions at constants.jl:43 overwritten in module GLM at constants.jl:43.
Warning: Method definition convert(Type{Float32},MathConst{:sqrt2}) in module Distributions at constants.jl:44 overwritten in module GLM at constants.jl:44.
julia>
I'm a complete noob to Julia. Please be patient with me whilst I find my feet.
For binomial GLMs, separation is a common problem in some areas of application. The general "solution" is MAP estimation, or penalized MLE. The methods of Firth and of Gelman et al work well (but ridge estimation would also do).
In R, these are implemented in various different functions in various different packages: brglm::brglm, arm::bayesglm. Both of these use fairly straightforward modifications to the IRLS algorithm. (Firth's method had a couple of previous incarnations using different fitting algorithms in the logistf and brlr packages.)
So far as I understand the Julia GLM package, it implements MLE. To avoid multiple future packages implementing alternatives such as those mentioned above, a general interface for penalized MLE would be useful, with predefined options for at least Firth's method.
Is such an interface reasonable, or does the required flexibility make it more sensible to implement separate fitting methods explicitly elsewhere? If total flexibility is unreasonable, can Firth's method be offered as an option (even SAS has it!)
Thanks,
Harry
http://biomet.oxfordjournals.org/content/80/1/27.short
http://www.stat.columbia.edu/~gelman/research/published/priors11.pdf
The functions delbeta
and vcov
uses the internal representation of the QR
object but they changed in the update of the linear algebra code. The QR
factorization now uses xgeqrt3
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.