I'm running reg() on data set with 9 million rows and nearly 100 non-absorbed regresso

Hi both, just for reference - in R, lm() calls <code

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Wrong results in a large data set with one set of FE,about fixedeffects/fixedeffectmodels.jl

Comments (51)

nilshg commented on May 27, 2024 1

I should have guessed that you're an advanced enough user to know this, but I thought I mention it just in case. Have learned a lot over the years from your blogs, good to see you involved here!

from fixedeffectmodels.jl.

s3alfisc commented on May 27, 2024 1

Hi both, just for reference - in R, lm() calls .lm.fit(), which uses the QR decomposition. Here's the documentation: link.

lm.fit (x, y,    offset = NULL, method = "qr", tol = 1e-7,
       singular.ok = TRUE, ...)

method: currently, only method = "qr" is supported.
tol: tolerance for the [qr](https://stat.ethz.ch/R-manual/R-devel/library/Matrix/html/qr-methods.html) decomposition. Default is 1e-7.

from fixedeffectmodels.jl.

droodman commented on May 27, 2024 1

The example I posted about OpenBlas had about half the columns dropped. I found that the problem went away with even small changes to the number of rows, so I figured I hit diminishing returns and stopped.

from fixedeffectmodels.jl.

matthieugomez commented on May 27, 2024

Thanks for the report. Does this happen without the survey fixed effects too?

from fixedeffectmodels.jl.

droodman commented on May 27, 2024

Random thought: with WildBootTests.jl, I have struggled with the lack of a function like Stata's invsym()--a matrix inversion function that is precise and gracefully handles rank-deficient, symmetric, positive-semi-definite matrices like X'X when X doesn't have full column rank. It leads to strange numerical behavior in sporadic instances. I've found that pinv() is often imprecise, I think even when I follow the recommendation to set rtol = sqrt(eps(real(float(oneunit(eltype(M)))))).

Perhaps something like that is happening here?

from fixedeffectmodels.jl.

nilshg commented on May 27, 2024

Don't have anything to contribute on the numerical issues here but just thought I'd point out that when you've got a formula that long it's often useful to constrict it programmatically, here I think your formula is equivalent to:

term("part") ~ sum([term("_IBirthyr_$i") for i ∈ 1907:2000]) + fe(:survey)

from fixedeffectmodels.jl.

droodman commented on May 27, 2024

Thanks @nilshg. In fact that long command is programmatically generated. I've written a Stata program called reghdfejl that is meant to be a slot-in replacement for reghdfe but calls FixedEffectModels.jl. A distinct component is a Stata-Julia bridge written in C, which so far I've only compiled in Windows. Once I get it compiled in Linux MacOs too, and documented everything, I'll share it all.

from fixedeffectmodels.jl.

matthieugomez commented on May 27, 2024

I'm actually using some version of invsym to get the basis of a set of vectors:
https://github.com/FixedEffects/FixedEffectModels.jl/blob/master/src/utils/basecol.jl#L56

from fixedeffectmodels.jl.

droodman commented on May 27, 2024

Oh interesting. I was looking at how GLM does it:. It calls cholesky() with check = false, zeros out certain entries, then divides X'X into X'Y using LAPACK.potrs!(). https://github.com/JuliaStats/GLM.jl/blob/b1ba4c5fdd5030b98a8cf9fe9c46319e5f5eb20e/src/linpred.jl#L186
But I presume your solution works well too.

from fixedeffectmodels.jl.

droodman commented on May 27, 2024

@matthieugomez, yes, it happens even with lm()! Whereas it does not happen in R and Stata. I found a GLM issue with a long discussion of instability.

One idea in that discussion is that it is more stable to use the QR decomposition of X than the Cholesky decomposition of X'X. This says that's what R does. That may be the root of the matter.

from fixedeffectmodels.jl.

droodman commented on May 27, 2024

Ah. GLM is gaining a method=:qr option. FixedEffectModels could copy that option, possibly switching to the QR as the default, as seems to be done in R and Stata(?); or use a hybrid approach in which it uses the Cholesky decomposition is used when the condition number is below some threshold.

from fixedeffectmodels.jl.

matthieugomez commented on May 27, 2024

I cannot reproduce this issue on my computer (Mac). Can you try

using CSV, DataFrames, LinearAlgebra, FixedEffectModels, Plots
df = CSV.read(".../FEbug.csv", DataFrame)
f = @formula(part ~ _Ibirthyr_1907+_Ibirthyr_1908+_Ibirthyr_1909+_Ibirthyr_1910+_Ibirthyr_1911+_Ibirthyr_1912+_Ibirthyr_1913+_Ibirthyr_1914+
     _Ibirthyr_1915+_Ibirthyr_1916+_Ibirthyr_1917+_Ibirthyr_1918+_Ibirthyr_1919+_Ibirthyr_1920+_Ibirthyr_1921+_Ibirthyr_1922+_Ibirthyr_1923+_Ibirthyr_1924+
     _Ibirthyr_1925+_Ibirthyr_1926+_Ibirthyr_1927+_Ibirthyr_1928+_Ibirthyr_1929+_Ibirthyr_1930+_Ibirthyr_1931+_Ibirthyr_1932+_Ibirthyr_1933+_Ibirthyr_1934+
     _Ibirthyr_1935+_Ibirthyr_1936+_Ibirthyr_1937+_Ibirthyr_1938+_Ibirthyr_1939+_Ibirthyr_1940+_Ibirthyr_1941+_Ibirthyr_1942+_Ibirthyr_1943+_Ibirthyr_1944+
     _Ibirthyr_1945+_Ibirthyr_1946+_Ibirthyr_1947+_Ibirthyr_1948+_Ibirthyr_1949+_Ibirthyr_1950+_Ibirthyr_1951+_Ibirthyr_1952+_Ibirthyr_1953+_Ibirthyr_1954+
     _Ibirthyr_1955+_Ibirthyr_1956+_Ibirthyr_1957+_Ibirthyr_1958+_Ibirthyr_1959+_Ibirthyr_1960+_Ibirthyr_1962+_Ibirthyr_1963+_Ibirthyr_1964+
     _Ibirthyr_1965+_Ibirthyr_1966+_Ibirthyr_1967+_Ibirthyr_1968+_Ibirthyr_1969+_Ibirthyr_1970+_Ibirthyr_1971+_Ibirthyr_1972+_Ibirthyr_1973+_Ibirthyr_1974+
     _Ibirthyr_1975+_Ibirthyr_1976+_Ibirthyr_1977+_Ibirthyr_1978+_Ibirthyr_1979+_Ibirthyr_1980+_Ibirthyr_1981+_Ibirthyr_1982+_Ibirthyr_1983+_Ibirthyr_1984+
     _Ibirthyr_1985+_Ibirthyr_1986+_Ibirthyr_1987+_Ibirthyr_1988+_Ibirthyr_1989+_Ibirthyr_1990+_Ibirthyr_1991+_Ibirthyr_1992+_Ibirthyr_1993+_Ibirthyr_1994+
     _Ibirthyr_1995+_Ibirthyr_1996+_Ibirthyr_1997+_Ibirthyr_1998+_Ibirthyr_1999+_Ibirthyr_2000)

X = modelmatrix(f, df) 
y = response(f, df) 

coef_qr = qr(X) \ y
coef_chol = cholesky!(X' * X) \ (X' * y)

plot([coef_qr coef_chol])

And tell me whether coef_qr and coef_chol give correct results?

from fixedeffectmodels.jl.

droodman commented on May 27, 2024

Yes, they look identical, and they look good.

from fixedeffectmodels.jl.

matthieugomez commented on May 27, 2024

Thanks. Can you try reg(df, f) then? If it reports correct results, can you try adding fe(survey) and _00001 to the formula, separately, to see which one is creating the incorrect result?

from fixedeffectmodels.jl.

droodman commented on May 27, 2024

Also, I have tried the original example on a Mac with an M2 and an Intel Mac, and it worked fine on both.

from fixedeffectmodels.jl.

droodman commented on May 27, 2024

Alas, reg(df,f) is failing:

julia> m=reg(df,f)
                                  FixedEffectModel
=====================================================================================
Number of obs:                    8830997   Converged:                           true
dof (model):                           93   dof (residuals):                  8830902
R²:                                 0.020   R² adjusted:                        0.020
F-statistic:                      2467.09   P-value:                            0.000
=====================================================================================
                  Estimate   Std. Error      t-stat  Pr(>|t|)   Lower 95%   Upper 95%
─────────────────────────────────────────────────────────────────────────────────────
_Ibirthyr_1907  -0.273123   0.0400529      -6.81906    <1e-11  -0.351625   -0.194621
_Ibirthyr_1908  -0.266322   0.0201056     -13.2462     <1e-39  -0.305729   -0.226916
_Ibirthyr_1909  -0.283199   0.0198092     -14.2963     <1e-45  -0.322025   -0.244374
_Ibirthyr_1910  -0.273939   0.0129585     -21.1398     <1e-98  -0.299337   -0.248541
_Ibirthyr_1911  -0.27852    0.0138402     -20.1239     <1e-89  -0.305646   -0.251394
_Ibirthyr_1912  -0.27633    0.0123648     -22.3481     <1e-99  -0.300565   -0.252095
_Ibirthyr_1913  -0.272623   0.0109313     -24.9397     <1e-99  -0.294047   -0.251198
_Ibirthyr_1914  -0.275257   0.00984198    -27.9676     <1e-99  -0.294546   -0.255967
_Ibirthyr_1915  -0.278617   0.00771855    -36.0971     <1e-99  -0.293745   -0.263489
_Ibirthyr_1916  -0.272556   0.00727289    -37.4756     <1e-99  -0.28681    -0.258301
_Ibirthyr_1917  -0.269847   0.00720218    -37.4674     <1e-99  -0.283963   -0.255731
_Ibirthyr_1918  -0.272885   0.00571428    -47.755      <1e-99  -0.284085   -0.261686
_Ibirthyr_1919  -0.269995   0.00549542    -49.1309     <1e-99  -0.280765   -0.259224
_Ibirthyr_1920  -0.267947   0.00459044    -58.3707     <1e-99  -0.276945   -0.25895
_Ibirthyr_1921  -0.26684    0.00440162    -60.623      <1e-99  -0.275467   -0.258212
_Ibirthyr_1922  -0.264173   0.00449576    -58.7605     <1e-99  -0.272984   -0.255361
_Ibirthyr_1923  -0.265196   0.00396114    -66.9493     <1e-99  -0.27296    -0.257432
_Ibirthyr_1924  -0.261272   0.0038648     -67.6032     <1e-99  -0.268847   -0.253698
_Ibirthyr_1925  -0.263561   0.0032994     -79.8814     <1e-99  -0.270028   -0.257094
_Ibirthyr_1926  -0.260826   0.00329492    -79.1601     <1e-99  -0.267284   -0.254368
_Ibirthyr_1927  -0.262224   0.00311456    -84.193      <1e-99  -0.268329   -0.25612
_Ibirthyr_1928  -0.259789   0.00275803    -94.1937     <1e-99  -0.265195   -0.254383
_Ibirthyr_1929  -0.259259   0.00270146    -95.9699     <1e-99  -0.264554   -0.253964
_Ibirthyr_1930  -0.257863   0.00233074   -110.636      <1e-99  -0.262432   -0.253295
_Ibirthyr_1931  -0.257178   0.00233809   -109.995      <1e-99  -0.26176    -0.252595
_Ibirthyr_1932  -0.257484   0.00226968   -113.445      <1e-99  -0.261933   -0.253036
_Ibirthyr_1933  -0.256838   0.00208717   -123.056      <1e-99  -0.260929   -0.252747
_Ibirthyr_1934  -0.25508    0.00204603   -124.671      <1e-99  -0.25909    -0.25107
_Ibirthyr_1935  -0.253795   0.00187253   -135.536      <1e-99  -0.257465   -0.250125
_Ibirthyr_1936  -0.250249   0.00185886   -134.625      <1e-99  -0.253892   -0.246606
_Ibirthyr_1937  -0.250794   0.00185182   -135.431      <1e-99  -0.254424   -0.247165
_Ibirthyr_1938  -0.248538   0.00174025   -142.818      <1e-99  -0.251949   -0.245127
_Ibirthyr_1939  -0.246448   0.00170952   -144.162      <1e-99  -0.249799   -0.243098
_Ibirthyr_1940  -0.246684   0.00156368   -157.758      <1e-99  -0.249749   -0.243619
_Ibirthyr_1941  -0.242315   0.00161241   -150.281      <1e-99  -0.245476   -0.239155
_Ibirthyr_1942  -0.241501   0.00156544   -154.271      <1e-99  -0.244569   -0.238433
_Ibirthyr_1943  -0.239828   0.00152961   -156.79       <1e-99  -0.242826   -0.23683
_Ibirthyr_1944  -0.23724    0.00156606   -151.488      <1e-99  -0.240309   -0.23417
_Ibirthyr_1945  -0.236017   0.00143321   -164.677      <1e-99  -0.238826   -0.233208
_Ibirthyr_1946  -0.231602   0.00145891   -158.75       <1e-99  -0.234461   -0.228742
_Ibirthyr_1947  -0.229845   0.00147762   -155.551      <1e-99  -0.232741   -0.226949
_Ibirthyr_1948  -0.224731   0.00141518   -158.801      <1e-99  -0.227505   -0.221958
_Ibirthyr_1949  -0.221772   0.00140861   -157.44       <1e-99  -0.224533   -0.219011
_Ibirthyr_1950  -0.222545   0.00130881   -170.035      <1e-99  -0.22511    -0.219979
_Ibirthyr_1951  -0.213693   0.00133732   -159.792      <1e-99  -0.216315   -0.211072
_Ibirthyr_1952  -0.211062   0.00132429   -159.378      <1e-99  -0.213657   -0.208466
_Ibirthyr_1953  -0.206814   0.00127914   -161.682      <1e-99  -0.209321   -0.204307
_Ibirthyr_1954  -0.201267   0.00126368   -159.271      <1e-99  -0.203744   -0.198791
_Ibirthyr_1955  -0.148427   0.00121979   -121.683      <1e-99  -0.150818   -0.146036
_Ibirthyr_1956  -0.148154   0.00121726   -121.711      <1e-99  -0.15054    -0.145768
_Ibirthyr_1957  -0.145949   0.00121263   -120.358      <1e-99  -0.148326   -0.143573
_Ibirthyr_1958  -0.146751   0.00117288   -125.12       <1e-99  -0.14905    -0.144452
_Ibirthyr_1959  -0.141585   0.00116838   -121.18       <1e-99  -0.143875   -0.139295
_Ibirthyr_1960  -0.1423     0.00111462   -127.667      <1e-99  -0.144485   -0.140116
_Ibirthyr_1962  -0.137285   0.001137     -120.744      <1e-99  -0.139514   -0.135057
_Ibirthyr_1963  -0.139382   0.00111215   -125.327      <1e-99  -0.141562   -0.137202
_Ibirthyr_1964  -0.13438    0.00111691   -120.313      <1e-99  -0.136569   -0.13219
_Ibirthyr_1965  -0.140646   0.00108565   -129.55       <1e-99  -0.142773   -0.138518
_Ibirthyr_1966  -0.13261    0.00110388   -120.13       <1e-99  -0.134773   -0.130446
_Ibirthyr_1967  -0.131491   0.00111893   -117.515      <1e-99  -0.133684   -0.129298
_Ibirthyr_1968  -0.129631   0.00109942   -117.909      <1e-99  -0.131786   -0.127476
_Ibirthyr_1969  -0.127519   0.00108838   -117.163      <1e-99  -0.129652   -0.125385
_Ibirthyr_1970  -0.131476   0.00106977   -122.901      <1e-99  -0.133573   -0.12938
_Ibirthyr_1971  -0.126526   0.00108828   -116.262      <1e-99  -0.128659   -0.124393
_Ibirthyr_1972  -0.129166   0.00108103   -119.484      <1e-99  -0.131284   -0.127047
_Ibirthyr_1973  -0.133421   0.00108229   -123.277      <1e-99  -0.135543   -0.1313
_Ibirthyr_1974  -0.134938   0.00109439   -123.3        <1e-99  -0.137083   -0.132793
_Ibirthyr_1975  -0.134999   0.00108154   -124.821      <1e-99  -0.137119   -0.132879
_Ibirthyr_1976  -0.134898   0.00109868   -122.781      <1e-99  -0.137051   -0.132744
_Ibirthyr_1977  -0.135823   0.00111433   -121.887      <1e-99  -0.138007   -0.133639
_Ibirthyr_1978  -0.136901   0.00112702   -121.472      <1e-99  -0.139109   -0.134692
_Ibirthyr_1979  -0.137868   0.00113635   -121.325      <1e-99  -0.140095   -0.135641
_Ibirthyr_1980  -0.13839    0.00111066   -124.602      <1e-99  -0.140567   -0.136214
_Ibirthyr_1981  -0.137445   0.00114061   -120.501      <1e-99  -0.13968    -0.135209
_Ibirthyr_1982  -0.135815   0.00112787   -120.417      <1e-99  -0.138025   -0.133604
_Ibirthyr_1983  -0.0870598  0.00114109    -76.2956     <1e-99  -0.0892963  -0.0848233
_Ibirthyr_1984  -0.0800584  0.00116387    -68.7865     <1e-99  -0.0823396  -0.0777773
_Ibirthyr_1985  -0.0726835  0.00118441    -61.3669     <1e-99  -0.0750049  -0.0703621
_Ibirthyr_1986  -0.059753   0.001222      -48.8977     <1e-99  -0.0621481  -0.0573579
_Ibirthyr_1987  -0.0509262  0.00127088    -40.0715     <1e-99  -0.0534171  -0.0484354
_Ibirthyr_1988  -0.0436099  0.0013144     -33.1785     <1e-99  -0.046186   -0.0410337
_Ibirthyr_1989  -0.0421915  0.00138641    -30.4323     <1e-99  -0.0449088  -0.0394742
_Ibirthyr_1990  -0.039882   0.0014523     -27.4613     <1e-99  -0.0427284  -0.0370355
_Ibirthyr_1991  -0.0316633  0.00158544    -19.9713     <1e-88  -0.0347708  -0.0285559
_Ibirthyr_1992  -0.121891   0.00171674    -71.0015     <1e-99  -0.125256   -0.118527
_Ibirthyr_1993  -0.119903   0.00181338    -66.1214     <1e-99  -0.123457   -0.116349
_Ibirthyr_1994  -0.113538   0.00186542    -60.8645     <1e-99  -0.117194   -0.109882
_Ibirthyr_1995  -0.115299   0.00193508    -59.5836     <1e-99  -0.119092   -0.111507
_Ibirthyr_1996  -0.117205   0.00204327    -57.3618     <1e-99  -0.12121    -0.113201
_Ibirthyr_1997  -0.117898   0.00227367    -51.8537     <1e-99  -0.122354   -0.113442
_Ibirthyr_1998  -0.114736   0.00272781    -42.0616     <1e-99  -0.120082   -0.10939
_Ibirthyr_1999  -0.120351   0.00364711    -32.999      <1e-99  -0.127499   -0.113203
_Ibirthyr_2000  -0.0949544  0.00852195    -11.1423     <1e-28  -0.111657   -0.0782517
(Intercept)      0.286457   0.000805943   355.43       <1e-99   0.284877    0.288036
=====================================================================================

julia> plot(coef(m)[2:end])

from fixedeffectmodels.jl.

matthieugomez commented on May 27, 2024

Sorry just to recap, is this a good summary of what you obtain?

using CSV, DataFrames, LinearAlgebra, FixedEffectModels, Plots
df = CSV.read(".../FEbug.csv", DataFrame)
f = @formula(part ~ _Ibirthyr_1907+_Ibirthyr_1908+_Ibirthyr_1909+_Ibirthyr_1910+_Ibirthyr_1911+_Ibirthyr_1912+_Ibirthyr_1913+_Ibirthyr_1914+
     _Ibirthyr_1915+_Ibirthyr_1916+_Ibirthyr_1917+_Ibirthyr_1918+_Ibirthyr_1919+_Ibirthyr_1920+_Ibirthyr_1921+_Ibirthyr_1922+_Ibirthyr_1923+_Ibirthyr_1924+
     _Ibirthyr_1925+_Ibirthyr_1926+_Ibirthyr_1927+_Ibirthyr_1928+_Ibirthyr_1929+_Ibirthyr_1930+_Ibirthyr_1931+_Ibirthyr_1932+_Ibirthyr_1933+_Ibirthyr_1934+
     _Ibirthyr_1935+_Ibirthyr_1936+_Ibirthyr_1937+_Ibirthyr_1938+_Ibirthyr_1939+_Ibirthyr_1940+_Ibirthyr_1941+_Ibirthyr_1942+_Ibirthyr_1943+_Ibirthyr_1944+
     _Ibirthyr_1945+_Ibirthyr_1946+_Ibirthyr_1947+_Ibirthyr_1948+_Ibirthyr_1949+_Ibirthyr_1950+_Ibirthyr_1951+_Ibirthyr_1952+_Ibirthyr_1953+_Ibirthyr_1954+
     _Ibirthyr_1955+_Ibirthyr_1956+_Ibirthyr_1957+_Ibirthyr_1958+_Ibirthyr_1959+_Ibirthyr_1960+_Ibirthyr_1962+_Ibirthyr_1963+_Ibirthyr_1964+
     _Ibirthyr_1965+_Ibirthyr_1966+_Ibirthyr_1967+_Ibirthyr_1968+_Ibirthyr_1969+_Ibirthyr_1970+_Ibirthyr_1971+_Ibirthyr_1972+_Ibirthyr_1973+_Ibirthyr_1974+
     _Ibirthyr_1975+_Ibirthyr_1976+_Ibirthyr_1977+_Ibirthyr_1978+_Ibirthyr_1979+_Ibirthyr_1980+_Ibirthyr_1981+_Ibirthyr_1982+_Ibirthyr_1983+_Ibirthyr_1984+
     _Ibirthyr_1985+_Ibirthyr_1986+_Ibirthyr_1987+_Ibirthyr_1988+_Ibirthyr_1989+_Ibirthyr_1990+_Ibirthyr_1991+_Ibirthyr_1992+_Ibirthyr_1993+_Ibirthyr_1994+
     _Ibirthyr_1995+_Ibirthyr_1996+_Ibirthyr_1997+_Ibirthyr_1998+_Ibirthyr_1999+_Ibirthyr_2000)

X = modelmatrix(f, df) 
y = response(f, df) 
# works everywhere
coef_qr = qr(X) \ y
coef_chol = cholesky!(X' * X) \ (X' * y)
# works correctly with Mac Intel or M2 but fails on Linux
reg(df, f)

from fixedeffectmodels.jl.

droodman commented on May 27, 2024

Yes!
Would it make sense for me to try the QR decomposition here ?

FixedEffectModels.jl/src/fit.jl

Line 397 in eda1b08

crossx = cholesky!(Symmetric(Xhat'Xhat))

On Windows...

from fixedeffectmodels.jl.

matthieugomez commented on May 27, 2024

Sure that would be helful. Could you also try this (note that I add an explicit intercept in the formula, which matters when using modelmatrix directly)

using CSV, DataFrames, LinearAlgebra, FixedEffectModels, Plots
df = CSV.read(".../FEbug.csv", DataFrame)
f = @formula(part ~ 1 + _Ibirthyr_1907+_Ibirthyr_1908+_Ibirthyr_1909+_Ibirthyr_1910+_Ibirthyr_1911+_Ibirthyr_1912+_Ibirthyr_1913+_Ibirthyr_1914+
     _Ibirthyr_1915+_Ibirthyr_1916+_Ibirthyr_1917+_Ibirthyr_1918+_Ibirthyr_1919+_Ibirthyr_1920+_Ibirthyr_1921+_Ibirthyr_1922+_Ibirthyr_1923+_Ibirthyr_1924+
     _Ibirthyr_1925+_Ibirthyr_1926+_Ibirthyr_1927+_Ibirthyr_1928+_Ibirthyr_1929+_Ibirthyr_1930+_Ibirthyr_1931+_Ibirthyr_1932+_Ibirthyr_1933+_Ibirthyr_1934+
     _Ibirthyr_1935+_Ibirthyr_1936+_Ibirthyr_1937+_Ibirthyr_1938+_Ibirthyr_1939+_Ibirthyr_1940+_Ibirthyr_1941+_Ibirthyr_1942+_Ibirthyr_1943+_Ibirthyr_1944+
     _Ibirthyr_1945+_Ibirthyr_1946+_Ibirthyr_1947+_Ibirthyr_1948+_Ibirthyr_1949+_Ibirthyr_1950+_Ibirthyr_1951+_Ibirthyr_1952+_Ibirthyr_1953+_Ibirthyr_1954+
     _Ibirthyr_1955+_Ibirthyr_1956+_Ibirthyr_1957+_Ibirthyr_1958+_Ibirthyr_1959+_Ibirthyr_1960+_Ibirthyr_1962+_Ibirthyr_1963+_Ibirthyr_1964+
     _Ibirthyr_1965+_Ibirthyr_1966+_Ibirthyr_1967+_Ibirthyr_1968+_Ibirthyr_1969+_Ibirthyr_1970+_Ibirthyr_1971+_Ibirthyr_1972+_Ibirthyr_1973+_Ibirthyr_1974+
     _Ibirthyr_1975+_Ibirthyr_1976+_Ibirthyr_1977+_Ibirthyr_1978+_Ibirthyr_1979+_Ibirthyr_1980+_Ibirthyr_1981+_Ibirthyr_1982+_Ibirthyr_1983+_Ibirthyr_1984+
     _Ibirthyr_1985+_Ibirthyr_1986+_Ibirthyr_1987+_Ibirthyr_1988+_Ibirthyr_1989+_Ibirthyr_1990+_Ibirthyr_1991+_Ibirthyr_1992+_Ibirthyr_1993+_Ibirthyr_1994+
     _Ibirthyr_1995+_Ibirthyr_1996+_Ibirthyr_1997+_Ibirthyr_1998+_Ibirthyr_1999+_Ibirthyr_2000)
X = modelmatrix(f, df) 
y = response(f, df) 
coef_qr = qr(X) \ y
coef_chol = cholesky!(X' * X) \ (X' * y)
coef_chol2 = cholesky!(Symmetric(X' * X)) \ (X' * y)

from fixedeffectmodels.jl.

droodman commented on May 27, 2024

Those all look good in Windows:

plot([coef_qr coef_chol.+.01 coef_chol2.+.02][2:end,:])

from fixedeffectmodels.jl.

droodman commented on May 27, 2024

Separately, yes, switching to the QR decomposition fixes the problem! I just replaced ldiv!(crossx, Xhat'y) with qr(Xhat) \ y. This confirms my initial guess that this is a numerical issue related to matrix inversion.

Sources I looked at a few days ago were nearly unanimous in saying that software should use the QR, that it's probably what's used in Stata and core R packages. E.g., "QR decomposition is indeed the standard way of solving least square problems effectively with much simplicity"

So I think this package should have an option for QR at least in the post-FE-removal stage, and I would have reghdfejl call it by default.

Relatedly, see this discussion of argument naming. method is getting used different ways in different places.

from fixedeffectmodels.jl.

matthieugomez commented on May 27, 2024

I don't understand why coef_chol2 gives you different results than reg — they should literally give you the same thing. I am also not convinced that QR can matter relative to Cholesky for economically meaningful regressions (especially because your coefficients are estimated with small SEs). I think something else may be happening.

Just before https://github.com/FixedEffects/FixedEffectModels.jl/blob/master/src/fit.jl#L397 I have added the lines

    f2 = @formula(part ~ 1 + _Ibirthyr_1907+_Ibirthyr_1908+_Ibirthyr_1909+_Ibirthyr_1910+_Ibirthyr_1911+_Ibirthyr_1912+_Ibirthyr_1913+_Ibirthyr_1914+
    _Ibirthyr_1915+_Ibirthyr_1916+_Ibirthyr_1917+_Ibirthyr_1918+_Ibirthyr_1919+_Ibirthyr_1920+_Ibirthyr_1921+_Ibirthyr_1922+_Ibirthyr_1923+_Ibirthyr_1924+
    _Ibirthyr_1925+_Ibirthyr_1926+_Ibirthyr_1927+_Ibirthyr_1928+_Ibirthyr_1929+_Ibirthyr_1930+_Ibirthyr_1931+_Ibirthyr_1932+_Ibirthyr_1933+_Ibirthyr_1934+
    _Ibirthyr_1935+_Ibirthyr_1936+_Ibirthyr_1937+_Ibirthyr_1938+_Ibirthyr_1939+_Ibirthyr_1940+_Ibirthyr_1941+_Ibirthyr_1942+_Ibirthyr_1943+_Ibirthyr_1944+
    _Ibirthyr_1945+_Ibirthyr_1946+_Ibirthyr_1947+_Ibirthyr_1948+_Ibirthyr_1949+_Ibirthyr_1950+_Ibirthyr_1951+_Ibirthyr_1952+_Ibirthyr_1953+_Ibirthyr_1954+
    _Ibirthyr_1955+_Ibirthyr_1956+_Ibirthyr_1957+_Ibirthyr_1958+_Ibirthyr_1959+_Ibirthyr_1960+_Ibirthyr_1962+_Ibirthyr_1963+_Ibirthyr_1964+
    _Ibirthyr_1965+_Ibirthyr_1966+_Ibirthyr_1967+_Ibirthyr_1968+_Ibirthyr_1969+_Ibirthyr_1970+_Ibirthyr_1971+_Ibirthyr_1972+_Ibirthyr_1973+_Ibirthyr_1974+
    _Ibirthyr_1975+_Ibirthyr_1976+_Ibirthyr_1977+_Ibirthyr_1978+_Ibirthyr_1979+_Ibirthyr_1980+_Ibirthyr_1981+_Ibirthyr_1982+_Ibirthyr_1983+_Ibirthyr_1984+
    _Ibirthyr_1985+_Ibirthyr_1986+_Ibirthyr_1987+_Ibirthyr_1988+_Ibirthyr_1989+_Ibirthyr_1990+_Ibirthyr_1991+_Ibirthyr_1992+_Ibirthyr_1993+_Ibirthyr_1994+
    _Ibirthyr_1995+_Ibirthyr_1996+_Ibirthyr_1997+_Ibirthyr_1998+_Ibirthyr_1999+_Ibirthyr_2000)
   Xhat2 = modelmatrix(f2, df)
   y2 = response(f2, df)
   coef2 = cholesky!(Xhat2' * Xhat2) \ (Xhat2' * y2)

   crossx = cholesky!(Symmetric(Xhat'Xhat))
   coef = ldiv!(crossx, Xhat'y)

   @assert sum(basis_coef) == size(Xhat2, 2)
   @assert Xhat == Xhat2
   @assert y == y2
   @assert coef == coef2
   @show "success"

When I run reg(df, f), the four assertions pass for me and "success" gets printed. They should not pass for you. Could you try to run it? Understanding which one fails would help us pinpoint the issue.

from fixedeffectmodels.jl.

droodman commented on May 27, 2024

I get:

ERROR: AssertionError: coef == coef2

from fixedeffectmodels.jl.

matthieugomez commented on May 27, 2024

Thanks. Can you try again after replacing the line coef = ldiv!(crossx, Xhat'y) by coef = crossx \ (Xhat' * y)? If that still fails, I'm out of idea on what is driving the issue.

from fixedeffectmodels.jl.

droodman commented on May 27, 2024

That still fails.

So I drilled down some with the debugger. It seems to have to do with multiplication by a SentinalArrays.ChainedVector. Look at the items in the Watch pane on the left.

from fixedeffectmodels.jl.

droodman commented on May 27, 2024

The upshot here is that replacing y2 in your test code with collect(y2) in order to convert it to Vector{Float64} lets it pass all the @asserts on my machine. I'm not sure what's going on with that. I'm also not sure it's relevant because the actual variable in the code, y, is already Vector{Float64}.

Meanwhile, using collect(y2) doesn't solve my original problem. What does solve the problem is qr().

Here's a complicated screenshot. The last item in the Watch list pulls together coef2 as defined in what is now line 408, involving collect(y2), the computation of coef in the actual code as ldiv!(crossx, Xhat'y), and the QR-based computation qr(Xhat) \ y. The call to show() in the last watch item dumps the full results into the debugger output window below. I copied the resulting 3-column matrix into another Julia session and plotted the three columns, as shown on the right. Only the QR-based result is correct.

In my experience, this is how it is with numerical issues in linear algebra routines. You drill and drill to find where the computation goes off the rails, and it turns out it's not really a logical bug, just imprecision in the matrix inversion/factorization that you previously didn't need to think about. Then you have to confront a speed-precision tradeoff.

from fixedeffectmodels.jl.

matthieugomez commented on May 27, 2024

Thanks a lot. I am open to using QR in some cases but, if you look at the conditioning number of X, it is not particularly different than the conditioning number of any other matrix (which, I think, is the same as saying that the system is not close to being collinear). So I'm unsure why this happens and how one could predict this issue.

from fixedeffectmodels.jl.

matthieugomez commented on May 27, 2024

I think that's what SAS does btw (and Stata is probably similar): https://blogs.sas.com/content/iml/2018/04/18/sweep-operator-sas.html, https://blogs.sas.com/content/iml/2021/07/14/performance-ls-regression.html

from fixedeffectmodels.jl.

droodman commented on May 27, 2024

Yes, this other SAS blog post about the QR says "most SAS regression procedures use the sweep operator to construct least-squares solutions." OTOH, it starts out with "A SAS programmer recently mentioned that some open-source software uses the QR algorithm to solve least-squares regression problems" and we can guess what that refers to. And indeed it looks like R uses the QR (this, this).

I don't know what Stata does, but evidently it sometimes uses something more precise than Cholesky. It looks like Numpy uses the SVD, which I read is even slower and more precise than the QR (this, this).

I'm getting a condition number of 346 for Xhat, thus 346^2 for crossx (which is faster to compute especially since crossx will still be computed anyway, right?). Isn't that pretty big?

I can see a few options:

Leave it up to the user, which seems to be where lm() is headed.
Always use the QR, apparently like R.
Use some threshold condition number on crossx to switch between Cholesky and QR. Matlab does something like this (see the flow chart).

from fixedeffectmodels.jl.

ericqu commented on May 27, 2024

Apologies as it is a bit off-topic, I was wondering if the data could be shared or tested with the LinearRegressionKit.jl, I would like to know out of curiosity if this data set is an issue with the sweep operator.

Personally, I am not a big fan of a threshold on the condition number as it only considers the Xs, in my view, this leaves too much uncertainty to leave the Y out.

My preference would be to have a QR by default, that could be changed by the user.

from fixedeffectmodels.jl.

droodman commented on May 27, 2024

@ericqu I'd be happy to send you the data if you provide a more private way to do so.

from fixedeffectmodels.jl.

matthieugomez commented on May 27, 2024

I am still not convinced that QR is needed in a setup where one obtains small SEs for all coefficients. Relatedly, to compute standard errors, one needs to compute inv(X'X), even though the computer science / numerical methods people would say it's a horrible thing to do. In any case, having QR as a default would be bad as it would slow down everything (this is one reason why R is much slower than Stata for standard OLS regressions with big data).

I have pushed a version that solves for coefficients using sweep operations, using the same algorithm I was already using to test for collinearity. Hopefully that solves your problem. #253

from fixedeffectmodels.jl.

droodman commented on May 27, 2024

Thank you @matthieugomez. I upgraded to 1.10.0 but unfortunately, the results have not changed.

If you are getting tired of this, I can just fork the project, tweak it, and point reghdfejl to that.

from fixedeffectmodels.jl.

matthieugomez commented on May 27, 2024

Well I would prefer to understand the difference, but it's hard to me to solve something that does not happen on my computer (I have double checked and I get the exact same coefficients in Stata and Julia). If you time Stata, you can see that it returns a result faster than what a QR factorization would take, so I really don't think they use QR.

from fixedeffectmodels.jl.

droodman commented on May 27, 2024

reg(df,f; nthreads=6) takes 48.640588 seconds. In Stata, after set processors 6, the regression takes 4.85 seconds. (Not a typo.)

from fixedeffectmodels.jl.

matthieugomez commented on May 27, 2024

Part of it is JuliaStats/StatsModels.jl#222. You should be able to avoid it by passing factors directly rather than the continuous variables.

In any case, the point is that 4.85 seconds is less than what my computer takes to compute a QR factorization.

from fixedeffectmodels.jl.

droodman commented on May 27, 2024

Interesting. Yeah on my machine the QR takes ~8.1 seconds.

from fixedeffectmodels.jl.

matthieugomez commented on May 27, 2024

That still fails.

So I drilled down some with the debugger. It seems to have to do with multiplication by a SentinalArrays.ChainedVector. Look at the items in the Watch pane on the left.

I am looking at this screenshot. Am I correct in understanding that you are getting (Xhat' * y)[end] = 320.0 but (Xhat[:, end]' * y) = 477.0? If it is the case (which seems wild to me), can you get it with this simpler code (i.e. not involving reg):

f = @formula(part ~ 1 + _Ibirthyr_1907+_Ibirthyr_1908+_Ibirthyr_1909+_Ibirthyr_1910+_Ibirthyr_1911+_Ibirthyr_1912+_Ibirthyr_1913+_Ibirthyr_1914+
    _Ibirthyr_1915+_Ibirthyr_1916+_Ibirthyr_1917+_Ibirthyr_1918+_Ibirthyr_1919+_Ibirthyr_1920+_Ibirthyr_1921+_Ibirthyr_1922+_Ibirthyr_1923+_Ibirthyr_1924+
    _Ibirthyr_1925+_Ibirthyr_1926+_Ibirthyr_1927+_Ibirthyr_1928+_Ibirthyr_1929+_Ibirthyr_1930+_Ibirthyr_1931+_Ibirthyr_1932+_Ibirthyr_1933+_Ibirthyr_1934+
    _Ibirthyr_1935+_Ibirthyr_1936+_Ibirthyr_1937+_Ibirthyr_1938+_Ibirthyr_1939+_Ibirthyr_1940+_Ibirthyr_1941+_Ibirthyr_1942+_Ibirthyr_1943+_Ibirthyr_1944+
    _Ibirthyr_1945+_Ibirthyr_1946+_Ibirthyr_1947+_Ibirthyr_1948+_Ibirthyr_1949+_Ibirthyr_1950+_Ibirthyr_1951+_Ibirthyr_1952+_Ibirthyr_1953+_Ibirthyr_1954+
    _Ibirthyr_1955+_Ibirthyr_1956+_Ibirthyr_1957+_Ibirthyr_1958+_Ibirthyr_1959+_Ibirthyr_1960+_Ibirthyr_1962+_Ibirthyr_1963+_Ibirthyr_1964+
    _Ibirthyr_1965+_Ibirthyr_1966+_Ibirthyr_1967+_Ibirthyr_1968+_Ibirthyr_1969+_Ibirthyr_1970+_Ibirthyr_1971+_Ibirthyr_1972+_Ibirthyr_1973+_Ibirthyr_1974+
    _Ibirthyr_1975+_Ibirthyr_1976+_Ibirthyr_1977+_Ibirthyr_1978+_Ibirthyr_1979+_Ibirthyr_1980+_Ibirthyr_1981+_Ibirthyr_1982+_Ibirthyr_1983+_Ibirthyr_1984+
    _Ibirthyr_1985+_Ibirthyr_1986+_Ibirthyr_1987+_Ibirthyr_1988+_Ibirthyr_1989+_Ibirthyr_1990+_Ibirthyr_1991+_Ibirthyr_1992+_Ibirthyr_1993+_Ibirthyr_1994+
    _Ibirthyr_1995+_Ibirthyr_1996+_Ibirthyr_1997+_Ibirthyr_1998+_Ibirthyr_1999+_Ibirthyr_2000)
Xhat = modelmatrix(f, df)
y_sentinel = response(f, df)
y_vector = convert(Vector{Float64}, y_sentinel)
@show (Xhat' * y_vector)[end]
@show Xhat[:, end]' * y_vector

from fixedeffectmodels.jl.

droodman commented on May 27, 2024

Yes you are correct. I don't know if calls for a new issue on that package.

julia> @show (Xhat' * y_vector)[end]
(Xhat' * y_vector)[end] = 320.0
320.0

julia> @show Xhat[:, end]' * y_vector
(Xhat[:, end])' * y_vector = 477.0
477.0

from fixedeffectmodels.jl.

matthieugomez commented on May 27, 2024

Well so that explains the problem — there is an issue with your version of Julia or with your underlying BLAS / MKL. Both Xhat and y_vectors are vectors/matrices of 0 and 1 so there should not be any difference between the two numbers. Have you tried with Julia 1.10? Can you try to find a version of this issue with simpler arrays so that you can open an issue on the Julia repository? (maybe Xhat= Xhat[:, end:end] and y = ones(length(y_vector))

from fixedeffectmodels.jl.

grantmcdermott commented on May 27, 2024

A few unsolicited, but hopefully still helpful thoughts.

The potential instability of Cholesky is well documented and, IMO, is good cause for defaulting to QR. Luke Tierney (R Core member) has a nice lecture series on this sort of thing here: http://homepage.divms.uiowa.edu/~luke/classes/STAT7400-2023/
Speaking of options, in R you can invoke different decomposition strategies through different packages. The (very fast) collapse package offers a flm convenience frontend to a bunch of these. All of which is to say that, if you can, doing something similar and offering your users options isn't a bad compromise.
Optimised BLAS introduces a whole new can of worms, due to competing parallelism / race conditions, among other things. Relying on MKL, in particular, is asking for trouble. Having been bit by wicked esoteric bugs in the past, I can only pass on the advice that was given to me: Friends don't let friends use MKL! (Didn't MacKinnon have a working paper demonstrating the instability of Stata results under different BLAS regimes, or am I misremembering?)

from fixedeffectmodels.jl.

matthieugomez commented on May 27, 2024

@grantmcdermott I could be convinced to add an option for QR but the first step would be for someone to come up with an example where the difference matters relative to the standard errors of the coefficients. The issue right now is a bug, not of instability of Cholesky versus QR.

from fixedeffectmodels.jl.

droodman commented on May 27, 2024

@matthieugomez I'm not sure what you mean by "the difference matters relative to the standard errors." In the motivating example here, the discontinuity is very large relative to the standard errors.

from fixedeffectmodels.jl.

matthieugomez commented on May 27, 2024

Your example has nothing to do with the instability of Cholesky versus QR. It's about the fact that your computer returns a wrong result for X'y (X and y are just arrays of 0-1, so it is not related to numerical instability). You should make sure you are on the latest version of Julia (1.9.3), simplify the example so that you can share it, and open an issue mentioning your OS and the result of LinearAlgebra.BLAS.get_config() on https://github.com/JuliaLang/julia.

Here is some code that might help you get a reproductible example (it produces a vector y and a matrix X similar to the ones obtained from your data).

N = 10_000_000
X = hcat(ones(N), rand(N, 100) .>= 0.95)
y = Vector{Float64}(rand(N) .>= 0.9)
@assert (X'y)[end] ≈ X[:, end]'y

from fixedeffectmodels.jl.

droodman commented on May 27, 2024

Submitted. I had thought that conversion of that Sentinel vector to a regular vector had eliminated the effect on the computations in this package. It did so in the debugging watches, but that's not quite the same thing.

Ironically @grantmcdermott, switching from OpenBLAS to MKL eliminates the problem(!).

What I don't understand is why sticking in a simple coef = qr(X) \ y also did. I guess the ldiv operator is not having this weird math issue.

from fixedeffectmodels.jl.

matthieugomez commented on May 27, 2024

Thanks. If you replace y by view(y, :), the issue does not happen anymore right? If so, I will do that until this OpenBlas issue gets fixed.

from fixedeffectmodels.jl.

droodman commented on May 27, 2024

I tried adding y = view(y,:) in fit.jl after the line constructing crossx. That did not help. But maybe that's not what you mean.

Following this comment I added using BLISBLAS in reghdfejl before calling FixedEffectModels. It works!

from fixedeffectmodels.jl.

matthieugomez commented on May 27, 2024

Great, but I'd like a solution even for people that don't use reghdfejl. What if you replace X' * y by X' * reshape(y, length(y), 1) in your reproductible example?

from fixedeffectmodels.jl.

droodman commented on May 27, 2024

Yes, that works too!
But now I don't trust OpenBLAS. What if you put using BLISBLAS in FixedEffectModels, at least for x86_64 architecture?

from fixedeffectmodels.jl.

matthieugomez commented on May 27, 2024

Well BLISBLAS can probably also create other types of bugs in the future... Out of curiosity, were you able to replicate this bug with other matrices? I'm wondering how common it is in the wild

from fixedeffectmodels.jl.

matthieugomez commented on May 27, 2024

Solved by #255

from fixedeffectmodels.jl.

Wrong results in a large data set with one set of FE about fixedeffectmodels.jl HOT 51 CLOSED

Comments (51)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent