mattwigway / eqsormo Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 1.0 1.14 MB

Equilibrium sorting models in Python

License: Apache License 2.0

TeX 1.15% Python 98.85%

eqsormo's People

Contributors

Stargazers

Watchers

Forkers

hikeinsky

eqsormo's Issues

p-values depend on method used to calculate hessian

Figure out what is best way to calc hessian. Moved from #12.

Weights

Add support for weights - this will likely be important when performing scenario analysis as it affects how much demand there is for different types of housing units.

ASCs slow to converge with some sample levels

When I run the transportlab model with 25 sampled alternatives in fast fit mode, it screams - but with 10 sampled alts (ostensibly a simpler problem), ASC computation takes many minutes.

Performance

Clearing the market is incredibly slow. There must be some optimizations I can do.

dependency cleanup

Running eqsormo requires dill and numba - are we still even using these?

Make sure type shocks are recovered correctly

The type shocks are the errors from the second stage 2SLS regression. Make sure that they're recovered correctly and don't inlcude the errors from the first stage of the 2SLS estimation.

Pickling fails on large models

I think we need to not save fullAlternatives.

Make sampling faster

I think clever use a numbaized loop with calls to numpy.randint in it will do the trick.

So initialize output np.zeros(len(output) * nsamp)
set chosen alternative output[np.arange(len(output) * nsamp)] = chosen
fill in loop with random
numpy.random.choice(np.arange(len(nsamp)[0:chosen,chosen+1:]) pseudo code

When using an optimizer that does not estimate a Hessian, run one iteration of BFGS at the end to get standard errors

Robust standard errors

endogenous attributes

Endogenous attributes should be supported. This doesn't matter in estimation, but in sorting, after each iteration, these endogenous attributes need to be updated.

Should ASCs be updated in Hessian calculation?

ASCs used to get updated in Hessian calculation, because they were updated every time full_utility was called. However, they don't anymore with the new, faster Hessian code. I'm not sure which is better - on the one hand, ASCs wouldn't get updated in a classic discrete choice model where the ASCs were estimated parameters. One the other hand, the ASCs also aren't included in the Hessian like estimated parameters would be.

Normalization of one price

Tra (2007) normalized one price when clearing the market to ensure that the solution was stable. When I did this (by arbitrarily forcing one excess demand to be zero) it prevented convergence. I'm not sure it's needed—as many authors have noted, there may be multiple equilibria in these models—but it could be put back if I could figure out how to implement it.

Parameterized price-income transformations no longer work

They stopped working in 0.3.3 after merging #52.

Prices oscillate rather than converge

Sometimes the prices may oscillate rather than converging. This could be due to ignoring off-diagonal elements of the Jacobian. In the example I saw, though, delta_n_excluded was also oscillating, so the exclusion by rent / income ratio could be part of the problem.

Temporary solution is to multiply price changes by 0.9, so that oscillation can't happen and the algorithm will get out of a rut (I think). But the value of that parameter will affect where the prices end up (recall that there are multiple equilibria due to the assumption of no migration or household formation), and due to the nonlinearity of the budget constraint, will affect the sorting equilibrium.

dill cannot serialize large model

struct.error: 'I' format requires 0 <= number <= 4294967295

We shouldn't use dill anyways. Rewrite to use numpy.savez_compressed

Why is TraSortingModel.probabilities so slow?

Takes many minutes, when we do basically the same calculation many times over in clear_market in under a second.

Banish .values

Pandas now recommends against using DataFrame.values, as the behavior depends on the dtype of the array. This shouldn't cause a results issue in eqsormo, but could be a perf problem if every access of .values is building big arrays and doing type conversion behind the scenes. For now workaround is to just convert DataFrames to float64 before calling TraSortingModel.

Second stage standard errors are wrong

At least, I'm pretty sure. When I estimate the second stage using 2SLS, I'm taking the dependent variable (mean indirect utility) as given. But of course it isn't given, it has error associated with it. So I think the currently-reported second stage standard errors are too small.

make sure variable standardization works correctly for endogenous variables

since they change during simulation.

Other price interactions

I could imagine wanting to interact price with things other than income. Provide a way to do this (just putting price into interactions won't work b/c the interaction won't get updated as price changes).

Numba DeprecationWarning: type List will no longer be supported

I get this warning when computing ASCs:

/opt/miniconda3/envs/py38/lib/python3.8/site-packages/numba/core/ir_utils.py:2031: NumbaPendingDeprecationWarning: 
Encountered the use of a type that is scheduled for deprecation: type 'reflected list' found for argument 'starting_values' of function 'compute_ascs'.

For more information visit http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-reflection-for-list-and-set-types

File "../code/eqsormo/eqsormo/common/compute_ascs.py", line 21:
@numba.jit(nopython=True)
def compute_ascs (base_utilities, supply, hhidx, choiceidx, starting_values=None, convergence_criterion=1e-6, weights=None):

The page referenced clearly explains why using Python lists in numba causes all kind of havoc. The parameter causing the issue is the starting values/ASCs. Best solution is to convert external interface to use tuples, though we will need to use lists internally since we update them. This has the added advantage of not mutating input parameters.

What to do when housing types become unaffordable in sorting?

Right now, when building alternatives, I eliminate any household/alternative combinations where annual rent exceeds annual income. When clearing the market, some housing types may become more expensive and should be eliminated from choice sets, while others may become cheaper and should be added back.

Why does the marked share error _increase_ in some iterations?

Margin 0: maxdiff: 76918.98281792714, mindiff: -6481.494345673128 974 ascs
Margin 1: maxdiff: 29188.437191055098, mindiff: -41732.96218125825 4 ascs
Margin 0: maxdiff: 2953.032446451347, mindiff: -1638.9271872431127 974 ascs
Margin 1: maxdiff: 33528.06852119905, mindiff: -37406.7008873173 4 ascs

Is this right or does it indicate a bug?

automated tests

we should have them

Get standard errors for non-bfgs solvers

BFGS frequently does not converge on MNL problems for some reason. Use numdifftools to calculate hessian diagonal.

Reëvaluate use of numba

I removed @numba.jit from compute_ascs to get better error messages when the model was crashing, and it shaved 1.5 hours off the model run time. Since this function is called a bunch and doesn't run too many Python loops, the overhead of calling a numba function (which I believe involves some type translation) probably is higher than the benefit. Figure out if this is the case other places we use numba, and remove numba if so.

Find standard errors with unequilibrated ASCs

It will take some coding, but the unequilibrated ASCs could be added back to the utility function parameters before Hessian estimation.

Add validation check for mis-aligned parallel inputs

Build docs on readthedocs

Precomputed household-housing interactions?

The current parameterization is inflexible - you can put in housing attributes, household attributes, and those you want multiplied together, but can't put in externally computed interactions that are more complicated - like distance to work.

Why are the pickle results files so big?

For the final model I'm submitting to the St Anselm conference, the Pickle file is 10.1 GB and the numpy file is only 55.4 MB. Now the pickle file does contain the houshold_housing_attributes which are big, but the numpy file contains a bunch of arrays that are just as big, although they do not have string index. In any case, moving to an npz-only format (#37) may solve.

Store indices in a mmapped file

There are a bunch of indices into the full alternatives array that we create, which require a lot of memory even though they're infrequently used. Store them in mmapped files instead to reduce memory requirements.

First stage standard errors are wrong when using a non-bfgs solver

I tried to fix #3 by running one iteration of bfgs after using other solvers, but it gives me standard errors that are way too big.

lint errors

We get a lot of lint errors (look at the GitHub actions CI on any commit) - should go through these and fix them.

Find ASCs with no sampling between first stage and second stage

We can eliminate the efficiency problem arising from alternative sampling in estimating ASCs by re-estimating ASCs with full alternatives once the other parameters have been recovered.

Reïmplement price clearing using the algorithm in Tra (2007)

On page 108, eq. 7.7a.

Finding ASCs uses too much memory

There's a lot of intermediate arrays being created. Rewrite to use np.add(x, y, out=x) semantics.

Auto check and report variance inflation factors for first and second stage

Pickled models are huge

As in 4+ GB. I think it's because fullAlternatives gets pickled with it. This could be reconstructed on unpickling, but it's probably not worth the trouble at this juncture.

Chain rule + numba for fast computation of derivatives

We need to compute the derivatives of choice probabilities with respect to price to clear the market. Since we allow infinite variability in functional forms for price/budget, implemented as pure Python functions, naively numbaizing the process does not improve performance. However, since the budget for every property is independent, we could do a two-step process, where we move the price of every property and recompute the budget (which should be numpy vectorized) to get the derivative of budget with respect to price, and then use a numbaized tight loop to get the derivative of propobability with respect to budget, and then use the chain rule to combine them.

Should the Hessian exactly match statsmodels?

Right now the test for our custom Hessian code checks that the inverse Hessian matches that computed by statsmodels. The inverse Hessian is what we actually use, but the regular Hessian can be quite a bit different between statsmodels and our code, I think due to precision issues. Does the intermediate result matter?

Numba?

Should be able to re-write the MNLFullASC code to use Numba and speed things up a lot.

Parallelize hessian estimation

Hessian estimation is the slowest part of the model fitting process, and it's single-threaded. This seems like something where parallelization should be possible. Investigate.

Can we use theta full in second stage?

Deborah asks why not estimate the same theta full used to clear the market after the first-stage parameters have been recovered, and use that to estimate the second stage. Might improve parameter estimation - but not sure if we can do that.

No way to know how much ASCs changed when recalculated with weights and with all samples

When computing full ASCs, we pass starting_values=model.first_stage_fit.ascs which then overwrites the first stage fit ascs. Instead pass copy of first_stage_fit.ascs to compute_ascs.

Why is there a precision problem in the Hessian test

I had to reduce the precision of np.allclose in the Hessian test after the 97222c0 - which should not affect the Hessian, I think (although computed ASCs might be a tiny bit different).

Crazy derivatives when calculating prices

Occasionally I'm seeing really crazy derivatives when clearing the market that lead to price changes up to something like 1.4 million dollars per year - clearly something isn't right.

Faster Hessian

Hessian computation is extremely slow - on a large sample it takes considerably more time than finding the optimal parameters with L-BFGS-B. The Hessian computation code we are using from statsmodels estimates the function a lot of times. I wonder if there's a faster way to approximate the Hessian - evaluate numdifftools and DiffSharp

Cross check against data/models from Nick Kuminoff's work

Nick Kuminoff has several example code/data packages on his website. Once the package is fully featured enough, I should add an example reproducing one of his models to ensure I get the same results.