Giter Club home page Giter Club logo

eqsormo's People

Contributors

mattwigway avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

hikeinsky

eqsormo's Issues

Weights

Add support for weights - this will likely be important when performing scenario analysis as it affects how much demand there is for different types of housing units.

ASCs slow to converge with some sample levels

When I run the transportlab model with 25 sampled alternatives in fast fit mode, it screams - but with 10 sampled alts (ostensibly a simpler problem), ASC computation takes many minutes.

Performance

Clearing the market is incredibly slow. There must be some optimizations I can do.

dependency cleanup

Running eqsormo requires dill and numba - are we still even using these?

Make sure type shocks are recovered correctly

The type shocks are the errors from the second stage 2SLS regression. Make sure that they're recovered correctly and don't inlcude the errors from the first stage of the 2SLS estimation.

Make sampling faster

I think clever use a numbaized loop with calls to numpy.randint in it will do the trick.

So initialize output np.zeros(len(output) * nsamp)
set chosen alternative output[np.arange(len(output) * nsamp)] = chosen
fill in loop with random
numpy.random.choice(np.arange(len(nsamp)[0:chosen,chosen+1:]) pseudo code

endogenous attributes

Endogenous attributes should be supported. This doesn't matter in estimation, but in sorting, after each iteration, these endogenous attributes need to be updated.

Should ASCs be updated in Hessian calculation?

ASCs used to get updated in Hessian calculation, because they were updated every time full_utility was called. However, they don't anymore with the new, faster Hessian code. I'm not sure which is better - on the one hand, ASCs wouldn't get updated in a classic discrete choice model where the ASCs were estimated parameters. One the other hand, the ASCs also aren't included in the Hessian like estimated parameters would be.

Normalization of one price

Tra (2007) normalized one price when clearing the market to ensure that the solution was stable. When I did this (by arbitrarily forcing one excess demand to be zero) it prevented convergence. I'm not sure it's needed—as many authors have noted, there may be multiple equilibria in these models—but it could be put back if I could figure out how to implement it.

Prices oscillate rather than converge

Sometimes the prices may oscillate rather than converging. This could be due to ignoring off-diagonal elements of the Jacobian. In the example I saw, though, delta_n_excluded was also oscillating, so the exclusion by rent / income ratio could be part of the problem.

Temporary solution is to multiply price changes by 0.9, so that oscillation can't happen and the algorithm will get out of a rut (I think). But the value of that parameter will affect where the prices end up (recall that there are multiple equilibria due to the assumption of no migration or household formation), and due to the nonlinearity of the budget constraint, will affect the sorting equilibrium.

Banish .values

Pandas now recommends against using DataFrame.values, as the behavior depends on the dtype of the array. This shouldn't cause a results issue in eqsormo, but could be a perf problem if every access of .values is building big arrays and doing type conversion behind the scenes. For now workaround is to just convert DataFrames to float64 before calling TraSortingModel.

Second stage standard errors are wrong

At least, I'm pretty sure. When I estimate the second stage using 2SLS, I'm taking the dependent variable (mean indirect utility) as given. But of course it isn't given, it has error associated with it. So I think the currently-reported second stage standard errors are too small.

Other price interactions

I could imagine wanting to interact price with things other than income. Provide a way to do this (just putting price into interactions won't work b/c the interaction won't get updated as price changes).

Numba DeprecationWarning: type List will no longer be supported

I get this warning when computing ASCs:

/opt/miniconda3/envs/py38/lib/python3.8/site-packages/numba/core/ir_utils.py:2031: NumbaPendingDeprecationWarning: 
Encountered the use of a type that is scheduled for deprecation: type 'reflected list' found for argument 'starting_values' of function 'compute_ascs'.

For more information visit http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-reflection-for-list-and-set-types

File "../code/eqsormo/eqsormo/common/compute_ascs.py", line 21:
@numba.jit(nopython=True)
def compute_ascs (base_utilities, supply, hhidx, choiceidx, starting_values=None, convergence_criterion=1e-6, weights=None):

The page referenced clearly explains why using Python lists in numba causes all kind of havoc. The parameter causing the issue is the starting values/ASCs. Best solution is to convert external interface to use tuples, though we will need to use lists internally since we update them. This has the added advantage of not mutating input parameters.

What to do when housing types become unaffordable in sorting?

Right now, when building alternatives, I eliminate any household/alternative combinations where annual rent exceeds annual income. When clearing the market, some housing types may become more expensive and should be eliminated from choice sets, while others may become cheaper and should be added back.

Why does the marked share error _increase_ in some iterations?

Margin 0: maxdiff: 76918.98281792714, mindiff: -6481.494345673128 974 ascs
Margin 1: maxdiff: 29188.437191055098, mindiff: -41732.96218125825 4 ascs
Margin 0: maxdiff: 2953.032446451347, mindiff: -1638.9271872431127 974 ascs
Margin 1: maxdiff: 33528.06852119905, mindiff: -37406.7008873173 4 ascs

Is this right or does it indicate a bug?

Reëvaluate use of numba

I removed @numba.jit from compute_ascs to get better error messages when the model was crashing, and it shaved 1.5 hours off the model run time. Since this function is called a bunch and doesn't run too many Python loops, the overhead of calling a numba function (which I believe involves some type translation) probably is higher than the benefit. Figure out if this is the case other places we use numba, and remove numba if so.

Precomputed household-housing interactions?

The current parameterization is inflexible - you can put in housing attributes, household attributes, and those you want multiplied together, but can't put in externally computed interactions that are more complicated - like distance to work.

Why are the pickle results files so big?

For the final model I'm submitting to the St Anselm conference, the Pickle file is 10.1 GB and the numpy file is only 55.4 MB. Now the pickle file does contain the houshold_housing_attributes which are big, but the numpy file contains a bunch of arrays that are just as big, although they do not have string index. In any case, moving to an npz-only format (#37) may solve.

Store indices in a mmapped file

There are a bunch of indices into the full alternatives array that we create, which require a lot of memory even though they're infrequently used. Store them in mmapped files instead to reduce memory requirements.

lint errors

We get a lot of lint errors (look at the GitHub actions CI on any commit) - should go through these and fix them.

Pickled models are huge

As in 4+ GB. I think it's because fullAlternatives gets pickled with it. This could be reconstructed on unpickling, but it's probably not worth the trouble at this juncture.

Chain rule + numba for fast computation of derivatives

We need to compute the derivatives of choice probabilities with respect to price to clear the market. Since we allow infinite variability in functional forms for price/budget, implemented as pure Python functions, naively numbaizing the process does not improve performance. However, since the budget for every property is independent, we could do a two-step process, where we move the price of every property and recompute the budget (which should be numpy vectorized) to get the derivative of budget with respect to price, and then use a numbaized tight loop to get the derivative of propobability with respect to budget, and then use the chain rule to combine them.

Should the Hessian exactly match statsmodels?

Right now the test for our custom Hessian code checks that the inverse Hessian matches that computed by statsmodels. The inverse Hessian is what we actually use, but the regular Hessian can be quite a bit different between statsmodels and our code, I think due to precision issues. Does the intermediate result matter?

Numba?

Should be able to re-write the MNLFullASC code to use Numba and speed things up a lot.

Parallelize hessian estimation

Hessian estimation is the slowest part of the model fitting process, and it's single-threaded. This seems like something where parallelization should be possible. Investigate.

Can we use theta full in second stage?

Deborah asks why not estimate the same theta full used to clear the market after the first-stage parameters have been recovered, and use that to estimate the second stage. Might improve parameter estimation - but not sure if we can do that.

Crazy derivatives when calculating prices

Occasionally I'm seeing really crazy derivatives when clearing the market that lead to price changes up to something like 1.4 million dollars per year - clearly something isn't right.

Faster Hessian

Hessian computation is extremely slow - on a large sample it takes considerably more time than finding the optimal parameters with L-BFGS-B. The Hessian computation code we are using from statsmodels estimates the function a lot of times. I wonder if there's a faster way to approximate the Hessian - evaluate numdifftools and DiffSharp

See also #33

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.