activitysim / populationsim Goto Github PK

View Code? Open in Web Editor NEW

52.0 10.0 37.0 48.91 MB

An Open Platform for Population Synthesis

Home Page: https://activitysim.github.io/populationsim

License: Other

Python 31.26% Jupyter Notebook 68.74%

python data-science population-synthesis activitysim bsd-3-clause microsimulation

populationsim's Issues

cylp doesn't easily install for 64bit Anaconda on Windows

We're using the GLPK_MI solver from cylp for the simultaneous integerizer since it is the most robust and stable option for cvxpy. Unfortunately it doesn't appear to be easily installed for 64bit Anaconda on Windows. @toliwaga and I searched for, and tested a few ideas, but again to no avail. We also tested all the other cvxpy solvers available, but nothing seems to work (see the table below). At this point, we're thinking it might make the most sense to implement the simultaneous integerizer in ortools since it works on Windows, is fast, and @toliwaga is familiar with it. We're working on a solution.

	LP	Example CALM	Example TEST
CBC	X	No easy install for Anaconda 64bit Windows?	No easy install for Anaconda 64bit Windows?
GLPK	X	No easy install for Anaconda 64bit Windows?	No easy install for Anaconda 64bit Windows?
GLPK_MI	X	No easy install for Anaconda 64bit Windows?	No easy install for Anaconda 64bit Windows?
Elemental	X	No Windows version	No Windows version
ECOS	X	tried and failed	Integerizer works but simul_integerize fails
ECOS_BB	X	tried and failed	Integerizer works but simul_integerize fails
GUROBI	X	Commercial license required	Commercial license required
MOSEK	X	Commercial license required	Commercial license required
XPRESS	X	Commercial license required	Commercial license required
CVXOPT	X	tried and failed	tried and failed
SCS	X	fails, hits max iters even with 100,000 iters set	fails, hits max iters even with 100,000 iters set
LS	X	cannot solve	cannot solve

Recommended way to configure model year?

Hi there - I've been playing with PopulationSim for our use (https://github.com/BayAreaMetro/populationsim) and so far, I'm testing it for 2010 by specifying the 2010 control files (for example, https://github.com/BayAreaMetro/populationsim/blob/master/bay_area/households/configs/settings.yaml#L61)

But our standard practice will be to run it for multiple years but I don't like the obvious solutions:

having duplicates of the config that looks very similar with just the year changed
having a single config but copying/moving files around at runtime

I'd prefer something like having the settings.yaml file have something like

model_year: 2010

and then

  - tablename: MAZ_control_data
    filename : %model_year%_mazData.csv

How do you recommend folks handle this?
Thank you!

specify minimum weights

It would be useful for survey weighting to allow for user specified minimum weights in addition to user specified maximum weights. For example, a min weight of 1/4 * the initial weight and a max weight of 4 * the initial weight. Currently the minimum weight is hard coded as 0. Our testing shows that some records end up with a weight of 0 (in part due to integerization) and so they are not included in the final data set. This issue is related to #75 as well.

Missing Module in ActivitySim PyPi Dependency

The setup depends on the RSGInc ActivitySim fork that includes an 'inject' module in activitysim.core. This module does not exist in the pypi version of ActivitySim.

https://github.com/RSGInc/populationsim/blob/02cd75fdb5c991fd58c5bcb0271020852d452a6d/populationsim/steps/input_pre_processor.py#L10

Fail to integerize seed weights

After moving beyond #104 and preparing our seed data in a better way, we now have an issue with integerizing the seed weights. Here's the stack trace:

INFO - integerize_final_seed_weights seed id 49003
Welcome to the CBC MILP Solver
Version: 2.10.3
Build Date: Oct 11 2019

command line - cbc -solve -quit (default strategy 1)
Presolve 31 (-6) rows, 335 (-7) columns and 3583 (-6) elements
0  Obj -0 Primal inf 393.88521 (16)
31  Obj -5027.9043 Primal inf 1.0552473 (3)
32  Obj -5027.9043
Optimal - objective value -5027.9043
After Postsolve, objective -5027.9043, infeasibilities - dual 0 (0), primal 0 (0)
Optimal objective -5027.904278 - 32 iterations time 0.002, Presolve 0.00
Total time (CPU seconds):       0.00   (Wallclock seconds):       0.00
Traceback (most recent call last):
  File "run_populationsim.py", line 63, in <module>
    pipeline.run(models=steps, resume_after=resume_after)
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/activitysim/core/pipeline.py", line 594, in run
    run_model(model)
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/activitysim/core/pipeline.py", line 471, in run_model
    orca.run([step_name])
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/activitysim/core/orca.py", line 2034, in run
    step()
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/activitysim/core/orca.py", line 843, in __call__
    return self._func(**kwargs)
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/populationsim/steps/integerize_final_seed_weights.py", line 83, in integerize_final_seed_weights
    total_hh_control_col=total_hh_control_col
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/populationsim/integerizer.py", line 308, in do_integerizing
    status = integerizer.integerize()
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/populationsim/integerizer.py", line 187, in integerize
    smart_round(int_weights, resid_weights, self.total_hh_control_value)
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/populationsim/integerizer.py", line 38, in smart_round
    assert target_sum == int(target_sum)

My guess is that this is an issue with data typing or configuring the weight field appropriately. We are using the basic WGTP field from the ACS PUMS record, and the person file has both the WGTP and PWGTP fields appended. We filtered households out that had WGTP <= 0, but are unsure if something else needs to happen.

The repository is here. We needed to remove the processed seed data from the repo for GitHub's limits (eventually) but the seed files are here

add mtc popsim example

Consistent with the ActivitySim example, I think it makes sense to add the MTC TM1 PopSim setup as an additional example to this repo. @lmz?

ortools integerization results varies from platform to platform

Regression tests unreliable since edge case results depend on exact ortools/cbc version

Add simultaneous integerizer

Input HDF5 file for repop mode should be in the "data" folder

Currently, input HDF5 file from base run for the repop mode is copied to the "output" folder. This should be changed to "output" folder, the user should copy the HDF5 file to the "output" folder of the repop setup.

Make geographies flexible

We need to support N-number of lower level geographies. For example, in the settings file we have:

geographies: [REGION, PUMA, TRACT, TAZ]
seed_geography: PUMA

#or

geographies: [REGION, PUMA, TRACT, TAZ, MAZ]
seed_geography: PUMA

There can only be one geography in the geographies list before the seed_geography and the geographies after the seed_geography geography are lower level. There can be as many lower level geographies as the user wants.

support proportional membership for weighting

We're not planning to do this right now, but we don't want to forgot this idea for the future if needed.

In addition to tagging households/persons as T/F membership for attributes, allow for partial membership. For example, instead of is HH size 1, 2, 3 or 4, specify the HH is 80% likely size 1, 15% size 2, 5% size 3, 0% size 4+.

Standard license

We are considering integrating this library into our work as an option to synthesize a population (plug and play among others) and in doing so contributing to the codebase. Is it possible to put a standard license on the library like MIT or BSD?

RuntimeError: table 'taz_control_data' never checkpointed. Closing remaining open files:output\pipeline.h5...done

Dear All,

I am trying to run the popsim, and it seems all is working till one I get this message after the run_pop reads the files:

RuntimeError: table 'taz_control_data' never checkpointed.
Closing remaining open files:output\pipeline.h5...done

Any help?

Regards,

Issa

incorporate popsampler functionality

We might want to incorporate popsampler functionality into PopulationSim

create python version of existing R validation script and include as part of a popsim run

https://github.com/ActivitySim/populationsim/blob/master/scripts/validationPopulationSim.R

Cannot convert non-finite values to integer

I'm trying to get an implementation of populationsim up and running. I'm using the populationsim Anaconda environment (on MacOS 10.14.6), with the data files and configuration in this respository. I've been able to grind through many of the errors, but this traceback is something I can't figure out.

Traceback (most recent call last):
  File "run_populationsim.py", line 63, in <module>
    pipeline.run(models=steps, resume_after=resume_after)
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/activitysim/core/pipeline.py", line 594, in run
    run_model(model)
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/activitysim/core/pipeline.py", line 471, in run_model
    orca.run([step_name])
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/activitysim/core/orca.py", line 2034, in run
    step()
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/activitysim/core/orca.py", line 843, in __call__
    return self._func(**kwargs)
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/populationsim/steps/setup_data_structures.py", line 340, in setup_data_structures
    = build_grouped_incidence_table(incidence_table, control_spec, seed_geography)
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/populationsim/steps/setup_data_structures.py", line 230, in build_grouped_incidence_table
    how='left').group_id.astype(int).values
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/pandas/core/generic.py", line 5882, in astype
    dtype=dtype, copy=copy, errors=errors, **kwargs
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 581, in astype
    return self.apply("astype", dtype=dtype, **kwargs)
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 438, in apply
    applied = getattr(b, f)(**kwargs)
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 559, in astype
    return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs)
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 643, in _astype
    values = astype_nansafe(vals1d, dtype, copy=True, **kwargs)
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 700, in astype_nansafe
    "Cannot convert non-finite values (NA or inf) to " "integer"
ValueError: Cannot convert non-finite values (NA or inf) to integer
Closing remaining open files:output/pipeline.h5...done

It's not clear if these NA values are coming from the controls (unlikely) or the seed table (I suppose very likely) or the geographic crosswalk, and if so from which column. Am I on the right track, or is it a different issue entirely?

improve installation process

The zipped install setup described in the documentation can be tricky to update. I think we should improve our recommended installation procedure. Maybe we revert to simply requiring the user to install Anaconda and then providing an install script that they can run.

export an intermediate, un-expanded, seed table with float weights

@binnympaul - For survey weighting application, we need to switch off integerization in PopulationSim. My understanding is that if the integerization model steps is not run then PopulationSIm algorithm must be working with only float weights. The seed sample should not be expanded since the weight have not been integerized. Therefore, a final synthetic population will not be produced. So, how do we configure PopulationSim to export an intermediate, un-expanded, seed table with float weights?

summary_COUNTY output table?

The summary output tables that have the control, result and diffs are really helpful to visualize. We have configured controls for MAZs, TAZs and COUNTY. Is there a way to get a summary_COUNTY table output? It looks like it's considered a "meta_geography" rather than a "sub_geography", presumably because it's bigger than the PUMA; but I would think getting this output is still doable? Thank you!

allow user to specify output fields in the expanded hh and person files

the settings for this will look something like:

expand_households fails if zone IDs are floats

In the example below, the zone IDs in the control data file are stored as floats even though they are whole numbers (i.e they are 5.0 instead of 5 for example). This is fine for some of the steps, but is a problem for expand_households, specifically this line. We should fix this by maybe checking for this on input and/or cast the zone IDs to int(). I confirmed expand_households works if I remove all ".0" from the file.

latest ortools breaks populationsim so we froze the required package version

@binnympaul "Looks like the new release of ortools (7.5.7466, released Jan 28) is breaking populationsim. I’m trying a run after installing the previous version of ortools (7.4.7247 )and PopulationSim seems to be running beyond the error point."

Ymal note in user guide

We just had a run fail and we think it's because we put tabs in the ymal settings file.
Can a line or two be added to the user guide for the settings file on important considerations when editing ymals. In this case it seems like the user can only use spaces - is that correct.

Are there any other important aspects to updating the ymal that the user needs to be aware of?

create inputs pre-processor

The inputs pre-processor reads each input table, runs pandas expressions (*_expressions.csv) against the table to create additional required table fields, and save the tables to the datastore. For example, it processes raw Census tables to create the required fields for population synthesis. The inputs pre-processor exposes all input tables to the expressions calculator so tables can be joined (such as households to persons for example). It reads the geographic crosswalk file in order to join meta, mid, and low level zone tables if needed. The format of the expressions file follows ActivitySim, as shown in the example below. The seed_households expressions file below operates on the seed_households input file and processes the NPF field to create the FAMTAG field, which is then used by PopulationSim in later steps.

Description	Target	Expression
HH is a family	FAMTAG	pd.notnull( NPF ) * 1

Python 3 support

Need to make sure PopulationSim works for Python 3 (and update all related materials as well). Updating ActivitySim to work for both 2 and 3 wasn't a big deal, so updating PopulationSim should be relatively straightforward.

update documentation to use conda install instead of pip to be safe

Add major university column to persons file

@DDudich

#Scripts Adds Major university column to persons file as well as fixing Null value in Standard Occupation Classification (SOC) column.

x <- read.csv("persons.csv",as.is=T)
x$soc <- as.numeric(gsub("NUL",0,x$soc))
x$majoruni <-0
x<- x[order(x[,30],x[,10]),c(1:23,31,24:30)]
x$PERID <- 1:nrow(x)
write.csv(x,"persons_sorted_uni.csv",row.names=F)

Better Error Messaging

I think one piece of low fruit in regards to error messaging, would be better reporting on which zone is being worked on at the time of the failure. This could either be handled, by more consistently writing to the screen which step and which zone is being worked on for every zone/step, or by building better error messaging that could report which step and with which zone the process died in.

allow for specifying controls only at the seed level

print version of populationsim during run to help diagnose version issues

also maybe plot the variance around the mean instead of around the zero axis since it can be confusing when the mean appears beyond the variance range.

Integerizing warning/error messages during run

Completing a populationsim run throws these errors/warnings before then finishing successfully:
ERROR- Integerizer failed for COUNTY_5_MAZ_418156 status INFEASIBLE. Returning smart-rounded original weights.
WARNING- do_simul_integerizing failed for COUNTY_5 status INFEASIBLE.

These are returned for many of our MAZ/TAZ/COUNTY geographies. Are these simply a limitation of the algorithms that will have populationsim perform in a less sophisticated way, or could it mean something about our controls etc. for those geographies? A couple of our variables aren't coming out great, so wondering if there is a relationship here. Thanks! @lmz

ortools does not install on ubuntu 14, causes Travis to fail

cannot use same target column for two different controls

The control_field column in the controls.csv file must be unique. To use same control value for two expressions, duplicate the column and rename to something else.

Generate summary table for seed-level controls

Currently summary table are not produced for implementations with only seed level controls.
e.g., summary_PUMA.csv

pandas.read_csv sometimes fails with default utf-8 encoding

PopulationSim's input_pre_processor sometimes fails to read CSV files with certain Windows encodings using the default utf-8 decoder.

INFO - Reading csv file data\seed_households.csv
Traceback (most recent call last):
  File "pandas\_libs\parsers.pyx", line 1151, in pandas._libs.parsers.TextReader
._convert_tokens
  File "pandas\_libs\parsers.pyx", line 1281, in pandas._libs.parsers.TextReader
._convert_with_dtype
  File "pandas\_libs\parsers.pyx", line 1297, in pandas._libs.parsers.TextReader
._string_convert
  File "pandas\_libs\parsers.pyx", line 1520, in pandas._libs.parsers._string_bo
x_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 8: invalid
start byte

runtime arguments causing error

Following call results in error: python run_populationsim.py --config configs

Traceback (most recent call last):
File "run_populationsim.py", line 63, in
pipeline.run(models=steps, resume_after=resume_after)
File "C:\Users\binny.paul\Documents\Anaconda3\lib\site-packages\activitysim\core\pipeline.py", line 594, in run
run_model(model)
File "C:\Users\binny.paul\Documents\Anaconda3\lib\site-packages\activitysim\core\pipeline.py", line 471, in run_model
orca.run([step_name])
File "C:\Users\binny.paul\Documents\Anaconda3\lib\site-packages\activitysim\core\orca.py", line 2034, in run
step()
File "C:\Users\binny.paul\Documents\Anaconda3\lib\site-packages\activitysim\core\orca.py", line 843, in call
return self._func(**kwargs)
File "C:\Users\binny.paul\Documents\Anaconda3\lib\site-packages\populationsim\steps\setup_data_structures.py", line 317, in setup_data_structures
control_spec = read_control_spec(setting('control_file_name', 'controls.csv'), configs_dir)
File "C:\Users\binny.paul\Documents\Anaconda3\lib\site-packages\populationsim\steps\setup_data_structures.py", line 29, in read_control_spec
data_file_path = os.path.join(configs_dir, data_filename)
File "C:\Users\binny.paul\Documents\Anaconda3\lib\ntpath.py", line 76, in join
path = os.fspath(path)
TypeError: expected str, bytes or os.PathLike object, not list

Don't need docker any more for travis

now that they have migrated to Ubuntu Trusty

User should not be required to specify expanded outputs in the output_tables: setting

The user should not be required to explicitly specify the expanded_households or expanded_persons file in the output_table settings in the settings.YAML file. These should be written out regardless.

PopulationSim not compatible with PyYAML version 5

check code against SOABM populationsim data

assert failing in do_integerizing for backstopped controls for TAZ 141 seed 900
assert len(incidence_table.index) > 1

add on-board survey weighting example

We've successfully reweighted a transit on-board survey with the software as well. Let's add that example to the repo and include it in the user documentation.

Zero household sub-zones should be skipped in sub-balancing step

PopulationSim errors out for zero household MAZs within a zero-household TAZ.

Hard-coded household ID field name in summarize.py

Hard-coded field name for household ID is being used in summarize.py.

This forces users to name the unique household ID as 'hh_id' in seed_households file. The code must get the household_id_col name from the settings.yaml file.

example_calm needs a little explanation

I'm running a test of the data located here:
https://github.com/RSGInc/populationsim/tree/master/example_calm/data

I've had to guess at a lot of the field names/meanings. Here are the minimum things that I need to know that I can't make up:

Neither the household nor person seed has as a TAZ or Tract column (that I can see). I need that to match to the TAZ and tract controls. I don't see how to make use of the geo equivalency file, either.
I can't determine which seed field gives me the number of household workers. On the off chance that it's the "COW" (for "class of worker"?), then I need to know which categorical values translate to worker before I can summarize by household. To be clear, the tract level controls apply controls by the number of workers in the household, so that's what I'm trying to determine.

Other stuff I made up like the income group and age group ranges. The actual ranges would be nice to have, but are not required. I've also assumed that "NP" in the household table is "number of persons" (or household size).

Thanks!

tidy up, including more code documentation

more docstrings are needed for building the documentation

Parallel processing

PopulationSim is currently running in one process/thread. Eventually we want to multi-process/thread in order to improve runtimes, especially for larger / more complicated setups.

target name for total HH control must be same as control field name

The 'target' field in the controls.csv file must be same as the 'control_field' value. The two of these must be same as the value for the token 'total_hh_control' in the settings.YAML file.

If the 'target' field is set to something else, it results in Key Error.

maybe add command line tools for easier use

See ActivitySim/activitysim#287

Add update geography feature

Here's the initial design from @jfdman

As far as the single geography feature, here is what I propose to do. The user would specify the controls for a subset of geographies – usually just a few zones, along with a pre-created synthetic population. The controls would have to be specified for the lowest level geographies that the previous synthetic population was created for – the reason is obvious, if you don’t specify the lowest geography of the existing population you can’t use the output in the model. The controls do not have to be consistent with the controls used to generate the original synthetic population. For example, you could use some combination of housing type and number of bathrooms per unit as a control even if neither was used as a control in the original population. There would only be single-level controls – global controls cannot be specified since there is no guarantee that the selected geographies add up to a global geography, and because this simply isn’t a requirement of the use case. The user would also have the ability to specify whether to over-write the existing population in the selected geographies or add to the existing population.

Once the tool is set up, it would read in the existing population. It would synthesize the households in the specified geographies by first weighting the PUMS data to match the controls, then sequentially integerizing the weights for each geography. Simultaneous integerization isn’t necessary since there are no global controls. The households\persons in the existing population outside the selected area would be unaffected. The output file would contain the population for the entire region, with the synthetic population in the selected geography either replacing the existing population in that geography or adding to it.

floating-point versus integer controls

Integer controls behave better than floating-point controls and so we want the user to be aware of this.

@jfdman - maybe we should have a switch in the control file (roundControls=false), which would be set to false by default and throw an informative error message if the controls are not integers, but if set to true, would just throw a warning and round to the nearest integer before proceeding. We also need to update the wiki.

activitysim / populationsim Goto Github PK

populationsim's Issues

Recommend Projects

Recommend Topics

Recommend Org