activitysim / populationsim Goto Github PK
View Code? Open in Web Editor NEWAn Open Platform for Population Synthesis
Home Page: https://activitysim.github.io/populationsim
License: Other
An Open Platform for Population Synthesis
Home Page: https://activitysim.github.io/populationsim
License: Other
We're using the GLPK_MI solver from cylp for the simultaneous integerizer since it is the most robust and stable option for cvxpy. Unfortunately it doesn't appear to be easily installed for 64bit Anaconda on Windows. @toliwaga and I searched for, and tested a few ideas, but again to no avail. We also tested all the other cvxpy solvers available, but nothing seems to work (see the table below). At this point, we're thinking it might make the most sense to implement the simultaneous integerizer in ortools since it works on Windows, is fast, and @toliwaga is familiar with it. We're working on a solution.
LP | Example CALM | Example TEST | |
---|---|---|---|
CBC | X | No easy install for Anaconda 64bit Windows? | No easy install for Anaconda 64bit Windows? |
GLPK | X | No easy install for Anaconda 64bit Windows? | No easy install for Anaconda 64bit Windows? |
GLPK_MI | X | No easy install for Anaconda 64bit Windows? | No easy install for Anaconda 64bit Windows? |
Elemental | X | No Windows version | No Windows version |
ECOS | X | tried and failed | Integerizer works but simul_integerize fails |
ECOS_BB | X | tried and failed | Integerizer works but simul_integerize fails |
GUROBI | X | Commercial license required | Commercial license required |
MOSEK | X | Commercial license required | Commercial license required |
XPRESS | X | Commercial license required | Commercial license required |
CVXOPT | X | tried and failed | tried and failed |
SCS | X | fails, hits max iters even with 100,000 iters set | fails, hits max iters even with 100,000 iters set |
LS | X | cannot solve | cannot solve |
Hi there - I've been playing with PopulationSim for our use (https://github.com/BayAreaMetro/populationsim) and so far, I'm testing it for 2010 by specifying the 2010 control files (for example, https://github.com/BayAreaMetro/populationsim/blob/master/bay_area/households/configs/settings.yaml#L61)
But our standard practice will be to run it for multiple years but I don't like the obvious solutions:
I'd prefer something like having the settings.yaml file have something like
model_year: 2010
and then
- tablename: MAZ_control_data
filename : %model_year%_mazData.csv
How do you recommend folks handle this?
Thank you!
It would be useful for survey weighting to allow for user specified minimum weights in addition to user specified maximum weights. For example, a min weight of 1/4 * the initial weight and a max weight of 4 * the initial weight. Currently the minimum weight is hard coded as 0. Our testing shows that some records end up with a weight of 0 (in part due to integerization) and so they are not included in the final data set. This issue is related to #75 as well.
The setup depends on the RSGInc ActivitySim fork that includes an 'inject' module in activitysim.core. This module does not exist in the pypi version of ActivitySim.
After moving beyond #104 and preparing our seed data in a better way, we now have an issue with integerizing the seed weights. Here's the stack trace:
INFO - integerize_final_seed_weights seed id 49003
Welcome to the CBC MILP Solver
Version: 2.10.3
Build Date: Oct 11 2019
command line - cbc -solve -quit (default strategy 1)
Presolve 31 (-6) rows, 335 (-7) columns and 3583 (-6) elements
0 Obj -0 Primal inf 393.88521 (16)
31 Obj -5027.9043 Primal inf 1.0552473 (3)
32 Obj -5027.9043
Optimal - objective value -5027.9043
After Postsolve, objective -5027.9043, infeasibilities - dual 0 (0), primal 0 (0)
Optimal objective -5027.904278 - 32 iterations time 0.002, Presolve 0.00
Total time (CPU seconds): 0.00 (Wallclock seconds): 0.00
Traceback (most recent call last):
File "run_populationsim.py", line 63, in <module>
pipeline.run(models=steps, resume_after=resume_after)
File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/activitysim/core/pipeline.py", line 594, in run
run_model(model)
File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/activitysim/core/pipeline.py", line 471, in run_model
orca.run([step_name])
File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/activitysim/core/orca.py", line 2034, in run
step()
File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/activitysim/core/orca.py", line 843, in __call__
return self._func(**kwargs)
File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/populationsim/steps/integerize_final_seed_weights.py", line 83, in integerize_final_seed_weights
total_hh_control_col=total_hh_control_col
File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/populationsim/integerizer.py", line 308, in do_integerizing
status = integerizer.integerize()
File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/populationsim/integerizer.py", line 187, in integerize
smart_round(int_weights, resid_weights, self.total_hh_control_value)
File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/populationsim/integerizer.py", line 38, in smart_round
assert target_sum == int(target_sum)
My guess is that this is an issue with data typing or configuring the weight field appropriately. We are using the basic WGTP
field from the ACS PUMS record, and the person file has both the WGTP
and PWGTP
fields appended. We filtered households out that had WGTP <= 0
, but are unsure if something else needs to happen.
The repository is here. We needed to remove the processed seed data from the repo for GitHub's limits (eventually) but the seed files are here
Regression tests unreliable since edge case results depend on exact ortools/cbc version
Currently, input HDF5 file from base run for the repop mode is copied to the "output" folder. This should be changed to "output" folder, the user should copy the HDF5 file to the "output" folder of the repop setup.
We need to support N-number of lower level geographies. For example, in the settings file we have:
geographies: [REGION, PUMA, TRACT, TAZ]
seed_geography: PUMA
#or
geographies: [REGION, PUMA, TRACT, TAZ, MAZ]
seed_geography: PUMA
There can only be one geography in the geographies
list before the seed_geography
and the geographies after the seed_geography
geography are lower level. There can be as many lower level geographies as the user wants.
We're not planning to do this right now, but we don't want to forgot this idea for the future if needed.
In addition to tagging households/persons as T/F membership for attributes, allow for partial membership. For example, instead of is HH size 1, 2, 3 or 4, specify the HH is 80% likely size 1, 15% size 2, 5% size 3, 0% size 4+.
We are considering integrating this library into our work as an option to synthesize a population (plug and play among others) and in doing so contributing to the codebase. Is it possible to put a standard license on the library like MIT or BSD?
Dear All,
I am trying to run the popsim, and it seems all is working till one I get this message after the run_pop reads the files:
RuntimeError: table 'taz_control_data' never checkpointed.
Closing remaining open files:output\pipeline.h5...done
Any help?
Regards,
Issa
We might want to incorporate popsampler functionality into PopulationSim
I'm trying to get an implementation of populationsim up and running. I'm using the populationsim Anaconda environment (on MacOS 10.14.6), with the data files and configuration in this respository. I've been able to grind through many of the errors, but this traceback is something I can't figure out.
Traceback (most recent call last):
File "run_populationsim.py", line 63, in <module>
pipeline.run(models=steps, resume_after=resume_after)
File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/activitysim/core/pipeline.py", line 594, in run
run_model(model)
File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/activitysim/core/pipeline.py", line 471, in run_model
orca.run([step_name])
File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/activitysim/core/orca.py", line 2034, in run
step()
File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/activitysim/core/orca.py", line 843, in __call__
return self._func(**kwargs)
File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/populationsim/steps/setup_data_structures.py", line 340, in setup_data_structures
= build_grouped_incidence_table(incidence_table, control_spec, seed_geography)
File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/populationsim/steps/setup_data_structures.py", line 230, in build_grouped_incidence_table
how='left').group_id.astype(int).values
File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/pandas/core/generic.py", line 5882, in astype
dtype=dtype, copy=copy, errors=errors, **kwargs
File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 581, in astype
return self.apply("astype", dtype=dtype, **kwargs)
File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 438, in apply
applied = getattr(b, f)(**kwargs)
File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 559, in astype
return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs)
File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 643, in _astype
values = astype_nansafe(vals1d, dtype, copy=True, **kwargs)
File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 700, in astype_nansafe
"Cannot convert non-finite values (NA or inf) to " "integer"
ValueError: Cannot convert non-finite values (NA or inf) to integer
Closing remaining open files:output/pipeline.h5...done
It's not clear if these NA
values are coming from the controls (unlikely) or the seed table (I suppose very likely) or the geographic crosswalk, and if so from which column. Am I on the right track, or is it a different issue entirely?
The zipped install setup described in the documentation can be tricky to update. I think we should improve our recommended installation procedure. Maybe we revert to simply requiring the user to install Anaconda and then providing an install script that they can run.
@binnympaul - For survey weighting application, we need to switch off integerization in PopulationSim. My understanding is that if the integerization model steps is not run then PopulationSIm algorithm must be working with only float weights. The seed sample should not be expanded since the weight have not been integerized. Therefore, a final synthetic population will not be produced. So, how do we configure PopulationSim to export an intermediate, un-expanded, seed table with float weights?
The summary output tables that have the control, result and diffs are really helpful to visualize. We have configured controls for MAZs, TAZs and COUNTY. Is there a way to get a summary_COUNTY table output? It looks like it's considered a "meta_geography" rather than a "sub_geography", presumably because it's bigger than the PUMA; but I would think getting this output is still doable? Thank you!
In the example below, the zone IDs in the control data file are stored as floats even though they are whole numbers (i.e they are 5.0 instead of 5 for example). This is fine for some of the steps, but is a problem for expand_households, specifically this line. We should fix this by maybe checking for this on input and/or cast the zone IDs to int(). I confirmed expand_households works if I remove all ".0" from the file.
@binnympaul "Looks like the new release of ortools (7.5.7466, released Jan 28) is breaking populationsim. I’m trying a run after installing the previous version of ortools (7.4.7247 )and PopulationSim seems to be running beyond the error point."
We just had a run fail and we think it's because we put tabs in the ymal settings file.
Can a line or two be added to the user guide for the settings file on important considerations when editing ymals. In this case it seems like the user can only use spaces - is that correct.
Are there any other important aspects to updating the ymal that the user needs to be aware of?
The inputs pre-processor reads each input table, runs pandas expressions (*_expressions.csv) against the table to create additional required table fields, and save the tables to the datastore. For example, it processes raw Census tables to create the required fields for population synthesis. The inputs pre-processor exposes all input tables to the expressions calculator so tables can be joined (such as households to persons for example). It reads the geographic crosswalk file in order to join meta, mid, and low level zone tables if needed. The format of the expressions file follows ActivitySim, as shown in the example below. The seed_households
expressions file below operates on the seed_households
input file and processes the NPF
field to create the FAMTAG
field, which is then used by PopulationSim in later steps.
Description | Target | Expression |
---|---|---|
HH is a family | FAMTAG | pd.notnull( NPF ) * 1 |
Need to make sure PopulationSim works for Python 3 (and update all related materials as well). Updating ActivitySim to work for both 2 and 3 wasn't a big deal, so updating PopulationSim should be relatively straightforward.
#Scripts Adds Major university column to persons file as well as fixing Null value in Standard Occupation Classification (SOC) column.
x <- read.csv("persons.csv",as.is=T)
x$soc <- as.numeric(gsub("NUL",0,x$soc))
x$majoruni <-0
x<- x[order(x[,30],x[,10]),c(1:23,31,24:30)]
x$PERID <- 1:nrow(x)
write.csv(x,"persons_sorted_uni.csv",row.names=F)
I think one piece of low fruit in regards to error messaging, would be better reporting on which zone is being worked on at the time of the failure. This could either be handled, by more consistently writing to the screen which step and which zone is being worked on for every zone/step, or by building better error messaging that could report which step and with which zone the process died in.
also maybe plot the variance around the mean instead of around the zero axis since it can be confusing when the mean appears beyond the variance range.
Completing a populationsim run throws these errors/warnings before then finishing successfully:
ERROR- Integerizer failed for COUNTY_5_MAZ_418156 status INFEASIBLE. Returning smart-rounded original weights.
WARNING- do_simul_integerizing failed for COUNTY_5 status INFEASIBLE.
These are returned for many of our MAZ/TAZ/COUNTY geographies. Are these simply a limitation of the algorithms that will have populationsim perform in a less sophisticated way, or could it mean something about our controls etc. for those geographies? A couple of our variables aren't coming out great, so wondering if there is a relationship here. Thanks! @lmz
The control_field column in the controls.csv file must be unique. To use same control value for two expressions, duplicate the column and rename to something else.
Currently summary table are not produced for implementations with only seed level controls.
e.g., summary_PUMA.csv
PopulationSim's input_pre_processor sometimes fails to read CSV files with certain Windows encodings using the default utf-8 decoder.
INFO - Reading csv file data\seed_households.csv
Traceback (most recent call last):
File "pandas\_libs\parsers.pyx", line 1151, in pandas._libs.parsers.TextReader
._convert_tokens
File "pandas\_libs\parsers.pyx", line 1281, in pandas._libs.parsers.TextReader
._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1297, in pandas._libs.parsers.TextReader
._string_convert
File "pandas\_libs\parsers.pyx", line 1520, in pandas._libs.parsers._string_bo
x_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 8: invalid
start byte
Following call results in error: python run_populationsim.py --config configs
Traceback (most recent call last):
File "run_populationsim.py", line 63, in
pipeline.run(models=steps, resume_after=resume_after)
File "C:\Users\binny.paul\Documents\Anaconda3\lib\site-packages\activitysim\core\pipeline.py", line 594, in run
run_model(model)
File "C:\Users\binny.paul\Documents\Anaconda3\lib\site-packages\activitysim\core\pipeline.py", line 471, in run_model
orca.run([step_name])
File "C:\Users\binny.paul\Documents\Anaconda3\lib\site-packages\activitysim\core\orca.py", line 2034, in run
step()
File "C:\Users\binny.paul\Documents\Anaconda3\lib\site-packages\activitysim\core\orca.py", line 843, in call
return self._func(**kwargs)
File "C:\Users\binny.paul\Documents\Anaconda3\lib\site-packages\populationsim\steps\setup_data_structures.py", line 317, in setup_data_structures
control_spec = read_control_spec(setting('control_file_name', 'controls.csv'), configs_dir)
File "C:\Users\binny.paul\Documents\Anaconda3\lib\site-packages\populationsim\steps\setup_data_structures.py", line 29, in read_control_spec
data_file_path = os.path.join(configs_dir, data_filename)
File "C:\Users\binny.paul\Documents\Anaconda3\lib\ntpath.py", line 76, in join
path = os.fspath(path)
TypeError: expected str, bytes or os.PathLike object, not list
now that they have migrated to Ubuntu Trusty
The user should not be required to explicitly specify the expanded_households or expanded_persons file in the output_table settings in the settings.YAML file. These should be written out regardless.
assert failing in do_integerizing for backstopped controls for TAZ 141 seed 900
assert len(incidence_table.index) > 1
We've successfully reweighted a transit on-board survey with the software as well. Let's add that example to the repo and include it in the user documentation.
PopulationSim errors out for zero household MAZs within a zero-household TAZ.
Hard-coded field name for household ID is being used in summarize.py.
This forces users to name the unique household ID as 'hh_id' in seed_households file. The code must get the household_id_col name from the settings.yaml file.
I'm running a test of the data located here:
https://github.com/RSGInc/populationsim/tree/master/example_calm/data
I've had to guess at a lot of the field names/meanings. Here are the minimum things that I need to know that I can't make up:
Other stuff I made up like the income group and age group ranges. The actual ranges would be nice to have, but are not required. I've also assumed that "NP" in the household table is "number of persons" (or household size).
Thanks!
more docstrings are needed for building the documentation
PopulationSim is currently running in one process/thread. Eventually we want to multi-process/thread in order to improve runtimes, especially for larger / more complicated setups.
The 'target' field in the controls.csv file must be same as the 'control_field' value. The two of these must be same as the value for the token 'total_hh_control' in the settings.YAML file.
If the 'target' field is set to something else, it results in Key Error.
Here's the initial design from @jfdman
As far as the single geography feature, here is what I propose to do. The user would specify the controls for a subset of geographies – usually just a few zones, along with a pre-created synthetic population. The controls would have to be specified for the lowest level geographies that the previous synthetic population was created for – the reason is obvious, if you don’t specify the lowest geography of the existing population you can’t use the output in the model. The controls do not have to be consistent with the controls used to generate the original synthetic population. For example, you could use some combination of housing type and number of bathrooms per unit as a control even if neither was used as a control in the original population. There would only be single-level controls – global controls cannot be specified since there is no guarantee that the selected geographies add up to a global geography, and because this simply isn’t a requirement of the use case. The user would also have the ability to specify whether to over-write the existing population in the selected geographies or add to the existing population.
Once the tool is set up, it would read in the existing population. It would synthesize the households in the specified geographies by first weighting the PUMS data to match the controls, then sequentially integerizing the weights for each geography. Simultaneous integerization isn’t necessary since there are no global controls. The households\persons in the existing population outside the selected area would be unaffected. The output file would contain the population for the entire region, with the synthetic population in the selected geography either replacing the existing population in that geography or adding to it.
Integer controls behave better than floating-point controls and so we want the user to be aware of this.
@jfdman - maybe we should have a switch in the control file (roundControls=false), which would be set to false by default and throw an informative error message if the controls are not integers, but if set to true, would just throw a warning and round to the nearest integer before proceeding. We also need to update the wiki.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.