Giter Club home page Giter Club logo

populationsim's Introduction

PopulationSim

Build Status Coverage Status

PopulationSim is an open platform for population synthesis. It emerged from Oregon DOT's desire to build a shared, open, platform that could be easily adapted for statewide, regional, and urban transportation planning needs. PopulationSim is implemented in the ActivitySim framework.

Documentation

https://activitysim.github.io/populationsim/

populationsim's People

Contributors

bettinardi avatar binnympaul avatar bstabler avatar jamiecook avatar jfdman avatar johnklawlor avatar toliwaga avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

populationsim's Issues

Integerizing warning/error messages during run

Completing a populationsim run throws these errors/warnings before then finishing successfully:
ERROR- Integerizer failed for COUNTY_5_MAZ_418156 status INFEASIBLE. Returning smart-rounded original weights.
WARNING- do_simul_integerizing failed for COUNTY_5 status INFEASIBLE.

These are returned for many of our MAZ/TAZ/COUNTY geographies. Are these simply a limitation of the algorithms that will have populationsim perform in a less sophisticated way, or could it mean something about our controls etc. for those geographies? A couple of our variables aren't coming out great, so wondering if there is a relationship here. Thanks! @lmz

pandas.read_csv sometimes fails with default utf-8 encoding

PopulationSim's input_pre_processor sometimes fails to read CSV files with certain Windows encodings using the default utf-8 decoder.

INFO - Reading csv file data\seed_households.csv
Traceback (most recent call last):
  File "pandas\_libs\parsers.pyx", line 1151, in pandas._libs.parsers.TextReader
._convert_tokens
  File "pandas\_libs\parsers.pyx", line 1281, in pandas._libs.parsers.TextReader
._convert_with_dtype
  File "pandas\_libs\parsers.pyx", line 1297, in pandas._libs.parsers.TextReader
._string_convert
  File "pandas\_libs\parsers.pyx", line 1520, in pandas._libs.parsers._string_bo
x_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 8: invalid
start byte

example_calm needs a little explanation

I'm running a test of the data located here:
https://github.com/RSGInc/populationsim/tree/master/example_calm/data

I've had to guess at a lot of the field names/meanings. Here are the minimum things that I need to know that I can't make up:

  1. Neither the household nor person seed has as a TAZ or Tract column (that I can see). I need that to match to the TAZ and tract controls. I don't see how to make use of the geo equivalency file, either.
  2. I can't determine which seed field gives me the number of household workers. On the off chance that it's the "COW" (for "class of worker"?), then I need to know which categorical values translate to worker before I can summarize by household. To be clear, the tract level controls apply controls by the number of workers in the household, so that's what I'm trying to determine.

Other stuff I made up like the income group and age group ranges. The actual ranges would be nice to have, but are not required. I've also assumed that "NP" in the household table is "number of persons" (or household size).

Thanks!

add on-board survey weighting example

We've successfully reweighted a transit on-board survey with the software as well. Let's add that example to the repo and include it in the user documentation.

export an intermediate, un-expanded, seed table with float weights

@binnympaul - For survey weighting application, we need to switch off integerization in PopulationSim. My understanding is that if the integerization model steps is not run then PopulationSIm algorithm must be working with only float weights. The seed sample should not be expanded since the weight have not been integerized. Therefore, a final synthetic population will not be produced. So, how do we configure PopulationSim to export an intermediate, un-expanded, seed table with float weights?

expand_households fails if zone IDs are floats

In the example below, the zone IDs in the control data file are stored as floats even though they are whole numbers (i.e they are 5.0 instead of 5 for example). This is fine for some of the steps, but is a problem for expand_households, specifically this line. We should fix this by maybe checking for this on input and/or cast the zone IDs to int(). I confirmed expand_households works if I remove all ".0" from the file.

image

Add update geography feature

Here's the initial design from @jfdman

As far as the single geography feature, here is what I propose to do. The user would specify the controls for a subset of geographies – usually just a few zones, along with a pre-created synthetic population. The controls would have to be specified for the lowest level geographies that the previous synthetic population was created for – the reason is obvious, if you don’t specify the lowest geography of the existing population you can’t use the output in the model. The controls do not have to be consistent with the controls used to generate the original synthetic population. For example, you could use some combination of housing type and number of bathrooms per unit as a control even if neither was used as a control in the original population. There would only be single-level controls – global controls cannot be specified since there is no guarantee that the selected geographies add up to a global geography, and because this simply isn’t a requirement of the use case. The user would also have the ability to specify whether to over-write the existing population in the selected geographies or add to the existing population.

Once the tool is set up, it would read in the existing population. It would synthesize the households in the specified geographies by first weighting the PUMS data to match the controls, then sequentially integerizing the weights for each geography. Simultaneous integerization isn’t necessary since there are no global controls. The households\persons in the existing population outside the selected area would be unaffected. The output file would contain the population for the entire region, with the synthetic population in the selected geography either replacing the existing population in that geography or adding to it.

Cannot convert non-finite values to integer

I'm trying to get an implementation of populationsim up and running. I'm using the populationsim Anaconda environment (on MacOS 10.14.6), with the data files and configuration in this respository. I've been able to grind through many of the errors, but this traceback is something I can't figure out.

Traceback (most recent call last):
  File "run_populationsim.py", line 63, in <module>
    pipeline.run(models=steps, resume_after=resume_after)
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/activitysim/core/pipeline.py", line 594, in run
    run_model(model)
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/activitysim/core/pipeline.py", line 471, in run_model
    orca.run([step_name])
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/activitysim/core/orca.py", line 2034, in run
    step()
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/activitysim/core/orca.py", line 843, in __call__
    return self._func(**kwargs)
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/populationsim/steps/setup_data_structures.py", line 340, in setup_data_structures
    = build_grouped_incidence_table(incidence_table, control_spec, seed_geography)
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/populationsim/steps/setup_data_structures.py", line 230, in build_grouped_incidence_table
    how='left').group_id.astype(int).values
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/pandas/core/generic.py", line 5882, in astype
    dtype=dtype, copy=copy, errors=errors, **kwargs
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 581, in astype
    return self.apply("astype", dtype=dtype, **kwargs)
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 438, in apply
    applied = getattr(b, f)(**kwargs)
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 559, in astype
    return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs)
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 643, in _astype
    values = astype_nansafe(vals1d, dtype, copy=True, **kwargs)
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 700, in astype_nansafe
    "Cannot convert non-finite values (NA or inf) to " "integer"
ValueError: Cannot convert non-finite values (NA or inf) to integer
Closing remaining open files:output/pipeline.h5...done

It's not clear if these NA values are coming from the controls (unlikely) or the seed table (I suppose very likely) or the geographic crosswalk, and if so from which column. Am I on the right track, or is it a different issue entirely?

Recommended way to configure model year?

Hi there - I've been playing with PopulationSim for our use (https://github.com/BayAreaMetro/populationsim) and so far, I'm testing it for 2010 by specifying the 2010 control files (for example, https://github.com/BayAreaMetro/populationsim/blob/master/bay_area/households/configs/settings.yaml#L61)

But our standard practice will be to run it for multiple years but I don't like the obvious solutions:

  1. having duplicates of the config that looks very similar with just the year changed
  2. having a single config but copying/moving files around at runtime

I'd prefer something like having the settings.yaml file have something like

model_year: 2010

and then

  - tablename: MAZ_control_data
    filename : %model_year%_mazData.csv

How do you recommend folks handle this?
Thank you!

Make geographies flexible

We need to support N-number of lower level geographies. For example, in the settings file we have:

geographies: [REGION, PUMA, TRACT, TAZ]
seed_geography: PUMA

#or

geographies: [REGION, PUMA, TRACT, TAZ, MAZ]
seed_geography: PUMA

There can only be one geography in the geographies list before the seed_geography and the geographies after the seed_geography geography are lower level. There can be as many lower level geographies as the user wants.

Python 3 support

Need to make sure PopulationSim works for Python 3 (and update all related materials as well). Updating ActivitySim to work for both 2 and 3 wasn't a big deal, so updating PopulationSim should be relatively straightforward.

create inputs pre-processor

The inputs pre-processor reads each input table, runs pandas expressions (*_expressions.csv) against the table to create additional required table fields, and save the tables to the datastore. For example, it processes raw Census tables to create the required fields for population synthesis. The inputs pre-processor exposes all input tables to the expressions calculator so tables can be joined (such as households to persons for example). It reads the geographic crosswalk file in order to join meta, mid, and low level zone tables if needed. The format of the expressions file follows ActivitySim, as shown in the example below. The seed_households expressions file below operates on the seed_households input file and processes the NPF field to create the FAMTAG field, which is then used by PopulationSim in later steps.

Description Target Expression
HH is a family FAMTAG pd.notnull( NPF ) * 1

add mtc popsim example

Consistent with the ActivitySim example, I think it makes sense to add the MTC TM1 PopSim setup as an additional example to this repo. @lmz?

Ymal note in user guide

We just had a run fail and we think it's because we put tabs in the ymal settings file.
Can a line or two be added to the user guide for the settings file on important considerations when editing ymals. In this case it seems like the user can only use spaces - is that correct.

Are there any other important aspects to updating the ymal that the user needs to be aware of?

cylp doesn't easily install for 64bit Anaconda on Windows

We're using the GLPK_MI solver from cylp for the simultaneous integerizer since it is the most robust and stable option for cvxpy. Unfortunately it doesn't appear to be easily installed for 64bit Anaconda on Windows. @toliwaga and I searched for, and tested a few ideas, but again to no avail. We also tested all the other cvxpy solvers available, but nothing seems to work (see the table below). At this point, we're thinking it might make the most sense to implement the simultaneous integerizer in ortools since it works on Windows, is fast, and @toliwaga is familiar with it. We're working on a solution.

  LP Example CALM Example TEST
CBC X No easy install for Anaconda 64bit Windows? No easy install for Anaconda 64bit Windows?
GLPK X No easy install for Anaconda 64bit Windows? No easy install for Anaconda 64bit Windows?
GLPK_MI X No easy install for Anaconda 64bit Windows? No easy install for Anaconda 64bit Windows?
Elemental X No Windows version No Windows version
ECOS X tried and failed Integerizer works but simul_integerize fails
ECOS_BB X tried and failed Integerizer works but simul_integerize fails
GUROBI X Commercial license required Commercial license required
MOSEK X Commercial license required Commercial license required
XPRESS X Commercial license required Commercial license required
CVXOPT X tried and failed tried and failed
SCS X fails, hits max iters even with 100,000 iters set fails, hits max iters even with 100,000 iters set
LS X cannot solve cannot solve

improve installation process

The zipped install setup described in the documentation can be tricky to update. I think we should improve our recommended installation procedure. Maybe we revert to simply requiring the user to install Anaconda and then providing an install script that they can run.

Better Error Messaging

I think one piece of low fruit in regards to error messaging, would be better reporting on which zone is being worked on at the time of the failure. This could either be handled, by more consistently writing to the screen which step and which zone is being worked on for every zone/step, or by building better error messaging that could report which step and with which zone the process died in.

support proportional membership for weighting

We're not planning to do this right now, but we don't want to forgot this idea for the future if needed.

In addition to tagging households/persons as T/F membership for attributes, allow for partial membership. For example, instead of is HH size 1, 2, 3 or 4, specify the HH is 80% likely size 1, 15% size 2, 5% size 3, 0% size 4+.

image

Parallel processing

PopulationSim is currently running in one process/thread. Eventually we want to multi-process/thread in order to improve runtimes, especially for larger / more complicated setups.

floating-point versus integer controls

Integer controls behave better than floating-point controls and so we want the user to be aware of this.

@jfdman - maybe we should have a switch in the control file (roundControls=false), which would be set to false by default and throw an informative error message if the controls are not integers, but if set to true, would just throw a warning and round to the nearest integer before proceeding. We also need to update the wiki.

specify minimum weights

It would be useful for survey weighting to allow for user specified minimum weights in addition to user specified maximum weights. For example, a min weight of 1/4 * the initial weight and a max weight of 4 * the initial weight. Currently the minimum weight is hard coded as 0. Our testing shows that some records end up with a weight of 0 (in part due to integerization) and so they are not included in the final data set. This issue is related to #75 as well.

Fail to integerize seed weights

After moving beyond #104 and preparing our seed data in a better way, we now have an issue with integerizing the seed weights. Here's the stack trace:

INFO - integerize_final_seed_weights seed id 49003
Welcome to the CBC MILP Solver
Version: 2.10.3
Build Date: Oct 11 2019

command line - cbc -solve -quit (default strategy 1)
Presolve 31 (-6) rows, 335 (-7) columns and 3583 (-6) elements
0  Obj -0 Primal inf 393.88521 (16)
31  Obj -5027.9043 Primal inf 1.0552473 (3)
32  Obj -5027.9043
Optimal - objective value -5027.9043
After Postsolve, objective -5027.9043, infeasibilities - dual 0 (0), primal 0 (0)
Optimal objective -5027.904278 - 32 iterations time 0.002, Presolve 0.00
Total time (CPU seconds):       0.00   (Wallclock seconds):       0.00
Traceback (most recent call last):
  File "run_populationsim.py", line 63, in <module>
    pipeline.run(models=steps, resume_after=resume_after)
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/activitysim/core/pipeline.py", line 594, in run
    run_model(model)
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/activitysim/core/pipeline.py", line 471, in run_model
    orca.run([step_name])
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/activitysim/core/orca.py", line 2034, in run
    step()
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/activitysim/core/orca.py", line 843, in __call__
    return self._func(**kwargs)
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/populationsim/steps/integerize_final_seed_weights.py", line 83, in integerize_final_seed_weights
    total_hh_control_col=total_hh_control_col
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/populationsim/integerizer.py", line 308, in do_integerizing
    status = integerizer.integerize()
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/populationsim/integerizer.py", line 187, in integerize
    smart_round(int_weights, resid_weights, self.total_hh_control_value)
  File "/Users/gregmacfarlane/opt/anaconda3/envs/popsim/lib/python3.7/site-packages/populationsim/integerizer.py", line 38, in smart_round
    assert target_sum == int(target_sum)

My guess is that this is an issue with data typing or configuring the weight field appropriately. We are using the basic WGTP field from the ACS PUMS record, and the person file has both the WGTP and PWGTP fields appended. We filtered households out that had WGTP <= 0, but are unsure if something else needs to happen.

The repository is here. We needed to remove the processed seed data from the repo for GitHub's limits (eventually) but the seed files are here

runtime arguments causing error

Following call results in error: python run_populationsim.py --config configs

Traceback (most recent call last):
File "run_populationsim.py", line 63, in
pipeline.run(models=steps, resume_after=resume_after)
File "C:\Users\binny.paul\Documents\Anaconda3\lib\site-packages\activitysim\core\pipeline.py", line 594, in run
run_model(model)
File "C:\Users\binny.paul\Documents\Anaconda3\lib\site-packages\activitysim\core\pipeline.py", line 471, in run_model
orca.run([step_name])
File "C:\Users\binny.paul\Documents\Anaconda3\lib\site-packages\activitysim\core\orca.py", line 2034, in run
step()
File "C:\Users\binny.paul\Documents\Anaconda3\lib\site-packages\activitysim\core\orca.py", line 843, in call
return self._func(**kwargs)
File "C:\Users\binny.paul\Documents\Anaconda3\lib\site-packages\populationsim\steps\setup_data_structures.py", line 317, in setup_data_structures
control_spec = read_control_spec(setting('control_file_name', 'controls.csv'), configs_dir)
File "C:\Users\binny.paul\Documents\Anaconda3\lib\site-packages\populationsim\steps\setup_data_structures.py", line 29, in read_control_spec
data_file_path = os.path.join(configs_dir, data_filename)
File "C:\Users\binny.paul\Documents\Anaconda3\lib\ntpath.py", line 76, in join
path = os.fspath(path)
TypeError: expected str, bytes or os.PathLike object, not list

Standard license

We are considering integrating this library into our work as an option to synthesize a population (plug and play among others) and in doing so contributing to the codebase. Is it possible to put a standard license on the library like MIT or BSD?

summary_COUNTY output table?

The summary output tables that have the control, result and diffs are really helpful to visualize. We have configured controls for MAZs, TAZs and COUNTY. Is there a way to get a summary_COUNTY table output? It looks like it's considered a "meta_geography" rather than a "sub_geography", presumably because it's bigger than the PUMA; but I would think getting this output is still doable? Thank you!

Add major university column to persons file

@DDudich

#Scripts Adds Major university column to persons file as well as fixing Null value in Standard Occupation Classification (SOC) column.

x <- read.csv("persons.csv",as.is=T)
x$soc <- as.numeric(gsub("NUL",0,x$soc))
x$majoruni <-0
x<- x[order(x[,30],x[,10]),c(1:23,31,24:30)]
x$PERID <- 1:nrow(x)
write.csv(x,"persons_sorted_uni.csv",row.names=F)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.