policyengine / openfisca-us-data Goto Github PK
View Code? Open in Web Editor NEWPython package to standardise loading input datasets to OpenFisca-US.
Python package to standardise loading input datasets to OpenFisca-US.
I see that openfisca-us-data {dataset} remove {year}
works, but some users might want to start entirely fresh.
Should take as args:
Then does a groupby and checks that it's one per ID
All remaining e00000
-type variables should be inputted at the tax unit level.
Would be a step nicer than a better error message (#27)
Tests throw this warning:
/home/runner/work/openfisca-us-data/openfisca-us-data/tests/cps/test_aggregates.py:20: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
tc = yaml.load(f)
Per Fremstad & Paul
The variables in cps.csv.gz that we need openfisca-us-data
to replicate are listed in the comment below. Each issue in this epic breaks the project down into a smaller group of variables that can probably be implemented together. For each issue (or variable added), we'll need to add them in a PR to the CPS module, with unit tests that check aggregates are close to expected values.
At this point, we should check that all variables are covered and evaluate any that are outstanding.
Ensure that the structures created by the input dataset (person_id, family_id, etc. and all foreign keys) match the structures described in the taxdata CPS file.
Some household, some person
The relevant variables are:
This should be de-prioritised since we're aiming to model many of these anyway.
With an example
As produced in https://github.com/UBICenter/local-child-allowance
This will allow us to map the ACS to local areas, though this will augment the record count substantially, since the mapping is per PUMA-geo.
This ensures that we use all the information (casting to person-level where necessary) from:
* age
e.g. currently openfisca-us-data raw_cps generate 2021
downloads the "2021 ASEC" which captures activity for calendar year 2020 (survey administered in March 2021). I think we should subtract 1 from all years to align with policy years.
openfisca-us-data/openfisca_us_data/datasets/cps/raw_cps.py
Lines 114 to 118 in e16106c
See this notebook.
!openfisca-us-data raw_cps generate 2021
works, but then !openfisca-us-data cps generate 2021
produces:
/usr/local/lib/python3.7/dist-packages/openfisca_core/parameters/config.py:17: LibYAMLWarning: libyaml is not installed in your environment. This can make OpenFisca slower to start. Once you have installed libyaml, run 'pip uninstall pyyaml && pip install pyyaml --no-cache-dir' so that it is used in your Python environment.
warnings.warn(" ".join(message), LibYAMLWarning)
tcmalloc: large alloc 1082007552 bytes == 0x55b0a79ea000 @ 0x7f62828df1e7 0x7f627fd4046e 0x7f627fd90c7b 0x7f627fd9135f 0x7f627fe33103 0x55b0a3936544 0x55b0a3936240 0x55b0a39aa627 0x55b0a3937afa 0x55b0a39a5c0d 0x55b0a3939b6b 0x55b0a397a9c9 0x55b0a397a93c 0x55b0a3a1e409 0x55b0a39a5e7a 0x55b0a39a49ee 0x55b0a3937bda 0x55b0a39a6737 0x55b0a39a49ee 0x55b0a3937bda 0x55b0a39a5c0d 0x55b0a3937afa 0x55b0a39a9d00 0x55b0a3937afa 0x55b0a39a9d00 0x55b0a3939b6b 0x55b0a397a9c9 0x55b0a397a93c 0x55b0a3a1e409 0x55b0a39a5e7a 0x55b0a39a4ced
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/openfisca_us_data/cli.py", line 17, in main
return getattr(datasets[args.dataset], args.action)(*args.args)
File "/usr/local/lib/python3.7/dist-packages/openfisca_us_data/utils.py", line 97, in new_generate_func
return generate_func(year, *args)
File "/usr/local/lib/python3.7/dist-packages/openfisca_us_data/datasets/cps/cps.py", line 30, in generate
"person",
File "/usr/local/lib/python3.7/dist-packages/openfisca_us_data/datasets/cps/cps.py", line 29, in <listcomp>
for entity in (
File "/usr/local/lib/python3.7/dist-packages/pandas/io/pytables.py", line 569, in __getitem__
return self.get(key)
File "/usr/local/lib/python3.7/dist-packages/pandas/io/pytables.py", line 792, in get
return self._read_group(group)
File "/usr/local/lib/python3.7/dist-packages/pandas/io/pytables.py", line 1810, in _read_group
return s.read()
File "/usr/local/lib/python3.7/dist-packages/pandas/io/pytables.py", line 3161, in read
values = self.read_array(f"block{i}_values", start=_start, stop=_stop)
File "/usr/local/lib/python3.7/dist-packages/pandas/io/pytables.py", line 2818, in read_array
ret = node[0][start:stop]
File "/usr/local/lib/python3.7/dist-packages/tables/vlarray.py", line 681, in __getitem__
return self.read(start, stop, step)[0]
File "/usr/local/lib/python3.7/dist-packages/tables/vlarray.py", line 821, in read
listarr = self._read_array(start, stop, step)
File "tables/hdf5extension.pyx", line 2155, in tables.hdf5extension.VLArray._read_array
ValueError: cannot set WRITEABLE flag to True of this array
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/openfisca-us-data", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/openfisca_us_data/cli.py", line 19, in main
print(f"Encountered an error: {e.with_traceback()}")
TypeError: with_traceback() takes exactly one argument (0 given)
Closing remaining open files:/usr/local/lib/python3.7/dist-packages/openfisca_us_data/microdata/external/raw_cps_2021.h5...done
Trying to debug:
from openfisca_us import Microsimulation
sim = Microsimulation()
sim.calc("household_net_income")
(breaks, as there are only 60k households returned by summing net income among households, but 90k households). I dug around and this is seemingly because there are some households with a household ID household.H_SEQ
in the CPS, for which there are no people. MWE:
from openfisca_us_data import CPS
household = CPS.load(2020, "household")
person = CPS.load(2020, "person")
4 in household.H_SEQ.values and 4 not in person.PH_SEQ.values
>> True
@MaxGhenis @nmrodelo @tolaouk if you've worked with the CPS indexes before and can spot a simple error on my part here, would be much appreciated! Thanks - this should get policyengine-us pretty close to functional.
For example, here are the files for 2019:
Each includes two CSV files which need to be concatenated. I don't think the ACS has a family concept. I'm speaking tomorrow with someone who built tax units from the ACS to use with taxcalc, so I'll report back on how we can do that.
i.e.
openfisca-us-data raw_cps generate 2021
openfisca-us-data cps generate 2021
(This will be 2020 after #25 is fixed)
As I'm working on my own raw class functionality, I'm looking for conventions to follow. However, I'm wondering whether to follow the conventions of lines 47, 49 and 51 of openfisca_us_data/datasets/cps/raw_cps.py, which fills in missing data with 0s as so:
storage["person"] = person = pd.read_csv(f).fillna(0)
I assume a processing operation like this would be better suited to the CPS method rather than RawCPS. If not, just let me know the reasoning behind putting it here as I'm making similar decisions for the CE survey.
Edit: Ah I'm seeing the functions below that take sums and probably need those 0s filled in. I guess I'm still struggling with what processing goes in Raw.
The status bar was creating errors in the ACS in #49 so I removed it. Once it's made reliable it'd be good to make it a utility since it's pretty similar across the CPS and ACS, and would also be similar for other datasets. Though maybe belongs in an openfisca-data
package...
Currently it's the SPM research file from ACS
e.g. here's what happens when trying to generate the CPS for a year prior to generating the raw CPS:
(base) mghenis@penguin:~/PolicyEngine/openfisca-us-data$ openfisca-us-data cps generate 2018
Downloaded ASEC: 100%|████████████████████| 13.5k/13.5k [00:00<00:00, 175kiB/s]
Traceback (most recent call last):
File "/home/mghenis/PolicyEngine/openfisca-us-data/openfisca_us_data/datasets/cps/raw_cps.py", line 40, in generate
zipfile = ZipFile(file)
File "/home/mghenis/anaconda3/lib/python3.8/zipfile.py", line 1269, in __init__
self._RealGetContents()
File "/home/mghenis/anaconda3/lib/python3.8/zipfile.py", line 1336, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/mghenis/PolicyEngine/openfisca-us-data/openfisca_us_data/cli.py", line 17, in main
return getattr(datasets[args.dataset], args.action)(*args.args)
File "/home/mghenis/PolicyEngine/openfisca-us-data/openfisca_us_data/utils.py", line 94, in new_generate_func
return generate_func(year, *args)
File "/home/mghenis/PolicyEngine/openfisca-us-data/openfisca_us_data/datasets/cps/cps.py", line 22, in generate
RawCPS.generate(year)
File "/home/mghenis/PolicyEngine/openfisca-us-data/openfisca_us_data/utils.py", line 94, in new_generate_func
return generate_func(year, *args)
File "/home/mghenis/PolicyEngine/openfisca-us-data/openfisca_us_data/datasets/cps/raw_cps.py", line 51, in generate
f"Attempted to extract and save the CSV files, but encountered an error: {e.with_traceback()}"
TypeError: with_traceback() takes exactly one argument (0 given)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/mghenis/anaconda3/bin/openfisca-us-data", line 33, in <module>
sys.exit(load_entry_point('openfisca-us-data', 'console_scripts', 'openfisca-us-data')())
File "/home/mghenis/PolicyEngine/openfisca-us-data/openfisca_us_data/cli.py", line 19, in main
print(f"Encountered an error: {e.with_traceback()}")
After moving the Consumer Expenditure Survey code to openfisca-us, and transferring issues there.
We need to make sure the CPS
dataset includes the data from:
These should be person-level (one variable in place of two _p and _s variables). In taxdata they are tax-unit level, but only refer to one person within that tax unit.
Due to pandas-dev/pandas#16615:
openfisca-us-data raw_acs generate 2018
throws:
ValueError: Length of values (3061064) does not match length of index (1615763)
This also occurs when doing it from Python:
pd.read_sas("https://www2.census.gov/programs-surveys/supplemental-poverty-measure/datasets/spm/spm_pu_2018.sas7bdat")
or when trying the encoding suggestion from SO:
pd.read_sas("https://www2.census.gov/programs-surveys/supplemental-poverty-measure/datasets/spm/spm_pu_2018.sas7bdat", encoding="iso-8859-1")
This is the remaining step needed to get a basic PolicyEngine US working, and it'll likely show more errors not found in OpenFisca-US unit testing (acceptably so: the thousands of CPS households will provide more edge cases).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.