policyengine / openfisca-us-data Goto Github PK

  /home/runner/work/openfisca-us-data/openfisca-us-data/tests/cps/test_aggregates.py:20: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
    tc = yaml.load(f)

Estimate carbon emissions in CE

Per Fremstad & Paul

Impute plastic consumption from CE to CPS

Add installation instructions to README

The variables in cps.csv.gz that we need openfisca-us-data to replicate are listed in the comment below. Each issue in this epic breaks the project down into a smaller group of variables that can probably be implemented together. For each issue (or variable added), we'll need to add them in a PR to the CPS module, with unit tests that check aggregates are close to expected values.

Drop SPM-unit- and tax-unit-level columns from person file in raw

Impute CE carbon emissions onto ACS microdata

Add remaining variables

At this point, we should check that all variables are covered and evaluate any that are outstanding.

Use household structure variables to create primary/foreign keys

Ensure that the structures created by the input dataset (person_id, family_id, etc. and all foreign keys) match the structures described in the taxdata CPS file.

Add CE microdata

Create relevant ACS variables for CE imputation in ACS generation

Some household, some person

Impute plastic consumption from CE to ACS

Create Consumer Expenditure Survey input dataset

Add benefit income variables

The relevant variables are:

mcaid_ben
mcare_ben
ssi_ben
tanf_ben
vet_ben
wic_ben
snap_ben
housing_ben
other_ben

This should be de-prioritised since we're aiming to model many of these anyway.

Add wiki page on translating variable from taxdata

With an example

Impute carbon emissions from the Consumer Expenditure Survey to the Current Population Survey

Add PUMA mappings to legislative district and other geographies

As produced in https://github.com/UBICenter/local-child-allowance

This will allow us to map the ACS to local areas, though this will augment the record count substantially, since the mapping is per PUMA-geo.

Add individual demographic variables

This ensures that we use all the information (casting to person-level where necessary) from:
* age

blind (from blind_head and blind_spouse)
fips

Change ASEC year to be survey year

e.g. currently openfisca-us-data raw_cps generate 2021 downloads the "2021 ASEC" which captures activity for calendar year 2020 (survey administered in March 2021). I think we should subtract 1 from all years to align with policy years.

Estimate carbon emissions in Consumer Expenditure Survey

Assert that values are unique within SPM when selecting first values

openfisca-us-data/openfisca_us_data/datasets/cps/raw_cps.py

Lines 114 to 118 in e16106c

 return ( 

 person[["SPM_" + column for column in SPM_UNIT_COLUMNS]] 

 .groupby(person.SPM_ID) 

 .first() 

 )

Error generating cps from Colab

See this notebook.

!openfisca-us-data raw_cps generate 2021 works, but then !openfisca-us-data cps generate 2021 produces:

/usr/local/lib/python3.7/dist-packages/openfisca_core/parameters/config.py:17: LibYAMLWarning: libyaml is not installed in your environment. This can make OpenFisca slower to start. Once you have installed libyaml, run 'pip uninstall pyyaml && pip install pyyaml --no-cache-dir' so that it is used in your Python environment.

  warnings.warn(" ".join(message), LibYAMLWarning)
tcmalloc: large alloc 1082007552 bytes == 0x55b0a79ea000 @  0x7f62828df1e7 0x7f627fd4046e 0x7f627fd90c7b 0x7f627fd9135f 0x7f627fe33103 0x55b0a3936544 0x55b0a3936240 0x55b0a39aa627 0x55b0a3937afa 0x55b0a39a5c0d 0x55b0a3939b6b 0x55b0a397a9c9 0x55b0a397a93c 0x55b0a3a1e409 0x55b0a39a5e7a 0x55b0a39a49ee 0x55b0a3937bda 0x55b0a39a6737 0x55b0a39a49ee 0x55b0a3937bda 0x55b0a39a5c0d 0x55b0a3937afa 0x55b0a39a9d00 0x55b0a3937afa 0x55b0a39a9d00 0x55b0a3939b6b 0x55b0a397a9c9 0x55b0a397a93c 0x55b0a3a1e409 0x55b0a39a5e7a 0x55b0a39a4ced
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/openfisca_us_data/cli.py", line 17, in main
    return getattr(datasets[args.dataset], args.action)(*args.args)
  File "/usr/local/lib/python3.7/dist-packages/openfisca_us_data/utils.py", line 97, in new_generate_func
    return generate_func(year, *args)
  File "/usr/local/lib/python3.7/dist-packages/openfisca_us_data/datasets/cps/cps.py", line 30, in generate
    "person",
  File "/usr/local/lib/python3.7/dist-packages/openfisca_us_data/datasets/cps/cps.py", line 29, in <listcomp>
    for entity in (
  File "/usr/local/lib/python3.7/dist-packages/pandas/io/pytables.py", line 569, in __getitem__
    return self.get(key)
  File "/usr/local/lib/python3.7/dist-packages/pandas/io/pytables.py", line 792, in get
    return self._read_group(group)
  File "/usr/local/lib/python3.7/dist-packages/pandas/io/pytables.py", line 1810, in _read_group
    return s.read()
  File "/usr/local/lib/python3.7/dist-packages/pandas/io/pytables.py", line 3161, in read
    values = self.read_array(f"block{i}_values", start=_start, stop=_stop)
  File "/usr/local/lib/python3.7/dist-packages/pandas/io/pytables.py", line 2818, in read_array
    ret = node[0][start:stop]
  File "/usr/local/lib/python3.7/dist-packages/tables/vlarray.py", line 681, in __getitem__
    return self.read(start, stop, step)[0]
  File "/usr/local/lib/python3.7/dist-packages/tables/vlarray.py", line 821, in read
    listarr = self._read_array(start, stop, step)
  File "tables/hdf5extension.pyx", line 2155, in tables.hdf5extension.VLArray._read_array
ValueError: cannot set WRITEABLE flag to True of this array

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/openfisca-us-data", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/openfisca_us_data/cli.py", line 19, in main
    print(f"Encountered an error: {e.with_traceback()}")
TypeError: with_traceback() takes exactly one argument (0 given)
Closing remaining open files:/usr/local/lib/python3.7/dist-packages/openfisca_us_data/microdata/external/raw_cps_2021.h5...done

Add RawACS for full ACS microdata

Fix age variable in CPS

Per this comment

Empty households

Trying to debug:

from openfisca_us import Microsimulation
sim = Microsimulation()
sim.calc("household_net_income")

(breaks, as there are only 60k households returned by summing net income among households, but 90k households). I dug around and this is seemingly because there are some households with a household ID household.H_SEQ in the CPS, for which there are no people. MWE:

from openfisca_us_data import CPS
household = CPS.load(2020, "household")
person = CPS.load(2020, "person")
4 in household.H_SEQ.values and 4 not in person.PH_SEQ.values
>> True

@MaxGhenis @nmrodelo @tolaouk if you've worked with the CPS indexes before and can spot a simple error on my part here, would be much appreciated! Thanks - this should get policyengine-us pretty close to functional.

Estimate plastic consumption in Consumer Expenditure Survey

Explore SQL database

Add full ACS

For example, here are the files for 2019:

Each includes two CSV files which need to be concatenated. I don't think the ACS has a family concept. I'm speaking tomorrow with someone who built tax units from the ACS to use with taxcalc, so I'll report back on how we can do that.

Update README to document the standard process for the latest ASEC

i.e.

openfisca-us-data raw_cps generate 2021
openfisca-us-data cps generate 2021

(This will be 2020 after #25 is fixed)

Transfer repo to PolicyEngine GitHub org

Support 2020 ASEC

Just dropped: https://www.census.gov/data/datasets/time-series/demo/cps/cps-asec.html

.fillna(0) used in RawCPS, should probably be used in CPS

As I'm working on my own raw class functionality, I'm looking for conventions to follow. However, I'm wondering whether to follow the conventions of lines 47, 49 and 51 of openfisca_us_data/datasets/cps/raw_cps.py, which fills in missing data with 0s as so:

storage["person"] = person = pd.read_csv(f).fillna(0)

I assume a processing operation like this would be better suited to the CPS method rather than RawCPS. If not, just let me know the reasoning behind putting it here as I'm making similar decisions for the CE survey.

Edit: Ah I'm seeing the functions below that take sums and probably need those 0s filled in. I guess I'm still struggling with what processing goes in Raw.

Utility to download and load file with progress bar

The status bar was creating errors in the ACS in #49 so I removed it. Once it's made reliable it'd be good to make it a utility since it's pretty similar across the CPS and ACS, and would also be similar for other datasets. Though maybe belongs in an openfisca-data package...

Impute carbon emissions from the Consumer Expenditure Survey to the American Community Survey

Rename existing RawACS to RawSPMACS

Currently it's the SPM research file from ACS

Throw informative error message when running `cps generate` before `raw_cps generate`

e.g. here's what happens when trying to generate the CPS for a year prior to generating the raw CPS:

(base) mghenis@penguin:~/PolicyEngine/openfisca-us-data$ openfisca-us-data cps generate 2018                                                                    
Downloaded ASEC: 100%|████████████████████| 13.5k/13.5k [00:00<00:00, 175kiB/s]
Traceback (most recent call last):
  File "/home/mghenis/PolicyEngine/openfisca-us-data/openfisca_us_data/datasets/cps/raw_cps.py", line 40, in generate                                           
    zipfile = ZipFile(file)
  File "/home/mghenis/anaconda3/lib/python3.8/zipfile.py", line 1269, in __init__                                                                               
    self._RealGetContents()
  File "/home/mghenis/anaconda3/lib/python3.8/zipfile.py", line 1336, in _RealGetContents                                                                       
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mghenis/PolicyEngine/openfisca-us-data/openfisca_us_data/cli.py", line 17, in main                                                                
    return getattr(datasets[args.dataset], args.action)(*args.args)
  File "/home/mghenis/PolicyEngine/openfisca-us-data/openfisca_us_data/utils.py", line 94, in new_generate_func                                                 
    return generate_func(year, *args)
  File "/home/mghenis/PolicyEngine/openfisca-us-data/openfisca_us_data/datasets/cps/cps.py", line 22, in generate                                               
    RawCPS.generate(year)
  File "/home/mghenis/PolicyEngine/openfisca-us-data/openfisca_us_data/utils.py", line 94, in new_generate_func                                                 
    return generate_func(year, *args)
  File "/home/mghenis/PolicyEngine/openfisca-us-data/openfisca_us_data/datasets/cps/raw_cps.py", line 51, in generate                                           
    f"Attempted to extract and save the CSV files, but encountered an error: {e.with_traceback()}"                                                              
TypeError: with_traceback() takes exactly one argument (0 given)

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/mghenis/anaconda3/bin/openfisca-us-data", line 33, in <module>
    sys.exit(load_entry_point('openfisca-us-data', 'console_scripts', 'openfisca-us-data')())                                                                   
  File "/home/mghenis/PolicyEngine/openfisca-us-data/openfisca_us_data/cli.py", line 19, in main                                                                
    print(f"Encountered an error: {e.with_traceback()}")

Adjust TANF underreporting

Some sources:

ht @kickit123

Archive repo

After moving the Consumer Expenditure Survey code to openfisca-us, and transferring issues there.

Create tax units from ACS data

Can follow @hiebertjames's approach from https://github.com/CityOfPhiladelphia/federal-tax-unit-simulation/blob/master/Tax%20Units%20Analysis.R

Add individual financial variables

We need to make sure the CPS dataset includes the data from:

e00200
e00900
pencon

These should be person-level (one variable in place of two _p and _s variables). In taxdata they are tax-unit level, but only refer to one person within that tax unit.

RawACS generation is failing

Due to pandas-dev/pandas#16615:

openfisca-us-data raw_acs generate 2018

throws:

ValueError: Length of values (3061064) does not match length of index (1615763)

This also occurs when doing it from Python:

pd.read_sas("https://www2.census.gov/programs-surveys/supplemental-poverty-measure/datasets/spm/spm_pu_2018.sas7bdat")

or when trying the encoding suggestion from SO:

pd.read_sas("https://www2.census.gov/programs-surveys/supplemental-poverty-measure/datasets/spm/spm_pu_2018.sas7bdat", encoding="iso-8859-1")

Ensure OpenFisca-US completes a full microsimulation

This is the remaining step needed to get a basic PolicyEngine US working, and it'll likely show more errors not found in OpenFisca-US unit testing (acceptably so: the thousands of CPS households will provide more edge cases).

Add basic income and benefit tuples

taxdata has a set of income tuples here and here. These look like renames to me, so we should be able to implement relatively straightforwardly in the BaseCPS dataset.

	return (
	person[["SPM_" + column for column in SPM_UNIT_COLUMNS]]
	.groupby(person.SPM_ID)
	.first()
	)

policyengine / openfisca-us-data Goto Github PK

openfisca-us-data's Issues

Recommend Projects

Recommend Topics

Recommend Org