The warn-transformer from biglocalnews

warn-transformer's Introduction

About

The project is sponsored by Big Local News, a program at Stanford University that collects data for impactful journalism. The code is maintained by Ben Welsh, a visiting data journalist from the Los Angeles Times.

warn-transformer's People

Contributors

Stargazers

Watchers

warn-transformer's Issues

Draft a straightforward method for feeding one state's raw CSV into the data schema

Create State-specific standardization for additional date columns

need created as a result of the work for #21

Add transformers for all sources

CT bug in standardize_dates.py

Many important fields are missing.
Are they present in the source data?
What's the problem?

Add `BLN_API_KEY` and `BLN_PROJECT_ID` to repository secrets so we can automate work later on

Come up with a simple class that can be used by each state to consolidate data

IL incorrect uses total employee field

Believe the correct field is Revised Layoff.

Integrate consolidate step into GitHub flow

Add a FL transformer after the scraper gets fixed

Add a OH transformer after the scraper gets fixed

create multiple date fields for standardize_fields.py

We need to reduce the incidences of multiple dates getting shoved into the same field. Let's create a new field in STANDARDIZED_FIELDS and use CT as a pilot state to make sure this doesnt happen.

My suspicion is we'll need to create two date_effective columns: date_layoff, and date_closure

Consider accepting dates before 1999

WARN was enacted in 1988 and there are some states (OR, IL, soon GA) with notices going back to then. I think it's super cool to have that kind of historical perspective where we can. Therefore, I propose that the first valid year should be 1988.

WI improvement

The WI scraper is picking up data from the website that we either 1) dont want to scrape
or 2) dont want in the final analysis data.
This is an example of what is scraped:

and tables like this from the site are the cause:

I'll go ahead and assign myself to this. i think we should drop these rows but integrate the data somehow.

Standardize VA scraper

Currently there are two columns for "Closing" and 'Layoff" each marked with yes or no. For standardization purposes, it would be helpful for the crawler to have one column (called something like "Layoff Type") with expected outputs of: "Closing" or "Layoff".
Do we need to do any standardization with the columns "permanent" or "realignment"? Is this relevant information? I don't think I notice it in any other crawlers.
Investigate the relationship between the columns "city/town" and "location city". What is our merging strategy? Should we ignore one column, or is it safe to simply merge all empty/non-empty cells? Edit: drop "location city" and go with "city/town" column. The question remains of what is the purpose of "location city" which doesnt seem to be tied to the business address.

Standardize date formats

datasets from some states contain different date formats within themselves (eg: MO, DC) and possibly different conventions for documenting updates.

A thorough date format standardization can be done to all of the states at some point.

Add a CLI entry point

Add a geocoding step for the location field

Define a standardized marshmallow schema for our standardized data

Create a data pipeline to generate a CSV with standardized field names

We should create one or more scripts that can take files exported by our scrapers (i.e. the files in exports/ dir) and produce a single CSV that merges all states. For each state, it should map a subset of fields to some minimal set of standardized fields (e.g. Company Name -> employer, Number of Employees Affected -> number_affected, etc.).

The output of this process should be a single CSV that contains the subset of fields for all states we currently scrape.

Something to consider: we may want to model a wider range of standardized field names on the approach used by https://layoffdata.com/data/

Package the module for a private distribution to Google Artifact Registry

Add integrate command to the docs

Add separate fields for notice date and effective date

Write a transformer for a few more states and see how it goes

We expect IA duplicates

As described here biglocalnews/warn-scraper#435

Add an exploratory notebook that documents where we are at with null values

Add documentation for how to make a release

What are we going to do with dates that don't parse?

Error: "No project named WARN Act Notices found" on `make download`

Following instructions in README, I set up project, set BLN API key, ran make download and got:

2022-02-23 19:54:12,877 - urllib3.connectionpool - https://api.biglocalnews.org:443 "POST /graphql HTTP/1.1" 200 None
Traceback (most recent call last):
  File "/opt/python/3.8.12/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/python/3.8.12/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/workspaces/warn-transformer/warn_transformer/download.py", line 52, in <module>
    run()
  File "/workspaces/warn-transformer/warn_transformer/download.py", line 32, in run
    p = c.get_project_by_name("WARN Act Notices")
  File "/workspaces/warn-transformer/.venv/lib/python3.8/site-packages/bln/client.py", line 435, in get_project_by_name
    raise ValueError(f"No project named {name} found")
ValueError: No project named WARN Act Notices found

make: *** [Makefile:76: download] Error 1

I'm able to view the project in BLN directory and download a file manually. Says here I'm a viewer, but of course I am not in the list of 15 explicitly added users.

Figure out a good way to root out duplicate records

Develop a system for looping through our raw CSV list and linking each one to transformer

Could we detect and handle deleted records during the integrate step?

Standardize to canonical company name

Using OpenCorporates API? (https://api.opencorporates.com/)

Suspiciously large OR job cut located in MN

company	postal_code	jobs	hash_id	location	date
NORTHWEST AIRLINES	OR	27500	50e28021cc053807ecdeaac24862150a3bb0bddc24b25c43d2832a96	ST. PAUL, MN	1998-08-11T00:00:00.000Z

VA standardization error: "UnicodeDecodeError"

@zstumgoren I'm getting a "UnicodeDecodeError" when my standardizing program runs on the VA export file. As you can see the error happens when my program begins to parse the rows out of the state csv file, which it does successfully so far on AK, CT, DC, FL, IN, ME, MO, and NJ. Any ideas what might be causing this bug? Here's the output from my python program:

(Cody-DellXPS-Uab08hY7) C:\Users\Cody-DellXPS\warn-analysis>python standardize_field_names.py
Processing state ak.csv...
(...)
Processing state va.csv...
Traceback (most recent call last):
File "standardize_field_names.py", line 110, in
main()
File "standardize_field_names.py", line 54, in main
for row_idx, row in enumerate(state_csv):
File "c:\users\cody-dellxps\appdata\local\programs\python\python37\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 6006: character maps to

Write a downloader method that pulls the latest files from our biglocalnews.org project

Handle erroneous data downstream re #231

#231 shows an example of 9mil+ layoffs. Let's check for layoffs above 999,999 and remove it, and throw a warnings!

Integrate consolidate step into Prefect flow

Some kind of "coalesce" system that allows for a fallback field

California's location attribution would be a good use case for this. There are two fields where we might find a place. It would be nice if we could start with one, and then fall back to the other if it's null.

MT empty rows in standardize_fields

Why are there no company names?
Why are several MT rows that are empty in the output?
date layoff should be in the output

What do we do about amendments?

Add city/location to the consolidated schema

Create a unique identifer for each row

Check state data for standardize_fields

standardize_fields.csv could use a pass over of all the states' data

if you find something suspicious:

check it against the state export .csv (eg exports/FL.csv) to make sure the problem isnt in my code.
check the state WARN website to make sure there isnt a problem with the scraper
let Cody know and file a ticket for the state if ticket doesnt already exist. this applies whether its a problem w the scraper or a problem w the website

Multiple revisions of same layoff in WI data

Think there's going to be a lot of this sort of thing to clean up:

Add unittests

What's up with "NON-WARN" rows in jobcenter states?

CT standardization

-goal is to get standardized layoff vs closure data from the state (eg from closure: "yes" to layoff_type: "closure")
~~secondary goal is to clean up "revised notice" from the company names~~
regex is hard, and we'll probably clean up company names as part of the "convert to canonical company names" project

Add a load step

Get yesterday's file
Get today's file
Loop through every source, compare today's hashes to yesterday, when they match exactly nothing has changed and those rows can be put aside. The remaining rows are either new or amended.
Loop through the new or amended list and compare each dict against all of yesterday's dict that didn't have an exactly match. In cases where there is a match above a certain fancy threshold to be determined later, these are potential amendment and should be filed seperately. In cases where there is no close match, the records are assumed to be new.
New records are added to yesterday's list with a current timestamp
Amended records give a customized test based on what we know about the fields. In cases where the fields that have changed suggest it's highly likely the record is amended, we write the amended with an update timestamp.
In cases where the field comparison doesn't suggest the record is a likely amendment, we may ask a human to decide if it ought to be filed as new.
Log all new records and all amendments to console, but also maybe Slack, a GitHub Issue, something else.
Write out an updated version of our "loaded.csv" file that has the reconciled database with our additional timestamp metadata columns

Break out documentation into Sphinx site

Handful of entries with dodgy dates

company	postal_code	jobs	hash_id	location	date	year
Adventist Health St. Helena	CA	5	bb9826463869907671fcc3b7b3d94b562af891b7d2994aa4c9fa1658	Saint Helena	2008-09-04	2008
Dominion/State Line Energy Station	IN	109	4c08e9392e75cdf08e26e4bde78161cb4dfc731ff5e1130aed2ca60a	Hammond	1202-01-30	1202
burlington woods nursing home	NJ	20	c2a3baa8fdc69c4af11059e51ed7f1b1ae62b0b6c10c5b1cf2d7b144	burlington	3030-08-23	3030
Hooper Holmes, Inc. dba Provant Health	RI	92	6410a06432ddae7bc4eee219c914665055867e9f6e00661ee9df60e7	Warwick	2108-11-01	2108
Nordstrom Providence Place	RI	181	08e170ab6fb5f2e6aaf0186ff83c793a2949476fd7638ae7b1a8e46f	Providence	2108-10-23	2108

Recommend Projects

biglocalnews / warn-transformer Goto Github PK

warn-transformer's Introduction

Links