Giter Club home page Giter Club logo

warn-transformer's Introduction

warn-transformer's People

Contributors

anikasikka avatar chriszs avatar esagara avatar palewire avatar stucka avatar ydoc5212 avatar zstumgoren avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

warn-transformer's Issues

create multiple date fields for standardize_fields.py

We need to reduce the incidences of multiple dates getting shoved into the same field. Let's create a new field in STANDARDIZED_FIELDS and use CT as a pilot state to make sure this doesnt happen.

My suspicion is we'll need to create two date_effective columns: date_layoff, and date_closure

Consider accepting dates before 1999

WARN was enacted in 1988 and there are some states (OR, IL, soon GA) with notices going back to then. I think it's super cool to have that kind of historical perspective where we can. Therefore, I propose that the first valid year should be 1988.

WI improvement

The WI scraper is picking up data from the website that we either 1) dont want to scrape
or 2) dont want in the final analysis data.
This is an example of what is scraped:
image

and tables like this from the site are the cause:
image

I'll go ahead and assign myself to this. i think we should drop these rows but integrate the data somehow.

Standardize VA scraper

  • Currently there are two columns for "Closing" and 'Layoff" each marked with yes or no. For standardization purposes, it would be helpful for the crawler to have one column (called something like "Layoff Type") with expected outputs of: "Closing" or "Layoff".

  • Do we need to do any standardization with the columns "permanent" or "realignment"? Is this relevant information? I don't think I notice it in any other crawlers.

  • Investigate the relationship between the columns "city/town" and "location city". What is our merging strategy? Should we ignore one column, or is it safe to simply merge all empty/non-empty cells? Edit: drop "location city" and go with "city/town" column. The question remains of what is the purpose of "location city" which doesnt seem to be tied to the business address.

Standardize date formats

datasets from some states contain different date formats within themselves (eg: MO, DC) and possibly different conventions for documenting updates.

A thorough date format standardization can be done to all of the states at some point.

Create a data pipeline to generate a CSV with standardized field names

We should create one or more scripts that can take files exported by our scrapers (i.e. the files in exports/ dir) and produce a single CSV that merges all states. For each state, it should map a subset of fields to some minimal set of standardized fields (e.g. Company Name -> employer, Number of Employees Affected -> number_affected, etc.).

The output of this process should be a single CSV that contains the subset of fields for all states we currently scrape.

Something to consider: we may want to model a wider range of standardized field names on the approach used by https://layoffdata.com/data/

Error: "No project named WARN Act Notices found" on `make download`

Following instructions in README, I set up project, set BLN API key, ran make download and got:

2022-02-23 19:54:12,877 - urllib3.connectionpool - https://api.biglocalnews.org:443 "POST /graphql HTTP/1.1" 200 None
Traceback (most recent call last):
  File "/opt/python/3.8.12/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/python/3.8.12/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/workspaces/warn-transformer/warn_transformer/download.py", line 52, in <module>
    run()
  File "/workspaces/warn-transformer/warn_transformer/download.py", line 32, in run
    p = c.get_project_by_name("WARN Act Notices")
  File "/workspaces/warn-transformer/.venv/lib/python3.8/site-packages/bln/client.py", line 435, in get_project_by_name
    raise ValueError(f"No project named {name} found")
ValueError: No project named WARN Act Notices found

make: *** [Makefile:76: download] Error 1

I'm able to view the project in BLN directory and download a file manually. Says here I'm a viewer, but of course I am not in the list of 15 explicitly added users.

Screen Shot 2022-02-23 at 2 57 02 PM

Suspiciously large OR job cut located in MN

company postal_code jobs hash_id location date
NORTHWEST AIRLINES OR 27500 50e28021cc053807ecdeaac24862150a3bb0bddc24b25c43d2832a96 ST. PAUL, MN 1998-08-11T00:00:00.000Z

VA standardization error: "UnicodeDecodeError"

@zstumgoren I'm getting a "UnicodeDecodeError" when my standardizing program runs on the VA export file. As you can see the error happens when my program begins to parse the rows out of the state csv file, which it does successfully so far on AK, CT, DC, FL, IN, ME, MO, and NJ. Any ideas what might be causing this bug? Here's the output from my python program:

(Cody-DellXPS-Uab08hY7) C:\Users\Cody-DellXPS\warn-analysis>python standardize_field_names.py
Processing state ak.csv...
(...)
Processing state va.csv...
Traceback (most recent call last):
File "standardize_field_names.py", line 110, in
main()
File "standardize_field_names.py", line 54, in main
for row_idx, row in enumerate(state_csv):
File "c:\users\cody-dellxps\appdata\local\programs\python\python37\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 6006: character maps to

Check state data for standardize_fields

standardize_fields.csv could use a pass over of all the states' data

  • Run all the Scrapers! copy paste the following command into pipenv shell:
    python -m warn.cli -s AK AL AZ CA DE MD NE NY OH OK OR RI SD UT VT WA WI CT DC FL IA KS ME MO MT NJ VA
    Let Cody know if any scrapers fail (it will give a summary of which scrapers failed at the end).
    image
  • Run standardize_fields.py
  • Open the output in analysis/standardize_fields.csv (doesnt have to be super in-depth, just check a few rows really closely and glance over the rest):
    • CA
    • MD
    • MT
    • MO
    • UT
    • WI
    • IA
    • KS
    • NY

if you find something suspicious:

  1. check it against the state export .csv (eg exports/FL.csv) to make sure the problem isnt in my code.
  2. check the state WARN website to make sure there isnt a problem with the scraper
  3. let Cody know and file a ticket for the state if ticket doesnt already exist. this applies whether its a problem w the scraper or a problem w the website

CT standardization

-goal is to get standardized layoff vs closure data from the state (eg from closure: "yes" to layoff_type: "closure")
secondary goal is to clean up "revised notice" from the company names
regex is hard, and we'll probably clean up company names as part of the "convert to canonical company names" project

Add a load step

  • Get yesterday's file
  • Get today's file
  • Loop through every source, compare today's hashes to yesterday, when they match exactly nothing has changed and those rows can be put aside. The remaining rows are either new or amended.
  • Loop through the new or amended list and compare each dict against all of yesterday's dict that didn't have an exactly match. In cases where there is a match above a certain fancy threshold to be determined later, these are potential amendment and should be filed seperately. In cases where there is no close match, the records are assumed to be new.
  • New records are added to yesterday's list with a current timestamp
  • Amended records give a customized test based on what we know about the fields. In cases where the fields that have changed suggest it's highly likely the record is amended, we write the amended with an update timestamp.
  • In cases where the field comparison doesn't suggest the record is a likely amendment, we may ask a human to decide if it ought to be filed as new.
  • Log all new records and all amendments to console, but also maybe Slack, a GitHub Issue, something else.
  • Write out an updated version of our "loaded.csv" file that has the reconciled database with our additional timestamp metadata columns

Handful of entries with dodgy dates

company postal_code jobs hash_id location date year
Adventist Health St. Helena CA 5 bb9826463869907671fcc3b7b3d94b562af891b7d2994aa4c9fa1658 Saint Helena 2008-09-04 2008
Dominion/State Line Energy Station IN 109 4c08e9392e75cdf08e26e4bde78161cb4dfc731ff5e1130aed2ca60a Hammond 1202-01-30 1202
burlington woods nursing home NJ 20 c2a3baa8fdc69c4af11059e51ed7f1b1ae62b0b6c10c5b1cf2d7b144 burlington 3030-08-23 3030
Hooper Holmes, Inc. dba Provant Health RI 92 6410a06432ddae7bc4eee219c914665055867e9f6e00661ee9df60e7 Warwick 2108-11-01 2108
Nordstrom Providence Place RI 181 08e170ab6fb5f2e6aaf0186ff83c793a2949476fd7638ae7b1a8e46f Providence 2108-10-23 2108

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.