Comments (9)
The pandas stuff is just to explore the data and document what I find. I think our ultimate solution will be something else, and will likely also involve some upstream patches to our scrapers.
from warn-transformer.
I've identified some here. It looks like our hashing method does seem to condense the detection
https://github.com/biglocalnews/warn-transformer/blob/main/_notebooks/duplicates.ipynb
from warn-transformer.
Is notebook/pandas approach for experimentation purposes? Wondering if we could just use sets + tuples to perform a basic dedupe in the production pipeline code to keep things light-weight...
from warn-transformer.
Alabama is clearly caused by the source
from warn-transformer.
California dupes also appear in the raw source materials. Not sure why, but we're not introducing it.
from warn-transformer.
from warn-transformer.
many of these files have numerous dupes. Could be not complete dupes, with some date differences, and slight changes in dates or number of employees. From what I understand, if there is any change, the company submits a new WARN notice to make sure they comply and it gets added. Additionally, if there are multiple locations, the data may be reflecting multiple layoffs from the same company though, so something to watch for.
from warn-transformer.
Gotcha. Now that I've verified these "exact dupes" come from the source, and aren't introduced as bugs in our code, I think I'm going to take the step of eliminating them from our dataset at this point.
from warn-transformer.
makes sense
from warn-transformer.
Related Issues (20)
- Add manual workflow to reinitialize the integrated.csv file
- Zero out additions and amendments files if there are no results
- Delaware `notice_date` is not being properly handled
- Upgrade to work with the latest NY data
- Fix California transformation error HOT 1
- Write a transformer for CO
- WI transform is failing
- Documentation improvement ideas
- Update click dependency when bug is resolved HOT 1
- LA import disabled until scraper is fixed HOT 1
- QA checks needed HOT 1
- Additions format doesn't allow further automation HOT 3
- Newer pipenvs disable skip-lock functionality HOT 1
- mypy throwing more cli errors HOT 2
- Need retry on some other API calls HOT 1
- Document that test data needs to change when file format changes
- 'make test' fails on Linux HOT 1
- Date simplification technique is too simple HOT 1
- UTF-8 implicity doing bad stuff on Windows
- More Node updates HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from warn-transformer.