Comments (9)
@jbothma Somehow missed there was a pass 1 of the KZN data as well. Looking through it, it looks exactly the same as pass 2. Should it be extracted anyway?
from data-extraction.
Hey! The idea with pass1 and pass2 is that different people extract the same data twice then we can compare and spot errors so just do one of them.
from data-extraction.
Ah! Cool, makes sense
from data-extraction.
@JasonTame TBH, in retrospect we should have probably called them Pass A
and Pass B
since there is no implied sequence here, the intent is merely to have two people do the same task so that we can compare the outputs for correctness.
from data-extraction.
I'll take this one
from data-extraction.
Sincerest apologies and I hope this doesn't become a blocker for anyone else, but something came up and this ended up being rather challenging to process. The tabular data needs a lot of work and the OCR is not 100% so you really have to pay attention and make numerous manual fixes. Something came up so I couldn't dedicate the required time to get it up. I did some of the tables and have pushed draft data along with the outstanding rows in csv format to my fork if anybody feels they can finish this up. I'm unassigning myself in case a night owl can get it done, otherwise I'll be ready to finish up in the morrow. Kudos to @JasonTame for knocking out the first copy of this though...
from data-extraction.
KZN PROVINCIAL GOVERNMENT - Procurement Disclosure-OCRd.pdf
from data-extraction.
May I continue where @zacharlie left off? Looks like itβs the only unassigned issue
from data-extraction.
@Fruitymo It was completed in #257 and should probably be reassigned to me.
But this one is a bit challenging because of the volume of data and the OCR processing, which was rife with errors. As described in the PR, I recommend that this particular sheet be manually cross referenced with the original PDF and the latest existing dataset I submitted, row by row.
That should perhaps be a new issue though. Perhaps @jbothma or @schalkventer can advise.
It's a painstaking process, but I don't think the automated integrity checks help much here at all... Unless someone has a more refined OCR processing technique we can validate against.
from data-extraction.
Related Issues (20)
- Department of Tourism (DT) - pass 1 HOT 5
- Department of Trade, Industry and Competition (the DTIC) - pass 1 HOT 12
- Department of Transport (DOT) - pass 1 HOT 3
- Department of Basic Education (DBE) - pass 1 HOT 3
- Department of Health (DoH) - pass 1 HOT 5
- Department of Higher Education and Training (DHET) - pass 1 HOT 3
- Department of Social Development (DSD) - pass 1 HOT 3
- Department of Social Development (DSD) - National Development Agency (NDA) pass 1 HOT 5
- Department of Social Development (DSD) - South African Social Security Agency (SASSA) pass 1 HOT 15
- Department of Sports, Arts and Culture (DSAC) - pass 1 HOT 8
- Add to README "Sorry, your PDF file is image-based; it does not have any embedded text. It might have been scanned from paper... Tabula isn't able to extract any data from image-based PDFs. Click the Help button for more information." HOT 2
- Move GAUTENG PROVINCIAL GOVERNMENT-Economic Development.csv to Gauteng Folder
- Move Gauteng Province files to Gauteng Province
- Move North West files to North West Province Folder
- Move Free State Files to Free State Province Folder
- Specific departments for all the provinces
- Move SASSA files to SASSA folder
- Move files to their respective departments in the National Department Folder
- Hi Lali - looks like these were all uploaded with the same name - could you give each table a different file name? you can edit the name when editing the file here
- Department of Small Business Development (DSBD) - pass 1 HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from data-extraction.