Giter Club home page Giter Club logo

Comments (9)

JasonTame avatar JasonTame commented on May 30, 2024

@jbothma Somehow missed there was a pass 1 of the KZN data as well. Looking through it, it looks exactly the same as pass 2. Should it be extracted anyway?

from data-extraction.

jbothma avatar jbothma commented on May 30, 2024

Hey! The idea with pass1 and pass2 is that different people extract the same data twice then we can compare and spot errors so just do one of them.

from data-extraction.

JasonTame avatar JasonTame commented on May 30, 2024

Ah! Cool, makes sense

from data-extraction.

schalkventer avatar schalkventer commented on May 30, 2024

@JasonTame TBH, in retrospect we should have probably called them Pass A and Pass B since there is no implied sequence here, the intent is merely to have two people do the same task so that we can compare the outputs for correctness.

from data-extraction.

zacharlie avatar zacharlie commented on May 30, 2024

I'll take this one

from data-extraction.

zacharlie avatar zacharlie commented on May 30, 2024

Sincerest apologies and I hope this doesn't become a blocker for anyone else, but something came up and this ended up being rather challenging to process. The tabular data needs a lot of work and the OCR is not 100% so you really have to pay attention and make numerous manual fixes. Something came up so I couldn't dedicate the required time to get it up. I did some of the tables and have pushed draft data along with the outstanding rows in csv format to my fork if anybody feels they can finish this up. I'm unassigning myself in case a night owl can get it done, otherwise I'll be ready to finish up in the morrow. Kudos to @JasonTame for knocking out the first copy of this though...

from data-extraction.

zacharlie avatar zacharlie commented on May 30, 2024

KZN PROVINCIAL GOVERNMENT - Procurement Disclosure-OCRd.pdf

from data-extraction.

Fruitymo avatar Fruitymo commented on May 30, 2024

May I continue where @zacharlie left off? Looks like it’s the only unassigned issue

from data-extraction.

zacharlie avatar zacharlie commented on May 30, 2024

@Fruitymo It was completed in #257 and should probably be reassigned to me.

But this one is a bit challenging because of the volume of data and the OCR processing, which was rife with errors. As described in the PR, I recommend that this particular sheet be manually cross referenced with the original PDF and the latest existing dataset I submitted, row by row.

That should perhaps be a new issue though. Perhaps @jbothma or @schalkventer can advise.

It's a painstaking process, but I don't think the automated integrity checks help much here at all... Unless someone has a more refined OCR processing technique we can validate against.

from data-extraction.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.