Giter Club home page Giter Club logo

Comments (22)

rohitsaluja22 avatar rohitsaluja22 commented on July 20, 2024

You need data like following:

Word1 feature1 feature2 ..... featureN CorrectLabel
Word2 feature1 feature2 ..... featureN CorrectLabel

  • For different features, refer the publication OpenOCRCorrect.
  • They are very simple features, for e.g.
    a) 2gram frequency of first two characters of the word in a language data (may be a training data itself, but keep it training (common) for obtaining features for validation and test set as well, as we should not access ground truth during validation and testing)
    b) 2gram frequency of first three characters of the word in the global data.)
    and so on...

Then you can train a log linear classifier. After the classifier is trained. Run it over all the validation examples repeatedly with different thresholds for classifying the word as correct or incorrect. Find FScore on complete validation data for each thershold. Then pickup the threshold with best FScore, and using that threshold find the FScore of the test data.

from openocrcorrect.

VinayNagalgaonkar1998 avatar VinayNagalgaonkar1998 commented on July 20, 2024

Sir, actually the above comment looks more like an overview of what we have to do, but could you please break it down to smaller action items like what commands or steps, etc. should be used to do all the above-suggested procedure

from openocrcorrect.

rohitsaluja22 avatar rohitsaluja22 commented on July 20, 2024

Do you have annotated data?

Step1: We need two things initially for first step i.e. OCR text and corresponding Correct text (say for a complete book). You may start with 20 pages I have given here: https://github.com/rohitsaluja22/OpenOCRCorrect/tree/master/data/Book1Sanskrit/Inds and https://github.com/rohitsaluja22/OpenOCRCorrect/tree/master/data/Book1Sanskrit/Correct. Or use ICDAR post OCR competetion datasets.

Step2: Then use RETA aligner to align these and extract the correction pairs. For RETA you would need to register on website and/or mail the authors to get the aligner.

Once you have correction pairs ready I can reply with next steps.

from openocrcorrect.

VinayNagalgaonkar1998 avatar VinayNagalgaonkar1998 commented on July 20, 2024

The project we are doing is for English only.
Sir, actually we don't have annotated data, also could you please send the drive link for ICDAR post OCR competetion datasets along with this, as we were not able to find it anywhere. We searched for RETA aligner on the net but could not find any related articles.

Sir, we are trying to focus on theoretical aspects of the project since we have the final presentation on Friday(22 November) so that we can give the presentation. Hence, it would not be possible for us to actually gather and work out all the steps but they are important for us to know. So, could you please send the material or resources required in the above mentioned steps

from openocrcorrect.

rohitsaluja22 avatar rohitsaluja22 commented on July 20, 2024

This has tool, they too have data: http://ciir.cs.umass.edu/downloads/ocr-evaluation/
Write to them if needed, they reply very fast.

from openocrcorrect.

rohitsaluja22 avatar rohitsaluja22 commented on July 20, 2024

Fir competetion data, I am not sure about permissions for sharing. Better write to Christophe Rigaud (refer https://sites.google.com/view/icdar2019-postcorrectionocr). You can keep me in cc, he knows me, so I'll request him to share it asap if permissions are there.

from openocrcorrect.

VinayNagalgaonkar1998 avatar VinayNagalgaonkar1998 commented on July 20, 2024

Sir, we have got both RETA aligner and ICDAR dataset, can you please tell us the further steps from how to annotate data to, how to train log linear classifier, to getting a FScore in an elaborate manner. We are actually running out of time Sir could you please reply as soon as possible with all the steps for the above mentioned tasks.

The resources related to RETA that we got after contacting them -
http://ciir.cs.umass.edu/downloads/ocr-evaluation/RecursiveTextAlignmentTool_release_v1_1.zip
http://ciir.cs.umass.edu/downloads/ocr-evaluation/RETAS_dataset_v1.zip

from openocrcorrect.

rohitsaluja22 avatar rohitsaluja22 commented on July 20, 2024

Below are the steps to be followed for your requirements.

Step3: Creating dataset: you can use the C++ code for function here: void MainWindow::on_actionPrepareFeatures_triggered() @ https://github.com/rohitsaluja22/OpenOCRCorrect/blob/master/FrameWorkCode/mainwindow.cpp
This needs Dictionary and Correction Pairs as input.
This calls a function getNgramFeaturesinVect() which is defined in https://github.com/rohitsaluja22/OpenOCRCorrect/blob/master/FrameWorkCode/slpNPatternDict.h

Step4: For training model:

  1. Get the liblinear code https://github.com/cjlin1/liblinear and compile it
  2. Convert your dataset to the following format https://github.com/cjlin1/liblinear/blob/master/heart_scale
  3. Run the following command to train
    train -s 6 -c 0.5 -e 0.0001 data_file
    for l1 regularized logistic regression.
  4. for prediction, predict -b 1 test_file data_file.model output_file
    Output file have probability values
  5. You can parse the output file for threshold selection.

Parameters you can play with :
S can take 0,6 and 7 since you want output probabilities
and parameter C

Let me know if you have any other doubts.

from openocrcorrect.

VinayNagalgaonkar1998 avatar VinayNagalgaonkar1998 commented on July 20, 2024

Sir, after running the RETA tool for a pair of OCR text file and it's corresponding Ground Truth text file we got the following output text file(refer to the text file attached). Is this what you mean to refer for Correction Pairs ?? Sir, please reply as soon as possible, also tell us if this is the correct format to give as input to the C++ code mentioned in step 3
reta_tool_output.txt

from openocrcorrect.

rohitsaluja22 avatar rohitsaluja22 commented on July 20, 2024

Correction pairs should look like this:
https://github.com/rohitsaluja22/OpenOCRCorrect/blob/master/data/Book1Sanskrit/CPair

The Ground Truth is already alligned in POCR competetion data. Just take english data, write a python script to find word pairs and save them in fine CPair. I tried searching for myscripts but could not get them in my lab PC. Will check them in my laptop and update tommorow.

from openocrcorrect.

VinayNagalgaonkar1998 avatar VinayNagalgaonkar1998 commented on July 20, 2024

Sir, to figure out the steps you gave we went ahead with Sanskrit instead of English. We created the DataSet in Step 3 and followed till Step 4.3 but got stuck at this point as we did not have a test_file from Step 4.4. We want to carry out these steps for English and we require the following things :

  1. Ground Truth files for English i.e. 'Correct' folder like you gave for other languages OR a different and complete dataset i.e. same as all files in 'Book' folder of your project.
  2. Script to generate CPair file and a sample CPair file for English
  3. The 'test_file' for English as mentioned in Step 4.4
    Sir, please reply back as soon as possible with the requested resources, as we are very close to giving our final presentation.

from openocrcorrect.

rohitsaluja22 avatar rohitsaluja22 commented on July 20, 2024

Use these (attached):
In files with name *x in this which are OCR files: Just do line.split("^")[10].replace(" ","") using python for ever line and add that as first element to CPair
files with name *y in this which are GT files: Just do line.replace(" ","").replace("^"," ") using python for ever line and add that as second element to CPair

For validation/testing you can create separate feature file. For Sanskrit you could have split the data and then try training on partial data and testing on remaining. For less data, Cross Validation is common practice in ML.

EngPOCR19DataPairs.tar.gz

from openocrcorrect.

VinayNagalgaonkar1998 avatar VinayNagalgaonkar1998 commented on July 20, 2024

Sir, we wrote the script in python as instructed(see attachedScript.txt) and got the CPair file(see attachedCPair.txt) then when to try to do Step 3 while loading Dictionary the gives this error
(Dictionary 4040 Words Loaded
90712 patterns loaded
MaxElSize 44
Segmentation fault (core dumped))
So, we are not able to give the 2 inputs of Dictionary and CPair file. So, please suggest what should we do now? and could also send the required test_file in Step 4.4

from openocrcorrect.

VinayNagalgaonkar1998 avatar VinayNagalgaonkar1998 commented on July 20, 2024

Sir, please guide us through the above comment and provide the mentioned test_file in Step 4.4 as soon as possible

from openocrcorrect.

rohitsaluja22 avatar rohitsaluja22 commented on July 20, 2024

Use a cpu with high RAM, do not run it on laptop

from openocrcorrect.

VinayNagalgaonkar1998 avatar VinayNagalgaonkar1998 commented on July 20, 2024

Ok sure we will run it on a cpu with high RAM.

Although, Sir, please send us the required test_file by end of today, for completing Step 4.4. In case, if you don't have one for English, can you send the file for Sanskrit. Sir, it is of utmost importance to us to have an end to end understanding of the project.

from openocrcorrect.

rohitsaluja22 avatar rohitsaluja22 commented on July 20, 2024

from openocrcorrect.

VinayNagalgaonkar1998 avatar VinayNagalgaonkar1998 commented on July 20, 2024

Sir, actually could you please specify which file should we use for 'test_file'
(1) CPair (made from 'EngDataValx' and 'EngDataValy')
OR (2) CPairout (got as output from Step 3)

from openocrcorrect.

rohitsaluja22 avatar rohitsaluja22 commented on July 20, 2024

from openocrcorrect.

VinayNagalgaonkar1998 avatar VinayNagalgaonkar1998 commented on July 20, 2024

We have used the required file as 'test_file' and got an 'output_file'(see attached) with some output probabilities. Apart from this, we have thoroughly gone through your code and have a high-level understanding of how things work. But still, we want to have a clear understanding of how the model is being trained through logistic linear regression, what are the parameters involved, how are their values being manipulated and how FScore is being calculated and being improved. Sir, please suggest to us sometime when we could call you by end of today regarding the same. 
output_file.txt

from openocrcorrect.

rohitsaluja22 avatar rohitsaluja22 commented on July 20, 2024

Hi I am sorry a bit occupied with a project work at my end too.

from openocrcorrect.

rohitsaluja22 avatar rohitsaluja22 commented on July 20, 2024

Glad to hear that you achieved over 85% Accuracy for POCR 19 English data.

from openocrcorrect.

Related Issues (13)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.