Comments (22)
You need data like following:
Word1 feature1 feature2 ..... featureN CorrectLabel
Word2 feature1 feature2 ..... featureN CorrectLabel
- For different features, refer the publication OpenOCRCorrect.
- They are very simple features, for e.g.
a) 2gram frequency of first two characters of the word in a language data (may be a training data itself, but keep it training (common) for obtaining features for validation and test set as well, as we should not access ground truth during validation and testing)
b) 2gram frequency of first three characters of the word in the global data.)
and so on...
Then you can train a log linear classifier. After the classifier is trained. Run it over all the validation examples repeatedly with different thresholds for classifying the word as correct or incorrect. Find FScore on complete validation data for each thershold. Then pickup the threshold with best FScore, and using that threshold find the FScore of the test data.
from openocrcorrect.
Sir, actually the above comment looks more like an overview of what we have to do, but could you please break it down to smaller action items like what commands or steps, etc. should be used to do all the above-suggested procedure
from openocrcorrect.
Do you have annotated data?
Step1: We need two things initially for first step i.e. OCR text and corresponding Correct text (say for a complete book). You may start with 20 pages I have given here: https://github.com/rohitsaluja22/OpenOCRCorrect/tree/master/data/Book1Sanskrit/Inds and https://github.com/rohitsaluja22/OpenOCRCorrect/tree/master/data/Book1Sanskrit/Correct. Or use ICDAR post OCR competetion datasets.
Step2: Then use RETA aligner to align these and extract the correction pairs. For RETA you would need to register on website and/or mail the authors to get the aligner.
Once you have correction pairs ready I can reply with next steps.
from openocrcorrect.
The project we are doing is for English only.
Sir, actually we don't have annotated data, also could you please send the drive link for ICDAR post OCR competetion datasets along with this, as we were not able to find it anywhere. We searched for RETA aligner on the net but could not find any related articles.
Sir, we are trying to focus on theoretical aspects of the project since we have the final presentation on Friday(22 November) so that we can give the presentation. Hence, it would not be possible for us to actually gather and work out all the steps but they are important for us to know. So, could you please send the material or resources required in the above mentioned steps
from openocrcorrect.
This has tool, they too have data: http://ciir.cs.umass.edu/downloads/ocr-evaluation/
Write to them if needed, they reply very fast.
from openocrcorrect.
Fir competetion data, I am not sure about permissions for sharing. Better write to Christophe Rigaud (refer https://sites.google.com/view/icdar2019-postcorrectionocr). You can keep me in cc, he knows me, so I'll request him to share it asap if permissions are there.
from openocrcorrect.
Sir, we have got both RETA aligner and ICDAR dataset, can you please tell us the further steps from how to annotate data to, how to train log linear classifier, to getting a FScore in an elaborate manner. We are actually running out of time Sir could you please reply as soon as possible with all the steps for the above mentioned tasks.
The resources related to RETA that we got after contacting them -
http://ciir.cs.umass.edu/downloads/ocr-evaluation/RecursiveTextAlignmentTool_release_v1_1.zip
http://ciir.cs.umass.edu/downloads/ocr-evaluation/RETAS_dataset_v1.zip
from openocrcorrect.
Below are the steps to be followed for your requirements.
Step3: Creating dataset: you can use the C++ code for function here: void MainWindow::on_actionPrepareFeatures_triggered() @ https://github.com/rohitsaluja22/OpenOCRCorrect/blob/master/FrameWorkCode/mainwindow.cpp
This needs Dictionary and Correction Pairs as input.
This calls a function getNgramFeaturesinVect() which is defined in https://github.com/rohitsaluja22/OpenOCRCorrect/blob/master/FrameWorkCode/slpNPatternDict.h
Step4: For training model:
- Get the liblinear code https://github.com/cjlin1/liblinear and compile it
- Convert your dataset to the following format https://github.com/cjlin1/liblinear/blob/master/heart_scale
- Run the following command to train
train -s 6 -c 0.5 -e 0.0001 data_file
for l1 regularized logistic regression. - for prediction, predict -b 1 test_file data_file.model output_file
Output file have probability values - You can parse the output file for threshold selection.
Parameters you can play with :
S can take 0,6 and 7 since you want output probabilities
and parameter C
Let me know if you have any other doubts.
from openocrcorrect.
Sir, after running the RETA tool for a pair of OCR text file and it's corresponding Ground Truth text file we got the following output text file(refer to the text file attached). Is this what you mean to refer for Correction Pairs ?? Sir, please reply as soon as possible, also tell us if this is the correct format to give as input to the C++ code mentioned in step 3
reta_tool_output.txt
from openocrcorrect.
Correction pairs should look like this:
https://github.com/rohitsaluja22/OpenOCRCorrect/blob/master/data/Book1Sanskrit/CPair
The Ground Truth is already alligned in POCR competetion data. Just take english data, write a python script to find word pairs and save them in fine CPair. I tried searching for myscripts but could not get them in my lab PC. Will check them in my laptop and update tommorow.
from openocrcorrect.
Sir, to figure out the steps you gave we went ahead with Sanskrit instead of English. We created the DataSet in Step 3 and followed till Step 4.3 but got stuck at this point as we did not have a test_file from Step 4.4. We want to carry out these steps for English and we require the following things :
- Ground Truth files for English i.e. 'Correct' folder like you gave for other languages OR a different and complete dataset i.e. same as all files in 'Book' folder of your project.
- Script to generate CPair file and a sample CPair file for English
- The 'test_file' for English as mentioned in Step 4.4
Sir, please reply back as soon as possible with the requested resources, as we are very close to giving our final presentation.
from openocrcorrect.
Use these (attached):
In files with name *x in this which are OCR files: Just do line.split("^")[10].replace(" ","") using python for ever line and add that as first element to CPair
files with name *y in this which are GT files: Just do line.replace(" ","").replace("^"," ") using python for ever line and add that as second element to CPair
For validation/testing you can create separate feature file. For Sanskrit you could have split the data and then try training on partial data and testing on remaining. For less data, Cross Validation is common practice in ML.
from openocrcorrect.
Sir, we wrote the script in python as instructed(see attachedScript.txt) and got the CPair file(see attachedCPair.txt) then when to try to do Step 3 while loading Dictionary the gives this error
(Dictionary 4040 Words Loaded
90712 patterns loaded
MaxElSize 44
Segmentation fault (core dumped))
So, we are not able to give the 2 inputs of Dictionary and CPair file. So, please suggest what should we do now? and could also send the required test_file in Step 4.4
from openocrcorrect.
Sir, please guide us through the above comment and provide the mentioned test_file in Step 4.4 as soon as possible
from openocrcorrect.
Use a cpu with high RAM, do not run it on laptop
from openocrcorrect.
Ok sure we will run it on a cpu with high RAM.
Although, Sir, please send us the required test_file by end of today, for completing Step 4.4. In case, if you don't have one for English, can you send the file for Sanskrit. Sir, it is of utmost importance to us to have an end to end understanding of the project.
from openocrcorrect.
from openocrcorrect.
Sir, actually could you please specify which file should we use for 'test_file'
(1) CPair (made from 'EngDataValx' and 'EngDataValy')
OR (2) CPairout (got as output from Step 3)
from openocrcorrect.
from openocrcorrect.
We have used the required file as 'test_file' and got an 'output_file'(see attached) with some output probabilities. Apart from this, we have thoroughly gone through your code and have a high-level understanding of how things work. But still, we want to have a clear understanding of how the model is being trained through logistic linear regression, what are the parameters involved, how are their values being manipulated and how FScore is being calculated and being improved. Sir, please suggest to us sometime when we could call you by end of today regarding the same.
output_file.txt
from openocrcorrect.
Hi I am sorry a bit occupied with a project work at my end too.
from openocrcorrect.
Glad to hear that you achieved over 85% Accuracy for POCR 19 English data.
from openocrcorrect.
Related Issues (13)
- Add support for Telugu HOT 1
- Primary OCR and Secondary OCR HOT 1
- Loading confusions........ hangs HOT 1
- Demo video link broken on README HOT 3
- When I am running that qt5 default command on ubuntu terminal it is showing qt 5 is not available HOT 1
- dataset in
- Dataset HOT 1
- Regarding OpenOCRCorrect Operations HOT 20
- What is the reason for using Qt? HOT 1
- How are you getting output from OCR? Can you point out the component from your project related to this? HOT 1
- What are the error models that you have rectified using this project? Have you implemented any technique like machine learning/deep learning in this project? If yes, could you explain it in short to us? HOT 1
- Doubt in research paper 4 from : https://www.cse.iitb.ac.in/~rohitsaluja/publication.html HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from openocrcorrect.