sanskrit / sanskrit-ocr-r0 Goto Github PK

A project to OCR critical sanskrit texts

sanskrit-ocr-r0's Introduction

Sanskrit

This is a general-purpose package for dealing with Sanskrit data of any kind. It currently operates at and below the word level and below, with modules like:

query, for accessing linguistic data
sandhi, for applying and undoing sandhi changes
sounds, for testing sounds and getting the meter of a phrase
sanscript, for transliterating Sanskrit from one script to another

Soon the package will move up to the word and sentence levels. Once there, it will provide tools for inflecting, analyzing, tagging, and parsing Sanskrit.

sanskrit-ocr-r0's People

Contributors

Stargazers

Watchers

sanskrit-ocr-r0's Issues

Digitize Abhyankar' sanskrit dictionary

Welcome! Answer all these questions.

Do you want a text OCR-ed? If yes continue below. If not, clear all this text and type your issue.

Has the text already been OCR-ed? Have you searched online (using both devanAgarI and latin transliterations)? No and Yes.
- For example in the large repositories of digitized texts listed here.
What factors should our OCR/ other volunteers consider when deciding whether to take up this request? In other words, what's so important about this text? Very valuable for students of Sanskrit Grammar.
Are you willing to proofread the OCR-ed text? See here to get an idea of what it involves. No.
Provide a link to the pdf of the printed book which needs to be OCR-ed. You can host the pdf online on several sites such as <dropbox.com> or <sites.google.com> . https://archive.org/stream/ADictionaryOfSanskritGrammarByMahamahopadhyayaKashinathVasudevAbhyankar/DictionaryOfSanskritGrammar_abhyankar#page/n0/mode/2up
Is there any other information you want to provide? No.

Ok - thanks for answering the above questions. Subscribe to this thread to stay updated.

Digitize गणरत्नमहोदधिः

Welcome! Answer all these questions.

Do you want a text OCR-ed? If yes continue below. If not, clear all this text and type your issue.

Has the text already been OCR-ed? Have you searched online (using both devanAgarI and latin transliterations)? For example in the large repositories of digitized texts listed here.
- Answer: Yes. Not digitized
If you commit to proof-read the result within a reasonable timeframe (say 1 month for 300 pages), we will be specially excited to oblige you. Are you willing to make such a commitment? See here to get an idea of what it involves.
- Answer: Yes
Are you OK with the scan quality that we currently offer?
- Answer: No. I will use a private SanskritOCR to OCR.
What other factors should our OCR/ other volunteers consider when deciding whether to take up this request? In other words, what's so important about this text?
- Answer: This is a commentary on गणपाठः
Provide a link to the pdf of the printed book which needs to be OCR-ed. You can host the pdf online on several sites such as http://dropbox.com or http://sites.google.com .
- Answer: https://sa.wikisource.org/wiki/%E0%A4%B8%E0%A4%9E%E0%A5%8D%E0%A4%9A%E0%A4%BF%E0%A4%95%E0%A4%BE:Ganaratnamahodadhi.pdf
Now for some details: please enter the following metadata (use both devanAgarI and latin alphabet forms if you can).
- Title: गणरत्नमहोदधिः (gaNaratnamahodadhiH)
- Author: वर्धमानकविः (VardhamanakaviH)
- Commentator: -
Is there any other information you want to provide?
- Answer:

Ok - thanks for answering the above questions 🙏. Subscribe to this thread to stay updated.

OCR ekAgnikANDa

The ekAgnikANDa text is essential for all practitioners of the Apastamba gRhya-sUtra (links here ). But alas, it is not digitized!

@shrivathsa - could you please give this a shot? I wonder if SanskritOCR is sufficiently good with accented text..

Upload OCR text for ज्यौतिषवेदाङ्गम् to Wikisource for proofreading.

shrI @shrivathsa has kindly ocr-ed this text. Once we get the OCR-ed text, it should be uploaded to wikisource for proofreading.

Links:
ज्यौतिषवेदाङ्गम् https://sa.wikisource.org/s/bu7 https://commons.wikimedia.org/wiki/File:Jyautisha_Vedangam.pdf

Expected project outcome and prodecures

@vvasuki
As I can see from the discussions on mail and here, this seems not only restricted to a single book OCRing, but something larger for crowdsourcing Sanskrit / bilingual texts.
Therefore, we need to decide on expected outcomes and processes. Feel free to edit the below mentioned things. They are only suggestive.

Expected outcome

A machine which can take in PDF as input and give pagewise .txt (unicode) as output
The OCR should have good dictionaries for better output.
An HTML frontend on which the people can edit and correct.
A user management system and some kind of history keeping for changes made.

Procedures

Which computer language to adopt ?
Stand alone tool / online tool / a wrapper on existing tools ?

Documentation of steps for using SanskritOCR

If one decides to use SanskritOCR for scanning purpose, the following steps may be useful.

Step 1 - Open the folder containing SanskritOCR. Repleace the existing dictionary-sa.dict with the present file. If you are a bit uncomfortable in changing a software, please take a backup of the old dictionary-sa.dict file somewhere in case you want to roll back.

Step 2 - Split the PDF file you want to scan into separate .tiff files. I use Irfanview. The following are steps in irfanview. It may differ with different softwares.

Step 2a - Open the .pdf file in irfanview.

Step 2b - Click on Options->Multipage images->Extract all pages. Select the folder where you want to extract the pages. Select .tiff for output. Run it. This will extract all pages in the folder separately.

Step 3 - Open SanskritOCR.exe.

Step 4 - Select File->New images and select a decently scanned page to train OCR.

Step 5 - Select Image->Remove irregular noise.

Step 6 - Select Recognition->Training set->New training set->Enter some name for you to understand (e.g. ganaratna)->Save and close.

Step 7 - Select the training set from list and click on 'Select as training set'.

Step 8 - Click on Recognition -> Training mode. (This will activate the training mode). Whenever machine doesn't understand a letter, it will ask for us to input the letter and store it.

Step 9 - Click on Recognition -> Recognize current page.

Step 10 - Whenever the machine asks you for input and you recognize the letter, note in the window. Sample case is attached here.

Step 11 - Complete training the present page.

Step 12 - Repeat the process for 5-10 pages. This would make machine understand some of the odd patterns in the font glyphs. After 5-10 pages, our machine has some idea of the structure of the glyph being scanned. Now we are good to do batch conversion.

Step 13 - Click on Recognition->OCR a complete directory.

Step 14 - Select the 'Source directory', Set 'Export format' as 'Multiple unicode text files (numbered)', Select 'save in/as' to the directory where you want to store the output. Click on 'Start OCR'. This will do batch conversion.

Output would be stored in 0001.txt, 0002.txt etc format.

Then @vvasuki 's NLP bot may place the data on wikisource.

Request for Bot for SanskritOCR output

@vvasuki
I have posted the output generated via SanskritOCR for gaNaratnamahodadhi here.
SanskritOCR has an option to give output pagewise.
e.g.
This PDF file had 330 pages.
The output .txt are named 0000.txt to 0329.txt.

I have very limited knowledge about java or bots in general. Would you be kind enough to device a bot for uploading these filew in index pages 1 to 330.
Once you have developed this kind of bot, I would be able to upload scans of SanskritOCR at a faster pace.

sanskrit / sanskrit-ocr-r0 Goto Github PK

sanskrit-ocr-r0's Introduction

Sanskrit

sanskrit-ocr-r0's People

Contributors

Stargazers

Watchers

sanskrit-ocr-r0's Issues

Digitize Abhyankar' sanskrit dictionary

Digitize गणरत्नमहोदधिः

OCR ekAgnikANDa

Upload OCR text for ज्यौतिषवेदाङ्गम् to Wikisource for proofreading.

Expected project outcome and prodecures

Documentation of steps for using SanskritOCR

Request for Bot for SanskritOCR output

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent