Giter Club home page Giter Club logo

sanskrit-ocr-r0's Introduction

Sanskrit

This is a general-purpose package for dealing with Sanskrit data of any kind. It currently operates at and below the word level and below, with modules like:

  • query, for accessing linguistic data
  • sandhi, for applying and undoing sandhi changes
  • sounds, for testing sounds and getting the meter of a phrase
  • sanscript, for transliterating Sanskrit from one script to another

Soon the package will move up to the word and sentence levels. Once there, it will provide tools for inflecting, analyzing, tagging, and parsing Sanskrit.

sanskrit-ocr-r0's People

Contributors

drdhaval2785 avatar shrivathsa avatar vvasuki avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sanskrit-ocr-r0's Issues

Digitize Abhyankar' sanskrit dictionary

Welcome! Answer all these questions.

Do you want a text OCR-ed? If yes continue below. If not, clear all this text and type your issue.

  • Has the text already been OCR-ed? Have you searched online (using both devanAgarI and latin transliterations)? No and Yes.
    • For example in the large repositories of digitized texts listed here.
  • What factors should our OCR/ other volunteers consider when deciding whether to take up this request? In other words, what's so important about this text? Very valuable for students of Sanskrit Grammar.
  • Are you willing to proofread the OCR-ed text? See here to get an idea of what it involves. No.
  • Provide a link to the pdf of the printed book which needs to be OCR-ed. You can host the pdf online on several sites such as <dropbox.com> or <sites.google.com> . https://archive.org/stream/ADictionaryOfSanskritGrammarByMahamahopadhyayaKashinathVasudevAbhyankar/DictionaryOfSanskritGrammar_abhyankar#page/n0/mode/2up
  • Is there any other information you want to provide? No.

Ok - thanks for answering the above questions. Subscribe to this thread to stay updated.

Digitize गणरत्नमहोदधिः

Welcome! Answer all these questions.

Do you want a text OCR-ed? If yes continue below. If not, clear all this text and type your issue.

  • Has the text already been OCR-ed? Have you searched online (using both devanAgarI and latin transliterations)? For example in the large repositories of digitized texts listed here.
    • Answer: Yes. Not digitized
  • If you commit to proof-read the result within a reasonable timeframe (say 1 month for 300 pages), we will be specially excited to oblige you. Are you willing to make such a commitment? See here to get an idea of what it involves.
    • Answer: Yes
  • Are you OK with the scan quality that we currently offer?
    • Answer: No. I will use a private SanskritOCR to OCR.
  • What other factors should our OCR/ other volunteers consider when deciding whether to take up this request? In other words, what's so important about this text?
    • Answer: This is a commentary on गणपाठः
  • Provide a link to the pdf of the printed book which needs to be OCR-ed. You can host the pdf online on several sites such as http://dropbox.com or http://sites.google.com .
  • Now for some details: please enter the following metadata (use both devanAgarI and latin alphabet forms if you can).
    • Title: गणरत्नमहोदधिः (gaNaratnamahodadhiH)
    • Author: वर्धमानकविः (VardhamanakaviH)
    • Commentator: -
  • Is there any other information you want to provide?
    • Answer:

Ok - thanks for answering the above questions 🙏. Subscribe to this thread to stay updated.

OCR ekAgnikANDa

The ekAgnikANDa text is essential for all practitioners of the Apastamba gRhya-sUtra (links here ). But alas, it is not digitized!

@shrivathsa - could you please give this a shot? I wonder if SanskritOCR is sufficiently good with accented text..

Expected project outcome and prodecures

@vvasuki
As I can see from the discussions on mail and here, this seems not only restricted to a single book OCRing, but something larger for crowdsourcing Sanskrit / bilingual texts.
Therefore, we need to decide on expected outcomes and processes. Feel free to edit the below mentioned things. They are only suggestive.

Expected outcome

  1. A machine which can take in PDF as input and give pagewise .txt (unicode) as output
  2. The OCR should have good dictionaries for better output.
  3. An HTML frontend on which the people can edit and correct.
  4. A user management system and some kind of history keeping for changes made.

Procedures

  1. Which computer language to adopt ?
  2. Stand alone tool / online tool / a wrapper on existing tools ?

Documentation of steps for using SanskritOCR

If one decides to use SanskritOCR for scanning purpose, the following steps may be useful.

Step 1 - Open the folder containing SanskritOCR. Repleace the existing dictionary-sa.dict with the present file. If you are a bit uncomfortable in changing a software, please take a backup of the old dictionary-sa.dict file somewhere in case you want to roll back.

Step 2 - Split the PDF file you want to scan into separate .tiff files. I use Irfanview. The following are steps in irfanview. It may differ with different softwares.

Step 2a - Open the .pdf file in irfanview.

Step 2b - Click on Options->Multipage images->Extract all pages. Select the folder where you want to extract the pages. Select .tiff for output. Run it. This will extract all pages in the folder separately.

Step 3 - Open SanskritOCR.exe.

Step 4 - Select File->New images and select a decently scanned page to train OCR.

Step 5 - Select Image->Remove irregular noise.

Step 6 - Select Recognition->Training set->New training set->Enter some name for you to understand (e.g. ganaratna)->Save and close.

Step 7 - Select the training set from list and click on 'Select as training set'.

Step 8 - Click on Recognition -> Training mode. (This will activate the training mode). Whenever machine doesn't understand a letter, it will ask for us to input the letter and store it.

Step 9 - Click on Recognition -> Recognize current page.

Step 10 - Whenever the machine asks you for input and you recognize the letter, note in the window. Sample case is attached here.
capture

Step 11 - Complete training the present page.

Step 12 - Repeat the process for 5-10 pages. This would make machine understand some of the odd patterns in the font glyphs. After 5-10 pages, our machine has some idea of the structure of the glyph being scanned. Now we are good to do batch conversion.

Step 13 - Click on Recognition->OCR a complete directory.

Step 14 - Select the 'Source directory', Set 'Export format' as 'Multiple unicode text files (numbered)', Select 'save in/as' to the directory where you want to store the output. Click on 'Start OCR'. This will do batch conversion.
capture

Output would be stored in 0001.txt, 0002.txt etc format.

Then @vvasuki 's NLP bot may place the data on wikisource.

Request for Bot for SanskritOCR output

@vvasuki
I have posted the output generated via SanskritOCR for gaNaratnamahodadhi here.
SanskritOCR has an option to give output pagewise.
e.g.
This PDF file had 330 pages.
The output .txt are named 0000.txt to 0329.txt.

I have very limited knowledge about java or bots in general. Would you be kind enough to device a bot for uploading these filew in index pages 1 to 330.
Once you have developed this kind of bot, I would be able to upload scans of SanskritOCR at a faster pace.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.