sanskrit / raw_etexts Goto Github PK

Home Page: https://sanskrit.github.io/projects/text/

CSS 0.03% HTML 93.39% TeX 6.58% Shell 0.01%

raw_etexts's Introduction

Intro

A repository containing important raw texts (some non-proofread). Processed (eg: transliterated) texts may be placed in other repos/ sites.

"raw" in raw_etexts means "mostly unencoded" text files (and not binary files like doc or pdf or rtf), whose contents may be searched by (say) grep. This means that we prefer files whose formatting (if any) does not interfere much with search.

Motivation

Websites with critical digitized texts have disappeared in the past, to the frustration of users.
It is easier to search multiple files encoded in plain text.
- part of an email sent by shrI vIranArAyaNa pANDurangi - "...problems in searching in wikisourse. it does not search to our wish. Suppose I wish to त्रिपादूर्ध्व wherever it occurs. If I search त्रिपा only it should search त्रिपादूर्ध्व. but it does not search it. it only searches त्रिपा. Hence I need to get all the veda and purana files in my computer. so I can put it in text files. text files search these things very fine."

Usage

Clone this repo: git clone https://github.com/sanskrit/raw_etexts.git --recursive --shallow-submodules
Initialize submodules if needed: git submodule update --init --recursive --depth 1
Update submodules if needed:
git submodule foreach -q 'git checkout $(git config -f $toplevel/.gitmodules submodule.$name.branch || echo master) || git checkout -b $(git config -f $toplevel/.gitmodules submodule.$name.branch || echo master)' (also without the -b option, if all submodules are not covered. Also consider adding the --recursive option.)
Check status: git submodule foreach -q 'git status'

Generating a catalog

On linux: ./make_catalog.sh

Contribution

Fork this repo and send pull requests.
Raise issues.
Add submodule: For example - git submodule add --depth 1 -- https://github.com/indic-dict/something.git some/path

If you're maintaining lot of submodules, it may make sense to group them together (eg. mixed/vishvAsa) so that they can be easily excluded from your local multi-file searches so as to avoid duplication.

Conventions

Allowed formats in the order of decreasing preference: md (possibly with frontmatter), txt, itx, tex, html.

Preferred naming conventions

OCR files will be named: xyz_ocr.md or xyz_ocr.txt.
When they are being proofread, they will be renamed: xyz_proofreading_.md or txt.
When fully proofread, they will be: xyz.md or xyz.txt

raw_etexts's People

Contributors

Stargazers

Watchers

raw_etexts's Issues

OCR Request: Vṛddha Sūryāruṇa Karma Vipāka | वृद्ध सूर्यारुण कर्म विपाक (1937)

Welcome! Answer all these questions.

Do you want a text OCR-ed? If yes continue below. If not, clear all this text and type your issue.

Has the text already been OCR-ed? Have you searched online (using both devanAgarI and latin transliterations)? For example in the large repositories of digitized texts listed here.
- Answer: No, it hasn't been ocr-ed. Doesn't exists in any of these repositories.
If you commit to proof-read the result within a reasonable timeframe (say 1 month for 300 pages), we will be specially excited to oblige you. Are you willing to make such a commitment? See here to get an idea of what it involves.
- Answer: I can surely give it a try but do not promise anything since I have serious health issues.
Are you OK with the scan quality that we currently offer? Or are you able to provide your own OCR text?
- Answer:Yes I'm ok. No I don't have any means to ocr devanāgarī texts. If you have such a working software please let me have it. I have tried a few open sourced softwares in the past but didn't get much success.
What other factors should our OCR/ other volunteers consider when deciding whether to take up this request? In other words, what's so important about this text?
- Answer: It is a very important text in Āyurveda and Jyotiṣa which lists the remedies in forms of mantras to many problems (ailments) in life.
Provide a link to the pdf of the printed book which needs to be OCR-ed. You can host the pdf online on several sites such as http://archive.org (← strongly recommended), http://dropbox.com or http://sites.google.com.
- Answer: http://www118.zippyshare.com/v/KHWqXDoH/file.html
Now for some details: please enter the following metadata (use both devanAgarI and latin alphabet forms if you can).
- Title: Vṛddha Sūryāruṇa Karma Vipāka | वृद्ध सूर्यारुण कर्म विपाक
- Author: unknown
- Commentator: Gaṅgāviṣṇu | गङ्गाविष्णु
Is there any other information you want to provide?
- Answer: This is coming in the line of an ancient sage called Sūrya and most likely linked to those who were of Yāmala Bhāskara stream i.e., natives who worshiped Tantrik form of Sun and in whose sampradāya Maya (author of Sūrya Siddhānta) was from. Very important text. It would be appreciated if we could find someone to translate this later for the sake of humanity.

Ok - thanks for answering the above questions 🙏. Subscribe to this thread to stay updated.

Nepal-German Manuscript Catalog Scraping

Can be made into a dict file.
https://catalogue.ngmcp.uni-hamburg.de/servlets/solr/select?q=%2BobjectType%3A%22ngmcpdocument%22+%2BallNGMCP%3A*&fl=*%2Cscore&version=4.5&mask=content%2Fsearch%2Fsimple.xed&start=0&rows=10

There are probably many valuable works hiding in this catalog.

नव्यन्यायभाषाप्रदीप - महेशचन्द्र न्यायरत्न

https://raw.githubusercontent.com/sanskrit/raw_etexts/master/nyAya-shAstram/navyanyayabhashapradipa.md

Book is already proof-read.
I found this book very good to enter into navya nyaya language.

Add texts from e-bhAratI sampat repos

http://ebharatisampat.in/FinalKaryashala/ shown by @suhasm

Imbibe skt library texts

URL: https://sanskritlibrary.org/textsList.html

Where do proof read items go?

It is good to have raw etext repository.

But where will we put the files after proof reading?

OCR Request for 3 Prahasanas.

Welcome! Answer all these questions.

Do you want a text OCR-ed? If yes continue below. If not, clear all this text and type your issue.

Yes. The Major Prahasanas: Bhagavadajjukīya, Hāsyacūḍāmaṇi, laTakamelaka (~150 pages) (Mattavilāsa, hAsyArNava, dhUrtasamagama are already done).

If you commit to proof-read the result within a reasonable timeframe (say 1 month for 300c pages), we will be specially excited to oblige you. Are you willing to make such a commitment? See here to get an idea of what it involves.
- Answer: Yes. 15 days for 3 Prahasanas (~150 pages). Can do only after May 1.
Are you OK with the scan quality that we currently offer? Or are you able to provide your own OCR text?
- Answer: I will provide the pdfs to be OCRed.
What other factors should our OCR/ other volunteers consider when deciding whether to take up this request? In other words, what's so important about this text?
- Answer: Prahasanas are some of the most enjoyable texts in Sanskrit literature. Being both short and interesting, they can serve as bridges between Samskrita Bharati books and real Kāvya literature like Kālidasa. If the texts are made accessible, it is likely that they will provide material for stage productions in Sanskrit universities or Samskrita Bharati Programmes.
Provide a link to the pdf of the printed book which needs to be OCR-ed. You can host the pdf online on several sites such as http://archive.org (← strongly recommended), http://dropbox.com or http://sites.google.com.
- Answer:
  भगवदज्जुकीयम्— https://archive.org/details/BhagavadajjukamBodhayanaVPrabhakaraSastri1986
  लटकमेलकम्—https://archive.org/details/LatakamelakaPrahasanaShankadhara
  हास्यचूडामणिः— A section in https://archive.org/details/CollectionOfSixDramasOfVatsaraja
Now for some details: please enter the following metadata (use both devanAgarI and latin alphabet forms if you can).
- Title: भगवदज्जुकीयम्, लटकमेलकम्, हास्यचूडामणिः
- Author: बोधायनः, शङ्खधरः, वत्सराजः
- Commentator: NA, NA, NA
Is there any other information you want to provide?
- Answer: No.

Ok - thanks for answering the above questions 🙏. Subscribe to this thread to stay updated.

Add Kuṭṭanīmata etc.

Gather texts from Digital Buddhist Sanskrit Canon

http://www.dsbcproject.org/canon-text/browse-by-list/1

OCR Request: prākṛta-śabda-mahārṇava and prākṛta-lakṣmī-nāma-mālā

Welcome! Answer all these questions.

Do you want a text OCR-ed? If yes continue below. If not, clear all this text and type your issue.

Has the text already been OCR-ed? Have you searched online (using both devanAgarI and latin transliterations)? For example in the large repositories of digitized texts listed here.
- Answer: The texts have not been OCRed.
If you commit to proof-read the result within a reasonable timeframe (say 1 month for 300 pages), we will be specially excited to oblige you. Are you willing to make such a commitment? See here to get an idea of what it involves.
- Answer: No. The texts are large and I do not have enough knowledge of Prakrit to proofread.
Are you OK with the scan quality that we currently offer? Or are you able to provide your own OCR text?
- Answer: Yes, happy.
What other factors should our OCR/ other volunteers consider when deciding whether to take up this request? In other words, what's so important about this text?
- Answer:
  There are no online Prakrit dictionaries available. It poses a serious difficulty to those trying to read Prakrit literature. Making a Prakrit dictionary available will probably encourage more people to learn Prakrit and give up dependence on the chāyā. It will also help Jains, Buddhists, linguists and historians.
Provide a link to the pdf of the printed book which needs to be OCR-ed. You can host the pdf online on several sites such as http://archive.org (← strongly recommended), http://dropbox.com or http://sites.google.com.
- Answer:
  prākṛta-śabda-mahārṇava: http://www.dli.ernet.in/handle/2015/348701 (This is the best quality pdf available AFAIK)
  prākṛta-lakṣmī-nāmamālā: https://archive.org/details/PaiaLacchiNammala001708HR
Now for some details: please enter the following metadata (use both devanAgarI and latin alphabet forms if you can).
- Title: pāia-sadda-mahaṇṇāvo (पाइअ-सद्द-महण्णवो)
- Author: Pandit Hargovind Das T. Sheth
- Commentator: NA
- Title: pāia-lacchī-nāma-mālā (पाइअ-लच्छी-नाम-माला)
- Author: Dhanapāla (धनपालः)
- Commentator: NA
Is there any other information you want to provide?
- Answer: We should upload OCRed pdfs of the texts to archive, even if not proofread. It will help many.

Ok - thanks for answering the above questions 🙏. Subscribe to this thread to stay updated.

Add https://ebharatisampat.in/

Sanskrit Bharati appears to have a collection of proofread, non-proofread unicode texts at https://ebharatisampat.in/. You may add that.
I've a script to do that. But, the problem is that git never works for me as intended and my forks become messy and unable to maintain in sync with original repositories. So, I can't add to my fork and push it.
I hope you'll be able to write a script to scrape texts.

आगमान्वेषणी-ग्रन्थसङ्ग्रहणम्

आगमान्वेषणीति तन्त्रांशः केनचिन् निर्मितः। निर्माता ऽज्ञातः। प्रदीपसिंहार्येण ज्ञापितं तद्विषये।
AgamanveShaNI.tar.gz

एताः सञ्चिकाः अस्मत्सङ्ग्रहे न सन्ति -

महाभारत-माध्वपाठः-१-तः १८
ईशावास्य-टीकागीताभाष्य-प्रमेयदीपिका-१-१८
तन्त्रदीपिका-१-1-l_ to 4-4
युक्तिमल्लिका
सङ्ग्रहरामायणम्
ब्रह्मलूत्रार्थसङ्ग्रह-अविदितव्याख्यानम्
स्तोत्रभाग- कन्दुकस्तुति
कृष्णचारित्र्यमञ्जरी
कृष्णाष्टकम्
गुरुस्तोत्रम्
चतुर्विंशतिमूर्तिस्तुतिः
जयतीर्थस्तुतिः
टीकाकृत्पादाष्टकम्
तारतम्यस्तोत्रम्
दशावतारस्तुतिः
नदीतारतम्यस्तोत्रम्
नवग्रहस्तोत्रम्
प्रातःसङ्कल्पगद्यम्
प्रार्थनादशकस्तोत्रम्
मध्वगुरुस्तुतिपरम्परा
महाभारततात्पर्यनिर्णयभावसङ्ग्रहः
यन्त्रोद्धारकहनुमत्र स्तोत्रम्
रघूत्तममङ्गलाष्टकम्
राघवेन्द्रस्तोत्रम्
वादिराजस्तोत्रम्
राघवेन्द्रगाथास्तवः
राघवेन्द्रस्तोत्रम्
वादिराजस्तोत्रम्
विष्णुस्तुतिः -सत्यसन्धरु
वेदव्यासकरावलम्बन
व्यासगद्य
व्यासराजस्तोत्र
शिवस्तुति
श्रीपादरत्नपञ्चमालिका
श्रीशगुणदर्पणम्
सत्यप्रियाष्टक
सत्यबोधस्तोत्रम्
सर्वमूलाद्यन्तश्लोकाः
श्रीरामवर्णनम्
हयग्रीवसम्पदास्तोत्रम्
हरिवायुस्तुतिः
स्रीधर्म-गुरुमन्त्रम

तत्र ग्रन्था जालक्षेत्राल् लभ्यन्ते कस्माच्चित्। कस्माद् इति ज्ञाते तेषां ग्रन्थानां प्राप्तिर् भवति सरला।

Error Detection

Vishvas, are you just storing or planning to clean the textes as well, please advise?

Fix windows-incompatible filenames

Log attached
raw etext clone errors.txt

शबरभाष्यस्य पूर्णः पाठः

6e8d738 इत्यनेन शबरभाष्यम् अङ्कीकृतं समर्पितम्। ममाध्ययनाय बहूपादेयम्। किन्तु तत्र कानिचन पृष्ठानि न विद्यन्ते - प्रथमाध्याये यथा +उप९९ एव पृष्ठानि। पुनः पृष्ठसङ्ख्याक्रमे व्यत्यासो ऽपेक्षितः - 1990020102811 - Mimamsa Darshana, Jaimini Muni, 1130p, (मीमांसादर्शनम् १ - १२ अध्यायाः CSS)_12_1.jpg.txt इत्यस्य स्थाने
1990020102811 - Mimamsa Darshana, Jaimini Muni, 1130p, (मीमांसादर्शनम् १ - १२ अध्यायाः CSS)_12_001.jpg.txt इति स्यात्।

अतः परिष्कारो विधेयः। प्रत्यध्यायम् एका सञ्चिकेति सङ्गृह्य प्रेषयितुं शक्यं वा @lalitaalaalitah ?

Two missing pages in Haravijaya edition scan

The scan of the Haravijaya edition used here for OCR is lacking (at least) two pages. I have located a scan of a reproduction of the same edition, and have extracted from it the two missing pages, which would need to be inserted before page 694 of the pdf, line 22477 of the e-text. I could just type it in, but maybe you like to OCR and include it?

Dvaita.net texts

http://www.dvaita.net/releases.html
They may provide unicode versions if asked

Sansknet Data

http://rsvidyapeetha.ac.in/sansknet/index-1.html

Scrape Kannada dicts from Bharatavani.in and Baraha

Bharatavani.in, being a government website, is always in the danger of disappearing.

~~1. ಪಂಪನ ನುಡಿಗಣಿ https://bharatavani.in/dictionary-surf/?did=70&language=Kannada~~

~~2. ಹಳಗನ್ನಡ ಪದಸಂಪದ (KaSaPa) https://bharatavani.in/dictionary-surf/?did=64&language=Kannada~~

ಚಂಪೂ ನುಡಿಗನ್ನಡಿ (Bharatavani.in)
https://bharatavani.in/dictionary-surf/?did=62&language=Kannada
Bhāratīya Bhāṣā Kośa
https://bharatavani.in/dictionary-surf/?did=193&language=Hindi
ಕನ್ನಡ ಜಾನಪದ ನಿಘಂಟು Vol 1 (Bharatavani.in
https://bharatavani.in/dictionary-surf/?did=140&language=kannada
ಜಾನಪದ ವಸ್ತು ಕೋಶ (Bharatvani.in)
https://bharatavani.in/dictionary-surf/?did=100&language=Kannada
ಸಂಕ್ಷಿಪ್ತ ಕನ್ನಡ ನಿಘಂಟು (KaSaPa) (Bharatavani.in)
https://bharatavani.in/dictionary-surf/?did=46&language=Kannada
Seven dictionaries on Baraha.com
http://baraha.com/v10/kannada/browse.php#

Imbibe some un-publicized digitized texts with western indologists

Eg. Śṛṅgāraprakāśa of Bhoja. (See mixed/cu-sarit )

आण्ड्रेयः पृष्टः।

Scrape Śābdikābharaṇam Dhātuvṛtti

http://www.sanskrit.nic.in/DigitalBook/S/sabdikabharanam/index.html
In the danger of vanishing. Could be made into a dictionary.

Save and apply common regex-fixes before handing over the digitized text for proof-reading

Some OCR tools result in predictable errors, which can be fixed by regex replacement, as pointed out in this thread. These regexes should be saved in a bash script, which may be of the following form:

perl -pi -e 's/FINDTEXT/REPLACETEXT/' someFileName

This script should then be used on the OCR output before passing it on the proof reader.

अभिधानराजेन्द्र OCR

This is a large dictionary of Prakrit having copious references from various Jain and non Jain works. A lot of Sanskrit in explanation
I would not be able to work on this kosha because of constraints of my resources and lack of knowledge of Prakrit.

So, putting all seven volumes OCR here, in case someone some day would like to work on this.

https://github.com/sanskrit/raw_etexts/tree/master/koshaH/abhidhanarajendra

Rashtriya Sanskrita Sansthan, JNU, UHyd Texts

Pali Moggallāna Vyākaraṇa.
http://www.sanskrit.nic.in/moggallana/p1.html
This can be scraped into a dictionary. It is likely this already exists in roman script on some website. Need to check.
If there is some way to extract data from the pdfs...
http://www.sanskrit.nic.in/ebooks.php
Particularly noteworthy are Abhinavabharati and Rasagangadhara with commentaries.
Some modern Sanskrit in unicode here. Low priority.
http://www.sanskrit.nic.in/ASSP/index.html#

There was more material on the Sansthan webisite about two years ago, which now seems to be gone.

JNU Bhāvaprakāśa Database
http://sanskrit.jnu.ac.in/bpnighantu/search.jsp
If not possible to scrape, then maybe by asking Prof. Girish Nath Jha nicely...
UHyd: Large number of texts in unknown font. Most of the texts are modern. Low priority.
http://sanskrit.uohyd.ac.in/Corpus/
These should be saved, even if they cannot be immediately convereted.

haravijaya proofreading

I would like to correct a few cantos of the Haravijaya OCR text, if that is welcome. I would start with canto 49, which is the last one missing in the incomplete e-text produced many years ago by Diwakar and Rabi Acharya. If it goes well, I would in due course correct a few more cantos. I am planning to then convert them to IAST, and integrate them into my electronic digital critical edition of the Haravijaya (work in progress of course). Anyway, I have started with the first few verses, but am not sure how I should encode the footnote markers, which, I am afraid, usually break in OCR. Should I maybe just leave them out, and whoever is interested in the variant readings will have to consult the scan of the edition, or later the electronic critical edition, anyways? Adding the numbers in the running text would make these specific words ungreppable sort of. Or do you have any convention for that?

वक्रारविन्दनिमितं करगाढसैद्ध-
पार्श्वद्वयं युधि बभार स पाञ्चजन्यम् ।
वैरिञ्चमण्डमिव निःश्वसितानिलोल-
पर्यस्यमानमुदरान्तरतो विनिर्यत् ॥ २ ॥

Here the ru together with footnote marker 3 was OCRed as sai.

sanskrit / raw_etexts Goto Github PK

raw_etexts's Introduction

Intro

Motivation

Usage

Generating a catalog

Contribution

Conventions

Preferred naming conventions

raw_etexts's People

Contributors

Stargazers

Watchers

Forkers

raw_etexts's Issues

Recommend Projects

Recommend Topics

Recommend Org