Giter Club home page Giter Club logo

raw_etexts's Introduction

Intro

A repository containing important raw texts (some non-proofread). Processed (eg: transliterated) texts may be placed in other repos/ sites.

"raw" in raw_etexts means "mostly unencoded" text files (and not binary files like doc or pdf or rtf), whose contents may be searched by (say) grep. This means that we prefer files whose formatting (if any) does not interfere much with search.

Motivation

  • Websites with critical digitized texts have disappeared in the past, to the frustration of users.
  • It is easier to search multiple files encoded in plain text.
    • part of an email sent by shrI vIranArAyaNa pANDurangi - "...problems in searching in wikisourse. it does not search to our wish. Suppose I wish to त्रिपादूर्ध्व wherever it occurs. If I search त्रिपा only it should search त्रिपादूर्ध्व. but it does not search it. it only searches त्रिपा. Hence I need to get all the veda and purana files in my computer. so I can put it in text files. text files search these things very fine."

Usage

  • Clone this repo: git clone https://github.com/sanskrit/raw_etexts.git --recursive --shallow-submodules
  • Initialize submodules if needed: git submodule update --init --recursive --depth 1
  • Update submodules if needed:
    git submodule foreach -q 'git checkout $(git config -f $toplevel/.gitmodules submodule.$name.branch || echo master) || git checkout -b $(git config -f $toplevel/.gitmodules submodule.$name.branch || echo master)' (also without the -b option, if all submodules are not covered. Also consider adding the --recursive option.)
  • Check status: git submodule foreach -q 'git status'

Generating a catalog

  • On linux: ./make_catalog.sh

Contribution

  • Fork this repo and send pull requests.
  • Raise issues.
  • Add submodule: For example - git submodule add --depth 1 -- https://github.com/indic-dict/something.git some/path

If you're maintaining lot of submodules, it may make sense to group them together (eg. mixed/vishvAsa) so that they can be easily excluded from your local multi-file searches so as to avoid duplication.

Conventions

Allowed formats in the order of decreasing preference: md (possibly with frontmatter), txt, itx, tex, html.

Preferred naming conventions

  • OCR files will be named: xyz_ocr.md or xyz_ocr.txt.
  • When they are being proofread, they will be renamed: xyz_proofreading_.md or txt.
  • When fully proofread, they will be: xyz.md or xyz.txt

raw_etexts's People

Contributors

chakrabortydeepro avatar drdhaval2785 avatar karthikraman avatar veerasimha avatar vvasuki avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

raw_etexts's Issues

OCR Request: Vṛddha Sūryāruṇa Karma Vipāka | वृद्ध सूर्यारुण कर्म विपाक (1937)

Welcome! Answer all these questions.

Do you want a text OCR-ed? If yes continue below. If not, clear all this text and type your issue.

  • Has the text already been OCR-ed? Have you searched online (using both devanAgarI and latin transliterations)? For example in the large repositories of digitized texts listed here.
    • Answer: No, it hasn't been ocr-ed. Doesn't exists in any of these repositories.
  • If you commit to proof-read the result within a reasonable timeframe (say 1 month for 300 pages), we will be specially excited to oblige you. Are you willing to make such a commitment? See here to get an idea of what it involves.
    • Answer: I can surely give it a try but do not promise anything since I have serious health issues.
  • Are you OK with the scan quality that we currently offer? Or are you able to provide your own OCR text?
    • Answer:Yes I'm ok. No I don't have any means to ocr devanāgarī texts. If you have such a working software please let me have it. I have tried a few open sourced softwares in the past but didn't get much success.
  • What other factors should our OCR/ other volunteers consider when deciding whether to take up this request? In other words, what's so important about this text?
    • Answer: It is a very important text in Āyurveda and Jyotiṣa which lists the remedies in forms of mantras to many problems (ailments) in life.
  • Provide a link to the pdf of the printed book which needs to be OCR-ed. You can host the pdf online on several sites such as http://archive.org (← strongly recommended), http://dropbox.com or http://sites.google.com.
  • Now for some details: please enter the following metadata (use both devanAgarI and latin alphabet forms if you can).
    • Title: Vṛddha Sūryāruṇa Karma Vipāka | वृद्ध सूर्यारुण कर्म विपाक
    • Author: unknown
    • Commentator: Gaṅgāviṣṇu | गङ्गाविष्णु
  • Is there any other information you want to provide?
    • Answer: This is coming in the line of an ancient sage called Sūrya and most likely linked to those who were of Yāmala Bhāskara stream i.e., natives who worshiped Tantrik form of Sun and in whose sampradāya Maya (author of Sūrya Siddhānta) was from. Very important text. It would be appreciated if we could find someone to translate this later for the sake of humanity.

Ok - thanks for answering the above questions 🙏. Subscribe to this thread to stay updated.

OCR Request for 3 Prahasanas.

Welcome! Answer all these questions.

Do you want a text OCR-ed? If yes continue below. If not, clear all this text and type your issue.

Yes. The Major Prahasanas: Bhagavadajjukīya, Hāsyacūḍāmaṇi, laTakamelaka (~150 pages) (Mattavilāsa, hAsyArNava, dhUrtasamagama are already done).

  • If you commit to proof-read the result within a reasonable timeframe (say 1 month for 300c pages), we will be specially excited to oblige you. Are you willing to make such a commitment? See here to get an idea of what it involves.
    • Answer: Yes. 15 days for 3 Prahasanas (~150 pages). Can do only after May 1.
  • Are you OK with the scan quality that we currently offer? Or are you able to provide your own OCR text?
    • Answer: I will provide the pdfs to be OCRed.
  • What other factors should our OCR/ other volunteers consider when deciding whether to take up this request? In other words, what's so important about this text?
    • Answer: Prahasanas are some of the most enjoyable texts in Sanskrit literature. Being both short and interesting, they can serve as bridges between Samskrita Bharati books and real Kāvya literature like Kālidasa. If the texts are made accessible, it is likely that they will provide material for stage productions in Sanskrit universities or Samskrita Bharati Programmes.
  • Provide a link to the pdf of the printed book which needs to be OCR-ed. You can host the pdf online on several sites such as http://archive.org (← strongly recommended), http://dropbox.com or http://sites.google.com.
  • Now for some details: please enter the following metadata (use both devanAgarI and latin alphabet forms if you can).
    • Title: भगवदज्जुकीयम्, लटकमेलकम्, हास्यचूडामणिः
    • Author: बोधायनः, शङ्खधरः, वत्सराजः
    • Commentator: NA, NA, NA
  • Is there any other information you want to provide?
    • Answer: No.

Ok - thanks for answering the above questions 🙏. Subscribe to this thread to stay updated.

OCR Request: prākṛta-śabda-mahārṇava and prākṛta-lakṣmī-nāma-mālā

Welcome! Answer all these questions.

Do you want a text OCR-ed? If yes continue below. If not, clear all this text and type your issue.

  • Has the text already been OCR-ed? Have you searched online (using both devanAgarI and latin transliterations)? For example in the large repositories of digitized texts listed here.
    • Answer: The texts have not been OCRed.
  • If you commit to proof-read the result within a reasonable timeframe (say 1 month for 300 pages), we will be specially excited to oblige you. Are you willing to make such a commitment? See here to get an idea of what it involves.
    • Answer: No. The texts are large and I do not have enough knowledge of Prakrit to proofread.
  • Are you OK with the scan quality that we currently offer? Or are you able to provide your own OCR text?
    • Answer: Yes, happy.
  • What other factors should our OCR/ other volunteers consider when deciding whether to take up this request? In other words, what's so important about this text?
    • Answer:
      There are no online Prakrit dictionaries available. It poses a serious difficulty to those trying to read Prakrit literature. Making a Prakrit dictionary available will probably encourage more people to learn Prakrit and give up dependence on the chāyā. It will also help Jains, Buddhists, linguists and historians.
  • Provide a link to the pdf of the printed book which needs to be OCR-ed. You can host the pdf online on several sites such as http://archive.org (← strongly recommended), http://dropbox.com or http://sites.google.com.
  • Now for some details: please enter the following metadata (use both devanAgarI and latin alphabet forms if you can).
    • Title: pāia-sadda-mahaṇṇāvo (पाइअ-सद्द-महण्णवो)
    • Author: Pandit Hargovind Das T. Sheth
    • Commentator: NA
    • Title: pāia-lacchī-nāma-mālā (पाइअ-लच्छी-नाम-माला)
    • Author: Dhanapāla (धनपालः)
    • Commentator: NA
  • Is there any other information you want to provide?
    • Answer: We should upload OCRed pdfs of the texts to archive, even if not proofread. It will help many.

Ok - thanks for answering the above questions 🙏. Subscribe to this thread to stay updated.

Add https://ebharatisampat.in/

Sanskrit Bharati appears to have a collection of proofread, non-proofread unicode texts at https://ebharatisampat.in/. You may add that.
I've a script to do that. But, the problem is that git never works for me as intended and my forks become messy and unable to maintain in sync with original repositories. So, I can't add to my fork and push it.
I hope you'll be able to write a script to scrape texts.

आगमान्वेषणी-ग्रन्थसङ्ग्रहणम्

आगमान्वेषणीति तन्त्रांशः केनचिन् निर्मितः। निर्माता ऽज्ञातः। प्रदीपसिंहार्येण ज्ञापितं तद्विषये।
AgamanveShaNI.tar.gz

एताः सञ्चिकाः अस्मत्सङ्ग्रहे न सन्ति -

  • महाभारत-माध्वपाठः-१-तः १८
  • ईशावास्य-टीकागीताभाष्य-प्रमेयदीपिका-१-१८
  • तन्त्रदीपिका-१-1-l_ to 4-4
  • युक्तिमल्लिका
  • सङ्ग्रहरामायणम्
  • ब्रह्मलूत्रार्थसङ्ग्रह-अविदितव्याख्यानम्
  • स्तोत्रभाग- कन्दुकस्तुति
  • कृष्णचारित्र्यमञ्जरी
  • कृष्णाष्टकम्
  • गुरुस्तोत्रम्
  • चतुर्विंशतिमूर्तिस्तुतिः
  • जयतीर्थस्तुतिः
  • टीकाकृत्पादाष्टकम्
  • तारतम्यस्तोत्रम्
  • दशावतारस्तुतिः
  • नदीतारतम्यस्तोत्रम्
  • नवग्रहस्तोत्रम्
  • प्रातःसङ्कल्पगद्यम्
  • प्रार्थनादशकस्तोत्रम्
  • मध्वगुरुस्तुतिपरम्परा
  • महाभारततात्पर्यनिर्णयभावसङ्ग्रहः
  • यन्त्रोद्धारकहनुमत्र स्तोत्रम्
  • रघूत्तममङ्गलाष्टकम्
  • राघवेन्द्रस्तोत्रम्
  • वादिराजस्तोत्रम्
  • राघवेन्द्रगाथास्तवः
  • राघवेन्द्रस्तोत्रम्
  • वादिराजस्तोत्रम्
  • विष्णुस्तुतिः -सत्यसन्धरु
  • वेदव्यासकरावलम्बन
  • व्यासगद्य
  • व्यासराजस्तोत्र
  • शिवस्तुति
  • श्रीपादरत्नपञ्चमालिका
  • श्रीशगुणदर्पणम्
  • सत्यप्रियाष्टक
  • सत्यबोधस्तोत्रम्
  • सर्वमूलाद्यन्तश्लोकाः
  • श्रीरामवर्णनम्
  • हयग्रीवसम्पदास्तोत्रम्
  • हरिवायुस्तुतिः
  • स्रीधर्म-गुरुमन्त्रम

तत्र ग्रन्था जालक्षेत्राल् लभ्यन्ते कस्माच्चित्। कस्माद् इति ज्ञाते तेषां ग्रन्थानां प्राप्तिर् भवति सरला।

Error Detection

Vishvas, are you just storing or planning to clean the textes as well, please advise?

शबरभाष्यस्य पूर्णः पाठः

6e8d738 इत्यनेन शबरभाष्यम् अङ्कीकृतं समर्पितम्। ममाध्ययनाय बहूपादेयम्। किन्तु तत्र कानिचन पृष्ठानि न विद्यन्ते - प्रथमाध्याये यथा +उप९९ एव पृष्ठानि। पुनः पृष्ठसङ्ख्याक्रमे व्यत्यासो ऽपेक्षितः - 1990020102811 - Mimamsa Darshana, Jaimini Muni, 1130p, (मीमांसादर्शनम् १ - १२ अध्यायाः CSS)_12_1.jpg.txt इत्यस्य स्थाने
1990020102811 - Mimamsa Darshana, Jaimini Muni, 1130p, (मीमांसादर्शनम् १ - १२ अध्यायाः CSS)_12_001.jpg.txt इति स्यात्।

अतः परिष्कारो विधेयः। प्रत्यध्यायम् एका सञ्चिकेति सङ्गृह्य प्रेषयितुं शक्यं वा @lalitaalaalitah ?

Scrape Kannada dicts from Bharatavani.in and Baraha

Bharatavani.in, being a government website, is always in the danger of disappearing.

1. ಪಂಪನ ನುಡಿಗಣಿ https://bharatavani.in/dictionary-surf/?did=70&language=Kannada

2. ಹಳಗನ್ನಡ ಪದಸಂಪದ (KaSaPa) https://bharatavani.in/dictionary-surf/?did=64&language=Kannada

  1. ಚಂಪೂ ನುಡಿಗನ್ನಡಿ (Bharatavani.in)
    https://bharatavani.in/dictionary-surf/?did=62&language=Kannada

  2. Bhāratīya Bhāṣā Kośa
    https://bharatavani.in/dictionary-surf/?did=193&language=Hindi

  3. ಕನ್ನಡ ಜಾನಪದ ನಿಘಂಟು Vol 1 (Bharatavani.in
    https://bharatavani.in/dictionary-surf/?did=140&language=kannada

  4. ಜಾನಪದ ವಸ್ತು ಕೋಶ (Bharatvani.in)
    https://bharatavani.in/dictionary-surf/?did=100&language=Kannada

  5. ಸಂಕ್ಷಿಪ್ತ ಕನ್ನಡ ನಿಘಂಟು (KaSaPa) (Bharatavani.in)
    https://bharatavani.in/dictionary-surf/?did=46&language=Kannada

  6. Seven dictionaries on Baraha.com
    http://baraha.com/v10/kannada/browse.php#

Rashtriya Sanskrita Sansthan, JNU, UHyd Texts

  1. Pali Moggallāna Vyākaraṇa.
    http://www.sanskrit.nic.in/moggallana/p1.html
    This can be scraped into a dictionary. It is likely this already exists in roman script on some website. Need to check.

  2. If there is some way to extract data from the pdfs...
    http://www.sanskrit.nic.in/ebooks.php
    Particularly noteworthy are Abhinavabharati and Rasagangadhara with commentaries.

  3. Some modern Sanskrit in unicode here. Low priority.
    http://www.sanskrit.nic.in/ASSP/index.html#

There was more material on the Sansthan webisite about two years ago, which now seems to be gone.

  1. JNU Bhāvaprakāśa Database
    http://sanskrit.jnu.ac.in/bpnighantu/search.jsp
    If not possible to scrape, then maybe by asking Prof. Girish Nath Jha nicely...

  2. UHyd: Large number of texts in unknown font. Most of the texts are modern. Low priority.
    http://sanskrit.uohyd.ac.in/Corpus/
    These should be saved, even if they cannot be immediately convereted.

haravijaya proofreading

I would like to correct a few cantos of the Haravijaya OCR text, if that is welcome. I would start with canto 49, which is the last one missing in the incomplete e-text produced many years ago by Diwakar and Rabi Acharya. If it goes well, I would in due course correct a few more cantos. I am planning to then convert them to IAST, and integrate them into my electronic digital critical edition of the Haravijaya (work in progress of course). Anyway, I have started with the first few verses, but am not sure how I should encode the footnote markers, which, I am afraid, usually break in OCR. Should I maybe just leave them out, and whoever is interested in the variant readings will have to consult the scan of the edition, or later the electronic critical edition, anyways? Adding the numbers in the running text would make these specific words ungreppable sort of. Or do you have any convention for that?

Selection_087

वक्रारविन्दनिमितं करगाढसैद्ध-
पार्श्वद्वयं युधि बभार स पाञ्चजन्यम् ।
वैरिञ्चमण्डमिव निःश्वसितानिलोल-
पर्यस्यमानमुदरान्तरतो विनिर्यत् ॥ २ ॥

Here the ru together with footnote marker 3 was OCRed as sai.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.