drdhaval2785 / inriaxmlwrapper Goto Github PK

A python library to use XML database of http://sanskrit.inria.fr for conjugation and analysis of Sanskrit word forms

Python 68.46% HTML 15.07% CSS 9.00% JavaScript 7.47%

inriaxmlwrapper's Introduction

What to expect

This code analyses a given Sanskrit text and gives its possible wordform analysis. e.g. किन्तु would be analysed as किन्तु(किन्तु-अव्ययम्-क्रियाविशेषणम्), शृगालः would be analysed as शृगालः(शृगाल-प्रथमाविभक्तिः-एकवचनम्-पुंल्लिङ्गम्) and चिन्तयेत् would be analysed as चिन्तयेत्(चिन्त्-प्राथमिकः-विधिलिङ्-कर्तरि-एकवचनम्-प्रथमपुरुषः).

Requirements

Installation and Usage

Download ZIP from the repository.
Extract the content to your favourite folder.
Put the file you want to analyse in sanskritinput.txt.
Open Terminal / cmd.exe.
cd to your folder.
Type python sanskritmark.py and press enter.
After the execution is over, check analysedoutput.txt for analysis of the text.

Limitations

Currently support for sandhi and samAsas is quite premitive. We are working on its improvement.

Keeping updated

If you want to update your database (in case Gerard updates his list), please download data from http://sanskrit.inria.fr/DATA/XML/SL_morph.tar.gz

Put the extracted XML data into the code directory. (SL_parts.xml is bigger than what github allows). If there are any preexisting XML files, overwrite.

Programs in repository

sanskritmark.py is the curent code under development.
SL_adverbs.xml, SL_final.xml, SL_morph.dtd, SL_nouns.xml, SL_parts.xml, SL_preverbs.txt, SL_pronouns.xml and SL_roots.xml are files taken from Gerard's database.
suffixentryfile.py is a file for generating data entry template for various parameters in code sanskritmark.py.
inriaxmlparser.py was the premitive version of the code. Now abandoned.

inriaxmlwrapper's People

Contributors

Stargazers

Watchers

Forkers

kmadathil pmshukla poojapi hareeshbabu82ns

inriaxmlwrapper's Issues

test for nripa

hi nripa,
writing a dummy issue.

findrootword hangs

Function findrootword is taking exorbitantly long, rather it kills the PC.
Have a look at it once again.
Not working properly.
It used to work reasonably well earlier.

Error in Building

Traceback (most recent call last):
  File "sanskritmark.py", line 26, in <module>
    upasargas = etree.parse('SL_upasargas.xml')
  File "src/lxml/lxml.etree.pyx", line 3442, in lxml.etree.parse (src/lxml/lxml.etree.c:81701)
  File "src/lxml/parser.pxi", line 1811, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:118620)
  File "src/lxml/parser.pxi", line 1837, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:118967)
  File "src/lxml/parser.pxi", line 1741, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:117879)
  File "src/lxml/parser.pxi", line 1138, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:112425)
  File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105881)
  File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:107589)
  File "src/lxml/parser.pxi", line 633, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:106400)
IOError: Error reading file 'SL_upasargas.xml': failed to load external entity "SL_upasargas.xml"

Reusing this for kmadathil/sanksrit_parser

@drdhaval2785

Is this under active development? I would like to reuse if possible.

Following issues are seen

upasargas = etree.parse('SL_upasargas.xml')
throws an error. No such file exists. There is an SL_preverbs.txt
Cannot import this directly as a module simple mod would be to change the last line to

if __name__ == "__main__":
        convertfromfile('sanskritinput.txt','analysedoutput.txt')

If you're not intending to support this, I'll reimplement based on your logic.

Reorganize code and publish to pip

यथात्रोक्तम् -
" एतदपि ननु pip क्षेत्रेऽभविष्यच् चेद् अज्ञासिषम्! सरलमेतत्। ततो यः कोऽपि सरलतया स्थापयित्वा तम् तन्त्रांशम् प्रायोक्ष्यत "s = sanskritmark.analyser(ot,split=False)" इत्यत्र यथा। अस्तु, issues क्षेत्रेऽनुवर्तिष्ये।"

अत्र यथापेक्षमहम् शक्नोमि सहकर्तुम्। स्वविचारः सूच्यताम्।

Running sanskritmark raises error

Running the script causes an error
Line 518, which is currently:
if i % 2 == 0 and i != len(dat): # Even members of datum are the words and odd members are word boundaries. Therefore, processing only even members.
should have:
... i != (len(dat)-1)
to skip the last element, which is empty. i never reaches len(dat)

(I could have forked, fixed and created a pull request, but it seems too much of an overhead for such a small fix. My apologies).

speedup - check leakage

The script is painfully slow.
Results are good, but very slow to be of some use.
Check the time or memory leakage and fix.