Giter Club home page Giter Club logo

inriaxmlwrapper's Introduction

What to expect

This code analyses a given Sanskrit text and gives its possible wordform analysis. e.g. किन्तु would be analysed as किन्तु(किन्तु-अव्ययम्-क्रियाविशेषणम्), शृगालः would be analysed as शृगालः(शृगाल-प्रथमाविभक्तिः-एकवचनम्-पुंल्लिङ्गम्) and चिन्तयेत् would be analysed as चिन्तयेत्(चिन्त्-प्राथमिकः-विधिलिङ्-कर्तरि-एकवचनम्-प्रथमपुरुषः).

Requirements

  1. python2.7

  2. lxml

Installation and Usage

  1. Download ZIP from the repository.

  2. Extract the content to your favourite folder.

  3. Put the file you want to analyse in sanskritinput.txt.

  4. Open Terminal / cmd.exe.

  5. cd to your folder.

  6. Type python sanskritmark.py and press enter.

  7. After the execution is over, check analysedoutput.txt for analysis of the text.

Limitations

Currently support for sandhi and samAsas is quite premitive. We are working on its improvement.

Keeping updated

If you want to update your database (in case Gerard updates his list), please download data from http://sanskrit.inria.fr/DATA/XML/SL_morph.tar.gz

Put the extracted XML data into the code directory. (SL_parts.xml is bigger than what github allows). If there are any preexisting XML files, overwrite.

Programs in repository

  1. sanskritmark.py is the curent code under development.

  2. SL_adverbs.xml, SL_final.xml, SL_morph.dtd, SL_nouns.xml, SL_parts.xml, SL_preverbs.txt, SL_pronouns.xml and SL_roots.xml are files taken from Gerard's database.

  3. suffixentryfile.py is a file for generating data entry template for various parameters in code sanskritmark.py.

  4. inriaxmlparser.py was the premitive version of the code. Now abandoned.

inriaxmlwrapper's People

Contributors

drdhaval2785 avatar kmadathil avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

inriaxmlwrapper's Issues

findrootword hangs

Function findrootword is taking exorbitantly long, rather it kills the PC.
Have a look at it once again.
Not working properly.
It used to work reasonably well earlier.

Error in Building

Traceback (most recent call last):
  File "sanskritmark.py", line 26, in <module>
    upasargas = etree.parse('SL_upasargas.xml')
  File "src/lxml/lxml.etree.pyx", line 3442, in lxml.etree.parse (src/lxml/lxml.etree.c:81701)
  File "src/lxml/parser.pxi", line 1811, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:118620)
  File "src/lxml/parser.pxi", line 1837, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:118967)
  File "src/lxml/parser.pxi", line 1741, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:117879)
  File "src/lxml/parser.pxi", line 1138, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:112425)
  File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105881)
  File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:107589)
  File "src/lxml/parser.pxi", line 633, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:106400)
IOError: Error reading file 'SL_upasargas.xml': failed to load external entity "SL_upasargas.xml"

Reusing this for kmadathil/sanksrit_parser

@drdhaval2785

Is this under active development? I would like to reuse if possible.

Following issues are seen

  1. upasargas = etree.parse('SL_upasargas.xml')
    throws an error. No such file exists. There is an SL_preverbs.txt
  2. Cannot import this directly as a module simple mod would be to change the last line to
if __name__ == "__main__":
        convertfromfile('sanskritinput.txt','analysedoutput.txt')

If you're not intending to support this, I'll reimplement based on your logic.

Reorganize code and publish to pip

यथात्रोक्तम् -
" एतदपि ननु pip क्षेत्रेऽभविष्यच् चेद् अज्ञासिषम्!​ सरलमेतत्। ततो यः कोऽपि सरलतया स्थापयित्वा तम् तन्त्रांशम् प्रायोक्ष्यत "s = sanskritmark.analyser(ot,split=False)" ​इत्यत्र यथा।​ अस्तु, issues क्षेत्रेऽनुवर्तिष्ये।"

अत्र यथापेक्षमहम् शक्नोमि सहकर्तुम्। स्वविचारः सूच्यताम्।

Running sanskritmark raises error

Running the script causes an error
Line 518, which is currently:
if i % 2 == 0 and i != len(dat): # Even members of datum are the words and odd members are word boundaries. Therefore, processing only even members.
should have:
... i != (len(dat)-1)
to skip the last element, which is empty. i never reaches len(dat)

(I could have forked, fixed and created a pull request, but it seems too much of an overhead for such a small fix. My apologies).

speedup - check leakage

The script is painfully slow.
Results are good, but very slow to be of some use.
Check the time or memory leakage and fix.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.