kmadathil / sanskrit_parser Goto Github PK

View Code? Open in Web Editor NEW

68.0 68.0 21.0 113.44 MB

Parsers for Sanskrit / संस्कृतम्

License: MIT License

Python 99.95% Shell 0.05%

sanskrit_parser's People

Contributors

Stargazers

Watchers

sanskrit_parser's Issues

Language support

It would be cool if the final model produced is easily usable from the following languages in that order of preference:

java/ scala
python
javascript

Module contracts

Opening a separate issue to talk about this, since it was being lost in the overgeneration discussion:

घटकानां मध्ये व्यवस्था एवं‌ भवेत् इति मम मतम्

L0:
Given an input string, return possible sandhi splits at each location
Given two input strings, return sandhi output(s) - Valid sandhis only.
(We will deal with overgeneration on a case basis for now)
L1:
Given a pada, return all possible lexical tags
L2:
Given a string with or without spaces, return a graph where each pada boundary is a legitimate split as per L0, as well as each pada being lexically valid as per L1
L3:
Given a lexical graph from L2, output paths that have valid morphologies, ordered (optionally) by DCS frequencies(?)

केचन अपवादा सन्ति येषां‌ चिन्ता करणीया |‌ अत्रैव कुर्वः

यत्रसन्धिविच्छेदं कर्तुम् lexical वा morphological विज्ञाप्तिः आवश्यका (उद: पुरस्करोति)
षत्वं णत्वं च

Sort has to be according to least splits

# Sort by descending order longest string in split
ps.sort(key=lambda x:max(map(len,x)))
ps.reverse()

has to be replaced by

# Sort by descending order of items in split
ps.sort(key=lambda x:len(x))

[[u'pArvatI', u'maha', u'indrayos'], [u'pArvatI', u'mahA', u'indrayos'],...........

is much better than

[[u'pArvatI', u'imas', u'hA', u'indrayos'], [u'pArvatI', u'imas', u'ha', u'indrayos'], [u'pArvati', u'imas', u'hA', u'indrayos'], [u'pArvati', u'imas', u'ha', u'indrayos'], [u'pArvatI', u'mahA', u'indrayos'], [u'pArvatI', u'maha', u'indrayos'],....

arparse import too slow

import argparse takes nearly three seconds on my computer.

We need only one module from it.

Maybe

from argparse import ArgumentParser

Would be economical and improve speed.

No module named requests

Steps to reproduce:

> pip install sanskrit_parser
> python -m sanskrit_parser.lexical_analyzer.sandhi --split taeva 1
Traceback (most recent call last):                                                                                                                             
  File "/usr/lib64/python2.7/runpy.py", line 174, in _run_module_as_main                                                                                       
    "__main__", fname, loader, pkg_name)                                                                                                                       
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code                                                                                                  
    exec code in run_globals                                                                                                                                   
  File "/home/arun/work/hadoop-cluster/projects/geeta/.venv/lib/python2.7/site-packages/sanskrit_parser/util/inriaxmlwrapper.py", line 16, in <module>         
    import requests                                                                                                                                            
ImportError: No module named requests

I think requests needs to be explicitly declared in setup dependencies.

Wrong split for SrIrapi

Sandhi module

(integ)*$ python SanskritLexicalAnalyzer.py --split SrIrapi --input-encoding SLP1
Parsing of XMLs started at 2017-07-16 11:50:23.374043
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-16 11:50:28.310012
Input String: SrIrapi
Input String in SLP1: SrIrapi
Start Split: 2017-07-16 11:50:34.029952
End DAG generation: 2017-07-16 11:50:34.032363
End pathfinding: 2017-07-16 11:50:34.033762
Splits:
[u'Sri', u'Ira', u'pi']
[u'SrI', u'Ira', u'pi']

Internal splitter:

(integ)*$ python SanskritLexicalAnalyzer.py --split SrIrapi --input-encoding SLP1 --use-internal-sandhi-splitter
Parsing of XMLs started at 2017-07-16 11:50:45.124203
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-16 11:50:50.126418
Input String: SrIrapi
Input String in SLP1: SrIrapi
Start Split: 2017-07-16 11:50:55.797431
End DAG generation: 2017-07-16 11:50:55.799320
End pathfinding: 2017-07-16 11:50:55.803311
Splits:
[u'SrIs', u'api']
[u'SrI', u'ras', u'pi']
[u'Sri', u'Iras', u'pi']
[u'SrI', u'Iras', u'pi']
[u'Sri', u'Ira', u'pi']
[u'SrI', u'iras', u'pi']
[u'Sri', u'iras', u'pi']
[u'SrI', u'Ira', u'pi']

Reuse potential

Another candidate for reuse - https://github.com/sanskrit/sanskrit

Haven't investigated too much, but appears to have an API for adding sandhi rules that can later be used to split words.

package setup and module import fail: util/data does not exist

>>> from sanskrit_parser.lexical_analyzer.SanskritLexicalAnalyzer import SanskritLexicalAnalyzer
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/sanskrit_parser/lexical_analyzer/SanskritLexicalAnalyzer.py", line 114, in <module>
    class SanskritLexicalAnalyzer(object):
  File "/usr/local/lib/python2.7/dist-packages/sanskrit_parser/lexical_analyzer/SanskritLexicalAnalyzer.py", line 120, in SanskritLexicalAnalyzer
    forms  = inriaxmlwrapper.InriaXMLWrapper()
  File "/usr/local/lib/python2.7/dist-packages/sanskrit_parser/util/inriaxmlwrapper.py", line 54, in __init__
    self._load_forms()
  File "/usr/local/lib/python2.7/dist-packages/sanskrit_parser/util/inriaxmlwrapper.py", line 103, in _load_forms
    self._generate_dict()
  File "/usr/local/lib/python2.7/dist-packages/sanskrit_parser/util/inriaxmlwrapper.py", line 75, in _generate_dict
    self._get_files()
  File "/usr/local/lib/python2.7/dist-packages/sanskrit_parser/util/inriaxmlwrapper.py", line 60, in _get_files
    os.mkdir(self.data_cache)
OSError: [Errno 13] Permission denied: '/usr/local/lib/python2.7/dist-packages/sanskrit_parser/util/data'

@avinashvarna any ideas?

Python 3 support

The readme docs mention support for python 3 as work in progress. Is there an existing branch or fork with this work? Would love to try it out.

शश्छोऽटि - ८.४.६३

Interaction with scunA scu is not correctly implemented. Perhaps a fix like you did for झयो होऽन्यतरस्याम्
where the interaction is captured as well is called for?

(integ)*$ python SanskritLexicalAnalyzer.py --split --input-encoding SLP1 'visfjecCivam'
Parsing of XMLs started at 2017-07-27 12:51:51.187147
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-27 12:51:56.268704
Input String: visfjecCivam

Input String in SLP1: visfjecCivam

Start Split: 2017-07-27 12:52:02.085158
End DAG generation: 2017-07-27 12:52:02.090917
No Valid Splits Found

Add some tests for Morphological Analyzer

Take this up when you feel comfortable with the Morpholgical Analyzer itself. As of now, there are too many ifs and buts with the way it works.

Speed up flattening code

Flattening takes too long for large splits. One possibility is to explore memoization of the flattening code to speed it it up.

Eg: with --no-flatten
Input String in SLP1: astyuttarasyAmdiSidevatAtmAhimAlayonAmanagADirAjaH
Start split: 2017-07-07 11:16:33.724848
End split: 2017-07-07 11:16:33.762777

With flattening, this takes forever.

Sandhi error with ech

akaH savarNe dIrghaH doesn't apply to ech. (aiuN RLk eo~N aiauch ... remember)

So therefore, echo'yavAyAvaH applies even in savarNe echi pare. Sandhi code doesn't handle this right.

$ python SanskritLexicalAnalyzer.py jIvikopaniSadAvaupamye --split
Parsing of XMLs started at 2017-07-07 21:47:29.716387
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-07 21:47:34.759100
Input String: jIvikopaniSadAvaupamye
Input String in SLP1: jIvikopanizadAvOpamye
Start split: 2017-07-07 21:47:36.786932
End split: 2017-07-07 21:47:36.793918
[[u'jIvikA', u'upanizat', u'Ava', u'Opamye'], [u'jIvikA', u'upanizadA', u'vA', u'Opamye'], [u'jIvikA', u'upanizadA', u'avO', u'Opamye'], [u'jIvikA', u'upanizadA', u'ava', u'Opamye'],

८ . ४ . ६२ झयो होऽन्यतरस्याम् not implemented in sandhi?

(integ)*$ python SanskritLexicalAnalyzer.py --split --input-encoding SLP1 aBavadDaraH
Parsing of XMLs started at 2017-07-26 14:37:56.127633
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-26 14:38:01.052023
Input String: aBavadDaraH
Input String in SLP1: aBavadDaraH
Start Split: 2017-07-26 14:38:06.801279
End DAG generation: 2017-07-26 14:38:06.803579
End pathfinding: 2017-07-26 14:38:06.804081
Splits:
[u'aBavat', u'Daras']

Update setup.py to indicate python 3 comaptibility

python docs + sphynx + readthedocs.io + github pages

#29 (comment) इत्यनेन प्रेरितमिदम्।

सर्वदैव टिप्पणयो यान्त्रान्त्रस्थाः स्युः - अन्यता यन्त्र-टिप्पण्योर् भेदो जायेत वर्धेत च। तदर्थम् अस्ति सरलो मार्गः कश्चन।

python docstring-इत्येतेषाम् स्थापनम्। विशिष्य, https://github.com/kmadathil/sanskrit_parser/blob/integ/README.md इत्यत्र ये सूचनाः दत्ताः, ताः तत्तद्यन्त्रादेशान्त्रभागेषु सन्निविष्टाः स्युः।
ततः sphynx-यन्त्रस्योपयोगेन विवरण-पत्त्र-जननम्। https://docs.readthedocs.io/en/latest/getting_started.html#in-rst इति तत्परिचयो दत्तः।
ततः readthedocs.io इत्यस्य https://github.com/kmadathil/sanskrit_parser इत्यनेन सम्पर्कस्य स्थापनम्।
तथा विकल्पेन github pages इत्यस्य व्यवस्था, येन readthedocs.io इत्यस्याभावे शक्नुयाम टिप्पणिप्रकाशनम् साधयितुम्।

ऊर्ध्वं प्रथम सोपानम् साधयत। ततः सोपानान्तरारोहणे निर्विघ्नतासिद्ध्यर्थं स्वसम्पादितानुभव उपकुर्यात्।

Use SLP1 for internal representation

Slp1 offers one to one correspondence, which avoids many complications later on. One example is प्रउग. HK will have prauga. It is ambiguous. Can be प्रौग / प्रउग.

Better L1 candidate/reuse ?

Came across this project which seems to have identical goals - https://github.com/sanskrit/sanskrit. (Hasn't been updated in 2 years though. I seem to recall the author saying on the sanskrit-programmers list that he has moved on to meditation, etc.)

Still going through the code, but it seems to have features such as trying to see if a stem in the db could have produced the given word by looking at the sup form, etc. We could test it to see if it recognizes more forms compared to the INRIA db, since it combines words from MW, INRIA and Learnsanskrit.org db.

Remove old references/branches from repo

Looking at the network graph of our repo it appears that the following old refs/branches are not needed as they have been merged with master:
origin/dag
dag-nx
integ
If there are no objections, I will remove these branches from the repo, just to reduce clutter.

Implement a better Sandhi class

Assigning to @avinashvarna

Things I'd like to see

A class rather than a function
Automated scheme for defining sandhi patterns and generating methods to do the work
Automated scheme for deriving sandhi reversal from forward sandhi

We have basic sandhi reversal working, but we should do this properly.

Separate out self-contained modules for reuse

-- From vvasuki

Something to keep in mind - Something we learned from past experience is that it is best to separate out self contained modules and publish them in pip (which is very simple - you've got the indic transliteration module as an example). This will encourage reuse like nothing else.

Speedup lexical lookup using an O(1) datastructure

Currently, the inriaxmlwrapper code (which we use for lexical lookup of forms), reads in XML, and does XPATH search each time when a form is queried.

To speed this up, we could use a suitable datastructure to store the data read from XML. It would speed up the search pretty significantly compared to the current XPATH search. We could convert from XML to TRIE in python, pickle it, and load the pickled version to save conversion time.

Wrong split for SvaSrUrBUtvA

(integ)*$ python SanskritLexicalAnalyzer.py --split SvaSrUrBUtvA --input-encoding SLP1
Parsing of XMLs started at 2017-07-16 12:04:46.552213
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-16 12:04:51.483170
Input String: SvaSrUrBUtvA
Input String in SLP1: SvaSrUrBUtvA
Start Split: 2017-07-16 12:04:57.070620
End DAG generation: 2017-07-16 12:04:57.074318
No Valid Splits Found

(integ)*$ python SanskritLexicalAnalyzer.py --split SvaSrUrBUtvA --input-encoding SLP1 --use-internal
Parsing of XMLs started at 2017-07-16 12:05:02.730832
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-16 12:05:07.637170
Input String: SvaSrUrBUtvA
Input String in SLP1: SvaSrUrBUtvA
Start Split: 2017-07-16 12:05:13.299885
End DAG generation: 2017-07-16 12:05:13.302790
End pathfinding: 2017-07-16 12:05:13.307413
Splits:
[u'SvaSrUs', u'BUtvA']
[u'SvaSrUs', u'BU', u'tvA']
[u'SvaSrUs', u'Bu', u'UtvA']

constraint module error

$ python SanskritMorphologicalAnalyzer.py 'astyuttarasyAMdishi'
Traceback (most recent call last):
  File "SanskritMorphologicalAnalyzer.py", line 11, in <module>
    import constraint
ImportError: No module named constraint

Maybe this is needed in the required modules list.

More canonical DAG implementation

Please see the dag branch.

Opening an issue to discuss this implementation, which I will merge to master soon.

getSandhiSplits now returns a SanskritLexicalGraph object, with the splits represented in graph form. Calling findAllPaths on this object returns a flat list of splits.

This is faster than the earlier implementation if you include flattening (now called pathfinding). We can split and find all paths/flatten in astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH in about 30s. I'd like to get it down to less than a second if possible

$ python SanskritLexicalAnalyzer.py astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH --split --print-max 100
Parsing of XMLs started at 2017-07-10 12:20:07.216416
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-10 12:20:12.179545
Input String: astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH
Input String in SLP1: astyuttarasyAmdiSidevatAtmAhimAlayonAmanagADirAjaH
Start split: 2017-07-10 12:20:14.445834
End split: 2017-07-10 12:20:14.480862
End pathfinding: 2017-07-10 12:20:52.277015

Reorganize code, publish on pip

put everything in a sanskrit_parser module. Then follow https://github.com/sanskrit-coders/indic_transliteration template to publish to pip. I can do the needful if this sounds good..

Utility: This will let others start using your work.

Test enhancements for sandhi and lexical analyzer

Some of the UoHD tests for these aren't passing. We've seen issues with the files themselves , which could be causing many of the failures. And of course, we're far from having all our issues ironed out.

I've not given enough thought to this, but this is what I have in mind

Create a simple script to run all the UoHD tests, automatically extract the passing ones, and add them to a pass set that's automatically run
Extract the currently failing tests, and move them to a separate fail list so they can be run individually and checked.
The default regression should run all the passing tests from 1 plus the manually created reference tests (which all pass)

good collection of ANN papers for NLP

https://github.com/dennybritz/deeplearning-papernotes/blob/master/README.md

Some of the papers may prove useful to task at hand. Not seen yet.

Use word frequencies for trimming split graphs?

DCS word frequencies have been publicly available for a while now - here and also on couchdb . You might find it useful to pare down possible sandhi-splits etc..

Path reduction ideas

I found it difficult to represent on screen. So scribbled a note by hand.

Sandhi overgeneration

Noted some sandhi overgenerations while working on MorphologicalAnalyzer

Note the cmdline below.

The following incorrect splits are seen:
Split: [asti, uttas, asyAm, diSi]
Split: [asti, uttaras, yAm, diSi]
Split: [asti, ut, taras, yAm, diSi]

Incidentally, Morphology is able to reject all the other lexically valid splits other than the correct one, which is also morphologically valid: [asti, uttarasyAm, diSi]

(morpho)$ python SanskritMorphologicalAnalyzer.py 'astyuttarasyAMdishi'
Input String: astyuttarasyAMdishi
Input String in SLP1: astyuttarasyAMdiSi
Start Split: 2017-08-09 15:04:59.190939
End DAG generation: 2017-08-09 15:04:59.205163
End pathfinding: 2017-08-09 15:04:59.208782
Splits:
Split: [asti, uttarasyAm, diSi]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttarasyAm, ('uttara#2', set([saptamIviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([dvitIyAviBaktiH, napuMsakaliNgam, bahuvacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttarasyAm, ('uttara#2', set([saptamIviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, strIliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttarasyAm, ('uttara#1', set([saptamIviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([dvitIyAviBaktiH, napuMsakaliNgam, bahuvacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttarasyAm, ('uttara#1', set([saptamIviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, strIliNgam, ekavacanam])))]
Split: [asti, uttas, asyAm, diSi]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttas, ('utta', set([prATamikaH, ekavacanam, karmaRiBUtakfdantaH, kfdantaH, praTamAviBaktiH, puMlliNgam]))), (asyAm, ('ayam', set([saptamIviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([dvitIyAviBaktiH, napuMsakaliNgam, bahuvacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttas, ('utta', set([prATamikaH, ekavacanam, karmaRiBUtakfdantaH, kfdantaH, praTamAviBaktiH, puMlliNgam]))), (asyAm, ('ayam', set([saptamIviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, strIliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttas, ('utta', set([prATamikaH, ekavacanam, karmaRiBUtakfdantaH, kfdantaH, praTamAviBaktiH, puMlliNgam]))), (asyAm, ('asi', set([saptamIviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([dvitIyAviBaktiH, napuMsakaliNgam, bahuvacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttas, ('utta', set([prATamikaH, ekavacanam, karmaRiBUtakfdantaH, kfdantaH, praTamAviBaktiH, puMlliNgam]))), (asyAm, ('asi', set([saptamIviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, strIliNgam, ekavacanam])))]
Split: [asti, uttara, syAm, diSi]
Split: [asti, uttaras, yAm, diSi]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttaras, ('uttara#2', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (yAm, ('yad', set([dvitIyAviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, napuMsakaliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttaras, ('uttara#2', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (yAm, ('yad', set([dvitIyAviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, puMlliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttaras, ('uttara#2', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (yAm, ('yad', set([dvitIyAviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, strIliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttaras, ('uttara#2', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (yAm, ('ya#2', set([dvitIyAviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, napuMsakaliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttaras, ('uttara#2', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (yAm, ('ya#2', set([dvitIyAviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, puMlliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttaras, ('uttara#2', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (yAm, ('ya#2', set([dvitIyAviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, strIliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttaras, ('uttara#1', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (yAm, ('yad', set([dvitIyAviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, napuMsakaliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttaras, ('uttara#1', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (yAm, ('yad', set([dvitIyAviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, puMlliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttaras, ('uttara#1', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (yAm, ('yad', set([dvitIyAviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, strIliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttaras, ('uttara#1', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (yAm, ('ya#2', set([dvitIyAviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, napuMsakaliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttaras, ('uttara#1', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (yAm, ('ya#2', set([dvitIyAviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, puMlliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttaras, ('uttara#1', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (yAm, ('ya#2', set([dvitIyAviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, strIliNgam, ekavacanam])))]
Split: [asti, ut, tara, syAm, diSi]
Split: [asti, ut, taras, yAm, diSi]

inriaxmlwrapper.py breaking logging for other modules

@avinashvarna
अन्यस्मिन् शाखायां शोधीकृतमिदं मया |‌ b21df2a इति commit दृष्ट्वा तद् सम्यग् अस्ति चेद् cherry-pick कुरुतात्

Check for reuse potential

--- From Avinash Varna

Some of these sub routines (such as sandhi splitting and lexical validation routines mentioned in Level 1) are available in Dr. Dhaval's projects (e.g. https://github.com/drdhaval2785/inriaxmlwrapper) and Michael Bykov's projects (https://github.com/mbykov). Could leverage them to bootstrap the setup.

Need to handle Natva, Shatva etc

SanskritLexicalAnalyzer currently reports no splits for praNamati, vizIdati etc, because of the Natva, Shatva, etc. whereas it can correctly handle pranamati (even though this is not the correct form). Need to add a layer to undo such retroflexion.

Distinguishing between akarmaka and sakarmaka dhAtus

Nothing in the lexical tags seems to help for this purpose.

The morphological analyzer needs to constrain dvitIyA/karma being picked for an akarmaka dhAtu. For example, this :

Lexical Split: [asti, uttarasyAm, diSi, de, avatA, AtmA]
Valid Morphologies
[(asti, ('as#1', set([kartari, law, ekavacanam, prATamikaH, praTamapuruzaH]))), (uttarasyAm, ('uttara#2', set([strIliNgam, saptamIviBaktiH, ekavacanam]))), (diSi, ('diS#2', set([strIliNgam, saptamIviBaktiH, ekavacanam]))), (de, ('da', set([dvivacanam, dvitIyAviBaktiH, strIliNgam]))), (avatA, ('avat', set([puMlliNgam, tftIyAviBaktiH, kartarivartamAnakfdanta-parasmEpadI, kartari, ekavacanam, prATamikaH, kfdantaH]))), (AtmA, ('Atman', set([puMlliNgam, praTamAviBaktiH, ekavacanam])))]

(karaka/vibhakti are not bijective, that is a different issue for which the morphological analyzer needs extension)

Testing SanskritLexicalAnalyzer

I have added a few manual tests for py.test (in the dag branch, pending merge to master).

How do we get more automated tests? Can we generate them from the DCS scraper output?

https://github.com/sanskrit-coders/dcs-scraper

जश्त्वं +‌ चुत्वम्

python -m sanskrit_parser.lexical_analyzer.sandhi --split kasyacijjantoH 7
Splitting {0} at {1} kasyacijjantoH 7
set([(u'kasyaciC', u'jantoH'), (u'kasyacij', u'jantoH'), (u'kasyacid', u'jantoH'), (u'kasyacic', u'jantoH'), (u'kasyacij', u'dantoH'), (u'kasyaciJ', u'jantoH')])

कस्यचित् +‌ जन्तोः इत्यप्यपेक्ष्यते split

त् - >‌ द् ( झलां ज्शोन्ते / जशि )
द् - >‌ ज् (श्चुना श्चु )
कस्यचिज्जन्तोः

readthedocs - module file docs not being generated.

http://sanskrit-parser.readthedocs.io/en/latest/sanskrit_parser_lexical_analyzer_sandhi.html does not show any info from the docstring. Build seems fine. However local build: cd docs; make html seems to work fine.

Thoughts on L3: morphological constraints

Some thoughts - not aiming to handle all possibilities right away

Input - Top n paths from Lexical Graph output from L2.

Step 1 - Karaka assignments

Find verb / Lakara / vacana / puruSha from syntactic tags. Note kartari/karmaNi/Nic etc.
Locate words with appropriate vibhaktis and assign karakas. Multiple possibilities
apply gender/number constraints on all items tagged with the same karaka
apply constraints on samasa constituents (sequence of >=1 samasa constitutents must end in a subanta, not an avyaya, or tiGanta)
apply constraints on upasargas (must precede dhatu form, can only have certain sequences)

Step 2 - Use a constraint solver to pick paths (with tags) that satisfy all constraints

If this is possible to do using the Lexical Graph output of L2 directly, that might help. That would be sort of a constrained path search. I fear, though that that would make things worse in comparison to working on paths. Perhaps this can be decomposed into a set of graph problems that will simplify things?

Fix module names to match convention

MaheshvaraSutras DhatuWrapper etc.. should be renamed maheshvara_sutras dhatu_wrapper etc. Dont forget the rst files in https://github.com/kmadathil/sanskrit_parser/tree/master/docs ..

Uniform conventions are good - avoid confusions while using code, pares down warning lines in linters so that one can focus on genuine problems etc..

Sandhi quality.

Sandhi is currently backtracked using this technique

         self.sandhi_map = dict([
            ('A',('A_','a_a','a_A','A_a','A_A','As_')),
            ('I',('I_','i_i','i_I','I_i','I_I')),
            ('U',('U_','u_u','u_U','U_u','U_U')),
            ('F',('F_','f_f','f_x','x_f','F_x','x_F','F_F')),
            ('e',('e_','e_a','a_i','a_I','A_i','A_I')),
            ('o',('o_','o_a','a_u','a_U','A_u','A_U','aH_','aH_a','a_s')),
            ('E',('E_','a_e','A_e','a_E','A_E')),
            ('O',('O_','a_o','A_o','a_O','A_O')),
            ('ar',('af','ar')),# FIXME Why is this
            ('d',('t_','d_')),
            ('H',('H_','s_')),
            ('S',('S_','s_','H_')),
            ('M',('m_','M_')),
            ('y',('y_','i_','I_')),
            ('N',('N_','M_')),
            ('Y',('Y_','M_')),
            ('R',('R_','M_')),
            ('n',('n_','M_')),
            ('m',('m_','M_')),
            ('v',('v_','u_','U_')),
            ('r',('r_','s_','H_'))])

if a character in a string matches a key, it's optionally replaced by each of the replacements in the map, split with _ to yield a string to replace the match char, and prepend to the right context, respectively.

This does a good bit, but does not fully and accurately capture all sandhi backtracking. We need a better method.

सहकारित्वे निवेशः

मित्रवर, लघुदोषे कस्मिंश्चिद् दृष्टे सहकर्तुं शक्नुयाम्। तदर्थं क उचितो मार्गो मतः? यद्युचितम् उपारोपणम् अनुमन्यस्व।

A readme is now in order

I saw your impressive speed up.
So just wanted to run SanskritLexicalAnalyzer.py as shown in the comment.
But it gave me error No module named base.SanskritBase

Maybe a line or two of setup would be in order.

Maybe I am not able to setup the directory structure as it should have been.

Double vacanam entry in tags

@drdhaval2785 commented in issue #37

One observation.
There are two ekavacanam in each entry. Some error?

Yes. There were two entries in the mapper mapping np-sg and sg to ekavacanam

DAG generation code creates duplicate nodes

I merged the dag-nx changes into my sandhi branch and while testing, discovered that the DAG generation code is creating duplicate nodes. E.g.
$ python SanskritLexicalAnalyzer.py --split astyuttarasyAmdiSi --input-encoding SLP1
Parsing of XMLs started at 2017-07-11 23:51:34.063000
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-11 23:51:39.569000
Input String: astyuttarasyAmdiSi
Input String in SLP1: astyuttarasyAmdiSi
Start Split: 2017-07-11 23:51:46.504000
End DAG generation: 2017-07-11 23:51:46.520000
End pathfinding: 2017-07-11 23:51:46.524000
Found 10 splits
[[u'asti', u'uttarasyAm', u'diSi'], [u'asti', u'uttarasyAm', u'diSi'], [u'asti', u'uttarasyAm', u'diSi'], [u'asti', u'uttarasyAm', u'diSi'], [u'asti', u'uttarasyAm', u'diSi'], [u'asti', u'uttarasyAm', u'diSi'], [u'asti', u'uttarasyAm', u'diSi'], [u'asti', u'uttarasyAm', u'diSi'], [u'asti', u'uttarasyAm', u'diSi'], [u'asti', u'uttarasyAm', u'diSi']]
and all the paths are really the same.

I think it is the following portion of the code in _possible_splits:
if rdag:
# Make sure we got a graph back
assert isinstance(rdag,SanskritLexicalGraph)
# if there are valid splits of the right side
# Extend splits list with s_c_left appended with possible splits of s_c_right
t = SanskritLexicalGraph(s_c_left,end=False)
The last line will create a new object with the same s_c_left, even if multiple splits have the same left. It is well past midnight my time, so will try to follow it up tomorrow.

Can be reproduced by running the avinashvarna branch with the above inputs. Will write out the graph structure to a graphviz dot file (requires pydotplus) which clearly shows the duplicate nodes. (I've also checked in the .dot file as a sample)

Sort splits in descending order

Usually we want to retain the longest known word e.g.
asti+uttarasyAm+diSi
is much better than
asti+ut+tara+syAm+diSi.

A rough idea would be to arrange in descending order of length of flattened list.

This will ensure that more probable splits show up in ranking.

Issues link doesn't work on PyPI

... because README.md uses a relative link, which only works on github.

Mahesvarasutras returns extra "a" with vyanjanas

$ python MaheshvaraSutras.py --pratyahara hal --varna a
aiuR fxk eoN EOc hayavaraw laR YamaNaRanam JaBaY GaQaDaz jabagaqadaS KaPaCaWaTacawatav kapay Sazasar hal
हल्
हयवरलञमङणनझभघढधजबगडदखफछठथचटतकपशषसह
Is अ in हल्?
True

Sandhi module not returning the right split?

Can't find the correct split below:
[u'asti', u'uttarasyAm', u'diSi', u'devatA', u'AtmA']

(branch dag-nx - I have integrated sandhi.py, please use the --use-sandhi-module switch)

python SanskritLexicalAnalyzer.py --split astyuttarasyAmdiSidevatAtmA --input-encoding SLP1 --use-sandhi-module --max-paths 100
Parsing of XMLs started at 2017-07-12 15:15:11.121289
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-12 15:15:16.045961
Input String: astyuttarasyAmdiSidevatAtmA
Input String in SLP1: astyuttarasyAmdiSidevatAtmA
Start Split: 2017-07-12 15:15:21.480729
End DAG generation: 2017-07-12 15:15:21.534660
End pathfinding: 2017-07-12 15:15:21.625139
[[u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttara', u'syAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttas', u'asyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttas', u'asyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttara', u'syAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'de', u'avat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'de', u'avat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'de', u'avat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'de', u'avat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'de', u'avat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'de', u'avat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'de', u'avat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'de', u'avat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttaras', u'yAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttara', u'syAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttara', u'syAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttaras', u'yAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttas', u'asyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttas', u'asyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttaras', u'yAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttas', u'asyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttas', u'asyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttaras', u'yAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttara', u'syAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttara', u'syAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devatA', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'de', u'avata', u'AtmA']]

error in debug mode

$ python lexical_analyzer/SanskritLexicalAnalyzer.py astyuttarasyAmdishi --spli
t --debug
Parsing of XMLs started at 2017-07-06 21:23:52.769000
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-06 21:23:58.202000
Input String: astyuttarasyAmdishi
Input String in SLP1: astyuttarasyAmdiSi
Start split: 2017-07-06 21:24:01.603000
Splitting  astyuttarasyAmdiSi
Left, Right substrings = a styuttarasyAmdiSi
s_c_list: [[u'a', u'styuttarasyAmdiSi']]
Invalid left word:  a
Left, Right substrings = as tyuttarasyAmdiSi
Context Sandhi match: (None, 's', '[tTkKpP]') as tyuttarasyAmdiSi
Trying: s_
Trying: r_
s_c_list: [[u'as', u'tyuttarasyAmdiSi'], [u'ar', u'tyuttarasyAmdiSi']]
Invalid left word:  as
Invalid left word:  ar
Left, Right substrings = ast yuttarasyAmdiSi
s_c_list: [[u'ast', u'yuttarasyAmdiSi']]
Invalid left word:  ast
Left, Right substrings = asty uttarasyAmdiSi
Context Sandhi match: (None, 'y', '[aAuUeEoO]') asty uttarasyAmdiSi
Trying: i_
Trying: I_
s_c_list: [[u'asty', u'uttarasyAmdiSi'], [u'asti', u'uttarasyAmdiSi'], [u'astI', u'uttarasyAmdiSi']]
Invalid left word:  asty
Valid left split:  asti
Traceback (most recent call last):
  File "lexical_analyzer/SanskritLexicalAnalyzer.py", line 406, in <module>
    main()
  File "lexical_analyzer/SanskritLexicalAnalyzer.py", line 402, in main
    splits=s.getSandhiSplits(i,sort=not args.no_sort,debug=args.debug)
  File "lexical_analyzer/SanskritLexicalAnalyzer.py", line 246, in getSandhiSplits
    ps = self._possible_splits(s,debug)
  File "lexical_analyzer/SanskritLexicalAnalyzer.py", line 338, in _possible_splits
    print "Valid left split: ", s_c_left, self.tag_cache[s_c_left]
KeyError: u'asti'

The normal mode functions as expected.

kmadathil / sanskrit_parser Goto Github PK

sanskrit_parser's People

Contributors

Stargazers

Watchers

Forkers

sanskrit_parser's Issues

Recommend Projects

Recommend Topics

Recommend Org