kmadathil / sanskrit_parser Goto Github PK
View Code? Open in Web Editor NEWParsers for Sanskrit / संस्कृतम्
License: MIT License
Parsers for Sanskrit / संस्कृतम्
License: MIT License
It would be cool if the final model produced is easily usable from the following languages in that order of preference:
Opening a separate issue to talk about this, since it was being lost in the overgeneration discussion:
घटकानां मध्ये व्यवस्था एवं भवेत् इति मम मतम्
L0:
Given an input string, return possible sandhi splits at each location
Given two input strings, return sandhi output(s) - Valid sandhis only.
(We will deal with overgeneration on a case basis for now)
L1:
Given a pada, return all possible lexical tags
L2:
Given a string with or without spaces, return a graph where each pada boundary is a legitimate split as per L0, as well as each pada being lexically valid as per L1
L3:
Given a lexical graph from L2, output paths that have valid morphologies, ordered (optionally) by DCS frequencies(?)
केचन अपवादा सन्ति येषां चिन्ता करणीया | अत्रैव कुर्वः
# Sort by descending order longest string in split
ps.sort(key=lambda x:max(map(len,x)))
ps.reverse()
has to be replaced by
# Sort by descending order of items in split
ps.sort(key=lambda x:len(x))
[[u'pArvatI', u'maha', u'indrayos'], [u'pArvatI', u'mahA', u'indrayos'],...........
is much better than
[[u'pArvatI', u'imas', u'hA', u'indrayos'], [u'pArvatI', u'imas', u'ha', u'indrayos'], [u'pArvati', u'imas', u'hA', u'indrayos'], [u'pArvati', u'imas', u'ha', u'indrayos'], [u'pArvatI', u'mahA', u'indrayos'], [u'pArvatI', u'maha', u'indrayos'],....
import argparse takes nearly three seconds on my computer.
We need only one module from it.
Maybe
from argparse import ArgumentParser
Would be economical and improve speed.
Steps to reproduce:
> pip install sanskrit_parser
> python -m sanskrit_parser.lexical_analyzer.sandhi --split taeva 1
Traceback (most recent call last):
File "/usr/lib64/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/arun/work/hadoop-cluster/projects/geeta/.venv/lib/python2.7/site-packages/sanskrit_parser/util/inriaxmlwrapper.py", line 16, in <module>
import requests
ImportError: No module named requests
I think requests
needs to be explicitly declared in setup dependencies.
Sandhi module
(integ)*$ python SanskritLexicalAnalyzer.py --split SrIrapi --input-encoding SLP1
Parsing of XMLs started at 2017-07-16 11:50:23.374043
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-16 11:50:28.310012
Input String: SrIrapi
Input String in SLP1: SrIrapi
Start Split: 2017-07-16 11:50:34.029952
End DAG generation: 2017-07-16 11:50:34.032363
End pathfinding: 2017-07-16 11:50:34.033762
Splits:
[u'Sri', u'Ira', u'pi']
[u'SrI', u'Ira', u'pi']
Internal splitter:
(integ)*$ python SanskritLexicalAnalyzer.py --split SrIrapi --input-encoding SLP1 --use-internal-sandhi-splitter
Parsing of XMLs started at 2017-07-16 11:50:45.124203
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-16 11:50:50.126418
Input String: SrIrapi
Input String in SLP1: SrIrapi
Start Split: 2017-07-16 11:50:55.797431
End DAG generation: 2017-07-16 11:50:55.799320
End pathfinding: 2017-07-16 11:50:55.803311
Splits:
[u'SrIs', u'api']
[u'SrI', u'ras', u'pi']
[u'Sri', u'Iras', u'pi']
[u'SrI', u'Iras', u'pi']
[u'Sri', u'Ira', u'pi']
[u'SrI', u'iras', u'pi']
[u'Sri', u'iras', u'pi']
[u'SrI', u'Ira', u'pi']
Another candidate for reuse - https://github.com/sanskrit/sanskrit
Haven't investigated too much, but appears to have an API for adding sandhi rules that can later be used to split words.
>>> from sanskrit_parser.lexical_analyzer.SanskritLexicalAnalyzer import SanskritLexicalAnalyzer
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/sanskrit_parser/lexical_analyzer/SanskritLexicalAnalyzer.py", line 114, in <module>
class SanskritLexicalAnalyzer(object):
File "/usr/local/lib/python2.7/dist-packages/sanskrit_parser/lexical_analyzer/SanskritLexicalAnalyzer.py", line 120, in SanskritLexicalAnalyzer
forms = inriaxmlwrapper.InriaXMLWrapper()
File "/usr/local/lib/python2.7/dist-packages/sanskrit_parser/util/inriaxmlwrapper.py", line 54, in __init__
self._load_forms()
File "/usr/local/lib/python2.7/dist-packages/sanskrit_parser/util/inriaxmlwrapper.py", line 103, in _load_forms
self._generate_dict()
File "/usr/local/lib/python2.7/dist-packages/sanskrit_parser/util/inriaxmlwrapper.py", line 75, in _generate_dict
self._get_files()
File "/usr/local/lib/python2.7/dist-packages/sanskrit_parser/util/inriaxmlwrapper.py", line 60, in _get_files
os.mkdir(self.data_cache)
OSError: [Errno 13] Permission denied: '/usr/local/lib/python2.7/dist-packages/sanskrit_parser/util/data'
@avinashvarna any ideas?
The readme docs mention support for python 3 as work in progress. Is there an existing branch or fork with this work? Would love to try it out.
Interaction with scunA scu is not correctly implemented. Perhaps a fix like you did for झयो होऽन्यतरस्याम्
where the interaction is captured as well is called for?
(integ)*$ python SanskritLexicalAnalyzer.py --split --input-encoding SLP1 'visfjecCivam'
Parsing of XMLs started at 2017-07-27 12:51:51.187147
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-27 12:51:56.268704
Input String: visfjecCivam
Input String in SLP1: visfjecCivam
Start Split: 2017-07-27 12:52:02.085158
End DAG generation: 2017-07-27 12:52:02.090917
No Valid Splits Found
Take this up when you feel comfortable with the Morpholgical Analyzer itself. As of now, there are too many ifs and buts with the way it works.
Flattening takes too long for large splits. One possibility is to explore memoization of the flattening code to speed it it up.
Eg: with --no-flatten
Input String in SLP1: astyuttarasyAmdiSidevatAtmAhimAlayonAmanagADirAjaH
Start split: 2017-07-07 11:16:33.724848
End split: 2017-07-07 11:16:33.762777
With flattening, this takes forever.
akaH savarNe dIrghaH doesn't apply to ech. (aiuN RLk eo~N aiauch ... remember)
So therefore, echo'yavAyAvaH applies even in savarNe echi pare. Sandhi code doesn't handle this right.
$ python SanskritLexicalAnalyzer.py jIvikopaniSadAvaupamye --split
Parsing of XMLs started at 2017-07-07 21:47:29.716387
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-07 21:47:34.759100
Input String: jIvikopaniSadAvaupamye
Input String in SLP1: jIvikopanizadAvOpamye
Start split: 2017-07-07 21:47:36.786932
End split: 2017-07-07 21:47:36.793918
[[u'jIvikA', u'upanizat', u'Ava', u'Opamye'], [u'jIvikA', u'upanizadA', u'vA', u'Opamye'], [u'jIvikA', u'upanizadA', u'avO', u'Opamye'], [u'jIvikA', u'upanizadA', u'ava', u'Opamye'],
(integ)*$ python SanskritLexicalAnalyzer.py --split --input-encoding SLP1 aBavadDaraH
Parsing of XMLs started at 2017-07-26 14:37:56.127633
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-26 14:38:01.052023
Input String: aBavadDaraH
Input String in SLP1: aBavadDaraH
Start Split: 2017-07-26 14:38:06.801279
End DAG generation: 2017-07-26 14:38:06.803579
End pathfinding: 2017-07-26 14:38:06.804081
Splits:
[u'aBavat', u'Daras']
#29 (comment) इत्यनेन प्रेरितमिदम्।
सर्वदैव टिप्पणयो यान्त्रान्त्रस्थाः स्युः - अन्यता यन्त्र-टिप्पण्योर् भेदो जायेत वर्धेत च। तदर्थम् अस्ति सरलो मार्गः कश्चन।
ऊर्ध्वं प्रथम सोपानम् साधयत। ततः सोपानान्तरारोहणे निर्विघ्नतासिद्ध्यर्थं स्वसम्पादितानुभव उपकुर्यात्।
Slp1 offers one to one correspondence, which avoids many complications later on. One example is प्रउग. HK will have prauga. It is ambiguous. Can be प्रौग / प्रउग.
Came across this project which seems to have identical goals - https://github.com/sanskrit/sanskrit. (Hasn't been updated in 2 years though. I seem to recall the author saying on the sanskrit-programmers list that he has moved on to meditation, etc.)
Still going through the code, but it seems to have features such as trying to see if a stem in the db could have produced the given word by looking at the sup form, etc. We could test it to see if it recognizes more forms compared to the INRIA db, since it combines words from MW, INRIA and Learnsanskrit.org db.
Looking at the network graph of our repo it appears that the following old refs/branches are not needed as they have been merged with master:
origin/dag
dag-nx
integ
If there are no objections, I will remove these branches from the repo, just to reduce clutter.
Assigning to @avinashvarna
Things I'd like to see
We have basic sandhi reversal working, but we should do this properly.
-- From vvasuki
Something to keep in mind - Something we learned from past experience is that it is best to separate out self contained modules and publish them in pip (which is very simple - you've got the indic transliteration module as an example). This will encourage reuse like nothing else.
Currently, the inriaxmlwrapper code (which we use for lexical lookup of forms), reads in XML, and does XPATH search each time when a form is queried.
To speed this up, we could use a suitable datastructure to store the data read from XML. It would speed up the search pretty significantly compared to the current XPATH search. We could convert from XML to TRIE in python, pickle it, and load the pickled version to save conversion time.
(integ)*$ python SanskritLexicalAnalyzer.py --split SvaSrUrBUtvA --input-encoding SLP1
Parsing of XMLs started at 2017-07-16 12:04:46.552213
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-16 12:04:51.483170
Input String: SvaSrUrBUtvA
Input String in SLP1: SvaSrUrBUtvA
Start Split: 2017-07-16 12:04:57.070620
End DAG generation: 2017-07-16 12:04:57.074318
No Valid Splits Found
(integ)*$ python SanskritLexicalAnalyzer.py --split SvaSrUrBUtvA --input-encoding SLP1 --use-internal
Parsing of XMLs started at 2017-07-16 12:05:02.730832
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-16 12:05:07.637170
Input String: SvaSrUrBUtvA
Input String in SLP1: SvaSrUrBUtvA
Start Split: 2017-07-16 12:05:13.299885
End DAG generation: 2017-07-16 12:05:13.302790
End pathfinding: 2017-07-16 12:05:13.307413
Splits:
[u'SvaSrUs', u'BUtvA']
[u'SvaSrUs', u'BU', u'tvA']
[u'SvaSrUs', u'Bu', u'UtvA']
$ python SanskritMorphologicalAnalyzer.py 'astyuttarasyAMdishi'
Traceback (most recent call last):
File "SanskritMorphologicalAnalyzer.py", line 11, in <module>
import constraint
ImportError: No module named constraint
Maybe this is needed in the required modules list.
Please see the dag branch.
Opening an issue to discuss this implementation, which I will merge to master soon.
getSandhiSplits now returns a SanskritLexicalGraph object, with the splits represented in graph form. Calling findAllPaths on this object returns a flat list of splits.
This is faster than the earlier implementation if you include flattening (now called pathfinding). We can split and find all paths/flatten in astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH in about 30s. I'd like to get it down to less than a second if possible
$ python SanskritLexicalAnalyzer.py astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH --split --print-max 100
Parsing of XMLs started at 2017-07-10 12:20:07.216416
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-10 12:20:12.179545
Input String: astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH
Input String in SLP1: astyuttarasyAmdiSidevatAtmAhimAlayonAmanagADirAjaH
Start split: 2017-07-10 12:20:14.445834
End split: 2017-07-10 12:20:14.480862
End pathfinding: 2017-07-10 12:20:52.277015
put everything in a sanskrit_parser module. Then follow https://github.com/sanskrit-coders/indic_transliteration template to publish to pip. I can do the needful if this sounds good..
Utility: This will let others start using your work.
Some of the UoHD tests for these aren't passing. We've seen issues with the files themselves , which could be causing many of the failures. And of course, we're far from having all our issues ironed out.
I've not given enough thought to this, but this is what I have in mind
https://github.com/dennybritz/deeplearning-papernotes/blob/master/README.md
Some of the papers may prove useful to task at hand. Not seen yet.
Noted some sandhi overgenerations while working on MorphologicalAnalyzer
Note the cmdline below.
The following incorrect splits are seen:
Split: [asti, uttas, asyAm, diSi]
Split: [asti, uttaras, yAm, diSi]
Split: [asti, ut, taras, yAm, diSi]
Incidentally, Morphology is able to reject all the other lexically valid splits other than the correct one, which is also morphologically valid: [asti, uttarasyAm, diSi]
(morpho)$ python SanskritMorphologicalAnalyzer.py 'astyuttarasyAMdishi'
Input String: astyuttarasyAMdishi
Input String in SLP1: astyuttarasyAMdiSi
Start Split: 2017-08-09 15:04:59.190939
End DAG generation: 2017-08-09 15:04:59.205163
End pathfinding: 2017-08-09 15:04:59.208782
Splits:
Split: [asti, uttarasyAm, diSi]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttarasyAm, ('uttara#2', set([saptamIviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([dvitIyAviBaktiH, napuMsakaliNgam, bahuvacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttarasyAm, ('uttara#2', set([saptamIviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, strIliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttarasyAm, ('uttara#1', set([saptamIviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([dvitIyAviBaktiH, napuMsakaliNgam, bahuvacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttarasyAm, ('uttara#1', set([saptamIviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, strIliNgam, ekavacanam])))]
Split: [asti, uttas, asyAm, diSi]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttas, ('utta', set([prATamikaH, ekavacanam, karmaRiBUtakfdantaH, kfdantaH, praTamAviBaktiH, puMlliNgam]))), (asyAm, ('ayam', set([saptamIviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([dvitIyAviBaktiH, napuMsakaliNgam, bahuvacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttas, ('utta', set([prATamikaH, ekavacanam, karmaRiBUtakfdantaH, kfdantaH, praTamAviBaktiH, puMlliNgam]))), (asyAm, ('ayam', set([saptamIviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, strIliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttas, ('utta', set([prATamikaH, ekavacanam, karmaRiBUtakfdantaH, kfdantaH, praTamAviBaktiH, puMlliNgam]))), (asyAm, ('asi', set([saptamIviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([dvitIyAviBaktiH, napuMsakaliNgam, bahuvacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttas, ('utta', set([prATamikaH, ekavacanam, karmaRiBUtakfdantaH, kfdantaH, praTamAviBaktiH, puMlliNgam]))), (asyAm, ('asi', set([saptamIviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, strIliNgam, ekavacanam])))]
Split: [asti, uttara, syAm, diSi]
Split: [asti, uttaras, yAm, diSi]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttaras, ('uttara#2', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (yAm, ('yad', set([dvitIyAviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, napuMsakaliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttaras, ('uttara#2', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (yAm, ('yad', set([dvitIyAviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, puMlliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttaras, ('uttara#2', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (yAm, ('yad', set([dvitIyAviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, strIliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttaras, ('uttara#2', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (yAm, ('ya#2', set([dvitIyAviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, napuMsakaliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttaras, ('uttara#2', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (yAm, ('ya#2', set([dvitIyAviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, puMlliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttaras, ('uttara#2', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (yAm, ('ya#2', set([dvitIyAviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, strIliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttaras, ('uttara#1', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (yAm, ('yad', set([dvitIyAviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, napuMsakaliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttaras, ('uttara#1', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (yAm, ('yad', set([dvitIyAviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, puMlliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttaras, ('uttara#1', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (yAm, ('yad', set([dvitIyAviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, strIliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttaras, ('uttara#1', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (yAm, ('ya#2', set([dvitIyAviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, napuMsakaliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttaras, ('uttara#1', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (yAm, ('ya#2', set([dvitIyAviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, puMlliNgam, ekavacanam])))]
[(asti, ('as#1', set([prATamikaH, kartari, ekavacanam, ekavacanam, praTamapuruzaH, law]))), (uttaras, ('uttara#1', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (yAm, ('ya#2', set([dvitIyAviBaktiH, strIliNgam, ekavacanam]))), (diSi, ('diS#2', set([saptamIviBaktiH, strIliNgam, ekavacanam])))]
Split: [asti, ut, tara, syAm, diSi]
Split: [asti, ut, taras, yAm, diSi]
@avinashvarna
अन्यस्मिन् शाखायां शोधीकृतमिदं मया | b21df2a इति commit दृष्ट्वा तद् सम्यग् अस्ति चेद् cherry-pick कुरुतात्
--- From Avinash Varna
Some of these sub routines (such as sandhi splitting and lexical validation routines mentioned in Level 1) are available in Dr. Dhaval's projects (e.g. https://github.com/drdhaval2785/inriaxmlwrapper) and Michael Bykov's projects (https://github.com/mbykov). Could leverage them to bootstrap the setup.
SanskritLexicalAnalyzer currently reports no splits for praNamati, vizIdati etc, because of the Natva, Shatva, etc. whereas it can correctly handle pranamati (even though this is not the correct form). Need to add a layer to undo such retroflexion.
Nothing in the lexical tags seems to help for this purpose.
The morphological analyzer needs to constrain dvitIyA/karma being picked for an akarmaka dhAtu. For example, this :
Lexical Split: [asti, uttarasyAm, diSi, de, avatA, AtmA]
Valid Morphologies
[(asti, ('as#1', set([kartari, law, ekavacanam, prATamikaH, praTamapuruzaH]))), (uttarasyAm, ('uttara#2', set([strIliNgam, saptamIviBaktiH, ekavacanam]))), (diSi, ('diS#2', set([strIliNgam, saptamIviBaktiH, ekavacanam]))), (de, ('da', set([dvivacanam, dvitIyAviBaktiH, strIliNgam]))), (avatA, ('avat', set([puMlliNgam, tftIyAviBaktiH, kartarivartamAnakfdanta-parasmEpadI, kartari, ekavacanam, prATamikaH, kfdantaH]))), (AtmA, ('Atman', set([puMlliNgam, praTamAviBaktiH, ekavacanam])))]
(karaka/vibhakti are not bijective, that is a different issue for which the morphological analyzer needs extension)
I have added a few manual tests for py.test (in the dag branch, pending merge to master).
How do we get more automated tests? Can we generate them from the DCS scraper output?
python -m sanskrit_parser.lexical_analyzer.sandhi --split kasyacijjantoH 7
Splitting {0} at {1} kasyacijjantoH 7
set([(u'kasyaciC', u'jantoH'), (u'kasyacij', u'jantoH'), (u'kasyacid', u'jantoH'), (u'kasyacic', u'jantoH'), (u'kasyacij', u'dantoH'), (u'kasyaciJ', u'jantoH')])
कस्यचित् + जन्तोः इत्यप्यपेक्ष्यते split
त् - > द् ( झलां ज्शोन्ते / जशि )
द् - > ज् (श्चुना श्चु )
कस्यचिज्जन्तोः
http://sanskrit-parser.readthedocs.io/en/latest/sanskrit_parser_lexical_analyzer_sandhi.html does not show any info from the docstring. Build seems fine. However local build: cd docs; make html
seems to work fine.
Some thoughts - not aiming to handle all possibilities right away
Input - Top n paths from Lexical Graph output from L2.
Step 1 - Karaka assignments
Step 2 - Use a constraint solver to pick paths (with tags) that satisfy all constraints
If this is possible to do using the Lexical Graph output of L2 directly, that might help. That would be sort of a constrained path search. I fear, though that that would make things worse in comparison to working on paths. Perhaps this can be decomposed into a set of graph problems that will simplify things?
MaheshvaraSutras DhatuWrapper etc.. should be renamed maheshvara_sutras dhatu_wrapper etc. Dont forget the rst files in https://github.com/kmadathil/sanskrit_parser/tree/master/docs ..
Uniform conventions are good - avoid confusions while using code, pares down warning lines in linters so that one can focus on genuine problems etc..
Sandhi is currently backtracked using this technique
self.sandhi_map = dict([
('A',('A_','a_a','a_A','A_a','A_A','As_')),
('I',('I_','i_i','i_I','I_i','I_I')),
('U',('U_','u_u','u_U','U_u','U_U')),
('F',('F_','f_f','f_x','x_f','F_x','x_F','F_F')),
('e',('e_','e_a','a_i','a_I','A_i','A_I')),
('o',('o_','o_a','a_u','a_U','A_u','A_U','aH_','aH_a','a_s')),
('E',('E_','a_e','A_e','a_E','A_E')),
('O',('O_','a_o','A_o','a_O','A_O')),
('ar',('af','ar')),# FIXME Why is this
('d',('t_','d_')),
('H',('H_','s_')),
('S',('S_','s_','H_')),
('M',('m_','M_')),
('y',('y_','i_','I_')),
('N',('N_','M_')),
('Y',('Y_','M_')),
('R',('R_','M_')),
('n',('n_','M_')),
('m',('m_','M_')),
('v',('v_','u_','U_')),
('r',('r_','s_','H_'))])
if a character in a string matches a key, it's optionally replaced by each of the replacements in the map, split with _ to yield a string to replace the match char, and prepend to the right context, respectively.
This does a good bit, but does not fully and accurately capture all sandhi backtracking. We need a better method.
मित्रवर, लघुदोषे कस्मिंश्चिद् दृष्टे सहकर्तुं शक्नुयाम्। तदर्थं क उचितो मार्गो मतः? यद्युचितम् उपारोपणम् अनुमन्यस्व।
I saw your impressive speed up.
So just wanted to run SanskritLexicalAnalyzer.py as shown in the comment.
But it gave me error No module named base.SanskritBase
Maybe a line or two of setup would be in order.
Maybe I am not able to setup the directory structure as it should have been.
@drdhaval2785 commented in issue #37
One observation.
There are two ekavacanam in each entry. Some error?
Yes. There were two entries in the mapper mapping np-sg and sg to ekavacanam
I merged the dag-nx changes into my sandhi branch and while testing, discovered that the DAG generation code is creating duplicate nodes. E.g.
$ python SanskritLexicalAnalyzer.py --split astyuttarasyAmdiSi --input-encoding SLP1
Parsing of XMLs started at 2017-07-11 23:51:34.063000
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-11 23:51:39.569000
Input String: astyuttarasyAmdiSi
Input String in SLP1: astyuttarasyAmdiSi
Start Split: 2017-07-11 23:51:46.504000
End DAG generation: 2017-07-11 23:51:46.520000
End pathfinding: 2017-07-11 23:51:46.524000
Found 10 splits
[[u'asti', u'uttarasyAm', u'diSi'], [u'asti', u'uttarasyAm', u'diSi'], [u'asti', u'uttarasyAm', u'diSi'], [u'asti', u'uttarasyAm', u'diSi'], [u'asti', u'uttarasyAm', u'diSi'], [u'asti', u'uttarasyAm', u'diSi'], [u'asti', u'uttarasyAm', u'diSi'], [u'asti', u'uttarasyAm', u'diSi'], [u'asti', u'uttarasyAm', u'diSi'], [u'asti', u'uttarasyAm', u'diSi']]
and all the paths are really the same.
I think it is the following portion of the code in _possible_splits:
if rdag:
# Make sure we got a graph back
assert isinstance(rdag,SanskritLexicalGraph)
# if there are valid splits of the right side
# Extend splits list with s_c_left appended with possible splits of s_c_right
t = SanskritLexicalGraph(s_c_left,end=False)
The last line will create a new object with the same s_c_left, even if multiple splits have the same left. It is well past midnight my time, so will try to follow it up tomorrow.
Can be reproduced by running the avinashvarna branch with the above inputs. Will write out the graph structure to a graphviz dot file (requires pydotplus) which clearly shows the duplicate nodes. (I've also checked in the .dot file as a sample)
Usually we want to retain the longest known word e.g.
asti+uttarasyAm+diSi
is much better than
asti+ut+tara+syAm+diSi.
A rough idea would be to arrange in descending order of length of flattened list.
This will ensure that more probable splits show up in ranking.
... because README.md uses a relative link, which only works on github.
$ python MaheshvaraSutras.py --pratyahara hal --varna a
aiuR fxk eoN EOc hayavaraw laR YamaNaRanam JaBaY GaQaDaz jabagaqadaS KaPaCaWaTacawatav kapay Sazasar hal
हल्
हयवरलञमङणनझभघढधजबगडदखफछठथचटतकपशषसह
Is अ in हल्?
True
Can't find the correct split below:
[u'asti', u'uttarasyAm', u'diSi', u'devatA', u'AtmA']
(branch dag-nx - I have integrated sandhi.py, please use the --use-sandhi-module switch)
python SanskritLexicalAnalyzer.py --split astyuttarasyAmdiSidevatAtmA --input-encoding SLP1 --use-sandhi-module --max-paths 100
Parsing of XMLs started at 2017-07-12 15:15:11.121289
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-12 15:15:16.045961
Input String: astyuttarasyAmdiSidevatAtmA
Input String in SLP1: astyuttarasyAmdiSidevatAtmA
Start Split: 2017-07-12 15:15:21.480729
End DAG generation: 2017-07-12 15:15:21.534660
End pathfinding: 2017-07-12 15:15:21.625139
[[u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttara', u'syAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttas', u'asyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttas', u'asyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttara', u'syAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'de', u'avat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'de', u'avat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'de', u'avat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'de', u'avat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'de', u'avat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'de', u'avat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'de', u'avat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'de', u'avat', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttaras', u'yAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttara', u'syAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttara', u'syAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttaras', u'yAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttas', u'asyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttas', u'asyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttaras', u'yAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttas', u'asyAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttas', u'asyAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttaras', u'yAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttara', u'syAm', u'diSi', u'devat', u'AtmA'], [u'asti', u'uttara', u'syAm', u'diSi', u'devata', u'AtmA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'devatA', u'at', u'mA'], [u'asti', u'uttarasyAm', u'diSi', u'de', u'avata', u'AtmA']]
error in debug mode
$ python lexical_analyzer/SanskritLexicalAnalyzer.py astyuttarasyAmdishi --spli
t --debug
Parsing of XMLs started at 2017-07-06 21:23:52.769000
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-06 21:23:58.202000
Input String: astyuttarasyAmdishi
Input String in SLP1: astyuttarasyAmdiSi
Start split: 2017-07-06 21:24:01.603000
Splitting astyuttarasyAmdiSi
Left, Right substrings = a styuttarasyAmdiSi
s_c_list: [[u'a', u'styuttarasyAmdiSi']]
Invalid left word: a
Left, Right substrings = as tyuttarasyAmdiSi
Context Sandhi match: (None, 's', '[tTkKpP]') as tyuttarasyAmdiSi
Trying: s_
Trying: r_
s_c_list: [[u'as', u'tyuttarasyAmdiSi'], [u'ar', u'tyuttarasyAmdiSi']]
Invalid left word: as
Invalid left word: ar
Left, Right substrings = ast yuttarasyAmdiSi
s_c_list: [[u'ast', u'yuttarasyAmdiSi']]
Invalid left word: ast
Left, Right substrings = asty uttarasyAmdiSi
Context Sandhi match: (None, 'y', '[aAuUeEoO]') asty uttarasyAmdiSi
Trying: i_
Trying: I_
s_c_list: [[u'asty', u'uttarasyAmdiSi'], [u'asti', u'uttarasyAmdiSi'], [u'astI', u'uttarasyAmdiSi']]
Invalid left word: asty
Valid left split: asti
Traceback (most recent call last):
File "lexical_analyzer/SanskritLexicalAnalyzer.py", line 406, in <module>
main()
File "lexical_analyzer/SanskritLexicalAnalyzer.py", line 402, in main
splits=s.getSandhiSplits(i,sort=not args.no_sort,debug=args.debug)
File "lexical_analyzer/SanskritLexicalAnalyzer.py", line 246, in getSandhiSplits
ps = self._possible_splits(s,debug)
File "lexical_analyzer/SanskritLexicalAnalyzer.py", line 338, in _possible_splits
print "Valid left split: ", s_c_left, self.tag_cache[s_c_left]
KeyError: u'asti'
The normal mode functions as expected.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.