Noted some sandhi overgenerations while working on MorphologicalAnalyzer <p dir="a

विलम्बेन प्रत्युत्तरार्थं क्षन्तव्योऽयं जनः । <a href="http://www.vitastaa.camp/" rel=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Sandhi overgeneration about sanskrit_parser HOT 11 CLOSED

kmadathil commented on August 15, 2024

Sandhi overgeneration

from sanskrit_parser.

Comments (11)

drdhaval2785 commented on August 15, 2024 1

One observation.

There are two ekavacanam in each entry. Some error?

from sanskrit_parser.

avinashvarna commented on August 15, 2024

I think we should clarify the assumptions/contract between these various modules.
Let's say we are trying to split string '012345' at 3. Since we are assuming that we cannot rely on having spaces in our input (as in the example above), the sandhi module is currently implemented to also return '0123' + '45', so as to enable us to correctly recognize words with intervening spaces removed. Even in the case of samAsas, upasarga+dhAtu combinations where there is no explicit sandhi, this would be necessary. E.g. rAmaputraH for samAsa, upagacChati/uttarati for upasarga + dhAtu. This is the reason why we see the following splits:
Split: [asti, uttaras, yAm, diSi]
Split: [asti, ut, taras, yAm, diSi]
(splitting uttarasyAm returns uttaras + yAm as well, which in turn when split returns ut + taras + yAm)
uttaras/taras may not really be valid words, but since the INRIA db stores visarga as 's', these are recognized as valid words and get propagated to higher layers.

Regarding uttas, asyAm, I think there are cases where a similar split is valid:
nis + agacChat -> niragacChat
As discussed before in #7, the sandhi module does not have morphological information, and so cannot distinguish between these cases. This is also compounded, in the case of visargas, by the fact that we need to start/end with words ending in s/r in forward/backward direction (as you mentioned in #16 (comment))

I can't think of any easy ways to prevent this type of over-generation given the database constraints without causing problems with under-generation, or augmenting the sandhi module with morphological information (which is a big task). Right now, the module errs on the side of over-generation to minimize (hopefully eliminate) under-generation. If you have any ideas, I'm all ears.

from sanskrit_parser.

kmadathil commented on August 15, 2024

‌uttaras/taras may not really be valid words, but since the INRIA db stores visarga as 's', these are recognized as valid words and get propagated to higher layers.

Nit. They are valid words. The visargaH turns up as a result of sandhi rules. That's a Paninian quirk that's probably glossed over everywhere :-)

nis + agacChat -> niragacChat

This can be dealt with by looking at the left context in the split rules. There's no real case whether as+a gives you (as,a). It can give you (ar,a) though. is+a, es+a etc. are different. No morphological information is needed here. Am I missing something?

Split: [asti, uttaras, yAm, diSi]
Split: [asti, ut, taras, yAm, diSi]
(splitting uttarasyAm returns uttaras + yAm as well, which in turn when split returns ut + taras + yAm)

I see your point about having to do this to avoid undergeneration.

One possible way to solve this is to pass the tentative outputs of the splitter through the joiner and confirm that they do result in the original sentence as one of the options. The joiner can afford to be stricter than the splitter. However, I notice that it is loose too.

python -m sanskrit_parser.lexical_analyzer.sandhi uttaras yAm --join
Joining uttaras yAm
set([u'uttarasyAm', u'uttararyAm', u'uttaroyAm', u'uttaraHyAm', u'uttaro yAm'])

The first two are incorrect, only the last three are correct. u'uttaraH yAm' is missing.

from sanskrit_parser.

kmadathil commented on August 15, 2024

घटकानां मध्ये व्यवस्था एवं‌ भवितव्यम् इति मम मतम्

L0:
Given an input string, return possible sandhi splits at each location
Given two input strings, return sandhi output(s) - साधु सन्धिरेव करणीया!
L1:
Given a pada, return all possible lexical tags
L2:
Given a string with or without spaces, return a graph where each pada boundary is a legitimate split as per L0, as well as each pada being lexically valid as per L1
L3:
Given a lexical graph from L2, output paths that have valid morphologies, ordered (optionally) by DCS frequencies(?)

केचन अपवादा सन्ति येषां‌ चिन्ता करणीया |‌ अत्रैव कुर्वः

यत्रसन्धिविच्छेदं कर्तुम् lexical वा morphological प्तिः आवश्यका (उद: पुरस्करोति)
षत्वं णत्वं च

from sanskrit_parser.

avinashvarna commented on August 15, 2024

Nit. They are valid words. The visargaH turns up as a result of sandhi rules. That's a Paninian quirk that's probably glossed over everywhere :-)

They have "पदसंज्ञा" but I am not sure that means that they are "valid words". Correct me if I'm wrong, but wouldn't "ससजुषो रुः" apply immediately and convert the "स्" into a "रेफ"? They are only intermediaries on a path to the final result. I am not sure that stopping the process at some intermediate step without applying subsequent सूत्रs is Paninian. If we use that argument, wouldn't things like "दिश् + स्" be "valid words".

There's no real case whether as+a gives you (as,a). It can give you (ar,a) though. is+a, es+a etc. are different. No morphological information is needed here. Am I missing something?

Sorry, निरगच्छत् was a bad example in this context. The result we were discussing is "uttarasyAm" --> "uttas asyAm", which as you said can happen in some contexts.

The joiner can afford to be stricter than the splitter. However, I notice that it is loose too.
साधु सन्धिरेव करणीया!

Currently, the joiner and the splitter use the same rules, since we only wanted to write the forward rules, and be able to automatically infer the backward rules. If we want to be able to split something and arrive at a result, we have to allow the join to do the same even though it will be incorrect in some cases. If we want to change this behavior, it would require some changes to the architecture of the sandhi module.

The first two are incorrect, only the last three are correct. u'uttaraH yAm' is missing.

Will add the option with the space after visarga. If we are sure we wouldn't need "ary" --> "as, y" and "asy" --> "as, y" we can remove them in the forward direction as well (Rules currently apply in both directions as mentioned above).

I acknowledge that I was lazy in writing up the rules on where it is OK to just return left + right (without any modifications), partly because I don't completely understand where the INRIA db stores "s" at the end and where it stores "r" at the end. If we can come up with a list of where it is ok to do this "no-sandhi" split (or equivalently, where it is NOT ok), we can modify the rules and see if that mitigates the over-generation problems.

from sanskrit_parser.

kmadathil commented on August 15, 2024

I acknowledge that I was lazy in writing up the rules on where it is OK to just return left + right

Actually, I'd disagree there. You absolutely did the right thing there. That gave us the headstart we needed! :-)

They have "पदसंज्ञा" but I am not sure that means that they are "valid words". Correct me if I'm wrong, but wouldn't "ससजुषो रुः" apply immediately and convert the "स्" into a "रेफ"? They are only intermediaries on a path to the final result.

You have a point. I'm being a bit loose here. It somehow seems "right" to me to store ramas (after it deletion, but before any of the sandhi and padAnta rules), but not diSs. Thus, the sa-lopa and na-lopa rules are applied to "pada"s stored in the db, but not rutva or any of the post-rutva changes.
It just "feels right to me", but is there a logical explanation for why that is? Possibly that sa-lopa and na-lopa are pada-internal changes, and post rutva changes depend on external context. However, rutva itself doesn't depend on external context, but if we store padas post-rutva, we will not distinguish between the ru and the r (ahar, prAtar). Maybe it'd have been more logical to store with an r, but distinguish the ru from a pre-existing r some other way. But, let's live with what we have for now.

I don't completely understand where the INRIA db stores "s" at the end and where it stores "r" at the end.

I think that's simple enough - wherever the r is inherent to the prAtipadika, it's stored. Wherever it comes from rutva, it is not. So, we see ahar/prAtar, but not rAmar. Am I missing something?

If we can come up with a list of where it is ok to do this "no-sandhi" split (or equivalently, where it is NOT ok), we can modify the rules and see if that mitigates the over-generation problems.

I agree. We should take the path of least effort right now. This is too early for any rearchitecture.

Following on from the previous paragraph, let's go with the assumptionthe padAnta rules have all been applied in the db (as it is), barring the rutva rule. That means we could get rid of a lot of overgeneration by rejecting the no-change split in the case where rutva will happen (strings ending in s and sajuz, not followed by त/थ/ट/ठ).

Other places where we can reject the no-change split that I can think of are where श्चुत्व / ष्टुत्व apply.

At this point, the best thing seems to be to look at least-effort paths to reduce overgeneration.

from sanskrit_parser.

drdhaval2785 commented on August 15, 2024

Regarding सान्त , रेफान्त words

It seems to me that they are stoted as such and not as विसर्गान्त words so that some sandhi rules can be applied unambiguously.
e.g.
पुनर् + रमते -> पुन + रमते (रो रि) -> पुना रमते (ढ्रलोपे पूर्वस्य दीर्घोऽणः)

Had it been stored as पुनः in DB, there would be a perpetual issue to find out whether the word was actually रेफान्त or सान्त.

from sanskrit_parser.

avinashvarna commented on August 15, 2024

विलम्बेन प्रत्युत्तरार्थं क्षन्तव्योऽयं जनः । संस्कृतसम्बद्धकार्यान्तरे व्यस्त आसम् ।
एतत् परिवर्तयितुं कश्चन प्रयत्नः विहितः । अधुना तु -
Input String: astyuttarasyAMdishi
Input String in SLP1: astyuttarasyAMdiSi
Start Split: 2017-08-16 00:02:39.920000
End DAG generation: 2017-08-16 00:02:39.930000
End pathfinding: 2017-08-16 00:02:39.931000
Splits:
Lexical Split: [asti, uttarasyAm, diSi]
Valid Morphologies
[(asti, ('as#1', set([prATamikaH, law, ekavacanam, kartari, praTamapuruzaH]))), (uttarasyAm, ('uttara#2', set([ekavacanam, saptamIviBaktiH, strIliNgam]))), (diSi, ('diS#2', set([napuMsakaliNgam, dvitIyAviBaktiH, bahuvacanam])))]
[(asti, ('as#1', set([prATamikaH, law, ekavacanam, kartari, praTamapuruzaH]))), (uttarasyAm, ('uttara#2', set([ekavacanam, saptamIviBaktiH, strIliNgam]))), (diSi, ('diS#2', set([ekavacanam, saptamIviBaktiH, strIliNgam])))]
[(asti, ('as#1', set([prATamikaH, law, ekavacanam, kartari, praTamapuruzaH]))), (uttarasyAm, ('uttara#1', set([ekavacanam, saptamIviBaktiH, strIliNgam]))), (diSi, ('diS#2', set([napuMsakaliNgam, dvitIyAviBaktiH, bahuvacanam])))]
[(asti, ('as#1', set([prATamikaH, law, ekavacanam, kartari, praTamapuruzaH]))), (uttarasyAm, ('uttara#1', set([ekavacanam, saptamIviBaktiH, strIliNgam]))), (diSi, ('diS#2', set([ekavacanam, saptamIviBaktiH, strIliNgam])))]
Lexical Split: [asti, uttara, syAm, diSi]
No valid morphologies for this split
Lexical Split: [asti, ut, tara, syAm, diSi]
No valid morphologies for this split

इत्येक एव विकल्पः दर्श्यते । एतेन अन्यत् किमपि भग्नं स्यात् । अतः सूक्ष्मतया testing आवश्यकम् । यदि अन्यः कोऽपि क्लेश उत्पद्यते तर्हि पुनरपि परिवर्तनानि आवश्यकानि स्युः ।

join विषये space युक्तः विकल्पः इतोऽपि योजनीयः । शीघ्रमेव करिष्यामि ।

from sanskrit_parser.

kmadathil commented on August 15, 2024

पिदधामि इदम्
आवश्यकं चेत् अन्यं issue उद्घातयामः

from sanskrit_parser.

gasyoun commented on August 15, 2024

@kmadathil have you seen Huet's or Scharf's Sandhi code?

from sanskrit_parser.

kmadathil commented on August 15, 2024

@gasyoun - No I have not. Can you point me to the code?

from sanskrit_parser.

Sandhi overgeneration about sanskrit_parser HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent