Giter Club home page Giter Club logo

Comments (7)

avinashvarna avatar avinashvarna commented on August 15, 2024

It does get parsed correctly for one of the two options:

from sanskrit_parser import Parser
from indic_transliteration import sanscript
parser = Parser(output_encoding=sanscript.SLP1)
splits = parser.split('Darmo rakzati rakzitaH', limit=2)
for split in splits:
    parses = split.parse(limit=2)
    if parses is not None:
        for parse in parses:
            print(str(parse))

produces:

Partition 2: rakzitar went to zero length!
['DarmaH', 'rakzati', 'rakzitaH']
['DarmaH', 'rakzati', 'rakzitaH']
DarmaH => (Darma, ['praTamAviBaktiH', 'puMlliNgam', 'ekavacanam']) : kartA of rakzati
rakzati => (rakz, ['praTamapuruzaH', 'parasmEpadam', 'prATamikaH', 'law', 'kartari', 'ekavacanam'])
rakzitaH => (rakzita, ['karmaRi', 'praTamAviBaktiH', 'puMlliNgam', 'prATamikaH', 'kfdantaH', 'ekavacanam', 'kta']) : viSezaRam of DarmaH
DarmaH => (Darma, ['praTamAviBaktiH', 'puMlliNgam', 'ekavacanam']) : viSezaRam of rakzitaH
rakzati => (rakz, ['praTamapuruzaH', 'parasmEpadam', 'prATamikaH', 'law', 'kartari', 'ekavacanam'])
rakzitaH => (rakzita, ['karmaRi', 'praTamAviBaktiH', 'puMlliNgam', 'prATamikaH', 'kfdantaH', 'ekavacanam', 'kta']) : kartA of rakzati

In the first split, where rakzitaH was parsed as rakzitar, no valid parse is found. In the second one, where rakzitaH is parsed as rakzitas, the parse output is correct.

However, when we pass pre_segmented=True, we force replace_ending_visarga='r' here:

o = SanskritObject(seg,
encoding=self.input_encoding,
strict_io=self.strict_io,
replace_ending_visarga='r')

This then results in no valid parse being found.

After splitting, the UI uses pre-segmented mode, which results in this error in the UI.

Not sure of the best way to fix this. visarga handling has always been a pain point for me.

from sanskrit_parser.

kmadathil avatar kmadathil commented on August 15, 2024

rakzitaH can be

  1. रक्षितर् - रक्षितृ | एकवचनम्,पुंल्लिङ्गम्,संबोधनविभक्तिः or
  2. रक्षितस् - रक्षित | कृदन्तः,प्रथमाविभक्तिः,एकवचनम्,पुंल्लिङ्गम्,कर्मणि,प्राथमिकः,क्त

In

o = SanskritObject(seg,
encoding=self.input_encoding,
strict_io=self.strict_io,
replace_ending_visarga='r')
ts = self.sandhi_analyzer.getMorphologicalTags(o, tmap=True)
if ts is None:
# Possible sakaranta
# Try by replacing end visarga with 's' instead
o = SanskritObject(seg,
encoding=self.input_encoding,
strict_io=self.strict_io,
replace_ending_visarga='s')
ts = self.sandhi_analyzer.getMorphologicalTags(o, tmap=True)

we prioritize the r form over the s form if it exists during a presegmented split (used by the UI). Ideally, we should be supplying both downstream. This is one case which proves that prioritizing either one form doesn't work.

from sanskrit_parser.

kmadathil avatar kmadathil commented on August 15, 2024

This is my summary of a potential fix

  • If both an r and s-anta exist, we supply both downstream.
  • This results in _maybe_pre_segment returning a list of strings
  • split downstream will handle a list of strings as well as strings. In the list case, the splits of each member will be concatenated to give the result

from sanskrit_parser.

avinashvarna avatar avinashvarna commented on August 15, 2024

Indeed, I should have said that we prioritize the 'r' replacement over the 's'. The proposed solution looks fine to try out to me. We should also think about how the results are presented to the user. Currently the user sees ['DarmaH', 'rakzati', 'rakzitaH'] twice and it is confusing as to why there are two seemingly identical splits (it was to me when I first started looking into this issue at least).

from sanskrit_parser.

kmadathil avatar kmadathil commented on August 15, 2024

We should also think about how the results are presented to the user. Currently the user sees ['DarmaH', 'rakzati', 'rakzitaH'] twice

Strict mode (--strict-io) disambiguates this. I've always preferred this for my own use :-)

That suggests a better fix. We should expect pre-segmented input in strict mode (since it's used mostly for testing and for the UI, both of which can do that). The current s/r tests can be retained as a fallback (or removed, if you prefer).

I've implemented it in the karthik branch and tested locally, and it seems fine. Please review PR #180

from sanskrit_parser.

kmadathil avatar kmadathil commented on August 15, 2024

I have merged the PR.
However, we need to further discuss two concerns on that fix (which is limited to the Web UI, not the command line script)

This doesn't address the problem of the user seeing two identical splits with no explanation?

How do we handle this in the command line script? Should we turn strict_io on, and use a display filter, such as we (now) have for the Web UI?

I am also concerned that this proliferates the visarga handling/normalization to another location (the javascript) which could make maintainability a challenge.

This raises the larger question of what should be the proper domain of the parser vs what should be the proper domain of a UI (either text or graphical). Especially with an application such as ambuda, with a capable UI, shouldn't we be leaving display decisions to them?

OTOH, those who would like to use the command line script shouldn't be left high-and-dry either. Earlier, #56 had opened the question of handling visargas, anuswaras etc. I would like to suggest that we refactor the functionality to split the core sandhi/parse functionality and the anusvara/visarga handling.

from sanskrit_parser.

avinashvarna avatar avinashvarna commented on August 15, 2024

IMHO, the majority of users who would want to use a parser are probably not looking to understand the nuances of a visarga that arose from sasajuSho ruH vs something else. I am in favor of just displaying one split, by collapsing the two options in this case. Later, for parsing, we could try replacing the visarga with both 'r' and 's' and return the combined results as you suggested. We can do that in a different layer, or at the API entry points.

Earlier, #56 had opened the question of handling visargas, anuswaras etc. I would like to suggest that we refactor the functionality to split the core sandhi/parse functionality and the anusvara/visarga handling.

Could you please elaborate on what you have in mind? I thought that as a result of #56, the normalization is only done at the entry points anyway, and the core parser/sandhi don't deal with it.

from sanskrit_parser.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.