Giter Club home page Giter Club logo

Comments (25)

drdhaval2785 avatar drdhaval2785 commented on August 15, 2024

Necessary data is available in json at https://github.com/drdhaval2785/samasasplitter/tree/master/dicts/str_counter

It is a dict with startingString as key and its occurrence as value.

from sanskrit_parser.

kmadathil avatar kmadathil commented on August 15, 2024

This is good, but it will help us reduce the complexity of graph creation rather than graph traversal / path search, which is s^n.
While we will waste time searching for astyu, astyut, ... etc, that adds to the graph creation complexity, which doesn't seem to be a problem right now. Even for our longer strings, we finish DAG creation nearly instantaneously.
But this is a good thing to keep in the back of our minds once that becomes limiting

from sanskrit_parser.

avinashvarna avatar avinashvarna commented on August 15, 2024

@drdhaval2785 This is a neat idea. Could be combined with a trie structure (the return of trie? :)) to basically determine when a prefix is no longer in the db. While it is not the bottle-neck right now, once we integrate with all the sandhi rules coded up, we might generate too many splits at a given location. It might not harm to implement it. Can we tag this issue as revisit later?

from sanskrit_parser.

drdhaval2785 avatar drdhaval2785 commented on August 15, 2024

There is one hidden possibility here.
As you can see, most of the paths end before u, the probability of a split is high there. In a way we will have weighted graph.

from sanskrit_parser.

drdhaval2785 avatar drdhaval2785 commented on August 15, 2024

@kmadathil

Are graph creation and traversal not related?
If we reduce the split places, would traversal time also not reduce?

from sanskrit_parser.

kmadathil avatar kmadathil commented on August 15, 2024

@drdhaval2785

What you're optimizing can be called the phonological graph.

The graph we're traversing is the lexical graph - the one that's created after all legal splits have been decided. For example, for astyuttarasyAmdiSi, this is the graph

                           asti
             +------------+-------------+--------+
             |                 |                |           |
       uttarasyAm   uttas          uttara     ut
            |                  |                |           |
            |                 rasyAm     syAm     tara
            |                     |              |            |
            +-------------- diSi-------  +---------+
                                  |
                                 End

(edit: The graph comes out horribly, I'll figure out a way to get it looking better)

This graph is created from the phonological graph by the getSandhiSplit method - using the process of figuring out each possible legal left split, then recursively splitting the right part to see if there's a legal split (with memoization to speed things up). Your idea will speed the creation of this graph by allowing us to consider fewer possibilities while creating the graph. Once created, every node in it is legal. Currently, this is not the complexity bottleneck. Creation of the lexical graph happens very fast, though we're not perfectly optimal in the implementation.

Traversing (using the findAllPaths method) the lexical graphs creates this set of paths. This is the part that has exponential complexity, and therefore the one that limits our performance.

[[u'asti', u'uttarasyAm', u'diSi'], [u'asti', u'uttas', u'rasyAm', u'diSi'], [u'asti', u'uttara', u'syAm', u'diSi'], [u'asti', u'ut', u'tara', u'syAm', u'diSi']]

If we can extend your idea to find was to simplify the lexical graph during or after creation, but before traversal, that would help us. If we could find constraints to put on the graph which would result in deletion of nodes or edges (of the lexical graph, not the phonological one), that would help us.

from sanskrit_parser.

drdhaval2785 avatar drdhaval2785 commented on August 15, 2024

Can you post nodes and edges (maybe adjacency matrix) for small samasa? This will help understand what kind of constraints can be put from grammatical or corpus linguistics point of view?

from sanskrit_parser.

avinashvarna avatar avinashvarna commented on August 15, 2024

graph

Does that help?
(created using graphviz)

from sanskrit_parser.

avinashvarna avatar avinashvarna commented on August 15, 2024

Here's another one if it helps. Ignore the weird behavior of the A on the right. When the graph is exported to the .dot format, it only uses node names, and so the A gets a lot of edges, even though there are probably different A nodes at different points in the splits.

I think this does show that if we can avoid the weird splits like A at some points, it might help simplify the graph and reduce the number of paths. I had noticed something similar earlier and modified the sandhi rules to account for this. Will test what happens with that and report back.

graph

from sanskrit_parser.

drdhaval2785 avatar drdhaval2785 commented on August 15, 2024

There is something more worriesome about the A.
Can we keep a constraint that a split may not be of one letter?

The words with 'A' prefix are anyhow separately enumerated in the dataset e.g. AgacCati, Agama etc. There is no need to keep A as a separate node. Node should be created onlycif string length is more than 1. This will reduce too many spurious splits

from sanskrit_parser.

drdhaval2785 avatar drdhaval2785 commented on August 15, 2024

Second idea is to use the the following as weight of edges
1/(length_of_string_of_destination_node).

E.g. edge (asti,uttarasyAm) will have weight of 1/11
and edge (asti,ut) will have weight of 1/2.

This will ensure that larger splits are facilitated and shorter splits are discouraged in considering paths.

We can then use the various algorithms for weighted graphs. This will give us graded splits i.e. paths with least resistance would be most preferrred.

What are uour takes @avinashvarna and @kmadathil?

from sanskrit_parser.

drdhaval2785 avatar drdhaval2785 commented on August 15, 2024

I guess I understand where is it going haywire.
https://networkx.readthedocs.io/en/stable/tutorial/tutorial.html mentions

we add new nodes/edges and NetworkX quietly ignores any that are already present.

Therefore when we added 'A' node again, networkx ignored that and kept only one 'A'.
A simple workaround maybe to name the nodes with numbers and storing the string as node attribute. Please keep it a try.

After weightage and number nodes, let us see how it fares.

from sanskrit_parser.

avinashvarna avatar avinashvarna commented on August 15, 2024

Changing the weight of the edges may help us sort the paths according to some criterion, but as far as I am aware, finding the paths according to lowest weights is at least as complex as finding the shortest path, so it doesn't really solve the problem.

Regarding the A node, this is an artifact of the exporting to a different format, as I mentioned earlier. This does not happen in the actual implementation. Moreover, with @kmadathil's fix for #21, there should not even be duplicate nodes.

I do agree that improving the quality of our sandhi splits is really the right way to reduce the graph traversal time. If we have to retrieve 10K paths to find our split, our retrieval precision is 10^-4, which is pretty bad. I think we need to focus on improving the splits by adding constraints if that is indeed the case. Your suggestion of preventing single letter splits for example is a good one (though I hope that the sandhi module already addresses this. I will check).

I am actually more interested in the 'diSi' node in the above graph. This is an articulation point (https://www.hackerearth.com/practice/algorithms/graphs/articulation-points-and-bridges/tutorial/). In other words, every path from asti to the end HAS to go through diSi. So we could actually split the problem of finding paths in the graph to two problems: asti --- diSi, and diSi --- end. Since the number of paths is exponential in the number of nodes, finding the paths in the smaller graphs will be faster, and then we can recombine the results to obtain the full paths.

Unfortunately, I don't think all the results we generate will have such articulation points. Might be something worth exploring.

from sanskrit_parser.

drdhaval2785 avatar drdhaval2785 commented on August 15, 2024

I think nodes are still letter based.

def addNode(self,node,root=False):
    """ Extend dag with node inserted at root """
    assert node not in self.G
    self.G.add_node(node)
    if root:
        self.roots.append(node)

This shows that nodes are lettet based

 if s_c_left not in node_cache:
                            # Extend splits list with s_c_left appended with possible splits of s_c_right
                            t = SanskritBase.SanskritObject(s_c_left,encoding=SanskritBase.SLP1)
                            node_cache[s_c_left] = t
                        if not splits:
                            splits = SanskritLexicalGraph()
                        if not splits.hasNode(t):
                            splits.addNode(t,root=True)

from sanskrit_parser.

avinashvarna avatar avinashvarna commented on August 15, 2024

t = SanskritBase.SanskritObject(s_c_left,encoding=SanskritBase.SLP1)
It's an object with that name. Two objects with the same name will not be identical.

from sanskrit_parser.

kmadathil avatar kmadathil commented on August 15, 2024

@avinashvarna is right. The reason I created SanskritObjects for nodes is to solve the precise problem of a word appearing twice in a sentence having to be two different nodes.

Articulation points are good ideas. Every space is an articulation point, and that could make things easier.

from sanskrit_parser.

drdhaval2785 avatar drdhaval2785 commented on August 15, 2024

I still have my reservations. We are doing cacheing.
So when I search for 'A' in cache, I will always retrieve the old cached node. Our new object creation happens only when that node is not in cache.

from sanskrit_parser.

kmadathil avatar kmadathil commented on August 15, 2024

@drdhaval2785 - We will eventually need weights, like you suggested. Right now, as @avinashvarna says, shortest path finding is equivalent to weighted search.
The node cache is keyed by string, but cleared at each Sandhi split point. Nodes with the same string, at the same split point, are identical.

from sanskrit_parser.

kmadathil avatar kmadathil commented on August 15, 2024

Proof. If both 'tvam's are the same node, there'll be a single word path 'tvam'-end and loops. We get a single valid path here.

(dag-nx)*$ python SanskritLexicalAnalyzer.py --split --input-encoding SLP1 tvaMcatvam --max 20
Parsing of XMLs started at 2017-07-13 09:24:28.927559
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-13 09:24:33.883752
Input String: tvaMcatvam
Input String in SLP1: tvaMcatvam
Start Split: 2017-07-13 09:24:39.341010
End DAG generation: 2017-07-13 09:24:39.342314
End pathfinding: 2017-07-13 09:24:39.343026
[[u'tvam', u'ca', u'tvam']]

from sanskrit_parser.

kmadathil avatar kmadathil commented on August 15, 2024

@drdhaval2785
AgacCAmi is not in Prof. Huet's database.

$python SanskritLexicalAnalyzer.py AgacCAmi
Parsing of XMLs started at 2017-07-13 09:26:43.422954
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-13 09:26:48.320092
Input String: AgacCAmi
Input String in SLP1: AgacCAmi
None

$ python SanskritLexicalAnalyzer.py AgacCAmi --split
Parsing of XMLs started at 2017-07-13 09:26:59.823000
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-13 09:27:04.781413
Input String: AgacCAmi
Input String in SLP1: AgacCAmi
Start Split: 2017-07-13 09:27:10.275910
End DAG generation: 2017-07-13 09:27:10.277353
End pathfinding: 2017-07-13 09:27:10.277907
[[u'A', u'gacCAmi']]

from sanskrit_parser.

kmadathil avatar kmadathil commented on August 15, 2024

@drdhaval2785
Your misgivings about node_cache were not misplaced. There were issues in the cacheing, which I have fixed.

from sanskrit_parser.

drdhaval2785 avatar drdhaval2785 commented on August 15, 2024

Any speedups after bug fix?

from sanskrit_parser.

kmadathil avatar kmadathil commented on August 15, 2024

Speed is similar, some incorrect funtionality is now corrected.

from sanskrit_parser.

drdhaval2785 avatar drdhaval2785 commented on August 15, 2024

Anything further to be done here? Or time to close?

from sanskrit_parser.

kmadathil avatar kmadathil commented on August 15, 2024

Time to close, I think.

from sanskrit_parser.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.