Giter Club home page Giter Club logo

Comments (20)

avinashvarna avatar avinashvarna commented on September 17, 2024

Would it be better to use a standard graph library instead of rolling our own? I've used networkx (https://networkx.readthedocs.io) before, and it comes with a lot of algorithms, including some for what we need. E.g. https://networkx.readthedocs.io/en/stable/reference/algorithms.simple_paths.html (See shortest simple paths - this should correspond to fewest splits) and some DAG specific ones (https://networkx.readthedocs.io/en/stable/reference/algorithms.dag.html)
I believe these algorithms are written using DFS/BFS to avoid the overhead of recursion in Python, and should have better performance than the simple recursive implementation.

from sanskrit_parser.

avinashvarna avatar avinashvarna commented on September 17, 2024

I hacked together a networkx based implementation as an example. Needs some cleaning, but the speedup is worth it. From a minute on my laptop to just about 0.1 seconds:

Parsing of XMLs started at 2017-07-10 20:33:07.631000
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-10 20:33:13.823000
Input String: astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH
Input String in SLP1: astyuttarasyAmdiSidevatAtmAhimAlayonAmanagADirAjaH
Start Split: 2017-07-10 20:33:18.199000
End DAG generation: 2017-07-10 20:33:18.254000
End pathfinding: 2017-07-10 20:34:13.241000
Finding k shortest paths using networkx: 2017-07-10 20:34:13.242000
End pathfinding: 2017-07-10 20:34:13.332000

from sanskrit_parser.

avinashvarna avatar avinashvarna commented on September 17, 2024

P.S. Let me know if you want me to clean it up. Didn't add too many comments since I wanted to get it implemented and share the results.

from sanskrit_parser.

drdhaval2785 avatar drdhaval2785 commented on September 17, 2024

from sanskrit_parser.

kmadathil avatar kmadathil commented on September 17, 2024

@avinashvarna
My plan was to use a simple implementation now and move to a better one as needed. We will definitely need a graph library for Level 3, so we might as well integrate it now. Let me look at your branch.

from sanskrit_parser.

avinashvarna avatar avinashvarna commented on September 17, 2024

Sorry, I actually added it into your dag branch, since I had some in-progress work on my branch. Hope that's ok. You can revert the commit if you want.

from sanskrit_parser.

kmadathil avatar kmadathil commented on September 17, 2024

No you didn't - you pushed it to a different branch called origin/dag by mistake. All's well :-)

from sanskrit_parser.

avinashvarna avatar avinashvarna commented on September 17, 2024

First time running git worktree add and I mess it up. Well, no harm done I guess :).

from sanskrit_parser.

drdhaval2785 avatar drdhaval2785 commented on September 17, 2024

I hacked together a networkx based implementation as an example. Needs some cleaning, but the speedup is worth it. From a minute on my laptop to just about 0.1 seconds:

Where is it? How to use?

from sanskrit_parser.

kmadathil avatar kmadathil commented on September 17, 2024

@avinashvarna
What's the maximum number of paths youve tried for? I couldn't get it to terminate in a reasonable time with --print-max 10000

@drdhaval2785
please pull the branch called "origin/dag" (not dag, but origin/dag)

from sanskrit_parser.

kmadathil avatar kmadathil commented on September 17, 2024

Comparison of runtimes (rough) - 1000 paths on astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH

BFS - dag-bfs branch: 10 seconds
DFS/Memoized - dag branch: 40 seconds
Networkx - dag-nx branch: 4 seconds
- origin/dag branch: 3 seconds

If you want to try this out, please note some branches use --print-max and some use --max-paths for the same thing. I'll eventually settle on --max-paths

from sanskrit_parser.

kmadathil avatar kmadathil commented on September 17, 2024

Note that the DFS/Memoized dag branch takes 40 seconds for the entire set of paths. No other implementation seems to be able to go to 10000 paths.

from sanskrit_parser.

kmadathil avatar kmadathil commented on September 17, 2024

@drdhaval2785 Please try the dag-nx branch.

from sanskrit_parser.

drdhaval2785 avatar drdhaval2785 commented on September 17, 2024

No other implementation seems to be able to go to 10000 paths.

Excuse my lack of technical knowledge about paths and graphs.
I am just curious why so many paths?
There is some idea in my mind by which we can definitely come to a conclusion that the sandhi or samasa is not split at certain letters. In 42 letter, there were only 18 which qualified to be called a split point.

from sanskrit_parser.

avinashvarna avatar avinashvarna commented on September 17, 2024

Maybe it's just to understand the limits right now. We definitely have some work ahead of us to minimize the number of false splits. I've been modifying the way we encode the sandhi rules to minimize these splits.

@kmadathil It appears to be some problem with the different algorithm they use in shortest_simple_paths. Just as a test, I replaced shortest_simple_paths with all_simple_paths in dag-nx branch, and it finishes with the same number of paths, taking almost the same time as the dag branch on my laptop:
End DAG generation: 2017-07-11 09:06:28.385000
End pathfinding: 2017-07-11 09:07:27.163000
Found 8478668 paths
So we could switch to using all_simple_paths for now and sorting them ourselves, which is what dag branch does anyway.

I will post a question on networkx mailing list/stackoverflow to see if this is a bug or some other issue.

from sanskrit_parser.

kmadathil avatar kmadathil commented on September 17, 2024

@drdhaval2785 Graphs are simple, you can figure out the basics in a day or less. :-)
https://www.codechef.com/wiki/tutorial-graph-theory-part-1

Why so many paths? Let's say we have a graph with an average of s possible splits each at n positions. The complexity of building the lexical graph is of the order of s.n, while the complexity of number of paths through it is s^n. This is why graph building is quick, while calculating the number of paths blows up with longer sentences. It's good practice when we're sitting on a problem with exponential complexity to know our limits, and possible fallback to a different method if we detect that we've exceeded them. That's why I'm worried about paths.
Right now, for astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH, we need approx 1000 shortest paths to ensure that we include the correct split. If I add the second half of the verse (haven't had the courage to yet), we'll probably need 10000.

from sanskrit_parser.

kmadathil avatar kmadathil commented on September 17, 2024

@avinashvarna
That's a good catch. We could have both versions, and fallback to the all_simple_paths implementation depending on the number of paths requested.

from sanskrit_parser.

kmadathil avatar kmadathil commented on September 17, 2024

dag-nx looks like the one that will be taken forward. I will close this after integrating with master. Awaiting the first round of integration issues with sandhi.py to be ironed out.

from sanskrit_parser.

avinashvarna avatar avinashvarna commented on September 17, 2024

I just happened to pick dag-nx to start because you mentioned it used networkx. I haven't actually yet seen the bfs branch. If you think that dag-nx is indeed the right one to go forward, then we are good to go. Else, let's use a different one.

from sanskrit_parser.

kmadathil avatar kmadathil commented on September 17, 2024

Merged dag-nx (with shortest_simple_paths and fallback all_simple_paths implementation) has been merged with master. Closing.
Any sandhi integration issues can be dealt with elsewhere

from sanskrit_parser.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.