Currently, the inriaxmlwrapper code (which we use for lexical lookup of forms), reads

I am not very familiar with tries, but based on <a href="https://stackoverflow.com/que

There is no need to do .keys() on a dict and convert it into a set. <code class="notra

Speedup lexical lookup using an O(1) datastructure about sanskrit_parser HOT 29 CLOSED

kmadathil commented on August 15, 2024

Speedup lexical lookup using an O(1) datastructure

from sanskrit_parser.

Comments (29)

avinashvarna commented on August 15, 2024 1

Yes. https://wiki.python.org/moin/TimeComplexity states that implementation of set and dict are intentionally similar.

from sanskrit_parser.

drdhaval2785 commented on August 15, 2024 1

Flood relief.
Added just now.

from sanskrit_parser.

drdhaval2785 commented on August 15, 2024

Seems a promising suggestion.

from sanskrit_parser.

avinashvarna commented on August 15, 2024

I am not very familiar with tries, but based on https://stackoverflow.com/questions/245878/how-do-i-choose-between-a-hash-table-and-a-trie-prefix-tree, their main advantage appears to be with prefix-based searches. If we are always going to be looking up an exact form, we could use a standard python dict that has an O(1) average cost for lookups. Since our keys are standard unicode strings, we probably don't need to worry about the O(n) worst case complexity. dicts are simpler to use, and everyone's familiar with it (as also indicated in the SO answer) and that also counts for a lot.

from sanskrit_parser.

kmadathil commented on August 15, 2024

Another TRIE (or DAWG) advantage is that shorter keys get looked up faster. That seemed attractive to me at first sight. However, if we're prioritizing longer splits (as being discussed elsewhere), that might pointless.

But yes - our keys are going to be standard English strings (since they're SLP1 encoded). And in any case, this is not really a limiting factor now. Anything that doesn't do a XPATH search would probably be practically the same, so I agree that just using a dict is smartest.

from sanskrit_parser.

drdhaval2785 commented on August 15, 2024

I feel
for member in set(x)
is equally fast. Its average case complexity is O(1) only.
So why try tries?
We can pickle the set itself and store, and unpickle and use directly.

from sanskrit_parser.

kmadathil commented on August 15, 2024

Please see my comment above, dict is now the preferred solution.

We need to store keys (forms) as well as values (tagsets). Using sets, we can only store the keys, as we do today for the quicklookup function. After loading the dict, we could do .keys() on it, and convert that to a set for quicklookup.

from sanskrit_parser.

drdhaval2785 commented on August 15, 2024

OK.

from sanskrit_parser.

avinashvarna commented on August 15, 2024

There is no need to do .keys() on a dict and convert it into a set. x in dict is effectively the same as x in set(dict.keys()).

from sanskrit_parser.

kmadathil commented on August 15, 2024

Does x in dict give the same performance as storing set(dict.keys()) and doing x in set?

from sanskrit_parser.

kmadathil commented on August 15, 2024

Ok dict it is then. We do x in dict for quick lookup and dict[x] to extract the lexical tags.

from sanskrit_parser.

drdhaval2785 commented on August 15, 2024

Now that we have decided on dicts, time to close this issue?

from sanskrit_parser.

kmadathil commented on August 15, 2024

Was hoping to close this after we implement it :-)

from sanskrit_parser.

avinashvarna commented on August 15, 2024

I actually have a preliminary implementation locally that includes pickling. @drdhaval2785 would you mind adding me as a collaborator for inriaxmlwrapper? I will create a branch there and add it in

from sanskrit_parser.

kmadathil commented on August 15, 2024

@drdhaval2785 @avinashvarna
Did this implementation get merged?

from sanskrit_parser.

avinashvarna commented on August 15, 2024

Waiting for @drdhaval2785 to add me as a collaborator. In the meantime, also see #30

from sanskrit_parser.

kmadathil commented on August 15, 2024

@avinashvarna
He might be busy. I have a fork https://github.com/kmadathil/inriaxmlwrapper which I use as a git remote for the same repo. I've added you as a collaborator on that. You can push to that. We'll generate a pull request after validating it locally.

from sanskrit_parser.

vvasuki commented on August 15, 2024

Gentlemen, since you're talking about pickling a dict - my suggestion is please just write a json - I vaguely remember reading this http://www.benfrederickson.com/dont-pickle-your-data/ , but the real reason is that this json will be usable in any other programming language easily, and it can be read in as a dict with json.loads() .

from sanskrit_parser.

vvasuki commented on August 15, 2024

Also, you might consider just using a database, if you suspect future memory constraints.

from sanskrit_parser.

avinashvarna commented on August 15, 2024

@drdhaval2785 Our thoughts are with you. Your dedication is both impressive and inspiring!

@vvasuki I was thinking about using a database as well, since it would be good not to load the entire db into memory just to look up a small subset of words. Will look into a suitable library.

from sanskrit_parser.

vvasuki commented on August 15, 2024

If you make the database url a parameter to your program, you can just use couchdb - it comes with a rest api, its easy to replicate (master to master) and has wrappers around the rest api. Optionally, you can be database-neutral by using an interface like the skeleton I've started in sanskrit_data.

from sanskrit_parser.

avinashvarna commented on August 15, 2024

This would require the user to install couchdb server separately at the least, or rely on having an internet connection at the time of usage. I understand the motivation, but I think it makes the package less self-contained. On the other hand, using a sqllite type of database would allow everything to be self-contained. (Is there a self-contained couchdb python module similar to sqllite3?)

Thoughts?

from sanskrit_parser.

vvasuki commented on August 15, 2024

This would require the user to install couchdb server separately at the least, or rely on having an internet connection at the time of usage.

That's not too big a disadvantage - For well over 10 years I've had almost-always-on internet. So, consider if that use case is really something to bother about now..

On the other hand, using a sqllite type of database would allow everything to be self-contained. (Is there a self-contained couchdb python module similar to sqllite3?)

Yes - couchbase-lite . I've used it quite a bit with scala and java. It can be synced with couchdb (but index-es don't sync). Alas, I don't see python api for it.

(just in case) Please, don't even consider going relational - nosql dbs are the way to go :-)

from sanskrit_parser.

avinashvarna commented on August 15, 2024

It's perhaps not as simple as you imagine. When you sent out the DCS database release email a couple of months ago, I wanted to play around with the data. After installing couchdb in my Ubuntu VM running on my windows laptop, I tried syncing with the vedavaapi.org database. Since the network interface bridging to VMs on Hyper-V is not the most stable in the world, the network would occasionally disconnect, and I just could not get my local couchdb database to fully sync with vedavaapi.org. (It might have synced eventually, but I wanted to explore the data right away, and did not have the patience to wait for eventual consistency). The release included a couchbase lite dump of the db, but as you noted, that is incompatible with couchdb. Eventually I used the couchdb backup scripts from https://github.com/danielebailo/couchdb-dump to dump the database from vedavaapi.org onto my windows host machine, scp'ed that into my VM, and then used the scripts again to restore the dump into the couchdb server on the VM. This is certainly not a smooth experience.

Consider also the following use case - As being discussed in #28, suppose I want to train a deep neural network (several stacked LSTM/CNN layers) using an ML platform as a service such as floydhub, I don't have control over the docker images they run, and so cannot install couchdb server. Given that I am paying by the second, I don't want my training script to go over an internet connection every time it needs to lookup tags for a single word. It makes sense to have a self-contained database in this case.

Regarding the nosql comment, I agree that they may be more natural when dealing with real-world documents that do not fall nicely into rows and columns, but we need to think whether they make sense in this particular application given the constraints. We are using a dict today for essentially key -> value mappings, and want to put this in a db. In the absence of a self-contained nosql python implementation, and given that sqllite3 is part of the python standard library and available on all platforms, I am not convinced that nosql has to be the way to go. I am open to being convinced otherwise.

from sanskrit_parser.

vvasuki commented on August 15, 2024

DCS database experience ...

There is a difference in scale. Small databases reach consistency quite fast. Since you've got couchdb setup, just try replicating some small db.

Given that I am paying by the second, I don't want my training script to go over an internet connection every time it needs to lookup tags for a single word. It makes sense to have a self-contained database in this case.

Definitely agree. In such cases, the db should be replicated. But this is not what an average user would necessarily care for. In a realistic deployment, this package will sit in a web server.

I agree that they may be more natural when dealing with real-world documents that do not fall nicely into rows and columns, but we need to think whether they make sense in this particular application given the constraints. We are using a dict today for essentially key -> value mappings, and want to put this in a db.

That seems like a classic trap - these are your constraints today. Tomorrow, maybe you want to index with different "keys" as well as part of your program. If it takes no or less extra effort, stay flexible.

In the absence of a self-contained nosql python implementation, and given that sqllite3 is part of the python standard library and available on all platforms, I am not convinced that nosql has to be the way to go.

The premise is false. I only said that the one self-contained nosql database I know does not support python, not that they don't exist. A casual search throws up many possibilities - https://pypi.python.org/pypi/tinydb , https://sqlite.org/json1.html , berkeley db.

from sanskrit_parser.

avinashvarna commented on August 15, 2024

Ok. I stand corrected :). My apologies for misinterpreting your statement about no python support.
TinyDB definitely seems to be a valid option, and as you say, given the choice, it is better to stay flexible. Will try to update the inriaxmlwrapper class to use tinydb.

from sanskrit_parser.

vvasuki commented on August 15, 2024

TinyDB definitely seems to be a valid option, and as you say, given the choice, it is better to stay flexible.

Taking the flexibility preference to a slightly higher level - it is a good idea not to be "married" to any database technology. Access it via an interface (such as DbInterface and ClientInterface here - PS: you don't have to implement every method). Switching to a different database tool should be as simple as calling a different class's constructor - one shouldn't have to go messing about anywhere else.

from sanskrit_parser.

kmadathil commented on August 15, 2024

Is there any reason to keep this issue open? Should we create another to use standard DB interfaces and close this one?

from sanskrit_parser.

avinashvarna commented on August 15, 2024

I am ok with closing this

from sanskrit_parser.

Speedup lexical lookup using an O(1) datastructure about sanskrit_parser HOT 29 CLOSED

Comments (29)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent