drdhaval2785 / samasasplitter Goto Github PK

View Code? Open in Web Editor NEW

4.0 5.0 3.0 46.4 MB

Repository to split samAsa (compounds) in Sanskrit

Python 99.90% Shell 0.10%

samasasplitter's Introduction

Execution

python split.py <wordInSLP1> [<dictname>]

e.g.

python split.py DavaleSvarapriya

python split.py DavaleSvarapriya MD

Dictnames

dictionaryname = ["ACC","CAE","AE","AP90","AP","BEN","BHS","BOP","BOR","BUR","CCS","GRA","GST","IEG","INM","KRM","MCI","MD","MW72","MW","MWE","PD","PE","PGN","PUI","PWG","PW","SCH","SHS","SKD","SNP","STC","VCP","VEI","WIL","YAT","ALL","mwb"]

Notes - ALL stands for all Cologne dictionaries combined. mwb stands for MW bricks i.e. key2 of MW separated by hyphen i.e. split headwords of MW.

output

['Davala+ISvara+priya', 'Dava+lA+ISvara+priya']

Only first 5 output are shown in decreasing probability by default.

If you want all results, replace print output[:5] line with print output.

Dictionary

As the speed decreases with too many headwords, currently we are using only 'MD' as our base dictionary. If you want to change the dictionary, it can be changed by altering the dictionaryname in createhwlist('MD')
Only words having >1 length are taken.
The dictionary (hwsorted.txt) is sorted in the following logic.

3.1. In decreasing order of the occurrence in number of dictionaries in sanhw2.txt. e.g. headword occurring in 29 dictionaries (and therefore more common) are sorted first, then 28, 27,..... 1.

3.2. In decreasing order of length of the words. Longest words are sorted first. Shortest ones at the end.

3.3. If the above two conditions are satisfied, they are sorted in alphabetic order.

Further projects

Analyse Sanskrit headwords from Cologne dictionaries and separate the headwords which have compounds. See compoundstudy.py and compoundstudy/compoundhw.py

samasasplitter's People

Contributors

Stargazers

Watchers

Forkers

drupchen sanskritick sanskrit-code

samasasplitter's Issues

Huet's lessons

First of all I wanted to salute you with https://github.com/drdhaval2785/samasasplitter. How quick, but even now it can handle sandhi! Huet's and @mbykov have done a lot in the field lately and hope will comment. @funderburkjim is out of the game, but not reason not to listen what he thinks of it.

80k word frequency included in https://github.com/gasyoun/SanskritLexicography/blob/0fb80a8de652e80eb5514d930289c0cc0588d85b/DCS_statistical_evaluation.htm
It's parsed from http://kjc-fs-cluster.kjc.uni-heidelberg.de/dcs/index.php?contents=corpus and contains part of MW. Of less interest but still might be https://github.com/gasyoun/SanskritLexicography/blob/0fb80a8de652e80eb5514d930289c0cc0588d85b/DCS-Moniers-roots-w-references.html Please also see
https://docs.google.com/document/d/11Z1snnew9a0eY96W5o-ZQ71Zve1WRjcOqfOFgagndy4/edit#heading=h.k0dxemsx30hk - questions I had after reading Gérard's emails:

Gérard Huet 08.02.14:

I have currently two kinds of suffix entries in my lexicon. 
Some are phonemic affixes used to indicate morphology, such as -na (even when it undergoes retroflexion when affixing).
Others are Paninian technical terms referring to generative morphology parameters, themselves often having little overlap with the final phonemic increment,
such as -cvi for inchoative compounds, -ktva for the -tva taddhita suffix under context condition k for constructing neuter abstracts of quality.
My goal is to replace progressively the approximate suffixes by more precise etymology indication, stating unambiguously the affixing operation.
I did this for k.rdanta constructions, at least completely for participles. This allowed me to replace eg
for \word{samucita} the approximate \desf{samuc}{-ita} by the precise \ppde{samuc} where my keyword ppde means (in French!) 
"passive past participle of". Thus I could give a unique scheme for all pp's, in -ta, -ita, -na, or whatever.
I also want to separate k.rt and taddhita suffixes. The latter are very numerous, and their productivity is unclear and non uniform.
I have worked out lately how to extend my machinery for automatic recognition of certain taddhita forms in order to parse long navya-nyaaya compounds.
Actually, an hour ago Arjuna, a student at UoH, just presented at SALA the result of joint research on this experiment, so I am very much into suffixes these days. 
If you are interested, you may play with the new "experimental" mode in my just released new V2.80 engine. 
In the reader page, set "Experiment" for Parser strength and "Word" for text, and you'ill be able to parse compounds such as
hewuwAvacCexakAvacCinnahewvaXikaraNawAprawiyogikahewuwAvacCexakasambanXAvacCinnAXeyawAnirUpiwaviSeRaNawAviSeRasambanXena (in WX input).

Gérard Huet 21.03.14:

What should the entries of a dictionary be ?
In French, it is easy. Entries are all bare stems of words. Which are assumed to be in finite number. Plus a few exotic inflected forms, such as "yeux", the plural of "œil". For verbs, conjugated forms are not listed. There is a special book for conjugation, the Bescherelle, that lists all conjugation schemes and the verbs that belong to their class. Thus such a notion as "the longest word in the French language" makes sense, amazingly enough, it is "anticonstitutionnellement".
You have to work out for yourself that it is the adverb in -ment corresponding to the adjectif "anticonstitutionnel", obtained by prefixing the opposite prefix anti-
to the adjectif "constitutionnel", itself the adjectif in -el giving the quality of substantive  "constitution", itself the verbal action in -tion corresponding to
verb "constituer" (itself being obtained from pre-verb con- in front of an ancient "stituer" coming from Latin). 
This point of view is just ignoring the productive nature of morphology. For instance, a few years ago, a political figure used the word "bravitude" instead of
correct "bravoure" (braveness), and she was mocked as ignorant, even though "bravitude" is morphologically correct.
Now in Sanskrit we have to take care of productive morphology. For compounds, of course, but also for simple words, obtainable by complex morphological processes. This makes sense, since the grammar is very explicit about morphological formation, albeit in a specially complex way, using phonological
processes such as gu.na/v.rddhi and sandhi. The problem is how to reflect this information in a lexicon. Should be list pratyaayas? Does it make sense
to list such pratyaayas in lexicographic order, in reverse order, in frequency order, in whatever order ? Look at "kvasu", which is discussed in the
Harkare-PratyayaKosha Issues that you pointed out to me. It is not an aadeza, at least not of the kind that make alternate the bhuu/as roots.
It is a k.rt pratyaaya. It is used for forming the stem of the perfect participle, such as vidvas from root vid. This is stated here. :-)
Now if you look at the etymology of my entry vidvas, you see:  विद्वस् vidvas [ppft. vid_1] 
and not [k.rd(vid_1,kvasu)]. Note that here you need the whole grammar to tell you that ultimately vid-kvasu will compute into vidvas.
The "k" is not phonetic material, as part of some hypothetical morpheme "kvasu". It is a control argument to the computational process. 
Thus I indicate entry "kvasu" just as help for someone who want to understand what this notion stands for in Paa.nini's grammar, but I keep implicit
in my etymological indication that ppft stem computation corresponds to k.rt kvasu, this is only needed for grammar specialists. Indeed even in my computer code
I do not use "kvasu", and the ppft stems are computed by cascades of morpho-phonetical processes which are not easily encodable into a simple notation.
Indeed often my participles are an abstraction over several pratyaaya affixes. You may look at the appendix of my COLING paper that tells in painful detail
how my computation of the future participle stem accounts for the set of suffixes {yat,kyap,.nyat}. This issue is complex. Paa.nini's grammar is a whole,
it is not a simply-connected set of modules. And it cannot be used stand-alone, you need the appropriate dhatupatha, and the ga.napatha as well. 
I have not studied Harkare's book, but it appears to me as a specialised Vyaakara.na work, assuming fine knowledge of all this grammar material,
and I do not see how to simply interleave it with a lexicon.
Take for instance :  काठकः indicated as mysterious in your Harkare issues pages. Harkare mentions suutra IV-2-46. If you look at this suutra, you find:
"After names of Vedic schools, (the suffixes that are valid to designate a collection of objects) are the same as the ones that denote a rule (relative to the relevant school, as an extension to suutras IV,3,126 ff.)" where I put in parens what is implicit from the anuv.rtti. 
Now it should be obvious that this is simply an example, corresponding to the school Ka.tha and stipulating that ka.thaka, denoting a rule of this school,
denotes also the adepts of the school of sage Ka.tha, author of Kaṭhopaniṣad. 
Personally, I would not venture in this Harkare book without the help of a pandit or at least of a scholar who has completely mastered Paninian processes and their nomenclature. It is like trying to understand a contemporary mathematics article without the appropriate training.

@drdhaval2785 I would go for:
3.1 Frequency
3.2 sanhw2 occurance
3.3 Word length (DEC)
3.4 Alphabetic order

Extracting split words (from MW)

A preliminary version of compoundhw.txt and mwb.txt was placed before @gasyoun
And he gave the following remarks

karuṇā-vipralambha [p= 255,3] is split in MW book. Why is it karuRAvipralamBa in your list?
can you remember which word can be only 1st part, which 2nd or 3rd? Which in the beginning ? Which at the end only ?
And there is no pratimanyUyamAna, but only apratimanyUyamAna:GST,MW,MW72,PW,PWG - so where do the wrong, impossible words come from?
pari-puṭana splittable, but not in your list.

prefix and suffix data in MW key2

Per #2 (comment) @gasyoun wants a list of prefix and suffix in MW for his purpose.
Try to make a small script for the same.
May come useful for the splitter also.

e.g.
a or A would be prefixoids in most of the cases.
Right now we are ignoring the single letter parts.
But I guess we can allow them in prefixes.

Code modification is not that easy.

Use Oliver's statistics to give weightage

Oliver's DCS database provides some data regarding usage frequency of a word-component.
It would be a good idea to give weightages based on those details to make the word breaks more universal rather than purely headword based.

Speed

@drdhaval2785

Thanks a lot for the code and spending time to help. Would appreciate some more of your help:

$ python split.py DavaleSvarapriya MD
Reading knownpairs 2017-07-05 14:27:48.709972
Calculating costs of dictionary headwords 2017-07-05 14:27:48.863055
Calculated costs of dictionary headwords 2017-07-05 14:27:48.874079
Calculated maxword 2017-07-05 14:27:48.876109
valid permutations are 4
2017-07-05 14:27:48.933371
['Davala+ISvara+priya'] 5
2017-07-05 14:27:48.934428
$ python split.py astyuttarasyAmdiSidevatAtmA MD
Reading knownpairs 2017-07-05 14:31:07.751634
Calculating costs of dictionary headwords 2017-07-05 14:31:07.915072
Calculated costs of dictionary headwords 2017-07-05 14:31:07.928610
Calculated maxword 2017-07-05 14:31:07.931826
valid permutations are 1
2017-07-05 14:31:54.897486
astyuttarasyAmdiSidevatAtmA 4
2017-07-05 14:31:54.898249

The first invocation is impressively fast!

The second does not split at all. Any idea why?

False Positives tUzRIMdaRqa vs. t U z R I M d a R q a

Around 10% of samāsas analysed with MD return a rather poor result:

niHsADvasa | ni | H | s | A | D | v | a | s | a
nizpattraka | ni | z | p | a | t | t | r | a | k
nizpraBAva | ni | z | p | r | a | B | A | v | a

Any idea why these words have been split to letters, @drdhaval2785 ?

Sandhi Method.

@drdhaval2785

Thanks a lot for the code and spending time to help. Would appreciate some more of your help:

Please clarify if I've understood your sandhi technique correctly

lstrep = [('A',('A','aa','aA','Aa','AA','As')),('I',('I','ii','iI','Ii','II')),('U',('U','uu','uU','Uu','UU')),('F',('F','ff','fx','xf','Fx','xF','FF')),('e',('e','ea','ai','aI','Ai','AI')),('o',('o','oa','au','aU','Au','AU','aH','aHa','as')),('E',('E','ae','Ae','aE','AE')),('O',('O','ao','Ao','aO','AO')),('ar',('af','ar')),('d',('t','d')),('H',('H','s')),('S',('S','s','H')),('M',('m','M')),('y',('y','i','I')),('N',('N','M')),('Y',('Y','M')),('R',('R','M')),('n',('n','M')),('m',('m','M')),('v',('v','u','U')),('r',('r','s','H')),]

This is converted to a dict, and each letter that matches a key is replaced (optionally) by each of the replacements.
eg: rAmeha, split at e becomes rAmeha,rAmeaha,rAmaiha,ramaIha,ramAiha,ramAIha

Then you take the outer product of all such replacements using itertools.product, and split each of them.
Did I get that right?

If so, a couple of questions:

Right context is not used at all, as I could gather. So
ramogacCati = ramasgacCati (one of the options)
ramokathayati = ramaskathayati (one of the options)
the latter shouldn't happen. (Did I get this wrong?)
Neither is left context, so
---ara--- could be split as ---as a--- which will not happen in reality (would be ---o---)
I don't see things like jhalaM jaSo'nte. How does vAgarthau get split into vAk + arthau?

I would like to reuse as much as possible, so If I can pick this directly, I would like to. Maybe I'm missing an extra step somewhere?

2016

Happy and productive new 2016 year!

'a'/'A' as prefixoids

Per @gasyoun at #2 (comment) we should take the words in MW where he has put a break after 'a' as genuine 'a-' breaks.

It seems easily doable to enter these words in some list and give a break after a.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.