keredson / wordninja Goto Github PK

Probabilistically split concatenated words using NLP based on English Wikipedia unigram frequencies.

License: MIT License

Python 100.00%

wordninja's Introduction

Word Ninja

Slice your munged together words! Seriously, Take anything, 'imateapot' for example, would become ['im', 'a', 'teapot']. Useful for humanizing stuff (like database tables when people don't like underscores).

This project is repackaging the excellent work from here: http://stackoverflow.com/a/11642687/2449774

Usage

$ python
>>> import wordninja
>>> wordninja.split('derekanderson')
['derek', 'anderson']
>>> wordninja.split('imateapot')
['im', 'a', 'teapot']
>>> wordninja.split('heshotwhointhewhatnow')
['he', 'shot', 'who', 'in', 'the', 'what', 'now']
>>> wordninja.split('thequickbrownfoxjumpsoverthelazydog')
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

Performance

It's super fast!

>>> def f():
...   wordninja.split('imateapot')
... 
>>> timeit.timeit(f, number=10000)
0.40885152100236155

It can handle long strings:

>>> wordninja.split('wethepeopleoftheunitedstatesinordertoformamoreperfectunionestablishjusticeinsuredomestictranquilityprovideforthecommondefencepromotethegeneralwelfareandsecuretheblessingsoflibertytoourselvesandourposteritydoordainandestablishthisconstitutionfortheunitedstatesofamerica')
['we', 'the', 'people', 'of', 'the', 'united', 'states', 'in', 'order', 'to', 'form', 'a', 'more', 'perfect', 'union', 'establish', 'justice', 'in', 'sure', 'domestic', 'tranquility', 'provide', 'for', 'the', 'common', 'defence', 'promote', 'the', 'general', 'welfare', 'and', 'secure', 'the', 'blessings', 'of', 'liberty', 'to', 'ourselves', 'and', 'our', 'posterity', 'do', 'ordain', 'and', 'establish', 'this', 'constitution', 'for', 'the', 'united', 'states', 'of', 'america']

And scales well. (This string takes ~7ms to compute.)

How to Install

pip3 install wordninja

Custom Language Models

#1 most requested feature! If you want to do something other than english (or want to specify your own model of english), this is how you do it.

>>> lm = wordninja.LanguageModel('my_lang.txt.gz')
>>> lm.split('derek')
['der','ek']

Language files must be gziped text files with one word per line in decreasing order of probability.

If you want to make your model the default, set:

wordninja.DEFAULT_LANGUAGE_MODEL = wordninja.LanguageModel('my_lang.txt.gz')

wordninja's People

Contributors

Stargazers

Watchers

Forkers

gyrsilvia raghavendranpm gechen treebuilder arnehuang macunha1 widemeadows solariat danarnold afcarl zhuguangqiang spongebobsquamirez hhy5277 punyajoy fanofjava yishuihanhan yehuangcn xujizhe813 cyb3a stanleysongpro mjcortejo dst1213 dji-transpire joservmx pakos47 party4bread maybe-ai ukm5 karnian gabrielle515 handwang wqwangchn akshatdeep12 awoziji mikess314 garfunkel zpeng-learn yusijia lemonheart96 whitewolfkings panming5 yunnitec hmassareli eric-mh jofunch sbakhit han1202qd caswml zoudajia innovationlabroma jiangallen jiangyouhua guilhermedeconto kolanich-libs ebrandvalue 3486488016 gsri30 wuruofan jupyterjazz adbmd techthiyanes aayrm5 bharath-kumar-3231 akash-agr sketchymandan fernanortega charminglittedeveloper handywiki sunzewei2715 kevin-ornot wiinew zhangli1984 djstrong zhangyunfang tusharvatsa32 no-m4d securecloud-biz rubinorlando skbt27i malay5-1 follow-the-vine-to-get-to-the-melon vpnry nodirmcsd interupt19 koffair joaopcnogueira vishnunkumar chriscoletech lyrics-wangkl filthyshoe minhhai2209 swiftspotter damonyu97 andysucao

wordninja's Issues

decarbonization splits into 'de carbon iz ation'

Add optional encoding and errors parameters to LanguageModel constructor

Currently, the LanguageModel constructor in the wordninja.py file opens the word file using gzip.open() without any option to specify the file encoding. This means that users who have word files with non-UTF-8 encoding may encounter decoding errors when using the wordninja package.

To address this issue, I propose modifying the __init__ function in the wordninja.py file to include an optional encoding parameter that can be used to specify the encoding of the word file. Additionally, I suggest adding an optional errors parameter to allow users to customize how decoding errors are handled.

Here's an example of what the modified function could look like:

def __init__(self, word_file, encoding='utf-8', errors='strict'):
    # Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
    with gzip.open(word_file) as f:
        words = f.read().decode(encoding=encoding, errors=errors).split()
    self._wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
    self._maxword = max(len(x) for x in words)

By adding these optional parameters, users can specify the encoding and error handling behavior of the word file when they create a LanguageModel instance, allowing them to use files in different encodings without having to modify the source code.

I plan to submit a pull request with these changes. Please let me know if there are any concerns or suggestions for improvement.

Thank you.

return list of all options sorted by likelihood

rather than returning the most likely list only, I would like to get the X most likely options.

ie. beanally it should return

[
  ["bean", "ally"],
  ["be", "an", "ally"],
  ["be", "anally"],
  ["be", "anal", "ly"],
]

Larger corpus?

(This is a suggested improvement)

The corpus currently used is very small and seems to have just been thrown together by the original author (who called it "quick and dirty"). A larger corpus would be much appreciated, since the main problem with this library (which I've been using on-and-off for the past year, with mixed results) seems to be the small number of words it can detect (e.g. it couldn't even properly detect contractions before those were added to the corpus).

Something like the following might be good:
https://www.kdnuggets.com/2017/11/building-wikipedia-text-corpus-nlp.html
or
https://www.corpusdata.org/formats.asp

Split on hyphen (-)

Is it possible to not split up the words that are hyphenated?
For example, "Post-Punk" is split into "Post" and "Punk" even though I added "Post-Punk" into my custom dict already.
Thank you.

The package can't split "helloworld".

The package is wonderful, but I find it can't split "helloworld". I guess it may be a bug, so I say it here.

The word split is kinda too aggressive

Thanks for this great work. I tried it out and found that the split is sometimes too aggressive to me, for example, the 'occupational' is split into 'occ', 'u', 'p', 'a', 't' and ional', and 'particulate' into 'part', 'icu', 'late'. Strangely it's not always like this - sometimes I can get 'occupational' and 'particulate' correctly. Any thoughts about why this happens?

Read other language dictionaries

Hello,

I would like to suggest an enhancement to allow other language dictionaries to be fed in wordninja.
The parameter right now is configured to use wordninja_words.txt.gz.

Thank you!

Splitting iam

So "Iam" doesn't work (splits into ['I', 'a', 'm'] - that seems to be the case for any capitalised word).

And "iam", stays "iam".

Make wordninja caps aware?

Is there a way to make wordninja aware of capital letters and be able to cut at caps boundaries if not informed by the dictionary? Or is there a way to build a caps aware dictionary?

for example: rCBVmeanSD should be cut as rCBV mean SD. But of course my domain specific language model must both define mean and means and the resulting cut is rCBV', 'meanS', 'D'.

Thoughts?

LanguageModel split fails when there is unrecognized characters

I am using the LanguageModel split with a wordlist for Mandarin chinese using these lists: https://en.wiktionary.org/wiki/Appendix:Mandarin_Frequency_lists (3rd column with accents removed, file attached)

pinyin.txt.gz

I have noticed this behaviour (xxx is a sequence of characters non recognized)

>>> lm = wordninja.LanguageModel('pinyin.txt.gz')
>>> lm.split('beijingdaibiaochu')
['beijing', 'daibiao', 'chu']
>>> lm.split('xxxbeijingdaibiaochu')
['x', 'x', 'x', 'b', 'e', 'i', 'j', 'i', 'n', 'g', 'd', 'a', 'i', 'b', 'i', 'a', 'o', 'c', 'h', 'u']
>>> lm.split('beijingxxxdaibiaochu')
['beijing', 'x', 'x', 'x', 'd', 'a', 'i', 'b', 'i', 'a', 'o', 'c', 'h', 'u']

Expected output should be:

['xxx', 'beijing', 'daibiao', 'chu']
['beijing', 'xxx', 'daibiao', 'chu']

append words to default wordlist

I understand you can have your own custom wordlist and even make it default. it would be great if one can also append to the current default. The default one works great for me except for a few words that are not there.

esg splits into 'e s g'

Similar to my prior issue on decarbonization, esg got splitted into 'e s g'.

Split on accentuated characters

Hello,

I ran into problem splitting on accentuated characters. I built a large dictionary file and this is the behaviour I get :

lm.split('Jenesaispasquoienpenserdecemachinlàmaisçasemblefonctionner')
['Je', 'ne', 'sais', 'pas', 'quoi', 'en', 'penser', 'de', 'ce', 'machin', 'l', 'mais', 'as', 'emble', 'fonctionner']

Expected output would be :
['Je', 'ne', 'sais', 'pas', 'quoi', 'en', 'penser', 'de', 'ce', 'machin', 'là', 'mais', 'ça', 'semble', 'fonctionner']

The à and ç characters seem to be causing the problems here.
Any pointers for me?

Unable to package this library using pyinstaller

Traceback (most recent call last):
File "Tool.py", line 8, in
File "", line 1007, in _find_and_load
File "", line 986, in _find_and_load_unlocked
File "", line 680, in _load_unlocked
File "PyInstaller\loader\pyimod02_importers.py", line 419, in exec_module
File "wordninja.py", line 80, in
File "wordninja.py", line 31, in init
File "gzip.py", line 58, in open
File "gzip.py", line 173, in init
FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\rock\Documents\JSON_Proto_Creation\dist\Tool\_internal\wordninja\wordninja_words.txt.gz'

The tool is packaged, but the above issue appears when I try to launch the exe.

[Feature Request] Exception List

Hello,

I have got an idea. If an exception list or whitelist is implemented such that the word sequence is not split, it'll be very good. For example, "tensorflow" is split into "tensor" and "flow". So, we can provide entities that we don't want them to be split and after the algorithm runs, they remain the same.

'patreon' outputs as ['pat', 're', 'on']

text = 'patreon'
wordninja.split(text)

['pat', 're', 'on'] # incorrect

Thanks for sharing!

Very interesting package!
Thanks for sharing!

Italian language

Hi I don't understand how could add an Italian dictionary to process the word in the right way. Do you explain me how I could, please? Thank you for you're support and you're patience, and I apologise for my stupid question

Incorrect Zipf exponent

I added your Zipf cost to a classroom demo I had (https://github.com/christophsk/segment-string) and found that the string "iamnotanumberiamaperson", segments as "iam not a number iam a person" instead of "i am a not a number i am a person". The latter is found using word probabilities from English Wikipedia.

The cause is the Zipf exponent log(len(words)) in

self._wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))

This value is too large. The exponent is a constant, independent of the size of the language model. Measurements suggest a value of about 2.5 i.e., frequency is proportional to 1 / rank^2.5. Using this value produces a correct result.

Suggest Line 33 in wordninja.py be changed to

self._wordcost = dict((k, log((i+1) * 2.5) for i,k in enumerate(words))

Installation trouble

Hey, found this package via the SO answer.

It doesn't seem like the .tar.gz file is pip installed into the right location:

import wordninja

---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-3-b786e3290074> in <module>()
----> 1 import wordninja

/home/rainer/.virtualenvs/ml/local/lib/python2.7/site-packages/wordninja.py in <module>()
     14 
     15 # Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
---> 16 with gzip.open(os.path.join(os.path.dirname(os.path.abspath(__file__)),'wordninja_words.txt.gz')) as f:
     17   words = f.read().decode().split()
     18 _wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))

/usr/lib/python2.7/gzip.pyc in open(filename, mode, compresslevel)
     32 
     33     """
---> 34     return GzipFile(filename, mode, compresslevel)
     35 
     36 class GzipFile(io.BufferedIOBase):

/usr/lib/python2.7/gzip.pyc in __init__(self, filename, mode, compresslevel, fileobj, mtime)
     92             mode += 'b'
     93         if fileobj is None:
---> 94             fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
     95         if filename is None:
     96             # Issue #13781: os.fdopen() creates a fileobj with a bogus name

IOError: [Errno 2] No such file or directory: '/home/rainer/.virtualenvs/ml/local/lib/python2.7/site-packages/wordninja_words.txt.gz'

Instead it seems to be in the root of my virtualenv:

rainer@rainer-Galago-Pro ~/.v/ml> ls
bin/  include/  lib/  local/  pip-selfcheck.json  share/  wordninja_words.txt.gz
rainer@rainer-Galago-Pro ~/.v/ml> pwd
/home/rainer/.virtualenvs/ml

keredson / wordninja Goto Github PK

wordninja's Introduction

Word Ninja

Usage

Performance

How to Install

Custom Language Models

wordninja's People

Contributors

Stargazers

Watchers

Forkers

wordninja's Issues

Recommend Projects

Recommend Topics

Recommend Org