Giter Club home page Giter Club logo

wordninja's Introduction

image

Word Ninja

Slice your munged together words! Seriously, Take anything, 'imateapot' for example, would become ['im', 'a', 'teapot']. Useful for humanizing stuff (like database tables when people don't like underscores).

This project is repackaging the excellent work from here: http://stackoverflow.com/a/11642687/2449774

Usage

$ python
>>> import wordninja
>>> wordninja.split('derekanderson')
['derek', 'anderson']
>>> wordninja.split('imateapot')
['im', 'a', 'teapot']
>>> wordninja.split('heshotwhointhewhatnow')
['he', 'shot', 'who', 'in', 'the', 'what', 'now']
>>> wordninja.split('thequickbrownfoxjumpsoverthelazydog')
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

Performance

It's super fast!

>>> def f():
...   wordninja.split('imateapot')
... 
>>> timeit.timeit(f, number=10000)
0.40885152100236155

It can handle long strings:

>>> wordninja.split('wethepeopleoftheunitedstatesinordertoformamoreperfectunionestablishjusticeinsuredomestictranquilityprovideforthecommondefencepromotethegeneralwelfareandsecuretheblessingsoflibertytoourselvesandourposteritydoordainandestablishthisconstitutionfortheunitedstatesofamerica')
['we', 'the', 'people', 'of', 'the', 'united', 'states', 'in', 'order', 'to', 'form', 'a', 'more', 'perfect', 'union', 'establish', 'justice', 'in', 'sure', 'domestic', 'tranquility', 'provide', 'for', 'the', 'common', 'defence', 'promote', 'the', 'general', 'welfare', 'and', 'secure', 'the', 'blessings', 'of', 'liberty', 'to', 'ourselves', 'and', 'our', 'posterity', 'do', 'ordain', 'and', 'establish', 'this', 'constitution', 'for', 'the', 'united', 'states', 'of', 'america']

And scales well. (This string takes ~7ms to compute.)

How to Install

pip3 install wordninja

Custom Language Models

#1 most requested feature! If you want to do something other than english (or want to specify your own model of english), this is how you do it.

>>> lm = wordninja.LanguageModel('my_lang.txt.gz')
>>> lm.split('derek')
['der','ek']

Language files must be gziped text files with one word per line in decreasing order of probability.

If you want to make your model the default, set:

wordninja.DEFAULT_LANGUAGE_MODEL = wordninja.LanguageModel('my_lang.txt.gz')

wordninja's People

Contributors

garfunkel avatar keredson avatar kolanich avatar srandal avatar sunzewei2715 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wordninja's Issues

Add optional encoding and errors parameters to LanguageModel constructor

Currently, the LanguageModel constructor in the wordninja.py file opens the word file using gzip.open() without any option to specify the file encoding. This means that users who have word files with non-UTF-8 encoding may encounter decoding errors when using the wordninja package.

To address this issue, I propose modifying the __init__ function in the wordninja.py file to include an optional encoding parameter that can be used to specify the encoding of the word file. Additionally, I suggest adding an optional errors parameter to allow users to customize how decoding errors are handled.

Here's an example of what the modified function could look like:

def __init__(self, word_file, encoding='utf-8', errors='strict'):
    # Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
    with gzip.open(word_file) as f:
        words = f.read().decode(encoding=encoding, errors=errors).split()
    self._wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
    self._maxword = max(len(x) for x in words)

By adding these optional parameters, users can specify the encoding and error handling behavior of the word file when they create a LanguageModel instance, allowing them to use files in different encodings without having to modify the source code.

I plan to submit a pull request with these changes. Please let me know if there are any concerns or suggestions for improvement.

Thank you.

return list of all options sorted by likelihood

rather than returning the most likely list only, I would like to get the X most likely options.

ie. beanally it should return

[
  ["bean", "ally"],
  ["be", "an", "ally"],
  ["be", "anally"],
  ["be", "anal", "ly"],
]

Larger corpus?

(This is a suggested improvement)

The corpus currently used is very small and seems to have just been thrown together by the original author (who called it "quick and dirty"). A larger corpus would be much appreciated, since the main problem with this library (which I've been using on-and-off for the past year, with mixed results) seems to be the small number of words it can detect (e.g. it couldn't even properly detect contractions before those were added to the corpus).

Something like the following might be good:
https://www.kdnuggets.com/2017/11/building-wikipedia-text-corpus-nlp.html
or
https://www.corpusdata.org/formats.asp

Split on hyphen (-)

Is it possible to not split up the words that are hyphenated?
For example, "Post-Punk" is split into "Post" and "Punk" even though I added "Post-Punk" into my custom dict already.
Thank you.

The word split is kinda too aggressive

Thanks for this great work. I tried it out and found that the split is sometimes too aggressive to me, for example, the 'occupational' is split into 'occ', 'u', 'p', 'a', 't' and ional', and 'particulate' into 'part', 'icu', 'late'. Strangely it's not always like this - sometimes I can get 'occupational' and 'particulate' correctly. Any thoughts about why this happens?

Read other language dictionaries

Hello,

I would like to suggest an enhancement to allow other language dictionaries to be fed in wordninja.
The parameter right now is configured to use wordninja_words.txt.gz.

Thank you!

Splitting iam

So "Iam" doesn't work (splits into ['I', 'a', 'm'] - that seems to be the case for any capitalised word).

And "iam", stays "iam".

Make wordninja caps aware?

Is there a way to make wordninja aware of capital letters and be able to cut at caps boundaries if not informed by the dictionary? Or is there a way to build a caps aware dictionary?

for example: rCBVmeanSD should be cut as rCBV mean SD. But of course my domain specific language model must both define mean and means and the resulting cut is rCBV', 'meanS', 'D'.

Thoughts?

LanguageModel split fails when there is unrecognized characters

Hi

I am using the LanguageModel split with a wordlist for Mandarin chinese using these lists: https://en.wiktionary.org/wiki/Appendix:Mandarin_Frequency_lists (3rd column with accents removed, file attached)

pinyin.txt.gz

I have noticed this behaviour (xxx is a sequence of characters non recognized)

>>> lm = wordninja.LanguageModel('pinyin.txt.gz')
>>> lm.split('beijingdaibiaochu')
['beijing', 'daibiao', 'chu']
>>> lm.split('xxxbeijingdaibiaochu')
['x', 'x', 'x', 'b', 'e', 'i', 'j', 'i', 'n', 'g', 'd', 'a', 'i', 'b', 'i', 'a', 'o', 'c', 'h', 'u']
>>> lm.split('beijingxxxdaibiaochu')
['beijing', 'x', 'x', 'x', 'd', 'a', 'i', 'b', 'i', 'a', 'o', 'c', 'h', 'u']

Expected output should be:

['xxx', 'beijing', 'daibiao', 'chu']
['beijing', 'xxx', 'daibiao', 'chu']

append words to default wordlist

I understand you can have your own custom wordlist and even make it default. it would be great if one can also append to the current default. The default one works great for me except for a few words that are not there.

Split on accentuated characters

Hello,

I ran into problem splitting on accentuated characters. I built a large dictionary file and this is the behaviour I get :

lm.split('Jenesaispasquoienpenserdecemachinlàmaisçasemblefonctionner')
['Je', 'ne', 'sais', 'pas', 'quoi', 'en', 'penser', 'de', 'ce', 'machin', 'l', 'mais', 'as', 'emble', 'fonctionner']

Expected output would be :
['Je', 'ne', 'sais', 'pas', 'quoi', 'en', 'penser', 'de', 'ce', 'machin', 'là', 'mais', 'ça', 'semble', 'fonctionner']

The à and ç characters seem to be causing the problems here.
Any pointers for me?

Unable to package this library using pyinstaller

Traceback (most recent call last):
File "Tool.py", line 8, in
File "", line 1007, in _find_and_load
File "", line 986, in _find_and_load_unlocked
File "", line 680, in _load_unlocked
File "PyInstaller\loader\pyimod02_importers.py", line 419, in exec_module
File "wordninja.py", line 80, in
File "wordninja.py", line 31, in init
File "gzip.py", line 58, in open
File "gzip.py", line 173, in init
FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\rock\Documents\JSON_Proto_Creation\dist\Tool\_internal\wordninja\wordninja_words.txt.gz'

The tool is packaged, but the above issue appears when I try to launch the exe.

[Feature Request] Exception List

Hello,

I have got an idea. If an exception list or whitelist is implemented such that the word sequence is not split, it'll be very good. For example, "tensorflow" is split into "tensor" and "flow". So, we can provide entities that we don't want them to be split and after the algorithm runs, they remain the same.

Italian language

Hi I don't understand how could add an Italian dictionary to process the word in the right way. Do you explain me how I could, please? Thank you for you're support and you're patience, and I apologise for my stupid question

Incorrect Zipf exponent

I added your Zipf cost to a classroom demo I had (https://github.com/christophsk/segment-string) and found that the string "iamnotanumberiamaperson", segments as "iam not a number iam a person" instead of "i am a not a number i am a person". The latter is found using word probabilities from English Wikipedia.

The cause is the Zipf exponent log(len(words)) in

self._wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))

This value is too large. The exponent is a constant, independent of the size of the language model. Measurements suggest a value of about 2.5 i.e., frequency is proportional to 1 / rank^2.5. Using this value produces a correct result.

Suggest Line 33 in wordninja.py be changed to

self._wordcost = dict((k, log((i+1) * 2.5) for i,k in enumerate(words))

Installation trouble

Hey, found this package via the SO answer.

It doesn't seem like the .tar.gz file is pip installed into the right location:

import wordninja

---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-3-b786e3290074> in <module>()
----> 1 import wordninja

/home/rainer/.virtualenvs/ml/local/lib/python2.7/site-packages/wordninja.py in <module>()
     14 
     15 # Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
---> 16 with gzip.open(os.path.join(os.path.dirname(os.path.abspath(__file__)),'wordninja_words.txt.gz')) as f:
     17   words = f.read().decode().split()
     18 _wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))

/usr/lib/python2.7/gzip.pyc in open(filename, mode, compresslevel)
     32 
     33     """
---> 34     return GzipFile(filename, mode, compresslevel)
     35 
     36 class GzipFile(io.BufferedIOBase):

/usr/lib/python2.7/gzip.pyc in __init__(self, filename, mode, compresslevel, fileobj, mtime)
     92             mode += 'b'
     93         if fileobj is None:
---> 94             fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
     95         if filename is None:
     96             # Issue #13781: os.fdopen() creates a fileobj with a bogus name

IOError: [Errno 2] No such file or directory: '/home/rainer/.virtualenvs/ml/local/lib/python2.7/site-packages/wordninja_words.txt.gz'

Instead it seems to be in the root of my virtualenv:

rainer@rainer-Galago-Pro ~/.v/ml> ls
bin/  include/  lib/  local/  pip-selfcheck.json  share/  wordninja_words.txt.gz
rainer@rainer-Galago-Pro ~/.v/ml> pwd
/home/rainer/.virtualenvs/ml

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.