Giter Club home page Giter Club logo

natural's People

Contributors

adamb0mb avatar adrianmcli avatar alexlangberg avatar chrisumbel avatar danielruf avatar dav009 avatar dependabot[bot] avatar gmarty avatar hugo-ter-doest avatar johnmarkos avatar joscha avatar kbrabrand avatar kkoch986 avatar kostasx avatar lfilho avatar liwenzhu avatar mbc1990 avatar mdantas avatar mdmower avatar mef avatar moos avatar mullr avatar nishant8bits avatar rawpixel-vincent avatar reblws avatar saidelimam avatar seejohnrun avatar thomashuet avatar tj avatar vladimir-polyakov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

natural's Issues

find if a word is misspelled

Hello & congrads for your code work. I have begin to use the natural and I want to ask if is any way to find if a word is misspelled or if a word is an english word, Any suggestions?

make test

what test framework is this? we should add a test target so it's easier

Multiple download of WordNet DB files

When a WordNet api is called first time, the program tries to download the DB files. If multiple api are called in fast sequence, multiple download requests get queued. This resulted in truncated files for me and hanging api calls.

WordNet 3.0 raw data files are now available as an npm module WNdb. This can be installed with a 'npm install WNdb' or added to the dependency list.

I think it'd be more robust to get the files at install time, or have the user manually install if size (about 10MB compressed) is an issue.

stemming and classification problems

Here is a whole code:

var natural = require('natural'),
classifier = new natural.BayesClassifier(natural.PorterStemmer);
classifier.addDocument("Master Chief returns in Halo 4, part of a new trilogy in the colossal Halo universe.Set almost five years after the events of Halo 3, Halo 4 takes the series in a new direction and sets the stage for an epic new sci-fi saga, in which the Master Chief returns to confront his destiny and face an ancient evil that threatens the fate of the entire universe. Halo 4 also introduces a new multiplayer offering, called Halo Infinity Multiplayer, that builds off of the Halo franchise's rich multiplayer history. The hub of the Halo 4 multiplayer experience is the UNSC Infinity – the largest starship in the UNSC fleet that serves as the center of your Spartan career. Here you’ll build your custom Spartan-IV supersoldier, and progress your multiplayer career across all Halo 4 competitive and cooperative game modes.", 'halo');
classifier.addDocument("The hunted becomes the hunter in the CryEngine-powered open-world shooter Crysis 3! Players take on the role of 'Prophet' as he returns to New York in the year in 2047, only to discover that the city has been encased in a Nanodome created by the corrupt Cell Corporation. The New York City Liberty Dome is a veritable urban rainforest teeming with overgrown trees, dense swamplands and raging rivers. Within the Liberty Dome, seven distinct and treacherous environments become known as the Seven Wonders. This dangerous new world demands advanced weapons and tactics. Prophet will utilize a lethal composite bow, an enhanced Nanosuit and devastating alien tech to become the deadliest hunter on the planet.Prophet is on a revenge mission after uncovering the truth behind Cell Corporation's motives for building the quarantined Nanodomes. The citizens were told that the giant citywide structures were resurrected to protect the population and to cleanse these metropolises of the remnants of Ceph forces. In reality, the Nanodomes are CELL's covert attempt at a land and technology grab in their quest for global domination. With Alien Ceph lurking around every corner and human enemies on the attack, nobody is safe in the path of vengeance. Everyone is a target in Prophet's quest for retribution.", 'crysis');
classifier.train();


console.log(classifier.classify('nano'));
console.log(classifier.classify('evil'));

Both logs returns "halo", in first case it should return a "crysis". How can I fix it?

unable to open .../WNdb/dict/index.adv

When I perform the following a couple of million times I get the error unable to open .../WNdb/dict/index.adv:

function isWord(text, cb) {
    wordnet.lookup(text, function(results) {
        cb(Array.isArray(results) && results.length > 0);
    });
}

Is there anything I can do to resolve this?

Tokenization with punctuation removed.

Hi Gents,

I've just started looking at this package for node and it looks interesting. I am testing the tokenizer and it works fine but it doesn't seem to strip out commas and other punctuations. I'm using it like so:

var tokenizer = new natural.TreebankWordTokenizer();
tokenizer.tokenize(someString);

Looking at the source code it looks like it should work. Are you aware with any current issues with this?

Cheers.

Normalizer?

I'd like to contribute a normalizer for Japanese. This is basically a set of replacement functions to normalize Japanese input before further processing. It can also be used to do some conversion (such as full-width <-> half-width characters).

But I'm not sure where to put it, there's no such tool for any languages yet in natural. Should it fit under a new folder at /lib/natural/normalizer?

Please let me know if there's a more appropriate place.

Example of Sentence Analyzer Use

Great project! I saw that there is a setence analyzer, with some notes as follows:

Take a POS input and analyse it for

  • Type of Sentense
    • Interrogative

      - Tag Questions

    • Declarative
    • Exclamatory
    • Imperative
  • Parts of a Sentense
    • Subject
    • Predicate
  • Show Preposition Phrases

However I didn't see any examples of using the Sentence analyzer. In particular I am unclear how to generate the necessary POS input. Are there some examples knocking around?

Many thanks in advance

0.0.58 Classifiers: Longer training time.

I just updated my code to use 0.0.58 classifiers. Using both Bayes & LogisticRegression, I tried to apply addDocument() times. Then applied '.train()'. In this case, the last train step takes a lot longer to complete than the old version (I let it run for 20 minutes before quitting abruptly).

I've also tried applying train() after each addDocument() call. This method takes 18 minutes for 251 items.

The older version could train about 5k in under 10 minutes on the same hardware. Is there something else I should be doing to obtain the same performance as the older version?

Options for Levensthein

There is no way to set one of insertion_cost/deletion_cost/substitution_cost to zero value.

Could you fix it please? :)

TF-IDF calculation

In the example, "node" in document 3 should be unadjusted log(5 / 2) or adjusted log(5 / 3). However, it turns to be log(6 / 3) now.

Just remove +1 from line 71 works.
Or, use an unadjusted version:

 67     var docsWithTerm = this.documents.reduce(function(count, document) {
 68         return count + (documentHasTerm(term, document) ? 1 : 0);
 69     }, 0);
 70 
 71     if(docsWithTerm) 
 72         return Math.log(this.documents.length / docsWithTerm);
 73     else
 74         return 0;

Thanks.

classify with bayes doesn't output score

I tried example with bayes and the classify method only returns the "best match" but not the whole scores list. Is this an error or have I the bad version ? => 0.1.20

Retrieving keywords from domains (or text without delimiters)

I have a list of domain names and would like to extract the keywords that exist in these names. For example:

expertsexchange.com

  • experts exchange

penisland.com

  • pen island

choosespain.com

  • choose spain

kidsexpress.com

  • kids express

childrenswear.com

  • childrens wear

dicksonweb

  • dickson web

Much like:

  1. http://stackoverflow.com/questions/1315373/programmatically-extract-keywords-from-domain-names
  2. http://stackoverflow.com/questions/195010/how-can-i-split-multiple-joined-words

Would natural be able to help me with this?

Broken requires in tfidf_spec.js

Ran into an issue where natural is breaking my application's test suite because it's behaving differently when the NODE_ENV=test.

This is bad breaking behavior. If natural needs to act differently when it's being tested it should only do so when natural itself is the target of the tests. Otherwise it breaks applications above it. In this case I'm using kue, which relies on reds, which relies on natural.

Named entity recognition

Do you have any plans for named entity recognition, I have seen that it would require a sequential classifier. The ability to train it with your own data set (json document) of POS tags and other key attributes.

"Victorys" instead of "Victories"

I tested the natural.NounInflector.pluralizeNoun() with "Victory", "Party" and "Repository" and the output was always wrong.

Example

var natural    = require('natural'),
nounInflector = new natural.NounInflector;
nounInflector.attach();

console.log("Victory".pluralizeNoun());

Jaro-Winkler Distance Algorithm Match Value Question

Firstly, a massive thank-you for putting this library together. It's been really helpful implementing some text similarity behaviour in an application I'm currently writing.

Secondly, I just wanted to ask a question regarding the match value that is returned from the JaroWinklerDistance algorithm. On reading the wikipedia article I definitely got the impression that a comparison between two strings that are exactly the same should return a result of 1. I'm finding, however, that this isn't the case.

In the application I'm currently writing, I'm using the functionality to order place names in order of their match strength against an original request. In the case of sydney, this works as expected:

> natural.JaroWinklerDistance('sydney', 'sydney');
1

However, when I did a comparison for seddon against seddon, it returns less than 1:

> natural.JaroWinklerDistance('seddon', 'seddon');
0.8933333333333334

I went on to do some other smaller tests in the node console to see if I could have the function produce a higher score for two different strings than two that were exactly the same, as realistically, this is the case that I'm worried might occur. After a little bit of playing around I found that I could:

> natural.JaroWinklerDistance('abc', 'abc');
0.8666666666666666
> natural.JaroWinklerDistance('abcd', 'abcd');
1
> natural.JaroWinklerDistance('abcd', 'abc');
0.9416666666666667

Once I grok the algorithm, I'll have a look at forking the code and seeing if I can work out where this is happening, but I thought I'd raise an issue first to see if it was something that someone else knew how to fix quickly and simply.

Thanks again for your efforts on the library.

Cheers,
Damon.

Irregular inflections don't inherit case

If I add an irregular inflection to the NounInflector, it does not carry over the case of the inflected word. For example:

var inflector = new NounInflector();
inflector.addIrregular('survey', 'surveys');

var plural = inflector.pluralize('Survey');
// `plural` == 'surveys'

plural = inflector.pluralize('SURVEY');
// `plural` == 'surveys'

Indeed, there's a case which specifies this as being expected behaviour in noun_inflector_spec.js:

javascript
expect(inflector.pluralize('ox')).toBe('oxen');
expect(inflector.pluralize('OX')).toBe('oxen');


Should it not attempt to inherit capitalization & uppercase?

Better error messages when .train() hasn't been called on BayesClassifier

If one forgets to call .train() on their classifier, here is the error you get from .classify():

> b.classify('dogs');
TypeError: Cannot read property 'label' of undefined
at [object Object].classify (/home/tlack/apps/classifier/node_modules/natural/node_modules/apparatus/lib/apparatus/classifier/classifier.js:37:51)
at [object Object].classify     (/home/tlack/apps/classifier/node_modules/natural/lib/natural/classifiers/classifier.js:84:28)

This is a pretty common mistake; it might be nice to throw an exception in this case instead of just bombing out.

Glossary

Is a glossary-like feature implemented in natural? Couldn't find it in the readme.

Ref: Glossay

Cheers!

Any plan on support for pure javascript on client side?

I did not go too deep to understand if current version can be used on client side as it is. But given the training data needed I guess not.

Is there any plan on making it available to use in browser rather than though node?

Thanks for your efforts. This will be a useful resource in the future.

Error when using BayesClassifier.load

Hey,

When trying to do this:

natural.BayesClassifier.load('classifier.json', null, function(err, classifier) {
console.log(classifier.classify("this is a test"));
});

(copied very closely from the examples -- same result with or without the null argument)

I get this error:

/Users/username/Sites/natural/node_modules/apparatus/lib/apparatus/classifier/bayes_classifier.js:95
classifier.proto = BayesClassifier.prototype;
^
TypeError: Cannot set property 'proto' of undefined
at Function.restore (/Users/username/Sites/natural/node_modules/apparatus/lib/apparatus/classifier/bayes_classifier.js:95:27)
at restore (/Users/username/Sites/natural/lib/natural/classifiers/bayes_classifier.js:37:54)
at /Users/username/Sites/natural/lib/natural/classifiers/bayes_classifier.js:44:23
at /Users/username/Sites/natural/lib/natural/classifiers/classifier.js:104:13

I cloned the repo locally so I could try to debug it, but I haven't had any luck figuring out why it thinks classifier is undefined there. If I log out what classifier is before the restore function is called (in the load function of classifier.js), it seems to have the right object. However, in the restore function, is does think classifier is undefined all the way through.

This did work before the switch to apparatus. Any help would be greatly appreciated!

Tokenizer documentation

I am looking for a tokenizer for my short-text categorization application. Natural.js seem to contain a lot of useful tokenizers, however, I find it hard to understand what they do.

For example: what is the "TreebankWordTokenizer"? What is the "Aggressive tokenizer"?

What tokenizer is the most commonly used in text-categorization applications?

Error in isVowel() in double_metaphone.js

Hi, i came across an error in isVowel(c) where c was undefined and so c.match throwed an error:

function isVowel(c) {
    if(!c) return false; // i added this as a workaround
    return c.match(/[aeiouy]/i);
}

This happened when isVowel gets called from handleH on line 223. I think the problem is that token[pos+1] is undefined.

    function handleH() {
        // keep if starts a word or is surrounded by vowels
        if((pos == 0 || isVowel(token[pos - 1])) && isVowel(token[pos + 1])) {
            add('H');
            pos++;
        }
    }

Jaro-Winkler Infinite Loop

Running...

natural.JaroWinklerDistance('aaa', 'abcd');

...which gives matches arrays of...

matches1= [ true, true, ]
matches2= [ true, , , ]

...causes an infinite while loop in the "count transpositions" section.

I don't have a solid enough understanding of the algorithm to know if this is a simple boundary check error or a sign of a more fundamental problem.

Missing letters?

var natural = require('natural');
natural.PorterStemmer.attach();
console.log('President Constitution'.tokenizeAndStem());

returns ['presid','constitut'] . Is this expected result?

Global leak (variable `term`) in TfIdf.listTerms

Using version 0.1.19 of natural, mocha detects a global "leak" whenever TfIdf.listTerms is invoked.

The same issue seems to exist in the most recent version of lib/natural/tfidf/tfidf.js as well.

Specifically, line 101 of tfidf.js, reads:

    for(term in this.documents[d]) {

I believe this should be:

    for(var term in this.documents[d]) {

(adding var between for( and term.)

For your convenience, the full context for this line is below:

TfIdf.prototype.listTerms = function(d) {
    var terms = [];

    for(term in this.documents[d]) {
    terms.push({term: term, tfidf: this.tfidf(term, d)})
    }

    return terms.sort(function(x, y) { return y.tfidf - x.tfidf });
}

(These are lines 98-106 of tfidf.js, both in the most recent version of tfidf.js currently in the GitHub repo, as well as the 0.1.19 version distributed thru npm.)

(Thanks for an all-around awesome library, by the way.)

.load + .restore

As also in Readme, it throws an error

natural = require('natural');
var classifier = new natural.BayesClassifier();
classifier.addDocument(['sell', 'gold'], 'sell');
classifier.addDocument(['buy', 'silver'], 'buy');

// serialize
var raw = JSON.stringify(classifier);

// deserialize

var restoredClassifier = natural.BayesClassifier.restore(JSON.parse(raw));
console.log(restoredClassifier.classify('sell'));

It throws the following error

    return this.getClassifications(observation)[0].label; 
                                                  ^
TypeError: Cannot read property 'label' of undefined

Bayesian values... too small

I (one more time) but this time I checked 20 times if I was doing something wrong :)

Ok, I'm trying to auto discover spam from my DB. I have a lot of spam (maybe... 800) and about 300 good comments.

I did this script:

var natural = require('natural');
var classifier = new natural.BayesClassifier();

//my own data getter
var db  = require('./libs/db');

var tickets = [];
var trained = false;
var count = 0;
var min = 10; //min manual insertion to spam/not-spam


//this function get ticket, and back to "question" afterward. This is
//there you can look my troubles
function ask(i, ticket, callback) {

    // remove some chars...
    var comment = ticket.comments[i].content;
    comment = (ticket.comments[i].authorlogin || '') + ' ' + (ticket.comments[i].authoremail || '') + ' ' + (ticket.comments[i].authorsite || '') + '\n' + comment;
    comment = comment.toLowerCase();
    comment = comment.replace(/[<>:]/g,' ')
    comment = comment.replace(/[àäâ]+/g,'a');
    comment = comment.replace(/[éêè]+/g,'e');
    comment = comment.replace(/[îï]+/g,'e');
    comment = comment.replace(/[ôö]+/g,'o');
    comment = comment.replace(/[ûüù]+/g,'u');
    comment = comment.replace(/[ŷÿ]+/g,'y');

    // false when bayesian calc is > 0.5
    var shouldtrain = true;
    var m = -1;
    count++;

    // if enought of data, try to auto classify
    if (trained) {
        var cl = classifier.getClassifications(comment);
        for (var c in cl) {
            m = Math.max(m, cl[c].value)
        }
        console.log(m)
        // classifier is not trained, or auto classification was not a success
        if(m > 0.5) {
            // never reached... m is always very small eg. 4.0686547471403846e-39
            // meaning 0,0000000000000000000000000000000000000004068....
            console.log("=== AUTO OK ===");
            var cla = classifier.classify(comment);
            console.log(comment + " :: " + cla + ' -> '+cl[0].value)
            shouldtrain = false;
            classifier.addDocument(comment, cla);
        }
    }

    //train "min" times, then call "train" and save
    if (count > min){ 
        classifier.train();
        trained = true;
        classifier.save('classifier.json', function (){
            console.log('saved')
        })
    }



    // auto get was not success or we haven't enought of datas
    if (shouldtrain) {

        process.stdin.resume();
        process.stdin.write(comment+'\n');
        process.stdin.write("[s]pam, [n]ot-spam:");

        process.stdin.once('data', function(d){
            var d = d.toString().trim();

            switch(d) {
                case "s":
                case "spam":
                    console.log('Set it to spam')
                    ticket.comments[i].spam="spam";
                    classifier.addDocument(comment, "spam")
                    break;

                case "n":
                case "not-spam":
                default:
                    console.log('Set it to non-spam')
                    ticket.comments[i].spam="not-spam";
                    classifier.addDocument(comment, "not-spam")
            }
            callback(ticket,i)
        });
    } else {
        callback(ticket, i)
    }

}

db.query('metal3d','blog',{}, function (doc) { tickets.push(doc)},
function (){
    var i = 0;
    var ticket = null;
    var question = function (ticket, idx){
        // we've got ticket, we are back from "ask" function
        if (ticket != null) {
            // if next comment on ticket
            if (ticket.comments && ticket.comments.length > idx+1 ) {
                ask(idx+1, ticket, question)
            }
            // seek ticket with comment
            else {
                i++;
                while (!tickets[i].comments || tickets[i].comments.length < 1 ) {
                    i++;
                }
                ticket = tickets[i];
                ask(0, ticket, question);
            }
        }
        // no ticket, it's the first call to "ask" function
        else {
            i=0;
            while (!tickets[i].comments.length) {
                i++;
            }
            ticket = tickets[i];
            ask(0, ticket, question)
        }
    }
    // let's start
    question();
});

To be precise. I loop inside my tickets, and for each tickets, each comments, I do this:

  • if not inserted "min" data, ask if comment is spam or not
    • append to the right classification
    • go next
  • else
    • try to auto classify
      • if max result is > 0.5 (50%)
        • add document to the classification
        • train()
      • else
        • ask wich classification to use
        • append to the given classification
        • train()

I watch "max" value (and I tried to see the entire getClassifications result) I always have a very little number (4.55566678789e-39).

I'm sure that a lot of spam has repeated values (tramadol, xanax, and multiple time the same sender email)

I'm sure my datas are good (they are displayed) and as you can see, I removed accentuated chars, "<" ans ">" etc... I really tried a lot of possibilities...

If you need my datas to check, I can give you a mongo export and my db.js file.

PS: my db module use last function as callback that is lauched AFTER getting the entire database. There is no asynch calculation is this script.

Best regards

Wordnet via SQLite

Are you aware that wordnet is readily available in various database formats? ( http://wnsql.sourceforge.net/ ) Sqlite seems an idea one for this application, and may make the job much easier. Would you be open to a pull request that replaces the current implementation with an sqlite-based one?

  • Russell

Naive Bayes getClassifications wrong sorting

I trained the classifier with over 3000 documents, when I asked getClassifications I noticed values where not sorted so the first value was not the highest probability. I changed the sign to sort from lowest probability to largest and then retuned the last value. That did the trick. Here is the code:

function getClassifications(observation) {
var classifier = this;
var labels = [];

for(var className in this.classFeatures) {
labels.push({label: className,
      value: classifier.probabilityOfClass(observation, className)});
}

return labels.sort(function(x, y) {return y.value - x.value});

}

and instead of calling Classify I did the following:

result = classifier.getClassifications(test[i]);
response.push([test[i], result[result.length - 1].label])

offsets on wordnet

hey, wordnet integration is awesome! that dataset is really hard to work with. awesome stuff dude!
anyways, found a bug on lookupSynonyms()

fs.js:248
 binding.read(fd, buffer, offset, length, position, wrapper);
      ^
Error: Offset is out of bounds

heres the code:

var natural = require('natural');
var wordnet = new natural.WordNet('.');

  wordnet.lookupSynonyms('hot', function(results) {
    results.forEach(function(result) {
      console.log(result.lemma+'  -  '+result.pos);
  }); 
});

cheers man, i'm working on a fork of that jspos tagger that i can integrate if you'd like

n-grams should support start and end symbols

In text categorization, when using n-grams as features, it is a common practice to add "start" and "end" symbols to the strings. From my experience, it has a significant effect on performance.

For example, when finding bigrams in the sentence "I went home", the result should be:

["[start] I", "I went", "went home", "home [end]"]

International Support

Any plans for international support, i guess i am trying to use the tokenizer to parse Arabic words

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.