nlp's Introduction

NLP

Practice and examples of using nltk library for NLP

Corpus
A large body of natural language text used for accumulating statistics on natural language text. The plural is corpora.
Lexicon
A lexicon is a collection of information about the words of a language about the lexical categories to which they belong. A lexical entry will include further information about the roles the word plays.
Example : BULL means an animal in english also the rise or positive for an investor.
Tokenization
Splitting sentences (sentence tokenizer) and words (word tokenizer) from the body of text.
StopWords
Words that are useless, and we wish to do nothing with them. So they are removed from text.
Stemming
Normalization, in terms of affixes involved with words.
Example : riding === ride , normalization with -ing affix.
Algorithms involved in stemming are PorterStemmer, LancasterStemmer, SnowballStemmer
Lemmatizing
Similar to stemming, Stemming can often create non-existent words, whereas lemmas are actual words.
Stemmed word may not be something you can just look up in a dictionary, but you can look up a lemma.
POS - Part Of Speech tagging
Labeling words in a sentence as nouns, adjectives, verbs...etc. along with tense forms.
For complete list of POS tags refer to nlpfile.py above.
NER - Named Entity Recognition
Pull out entities like people, places, things, locations, monetary figures etc.
Chunking
To group the words in text based on Nouns(generally), Verbs etc. , to have an idea what the sentence is about.
Chinking
Chunk without the Chink, ie to group except certain parts of speech.

Recommend Projects