Practice and examples of using nltk library for NLP
- Corpus
A large body of natural language text used for accumulating statistics on natural language text. The plural is corpora. - Lexicon
A lexicon is a collection of information about the words of a language about the lexical categories to which they belong. A lexical entry will include further information about the roles the word plays.
Example : BULL means an animal in english also the rise or positive for an investor. - Tokenization
Splitting sentences (sentence tokenizer) and words (word tokenizer) from the body of text. - StopWords
Words that are useless, and we wish to do nothing with them. So they are removed from text. - Stemming
Normalization, in terms of affixes involved with words.
Example : riding === ride , normalization with -ing affix.
Algorithms involved in stemming are PorterStemmer, LancasterStemmer, SnowballStemmer - Lemmatizing
Similar to stemming, Stemming can often create non-existent words, whereas lemmas are actual words.
Stemmed word may not be something you can just look up in a dictionary, but you can look up a lemma. - POS - Part Of Speech tagging
Labeling words in a sentence as nouns, adjectives, verbs...etc. along with tense forms.
For complete list of POS tags refer to nlpfile.py above. - NER - Named Entity Recognition
Pull out entities like people, places, things, locations, monetary figures etc. - Chunking
To group the words in text based on Nouns(generally), Verbs etc. , to have an idea what the sentence is about. - Chinking
Chunk without the Chink, ie to group except certain parts of speech.