Giter Club home page Giter Club logo

imsanjoykb / natural-language-processing Goto Github PK

View Code? Open in Web Editor NEW
5.0 3.0 3.0 6.51 MB

The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems. The content is based on our past and potential future engagements with customers as well as collaboration with partners, researchers, and the open source community.

License: MIT License

Jupyter Notebook 100.00%
natural-language-processing nlp-machine-learning sentiment-analysis word2vec machine-learning deep-learning nltk python word2vec-algorithm langdetect text-analysis text-classification nlp-tutorial nlp-course gensiom pos-tagging bigrams-search-algorithm

natural-language-processing's Introduction

Author : Sanjoy Biswas

Data Scientist

Natural Language Processing For Beginners

Downloading and install library

Word Bigram

Word2Vector

Driving NWords

Finding Unusual Word

Finding Potential Word

Classify News Document

Classify Parts of speech

Classify Bag words

Classify website

Twitter Search

NLTK Library

Word data analysis

Creating a POS Tagger

Classify website

Parts of Speech

Classify News Documents

N Grams

BiGrams Stemming

Finding Unusual Word

Sentiment Analysis with NLTK

Twitter Search

Langdetect Langid library

Finding unusual word

Text Manipulation

Sanjoy Biswas

natural-language-processing's People

Contributors

imsanjoykb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

natural-language-processing's Issues

TF-IDF Scheme

Term frequency refers to the number of times a word appears in the document and can be calculated as:

Term frequence = (Number of Occurences of a word)/(Total words in the document)
For instance, if we look at sentence S1 from the previous section i.e. "I love rain", every word in the sentence occurs once and therefore has a frequency of 1. On the contrary, for S2 i.e. "rain rain go away", the frequency of "rain" is two while for the rest of the words, it is 1.

word2vec survey response

Word2Vec can be used to get actionable metrics from thousands of customers reviews. Businesses don’t have enough time and tools to analyze survey responses and act on them thereon. This leads to loss of ROI and brand value.

Word embeddings prove invaluable in such cases. Vector representation of words trained on (or adapted to) survey data-sets can help embed complex relationship between the responses being reviewed and the specific context within which the response was made. Machine learning algorithms can leverage this information to identify actionable insights for your business/product.

Bag of Words

The bag of words approach is one of the simplest word embedding approaches. The following are steps to generate word embeddings using the bag of words approach.

We will see the word embeddings generated by the bag of words approach with the help of an example. Suppose you have a corpus with three sentences.

S1 = I love rain
S2 = rain rain go away
S3 = I am away

Word Embedding Approaches

One of the reasons that Natural Language Processing is a difficult problem to solve is the fact that, unlike human beings, computers can only understand numbers. We have to represent words in a numeric format that is understandable by the computers. Word embedding refers to the numeric representations of words.

Several word embedding approaches currently exist and all of them have their pros and cons. We will discuss three of them here:

Bag of Words
TF-IDF Scheme
Word2Vec

Word Counts with CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer# list of text documentstext = ["The quick brown fox jumped over the lazy dog."]# create the transformvectorizer = CountVectorizer()# tokenize and build vocabvectorizer.fit(text)# summarizeprint(vectorizer.vocabulary_)# encode documentvector = vectorizer.transform(text)# summarize encoded vectorprint(vector.shape)print(type(vector))print(vector.toarray())
-- | --

Bag-of-Words Model

A simple and effective model for thinking about text documents in machine learning is called the Bag-of-Words Model, or BoW.

The model is simple in that it throws away all of the order information in the words and focuses on the occurrence of words in a document.

This can be done by assigning each word a unique number. Then any document we see can be encoded as a fixed-length vector with the length of the vocabulary of known words. The value in each position in the vector could be filled with a count or frequency of each word in the encoded document.

Expert Systems Limitations

No technology can offer easy and complete solution. Large systems are costly, require significant development time, and computer resources. ESs have their limitations which include −

Limitations of the technology
Difficult knowledge acquisition
ES are difficult to maintain
High development costs

The softmax Word2Vec method in TensorFlow

As with any machine learning problem, there are two components – the first is getting all the data into a usable format, and the next is actually performing the training, validation and testing. First I'll go through how the data can be gathered into a usable format, then we'll talk about the TensorFlow graph of the model. Note that the code that I will be going through can be found in its entirety at this site's Github repository. In this case, the code is mostly based on the TensorFlow Word2Vec tutorial here with some personal changes.

How does Natural Language Processing Works?

NLP entails applying algorithms to identify and extract the natural language rules such that the unstructured language data is converted into a form that computers can understand.
When the text has been provided, the computer will utilize algorithms to extract meaning associated with every sentence and collect the essential data from them.
Sometimes, the computer may fail to understand the meaning of a sentence well, leading to obscure results.

Technical Aspect of Word Embeddings

vec (“Berlin”) – vec (“Germany”) = x – vec (“France”)

That is, the distance between the sets of vectors must be equal. Therefore,

x = vec (“Berlin”) – vec (“Germany”) + vec (“France”)

Creating the TensorFlow model

batch_size = 128
embedding_size = 128  # Dimension of the embedding vector.
skip_window = 1       # How many words to consider left and right.
num_skips = 2         # How many times to reuse an input to generate a context.

Lemmatization

Lemmatization is very similar to stemming, where we remove word affixes to get to the base form of a word. However, the base form in this case is known as the root word, but not the root stem. The difference being that the root word is always a lexicographically correct word (present in the dictionary), but the root stem may not be so. Thus, root word, also known as the lemma, will always be present in the dictionary. Both nltk and spacy have excellent lemmatizers. We will be using spacy here.

def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

lemmatize_text("My system keeps crashing! his crashed yesterday, ours crashes daily")

Removing Special Characters

def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

remove_special_characters("Well this was fun! What do you think? 123#@!", 
                          remove_digits=True)

Embedding

There are more ways to train word vectors in Gensim than just Word2Vec. See also Doc2Vec, FastText and wrappers for VarEmbed and WordRank.

The training algorithms were originally ported from the C package https://code.google.com/p/word2vec/ and extended with additional functionality and optimizations over the years.

Tokenizing Words and Sentences with NLTK

These are the words you will most commonly hear upon entering the Natural Language Processing (NLP) space, but there are many more that we will be covering in time. With that, let's show an example of how one might actually tokenize something into tokens with the NLTK module.

from nltk.tokenize import sent_tokenize, word_tokenize

EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard."

print(sent_tokenize(EXAMPLE_TEXT))

Removing HTML tags

def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text()
    return stripped_text

strip_html_tags('<html><h2>Some important text</h2></html>')

Preparing the text data

def maybe_download(filename, url, expected_bytes):
    """Download a file if not present, and make sure it's the right size."""
    if not os.path.exists(filename):
        filename, _ = urllib.request.urlretrieve(url + filename, filename)
    statinfo = os.stat(filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified', filename)
    else:
        print(statinfo.st_size)
        raise Exception(
            'Failed to verify ' + filename + '. Can you get to it with a browser?')
    return filena

Expanding Contractions

ef expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

expand_contractions("Y'all can't expand contractions I'd think")

RETROFITTING

While general-purpose datasets often benefit from the use of these pre-trained word embeddings, the representations may not always transfer well to specialized domains. This is because the embeddings have been trained on massive text corpus created from Wikipedia and similar sources.

Word Frequencies with TfidfVectorizer code

from sklearn.feature_extraction.text import TfidfVectorizer

list of text documents

text = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The fox"]

create the transform

vectorizer = TfidfVectorizer()

tokenize and build vocab

vectorizer.fit(text)

summarize

print(vectorizer.vocabulary_)
print(vectorizer.idf_)

encode document

vector = vectorizer.transform([text[0]])

summarize encoded vector

print(vector.shape)
print(vector.toarray())

Pros and Cons of TF-IDF

Though TF-IDF is an improvement over the simple bag of words approach and yields better results for common NLP tasks, the overall pros and cons remain the same. We still need to create a huge sparse matrix, which also takes a lot more computation than the simple bag of words approach.

Hashing with HashingVectorizer

from sklearn.feature_extraction.text import HashingVectorizer

list of text documents

text = ["The quick brown fox jumped over the lazy dog."]

create the transform

vectorizer = HashingVectorizer(n_features=20)

encode document

vector = vectorizer.transform(text)

summarize encoded vector

print(vector.shape)
print(vector.toarray())

Removing accented characters


def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

remove_accented_chars('Sómě Áccěntěd těxt')

8 Best NLP Library

  1. Natural Language Toolkit (NLTK)
  2. TextBlob
  3. CoreNLP
  4. Gensim
  5. spaCy
  6. polyglot
  7. scikit–learn
  8. Pattern

Inference Engine

Use of efficient procedures and rules by the Inference Engine is essential in deducting a correct, flawless solution.

In case of knowledge-based ES, the Inference Engine acquires and manipulates the knowledge from the knowledge base to arrive at a particular solution.

In case of rule based ES, it −

Applies rules repeatedly to the facts, which are obtained from earlier rule application.

Adds new knowledge into the knowledge base if required.

Resolves rules conflict when multiple rules are applicable to a particular case.

To recommend a solution, the Inference Engine uses the following strategies −

Forward Chaining
Backward Chaining

Creating Word2Vec Model

from gensim.models import Word2Vec

word2vec = Word2Vec(all_words, min_count=2)

vocabulary = word2vec.wv.vocab
print(vocabulary)

Capabilities of Expert Systems

The expert systems are capable of −

Advising
Instructing and assisting human in decision making
Demonstrating
Deriving a solution
Diagnosing
Explaining
Interpreting input
Predicting results
Justifying the conclusion
Suggesting alternative options to a problem

Usage examples of Embedding

from gensim.test.utils import common_texts
from gensim.models import Word2Vec
>>>
model = Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")

Word Frequencies with TfidfVectorizer

An alternative is to calculate word frequencies, and by far the most popular method is called TF-IDF. This is an acronym than stands for “Term Frequency – Inverse Document” Frequency which are the components of the resulting scores assigned to each word.

Term Frequency: This summarizes how often a given word appears within a document.
Inverse Document Frequency: This downscales words that appear a lot across documents.
Without going into the math, TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

Sentence Tokenization

from nltk.tokenize import sent_tokenize
text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text=sent_tokenize(text)
print(tokenized_text)

output:
['Hello Mr. Smith, how are you doing today?', 'The weather is great, and city is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard"]

The code below fills out the batch and context variables:

for i in range(batch_size // num_skips):
    target = skip_window  # input word at the center of the buffer
    targets_to_avoid = [skip_window]
    for j in range(num_skips):
        while target in targets_to_avoid:
            target = random.randint(0, span - 1)
        targets_to_avoid.append(target)
        batch[i * num_skips + j] = buffer[skip_window]  # this is the input word
        context[i * num_skips + j, 0] = buffer[target]  # these are the context words
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)

Why Fuzzy Logic?

Fuzzy logic is useful for commercial and practical purposes.

It can control machines and consumer products.
It may not give accurate reasoning, but acceptable reasoning.
Fuzzy logic helps to deal with the uncertainty in engineering.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.