Giter Club home page Giter Club logo

souravs17031999 / microsoftbing-search-query-prediction Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 0.0 12.11 MB

Analysis and Visualizations for COVID-19 Bing search engine queries + Classifier pipeline for predicting country based on search query.

Home Page: https://souravsdlboy.pythonanywhere.com/bing_search

Python 8.95% HTML 91.05%
text-classification text-processing natural-language-processing machine-learning bing-search covid-19 pandemic coronavirus deep-learning word-embeddings word2vec text-visualization text-classifier text-clustering text-classification-python world-health-organization

microsoftbing-search-query-prediction's Introduction

MicrosoftBing-search-query-prediction

Problem statement:

The world is going through COVID-19 pandemic caused due to Novel Coronavirus and people are mostly in their homes across the world due to lockdown and they are searching for various things related to Coronavirus on internet using popular search engines such as Chrome, Bing, Yahoo, DuckDuckGo etc.

Now, the problem task is to attempt to explore the intent of the people by just using the queries (searches) in the last four months Feb to May 2020 and also build a predicting model which predicts the Country of origin from where the search query was issued , as search queries made by people can really help in understanding what’s going in people’s mind during this pandemic and exploring the queries at global level as well as state level granularity will enable state and central authorities to take appropriate action and eventually helping out and benefitting the Citizens.

The thing is that most of people in the world or the authorities (govt. or private) were unprepared and underestimates the risk when outbreak emerged because we haven't seen this type of event earlier and so, people panicked completely. They stressed and gone into depression due to job salary cuts, loosing their family etc...
Others were worried about mask, sanitizers, food items (due to which people started vacating stuff from malls/shops), rumours started circulating regarding various vaccines/cures etc... which needed to be checked upon and many people didn't speak due to depression or panic.
So, it can be great and effective tool for analysing data at such a great scale for determining area based, country based etc... measures which can be taken for the above said problems/situations.

Motivation:

As a Computer Engineering Student and my deep interested in Machine learning and Deep learning including NLP tasks, I have my inner passion to help out researchers / scientists working day and night for helping us by providing some technological solution using my knowledge. I have always believed AI can bring great revolutionary ideas in field of healthcare.

Dataset Collection:

After searching on various platforms, I found the dataset relevant for this problem and currently only this dataset is actually available for this problem and Microsoft Bing Team has been very generous to provide the dataset on their GitHub.
Dataset Link: https://github.com/microsoft/BingCoronavirusQuerySet
Some info about dataset:

  • Data range: 2020-01-01 to 2020-04-30
  • All the private data has already been removed by the Bing Team.
  • Only searches that were issued many times by multiple users were included.
  • Dataset includes queries from all over the world that had an intent related to the Coronavirus or Covid-19.

Methods Explored :

  • Bag of Words model
    vis1
  • N-grams model
    vis1
  • tf-idf model
  • Continuos bag of words model
    vis1
  • Skim-Gram model
    vis1
  • Word2Vec model
    Above architechtures for CBOW and skipgrams are used to construct the training samples which forms both - embedding matrix and context matrix (neighbouring words in the window).
    Then, supervised learning is done using (input samples, target) and error function is minimized to learn new embeddings for the input samples.

Here, window size is a important hyperparameter which controls the context words.

Embedding matrix and context matrix will be used in the neural network architechture to learn the word vectors (embeddings).

Input (words) are one - hot encoded binary vector of vocab size and same is with output vector.

Then, backpropogration is used in the neural network training process to update the parameters which are useful for learning embeddings.

vis1
vis1
vis1

The embedding matrix are finally learnt embeddings through number of times we cycle the training process.
NOTE : In the construction of above training samples, we have only shown those which are positive samples i.e. containing context correct words so target will always be "1" but in reality we will do negative sampling also i.e. random sampling words through vocabulary and set it to target vector so that there are "0", wrong context words also, otherwise the model might only predict "1" for any context word but with negative sampling, it will actually try to learn the semantic relationships embedded.

using skipgram
vis1
using CBOW vis1

  • Clustering techniques
  • Classification techniques
  • WordCloud and more...

Bag of Words model :

Characterizes frequency of terms appearing within the document and among the documents without concerned with order.
(This is also input to the models used for classification pipeline)

CountVectorizer vis1

TfidfVectorizer vis1

Preprocessing steps :

vis1

  • Tokenization :
    vis1
    vis1

  • Cleaning : Removing punctuation, stopwords, HTML tags, XML tags etc.. vis1

  • Normalization : Converting dates, other symbols, acronyms, and abbreviations to Text.

  • Stemming :
    vis1

  • Lemmatization : vis1

  • See reference links at the end for more details.

Modelling :

Models :

vis1

Deployment and Hosting:

Live WebApp : Server Link

Exploring Data with Sample visualizations :

DATA
vis1
vis2
vis3
vis4
vis5
vis6

NOTE : For more visualizations, check out the live webapp.

Few links for references :

⭐️ this Project if you liked it !

NOTE : Few images are taken from google images and copyrights reserved with their respective owners.

microsoftbing-search-query-prediction's People

Contributors

souravs17031999 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.