TASK 2 : Getting the Top Keywords.

Approach:

First the text is going to be normalized through the next steps:

Bringing to lower case.
Bringing unicode characters to ascii.
Removing the stopwords from the text.

The next step is going to be the search of hapaxes, and their removing from text. The hapaxes are extracted to reduce the time of computing the ranking topics by eliminating the need of calculating them for extremely rare words.

NOTE: Aditionaly It may be used some stemming or lemantization of words, but for this task I decided to skipp this step.

Then different algorithms or metrics are aplied to the corpus to rank words.

Definition of the Top.

The Definition of the Top highly depends on the task, but some initial asumptions can be made.

The Statistical metrics - the simplest frequency of words in the corpus can say a lot about the words. However, the top words are not the most frequent ones not the rarest ones, but the ones that are somewhere in the middle. An example of this is TF-IDF method, which gives high weights to the rarest words, because they can become crucial for classification.
RAKE (Rapid Automatic Keyword Extraction) - is also a statistical method of analysing word frequency. However, this it takes into acount the context of words, so the words that are getting more often in context with other words, should be more important.
Based on the Coocurence matrix between words and some classes. In such a way it may informative to find out words that may help differentiate between different classes.

Top 10 words by TF-IDS.

milion
ai
new
game
google
raises
vr
launches
microsoft
games.

The plot is in the Jupyter notebook file.

What is also possible to try:

Personally I would try more ranking and scoring algorithms like:

TextRank.
Conditional Random Fields (CRF).
MAUI (Multipurpose automatic topic indexing).
Also it is possible to find the clusters the words vector represantation from models like GloVe or Bert, and find top x words that have the smallest distance to the centroids.
KEA (Keyphrase extraction algorithm) developed by Frank et al.
KPSotter.

sciencekot / nlp_exam_2023 Goto Github PK

nlp_exam_2023's Introduction

TASK 2 : Getting the Top Keywords.

Approach:

Definition of the Top.

Top 10 words by TF-IDS.

The plot is in the Jupyter notebook file.

What is also possible to try:

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent