Comments (2)
Investigation plan and details:
- Topic Clustering (K-means….?)
- given a set of article titles, cluster the titles into some sorts of clusters, which we ideally want to be clusters of topics/titles related to each other,
- will have to focus on algorithms that ideally can cluster without requiring a cluster number/
- feasibility assessment on if we even need something like this or if we can just use predefined categorizations by the news websites themselves, which most news websites should have
Summarizer
- (GPT2/From-Scratch Network) GPT2 go brrrr
- How to use/deploy/personalize GPT2 for our project
from geopoliticsdashboard.
Clustering
- Clustering algorithms that don't need a pre-defined number of clusters do exist and have many variants: https://stats.stackexchange.com/questions/241381/clustering-methods-that-do-not-require-pre-specifying-the-number-of-clusters
- Interesting post about clustering news articles we can look at but may be unnecessarily complex for our purposes: https://towardsdatascience.com/all-the-news-17fa34b52b9d
- An feasible approach could be to use a text classification and classify news articles into various classes/categories,however this would likely require us to predefine a set number of categories we want to classify into, or maybe we generate classifications from clustering first??? interesting/useful articles about this:
-
- VERY INTERESTING BLOG POST doing something sort of similar to us: https://towardsdatascience.com/clustering-news-articles-based-on-named-entities-306a23d368e1
- Overall, there seems to be alot of active research about classifying news articles based on the content text that we can look into, classifying news based on their title doesnt seem logically sound as there is just too little data per title to be reliable
- Doesnt seem like many news websites actually have useful/handy/consistent classifications on their articles(some websites only have categories for relevant/trending topics) , so may not be as trivial as we thought, seems that using ML could prove useful
Summarizer
-
Reference Links:
- https://blog.paperspace.com/generating-text-summaries-gpt-2/
- https://github.com/SKRohit/Generating_Text_Summary_With_GPT2
- https://medium.com/analytics-vidhya/text-summarization-using-bert-gpt2-xlnet-5ee80608e961
- very useful source for model APIs GPT1, 2, BERT, XLNet and other useful information: https://huggingface.co/transformers/quicktour.html
-
Text summarization is a commonly tackled problem in the NLP space, state of the art algorithms include GPT-2, BERT, XLNet
-
Different approaches to text summarization:
-
- Abstractive Summarization: aims to rephrase the information, this approach is most often used as it performs better in terms of readability of the output text,
-
- Extractive Summarization: aims to present the most important parts of the given content, but limitations include the algorithm often failing to organize the output sentences in a natural readable way, which may defeat its original purpose of presenting the important information in the first place
-
Publicly available CNN/Daily Mail dataset (https://github.com/abisee/cnn-dailymail, https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail) can help us easily get started with a baseline training data to tune our GPT2/BERT model on news articles, can constantly retrain on our own sources if needed
-
GPT models have a restriction on the context size, in terms of "tokens" after a putting article through a GPT tokenizer (probably just words too)
-
Good summary of many other text summarization approaches we can pick from if needed: https://www.machinelearningplus.com/nlp/text-summarization-approaches-nlp-example/ (not all of them use deep learning, some are just normal algorithms)
-
Overall, text summarization is a very well documented and research problem with many sources online for our use, we jsut need to decide on what sort of apprach/model/algorithm we want to use, how to tune the algorithm/model to our liking, and how to efficiently deploy.
from geopoliticsdashboard.
Related Issues (8)
- Figuring out High Level Feature List
- Investigate Approaches in a Persistent Storage System for Allocated News Articles
- Investigate Approaches to Query News Articles Via Web HOT 4
- Setting up POC for News API
- Look into Fake News Source Detectors
- Check out some of the pretrained networks available HOT 2
- Look into deployment of ML networks
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from geopoliticsdashboard.