A News Clustering Crawler

Contributor:

Xiaoxiao Liu ([email protected])
Da Chen ([email protected])

Domain:

Our project focuses on building an intelligent web news crawler whose goal is to find the latest, hottest technology news. This crawler can crawl through the main news authorities (such as New York Times, CNN, BBC) and group similar news together base on their similarity.

Run our code:

We have made a bash script that have connected all the scripts in our project in order. Just run this script by $ bash run.sh. Easy and efficient!

Motivation:

In modern days, with the explosion of information, it is extremely time consuming to read all news from different news websites. Especially many of them report news with different titles but almost the same contents. To save people’s time and energy, we may try to collect news from different websites and compare their similarities so as to generate a report for certain date’s news with less redundancy and provide links in case people want to read deeply.

Sample Result:

Highlights:

Filtering In our crawler we have made the following rules to make sure the news fetched back is the most accurate, latest, hottest and related news in the technology field from different major news authorities on the internet.

only select news posted within 7 days from today
only within certain depth of crawling
only within a certain news authority domain
only related to Technology topics

Cached data We used database (sqlite3) to store crawled urls and data to avoid having to requesting the same web page again and again (performing politeness). Each time a new url is to be crawled, it is first checked in the cache database to determine new or old.
Clustering by similarity We calculated the similarity of each pair of doc vectors and performed ranking in two times, first rank all doc vectors by the summation of similarity for each vector (ranking the rows), then rank all other vectors by their similarity to this current vector. In the process, zero similarities are excluded and each vector can only belong to one final cluster in order for duplication-free (which, in effect, “saves our readers time”).
DATA VISUALIZATION We render out result in a well-formatted pdf file with clickable links, images and text.

Complexity:

Our crawler can automatically extract and uniform certain data from complex html structures from different websites.
As the second major part of our project, we used our own tools (such as tokenizer) first generate the required materials for computing similarity, which includes raw, tokenized doc, hist and title files. With these srouce material, we next calculate the similarity, cluster together similar ones and then performed ranking.

jinhuli / news-cluster Goto Github PK

news-cluster's Introduction

A News Clustering Crawler

Contributor:

Domain:

Run our code:

Motivation:

Sample Result:

Highlights:

Complexity:

news-cluster's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent