htmlcontentscraper's Introduction

HtmlContentScraper

HtmlContentScrapper extracts article text from news websites.

The information available on web pages mostly contains noise like menus, ads and so on. HTML document does not discriminate between the text and the schema that represent the text. This requires to extract core content from websites using software like this one. This project is a test task. My first project on Python.

Startup
Algorithm
Future improvements
Tests

Startup

usage: launcher.py [-h] url

Extract article from website.

positional arguments:
  url         an website url

optional arguments:
  -h, --help  show this help message and exit

Launcher creates directory based on url and saves txt file with the extracted information there.

chizganov.com/article/index.html -> [CUR_DIR]/chizganov_com/article/index.txt

Algorithm

HtmlContentScraper based on ECON algorithm, but extends it with some changes when backtracking:

Clear tags that have different classes than snippet-node branch. Most websites have article paragraphs with the same classes
Allow backtracking with small changes in the text length between levels. Some websites have small noise that interrupt backtracking when it's unnecessary

I made this changes because ECON algorithm shows good results in chinesse websites, but has some problems in russian.

Future improvements

Finish current algorithm. Fix different paragraph styles
Add another algorithm like CoreEx, visual algorithm
Add template scraper
Add template html formatter
Encapsulate parsing mechanism

Tests

Test urls	Results
Lenta	link
Life	link
Meduza	link
НГС	link
TVZvezda	link
Express	link
ГазетаРу	link
Новая газета	link
NYTimes	link
Свобода	link
Washington Post	link
Wylsa	link