HtmlContentScrapper extracts article text from news websites.
The information available on web pages mostly contains noise like menus, ads and so on. HTML document does not discriminate between the text and the schema that represent the text. This requires to extract core content from websites using software like this one. This project is a test task. My first project on Python.
usage: launcher.py [-h] url
Extract article from website.
positional arguments:
url an website url
optional arguments:
-h, --help show this help message and exit
Launcher creates directory based on url and saves txt file with the extracted information there.
chizganov.com/article/index.html -> [CUR_DIR]/chizganov_com/article/index.txt
HtmlContentScraper based on ECON algorithm, but extends it with some changes when backtracking:
- Clear tags that have different classes than snippet-node branch. Most websites have article paragraphs with the same classes
- Allow backtracking with small changes in the text length between levels. Some websites have small noise that interrupt backtracking when it's unnecessary
I made this changes because ECON algorithm shows good results in chinesse websites, but has some problems in russian.
- Finish current algorithm. Fix different paragraph styles
- Add another algorithm like CoreEx, visual algorithm
- Add template scraper
- Add template html formatter
- Encapsulate parsing mechanism
Test urls | Results |
---|---|
Lenta | link |
Life | link |
Meduza | link |
НГС | link |
TVZvezda | link |
Express | link |
ГазетаРу | link |
Новая газета | link |
NYTimes | link |
Свобода | link |
Washington Post | link |
Wylsa | link |
All results in res folder.