Giter Club home page Giter Club logo

htmlcontentscraper's Introduction

HtmlContentScraper

HtmlContentScrapper extracts article text from news websites.

The information available on web pages mostly contains noise like menus, ads and so on. HTML document does not discriminate between the text and the schema that represent the text. This requires to extract core content from websites using software like this one. This project is a test task. My first project on Python.

Contents

Startup

usage: launcher.py [-h] url

Extract article from website.

positional arguments:
  url         an website url

optional arguments:
  -h, --help  show this help message and exit

Launcher creates directory based on url and saves txt file with the extracted information there.

chizganov.com/article/index.html -> [CUR_DIR]/chizganov_com/article/index.txt

Algorithm

HtmlContentScraper based on ECON algorithm, but extends it with some changes when backtracking:

  • Clear tags that have different classes than snippet-node branch. Most websites have article paragraphs with the same classes
  • Allow backtracking with small changes in the text length between levels. Some websites have small noise that interrupt backtracking when it's unnecessary

I made this changes because ECON algorithm shows good results in chinesse websites, but has some problems in russian.

Future improvements

  • Finish current algorithm. Fix different paragraph styles
  • Add another algorithm like CoreEx, visual algorithm
  • Add template scraper
  • Add template html formatter
  • Encapsulate parsing mechanism

Tests

Test urls Results
Lenta link
Life link
Meduza link
НГС link
TVZvezda link
Express link
ГазетаРу link
Новая газета link
NYTimes link
Свобода link
Washington Post link
Wylsa link

All results in res folder.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.