Giter Club home page Giter Club logo

sky's Introduction

sky is a web scraping framework, implemented with the latest python versions in mind (3.4+). It uses the asynchronous asyncio framework, as well as many popular modules and extensions.

Most importantly, it aims for next generation web crawling where machine intelligence is used to speed up the development/maintainance/reliability of crawling.

It mainly does this by considering the user to be interested in content from domains, not just a collection of single pages (templating approach).

See it live in action with a news website YOU propose:

  • Locally (view demo)
  • Remotely (needs online hosting)

Demo

Note that the following is only meant as a demo of some kind of app that could be built upon the scraping framework.

Make no mistake: the goal is to provide a smart-scraper, not some ugly UI.

Run:

  • Install using pip: pip3 install -U sky
  • Run sky view at the command line (use -port PORT to change port)
  • Visit localhost:7900
  • Enter a Domain/URL and see the result after clicking [>>>].

The demo uses a standard configuration that can easily be improved on when setting up a project.



Similar data (title, body, publish_date, images etc) will be very easy to use in your own applications.

Features/Goals

These are the features/goals of sky. Checkmarks have been accomplished:

  • Really fast, due to Python 3.4+ new asyncio/aiohttp libraries, based on 500lines/crawler
  • Smart, due to considering crawling of websites instead of single pages
  • Boilerplate FREE, removes crappy content (images, text, etc) that does not belong on pages
  • Nice API, carefully crafted, easily extendible
  • Open-source, democracy driven, with actual support
  • Free, versus enormous costs for even medium scale projects using (worse) online services
  • Link-graph-analysis, find out how a domain "looks" like
  • Include Batteries, Crawl any news website without any configuration
  • Automatic Natural Language Processing, detecting keywords in text automatically

Installation

Use pip to install sky:

pip3 install -U sky

This will install only the required packages. Storing data on the local system does not require any other packages.

To store data, the following optional backends are currently available: elasticsearch, cloudant and ZODB.

Using the package

To setup a project/crawling service, visit this readme for a "Getting started".

Contribute

It is very much appreciated if you'd like to contribute in one or more of the following areas:

  • More Backends
  • Documentations/tests
  • Improvement of detection
  • NLP

Templating approach

By considering crawl content to originate from a domain, rather than individual pages, the following willl be possible:

  • ✓ Drop duplicate content (menus, texts, images)
  • ✓ Provide error checking tools (making sure no bad documents slip by)
  • Detect whether a website changed the layout (causing non-sky scrapers to fail)
  • Understand sections of a website, such as comments, forum posts, related links etc
  • Consider which pages are linked to which (star graph)
  • Figure out the content pages by just pointing at the domain
  • Relate pages (page A is related by content to page B)
  • Consider an optimal re-crawling path

sky's People

Contributors

kootenpv avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.