Giter Club home page Giter Club logo

crawler-scrapper's Introduction

Keep-Current - The Web Crawler

Build Status BCH compliance

After studying a topic, keeping current with the news, published papers, advanced technologies and such proved to be a hard work. One must attend conventions, subscribe to different websites and newsletters, go over different emails, alerts and such while filtering the relevant data out of these sources.

In this project, we aspire to create a platform for students, researchers, professionals and enthusiasts to discover news on relevant topics. The users are encouraged to constantly give a feedback on the suggestions, in order to adapt and personalize future results.

The goal is to create an automated system that scans the web, through a list of trusted sources, classify and categorize the documents it finds, and match them to the different users, according to their interest. It then presents it as a timely summarized digest to the user, whether by email or within a site.

Web Crawler

This repository deploys a web crawler, that given a specific set of sources (URLs), should locate new documents (web-pages) and save them in the DB for future processing. When possible, in websites that allow, an API can be used. For example, for arxiv.org.

Recommended tools:

We lean heavily on existing tools as well as developing our own new methods. Among the existing tools we are using scrapy which later we hope to host on scrapy-cloud. Another related tool that should be considered is scrapy-splash which can render JS-based pages before storing them. and Textract can be used to extract the content, the text, to be saved.

Getting started

for running this project locally, please install pipenv.

pip install pipenv

Then run:

pipenv install

after all the dependencies are installed, please run

pipenv run python manage.py 

Who are we?

This project intends to be a shared work of Vienna Data Science Cafe Meet-Up members, with the purpose, beside the obvious result, to also be used as a learning platform, while advancing the Natural Language Processing / Machine Learning field by exploring, comparing and hacking different models.

Please feel free to contribute.

Project board is on Trello and we use Slack as our communication channel.

I want to help

We welcome anyone who would like to join and contribute. We meet regularly every month in Vienna through the Data Science Cafe meetup of the VDSG, show our progress and discuss the next steps.

The repository

This repository is the web crawler / spider. If you wish to assist in different aspects (Data Engineering / Web development / DevOps), we have divided the project to several additional repositories focusing on these topics:

  • The machine-learning engine can be found in our Main repository
  • Web Development & UI/UX experiments can be found in our App repository
  • Data Engineering tasks are more than welcomed in our Data Engineering repository
  • Devops tasks are all across the project. We are trying to develop this project in a serverless architecture, and currently looking into Docker and Kubernetes as well as different hosting providers and plans. Feel free to join the discussion and provide your input!

crawler-scrapper's People

Contributors

liadmagen avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.