Giter Club home page Giter Club logo

leopardslab / crawlerx Goto Github PK

View Code? Open in Web Editor NEW
21.0 5.0 15.0 12.12 MB

CrawlerX - Develop Extensible, Distributed, Scalable Crawler System which is a web platform that can be used to crawl URLs in different kind of protocols in a distributed way.

License: Apache License 2.0

JavaScript 2.84% HTML 0.21% Vue 34.47% Python 8.51% SCSS 52.31% Dockerfile 0.15% Shell 0.14% Mustache 0.17% CSS 1.20%
django-backend web-crawling mongodb-server vuejs elasticsearch message-broker firebase-auth

crawlerx's Introduction

CrawlerX - Develop Extensible, Distributed, Scalable Crawler System

The CrawlerX is a platform which we can use for crawl web URLs in different kind of protocols in a distributed way. Web crawling often called web scraping is a method of programmatically going over a collection of web pages and extracting data which useful for data analysis with web-based data. With a web scraper, you can mine data about a set of products, get a large corpus of text or quantitative data to play around with, get data from a site without an official API, or just satisfy your own personal curiosity.

Architecture Diagram

CrawlerX includes the following runtimes to do the crawling jobs for you.

  • VueJS Frontend - Dashboard which users intercat
  • Firebase - User authorization & authentication
  • Django Backend Server - which expose API endpoints for the frontend
  • RabbitMQ Server - Message broker
  • Celery Beat and Workers - Job Scheduler and executor
  • Scrapy Server - for extracting the data you need from websites
  • MongoDB Server - for store crawled data
  • ElasticSearch- for job/query seaching mechanisams

CrawlerX Dashboard

In the CrawlerX dashboard, you can get an abstract idea of the crawled and crawling projects and jobs with their status.

Dashboard

Crawl Job Scheduling

In CrawlerX, you can schedule crawl jobs in three ways.

  • Instant Scheduler - Crawl job is scheduled that run instantly
  • Interval Scheduler - Crawl job is scheduled that run at a specific interval
  • Cron Scheduler - Crawl job is scheduled that run as a cron job

Job Scheduler

Prerequisites

First you need to edit the .env file in crawlerx_app root directory with your web app's firebase configuration details.

VUE_APP_FIREBASE_API_KEY = "<your-api-key>"
VUE_APP_FIREBASE_AUTH_DOMAIN = "<your-auth-domain>"
VUE_APP_FIREBASE_DB_DOMAIN= "<your-db-domain>"
VUE_APP_FIREBASE_PROJECT_ID = "<your-project-id>"
VUE_APP_FIREBASE_STORAGE_BUCKET = "<your-storage-bucket>"
VUE_APP_FIREBASE_MESSAGING_SENDER_ID= "<your-messaging-sender-id>"
VUE_APP_FIREBASE_APP_ID = "<your-app-id>"
VUE_APP_FIREBASE_MEASURMENT_ID = "<your-measurementId>"

Setup on the Container based Environments

Kubernetes Helm Deployment

See the helm deployment documentation

Docker Composer

Please follow the below steps to setup CrawlerX on the container environment.

docker-compose up --build

Open http://localhost:8080 to view the CrawlerX web UI in the browser.

Setup on the VM based Environment

Please follow the below steps in order to set it up CrawlerX in your VM based environment.

Start RabbitMQ broker

$ docker run -d --hostname my-rabbit --name some-rabbit -p 8080:15672 rabbitmq:3-management

Start MongoDB Server

$ docker run -d -p 27017:27017 --name some-mongo \
    -e MONGO_INITDB_ROOT_USERNAME=<username> \
    -e MONGO_INITDB_ROOT_PASSWORD=<password> \
    mongo

Start Scrapy Daemon (after installing scrpay daemon)

$ cd scrapy_app
$ scrapyd

Start ElasticSearch

$ docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" elasticsearch:7.8.1

Start Celery Beat

$ cd crawlerx_server
$ celery -A crawlerx_server beat -l INFO

Start Celery Worker

$ cd crawlerx_server
$ celery -A crawlerx_server worker -l INFO

Start the Django backend :

$ pip install django
$ cd crawlerx_server
$ python3 manage.py runserver

Start the frontend :

$ cd crawlerx_app
$ npm install
$ npm start

Todos

  • Tor URL crawler

License

MIT

crawlerx's People

Contributors

beshiniii avatar charithccmc avatar dependabot[bot] avatar dizzysilva avatar drifterkaru avatar ffalpha avatar poornimarangoda avatar sajithaliyanage avatar sandagomipieris avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

crawlerx's Issues

Feature: Improve Schedule new Job popup view

We can update the Schedule new Job section with the following improvements.

  • New section for scheduled jobs
  • New section for file upload URLs
  • New design for corn schedule jobs

Exposing web app's firebase configuration

Describe the bug
Even though firebase has only used for authentication purpose in CrawlerX it is not nice to put them in public repo. instead of that we can add a env file to the project and put all the environmental variables there

[GSoC 2021] Integrate user based data management module

Is your feature request related to a problem? Please describe.

Currently it supports basic data saving mechanism for each user. Since this is a data scraping server, this should have a capability to manage data in more user friendly manner .

[GSoC 2021] Data export module for CrawlerX projects

Is your feature request related to a problem? Please describe.
Currently it only possible to view data via the embedded JSON viewer. It would be great if we can export these data as a JSON, CSV file in each project.

.gitignore problem

Describe the bug
.gitignore doesn't contain necessary line to ignore the node modules.

[GSoC 2021] Integrate Apache Airflow to manage Crawler jobs

Is your feature request related to a problem? Please describe.

Airflow is a platform created by the community to programmatically author, schedule and monitor workflows. It provides following advantages.

  • Easy to manage workflows
  • Guranteed delivery
  • Easy to monitor workflows
  • Easy to schedule Crawler jobs
  • Supports many many Message broker implementations
  • Container native support
  • Dashboard support

Cannot build docker - Failed to build Twisted

Hi.
I run docker-compose up --build and get an error:
image

I tried installing Twisted's dependencies in Dockerfile or change version Twisted in requirements.txt but it didn't solve the problem.
RUN apt-get update && apt-get install -y gcc libc6-dev

Can you help me?
Thank you!

Console warnings in Crawlerx App

Describe the bug
There are few console warnings related to firebase and vue router

To Reproduce
Steps to reproduce the behavior:

  1. Go to 'crawlerx_app'
  2. Run the project by typing 'npm run serve'
  3. Open Console
  4. See warnings

Screenshots
Erros

Desktop (please complete the following information):

  • OS: Windows
  • Browser :chrome

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.