leopardslab / crawlerx Goto Github PK

CrawlerX - Develop Extensible, Distributed, Scalable Crawler System which is a web platform that can be used to crawl URLs in different kind of protocols in a distributed way.

License: Apache License 2.0

JavaScript 2.84% HTML 0.21% Vue 34.47% Python 8.51% SCSS 52.31% Dockerfile 0.15% Shell 0.14% Mustache 0.17% CSS 1.20%

django-backend web-crawling mongodb-server vuejs elasticsearch message-broker firebase-auth

crawlerx's Introduction

CrawlerX - Develop Extensible, Distributed, Scalable Crawler System

The CrawlerX is a platform which we can use for crawl web URLs in different kind of protocols in a distributed way. Web crawling often called web scraping is a method of programmatically going over a collection of web pages and extracting data which useful for data analysis with web-based data. With a web scraper, you can mine data about a set of products, get a large corpus of text or quantitative data to play around with, get data from a site without an official API, or just satisfy your own personal curiosity.

CrawlerX includes the following runtimes to do the crawling jobs for you.

VueJS Frontend - Dashboard which users intercat
Firebase - User authorization & authentication
Django Backend Server - which expose API endpoints for the frontend
RabbitMQ Server - Message broker
Celery Beat and Workers - Job Scheduler and executor
Scrapy Server - for extracting the data you need from websites
MongoDB Server - for store crawled data
ElasticSearch- for job/query seaching mechanisams

CrawlerX Dashboard

In the CrawlerX dashboard, you can get an abstract idea of the crawled and crawling projects and jobs with their status.

Crawl Job Scheduling

In CrawlerX, you can schedule crawl jobs in three ways.

Instant Scheduler - Crawl job is scheduled that run instantly
Interval Scheduler - Crawl job is scheduled that run at a specific interval
Cron Scheduler - Crawl job is scheduled that run as a cron job

Prerequisites

First you need to edit the .env file in crawlerx_app root directory with your web app's firebase configuration details.

VUE_APP_FIREBASE_API_KEY = "<your-api-key>"
VUE_APP_FIREBASE_AUTH_DOMAIN = "<your-auth-domain>"
VUE_APP_FIREBASE_DB_DOMAIN= "<your-db-domain>"
VUE_APP_FIREBASE_PROJECT_ID = "<your-project-id>"
VUE_APP_FIREBASE_STORAGE_BUCKET = "<your-storage-bucket>"
VUE_APP_FIREBASE_MESSAGING_SENDER_ID= "<your-messaging-sender-id>"
VUE_APP_FIREBASE_APP_ID = "<your-app-id>"
VUE_APP_FIREBASE_MEASURMENT_ID = "<your-measurementId>"

Setup on the Container based Environments

Kubernetes Helm Deployment

See the helm deployment documentation

Docker Composer

Please follow the below steps to setup CrawlerX on the container environment.

docker-compose up --build

Open http://localhost:8080 to view the CrawlerX web UI in the browser.

Setup on the VM based Environment

Please follow the below steps in order to set it up CrawlerX in your VM based environment.

Start RabbitMQ broker

$ docker run -d --hostname my-rabbit --name some-rabbit -p 8080:15672 rabbitmq:3-management

Start MongoDB Server

$ docker run -d -p 27017:27017 --name some-mongo \
    -e MONGO_INITDB_ROOT_USERNAME=<username> \
    -e MONGO_INITDB_ROOT_PASSWORD=<password> \
    mongo

Start Scrapy Daemon (after installing scrpay daemon)

$ cd scrapy_app
$ scrapyd

Start ElasticSearch

$ docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" elasticsearch:7.8.1

Start Celery Beat

$ cd crawlerx_server
$ celery -A crawlerx_server beat -l INFO

Start Celery Worker

$ cd crawlerx_server
$ celery -A crawlerx_server worker -l INFO

Start the Django backend :

$ pip install django
$ cd crawlerx_server
$ python3 manage.py runserver

Start the frontend :

$ cd crawlerx_app
$ npm install
$ npm start

Todos

Tor URL crawler

License

MIT

crawlerx's People

Contributors

Stargazers

Watchers

Forkers

poornimarangoda codetheorem vinayaksh42 moulik-deepsource 2knal shivam-iitkgp yasirunet sandagomipieris ffalpha beshiniii dqsdatalabs drifterkaru dizzysilva loic-binet lapnd

crawlerx's Issues

[GSoC 2021] Integrate Django Celery Beat to manage and schedule Crawler jobs

Is your feature request related to a problem? Please describe.
$subject

Create crawler Job section in the Project section to schedule multiple jobs per project

Feature: Improve Schedule new Job popup view

We can update the Schedule new Job section with the following improvements.

New section for scheduled jobs
New section for file upload URLs
New design for corn schedule jobs

Create a backend REST service server with Django.

Feature: Add some new crawlers for popular web pages

Add crawl spiders for the following or popular websites.

Youtube
Quora
Facebook
Reddit
GitHub

Currently implemented spiders can be found in - https://github.com/leopardslab/CrawlerX/tree/master/scrapy_app/scrapy_app/spiders

[Epic for GSoC 2021] Complete CrawlerX Kubernetes Deployment with Helm

Please note: View this issue after enabling ZenHub

Project - [GSoC 2021] CrawlerX - On-demand auto-scaling platform on Kubernetes
Description - This epic relates to the deployment pattern of the CrawlerX on Kubernetes.
Student: sangagomipieris
Mentor: sajithaliyanage/ prabushitha

Create a set of pre-defined Crawlers in CrawlerX

Create Docker composer.yml files for CrawlerX platform

Is your feature request related to a problem? Please describe.
$subject. We need to create relevant composer.yml for the platform.

Describe the solution you'd like
Type/Improvement

Exposing web app's firebase configuration

Describe the bug
Even though firebase has only used for authentication purpose in CrawlerX it is not nice to put them in public repo. instead of that we can add a env file to the project and put all the environmental variables there

Create Projects section in front-end to manage multiple projects per user

[GSoC 2021] Integrate user based data management module

Is your feature request related to a problem? Please describe.

Currently it supports basic data saving mechanism for each user. Since this is a data scraping server, this should have a capability to manage data in more user friendly manner .

Create Dockerfiles for CrawlerX modules

Create Dockerfiles for following modules.

VueJs web application
Django server
Scrapy server

[GSoC 2021] Data export module for CrawlerX projects

Is your feature request related to a problem? Please describe.
Currently it only possible to view data via the embedded JSON viewer. It would be great if we can export these data as a JSON, CSV file in each project.

[GSoC 2021] K8s artifacts for VueJS based crawlerx web application

We need to create following k8s artifacts for the $subject

Deployment
Service
Ingress
ConfigMap
Secret

.gitignore problem

Describe the bug
.gitignore doesn't contain necessary line to ignore the node modules.

[GSoC 2021] K8s artifacts for ElasticSearch server and dashboard

We need to create following k8s artifacts for the $subject

Deployment
Service
Ingress
ConfigMap
Secret

[Epic for GSoC 2021] Complete Improvements of CrawlerX web application

Please note: View this issue after enabling ZenHub

Project - [GSoC 2021] Improve CrawlerX web application
Description - This epic relates to the project improve CrawlerX web applicaation.
Student: beshiniii
Mentor: sajithaliyanage/ prabushitha

[GSoC 2021] Integrate Apache Airflow to manage Crawler jobs

Is your feature request related to a problem? Please describe.

Airflow is a platform created by the community to programmatically author, schedule and monitor workflows. It provides following advantages.

Easy to manage workflows
Guranteed delivery
Easy to monitor workflows
Easy to schedule Crawler jobs
Supports many many Message broker implementations
Container native support
Dashboard support

[GSoC 2021] K8s artifacts for apache Airflow server

We need to create following k8s artifacts for the $subject

Deployment
Service
HPA
ConfigMap
Secret

Cannot build docker - Failed to build Twisted

Hi.
I run docker-compose up --build and get an error:

I tried installing Twisted's dependencies in Dockerfile or change version Twisted in requirements.txt but it didn't solve the problem.
RUN apt-get update && apt-get install -y gcc libc6-dev

Can you help me?
Thank you!

[GSoC 2021] K8s artifacts for Scrapy application

We need to create following k8s artifacts for the $subject

Deployment
Service
Ingress
ConfigMap
Secret
NFS provisioner

Create MongoDB based authentication mechanism

Is it possible to replace Firebase with any other opensource tool?

Hello Team, Thank you very much for the project.

Is it possible to replace Firebase with an open-source tools like appwrite, feathersjs, etc?

[GSoC 2021] K8s artifacts for MongoDB server

We need to create following k8s artifacts for the $subject

Deployment
Service
ConfigMap
Secret

[GSoC 2021] Documentation the progress of the project

this issue tracks the documentation progress of the project.

Console warnings in Crawlerx App

Describe the bug
There are few console warnings related to firebase and vue router

To Reproduce
Steps to reproduce the behavior:

Go to 'crawlerx_app'
Run the project by typing 'npm run serve'
Open Console
See warnings

Screenshots

Desktop (please complete the following information):

OS: Windows
Browser :chrome

Feature: Create a new logo for the CrawlerX platform

Create a new logo for the CrawlerX platform and insert it inside the folder called logo in the root directory. Also, update the favicon of the web-app

[GSoC 2021] K8s artifact for Django based backend server

We need to create following k8s artifacts for the $subject

Deployment
Service
ConfigMap
Secret
HPA

Create a front-end using VueJs for user managements

[GSoC 2021] Enable tor web URL support for CrawlerX

Is your feature request related to a problem? Please describe.
As per the current implementation, CrawlerX supports only for HTTP and HTTPS urls. This need to extend for Tor browser Urls.

Feature: Add Kubernetes artifacts for the CrawlerX project

Is your feature request related to a problem? Please describe.
This issue introduces Kubernetes artifacts for CrawlerX project. Currently, we can only deploy the platform in Docker orchestration frameworks.

Feature: Improve the JSON viewer inside the Job data

We can improve the JSON viewer by adding the followings,

Add a syntax highlighter
Add Copy to Clipboard button
Add some readability improvements

leopardslab / crawlerx Goto Github PK

crawlerx's Introduction

CrawlerX - Develop Extensible, Distributed, Scalable Crawler System

CrawlerX Dashboard

Crawl Job Scheduling

Prerequisites

Setup on the Container based Environments

Kubernetes Helm Deployment

Docker Composer

Setup on the VM based Environment

Please follow the below steps in order to set it up CrawlerX in your VM based environment.

Todos

License

crawlerx's People

Contributors

Stargazers

Watchers

Forkers

crawlerx's Issues

Recommend Projects

Recommend Topics

Recommend Org