Giter Club home page Giter Club logo

jobscrawling's Introduction

JobsCrawling

The project was create to parse data like (title, description, etc.) from the sites:

  • https://www.hrforecast.de/company/career/ and https://www.gazpromvacancy.ru/jobs/

And saving data to the Postgres database.

Prerequisites

  1. Installing python3
  • Follow this link and download the latest python3 OS X package
  • Run the package and follow the steps to install python3 on your computer.
  • Once the installation is done, on your Terminal, run python3 --version
  1. Istall pip package
  • Securely download the get-pip.py file from this link
  • From the directory where the file was downloaded to, run the following command in the Terminal
python3 get-pip.py
  1. Install virtualenv using pip
pip3 install virtualenv
  1. Install DB. Follow this link OR
pip install postgres
  1. Setup the VirualEnviroment for the project
virtualenv CrawlEnv
source crawlenv/bin/activate

Dependencies

  • Download Scrapy framework pip install scrapy
  • Download the psycorp2 pip install psycopg2-binary

Setup the Database

Run command in the shell: psql Then create a database and a user:

CREATE DATABASE jobs_hrforecast;
CREATE USER manager WITH ENCRYPTED PASSWORD 'hrforecast';
GRANT ALL PRIVILEGES ON DATABASE jobs_hrforecast TO manager;

Navigate to the database

psql jobs_hrforecast manager

Create a table jobs_data:

CREATE TABLE jobs_data (
id SERIAL PRIMARY KEY NOT NULL,
job_url VARCHAR(512),
job_title VARCHAR(255),
job_description TEXT,
company_name VARCHAR(255),
crawled_date TIMESTAMP DEFAULT NOW(),
posted_date VARCHAR(255)
);

Running Spiders

To run spiders, just execute following commands in project directory /DataCrawler/spiders/:

scrapy runspider hr_spider.py
scrapy runspider gazprom_spider.py

Or run bash script crawling.sh

The Output data

Data you can find in database. Connect to db as manager:

psql jobs_hrforecast manager

Then check for tables:

\dt

Select all data from the table:

SELECT * FROM jobs_data

jobscrawling's People

Contributors

devdjan avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.