Giter Club home page Giter Club logo

indeed-jobs-scraper's Introduction

Indeed Jobs Scraper

A Python web scraper to automate job search.

Requirements

  • Google Chrome version 84
  • pip3: pip3 install --upgrade pip
  • pipenv: pip3 install pipenv
  • Python 3.8 or higher: will be automatically installed in the local virtual environment

Indeed Jobs Scraper setup

  1. Clone this repository in your machine
  2. Traverse to the project directory and create a virtual environment: pipenv install
  3. Run pipenv run python indeed_crawler.py to check it works.
    • It should open up a dialog saying: "Do you wish to scrape Indeed with the default search config?(yes/no)"
    • Enter "yes". It should look for Spanish teacher jobs in New York

How to use Indeed Jobs Scraper

If you don't want to see how the bot interacts with the site through the browser open config/driver_window.json, change the false value to true and save it. This will minimise the browser.

To launch the bot, run pipenv run python indeed_crawler.py

Upon initiation, you will be prompted to either use the default search configuration or to create a new search.

If "yes" is entered, the bot will search with the parameters given in the previous search. To make a new search, enter "no".

You will be able to specify the following:

  • Country of search (even if the job is remote, it will search for companies located wherever you specify)
  • The job title
  • Specific location: You can look for remote jobs or jobs located somewhere specifically
  • Your base salary: This will not filter out jobs not offering this salary (since many of the posts don't show it). But it will be added as a search parameter
  • Job post recency
  • Matching terms: The bot will use your input to select or discard job posts containing specific terms. You can specify:
    • Words you want in the job title
    • Words you don't want in the job title
    • Words you want in the job description
    • Words you don't want in the description

You can also skip entering matching terms, in which case the bot will yield everything it finds with your search parameters.

Your search parameters and matching terms will be saved as default configuration. Therefore, in your next search you will only have to enter "yes" when prompted to use the default configuration.

Warning:

The logic behind the terms matching will make the bot bring you jobs which either title or description contains selected terms, provided either the description or the title doesn't contain unwanted terms.

This means that you might get jobs with unwanted terms in either the description or in the title because they contain wanted terms in one of those elements. Also you might see jobs which either the title or the description doesn't contain wanted terms because only one of them does.

This is done to make the crawling process more open so you don't miss out potentially interesting results. Nevertheless, the results should always have wanted terms at least in the title or in the description.

What do you get at the end of the process?

You will get a csv file in the 'results' folder with:

  • Jobs titles
  • Jobs locations
  • Jobs salaries (if shown)
  • Companies names
  • Jobs ratings (if shown)
  • Post recency
  • Jobs description
  • Url to apply

indeed-jobs-scraper's People

Contributors

jlgamez avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

indeed-jobs-scraper's Issues

need to update the click_on_job_and_add_description to account for the new iframe

def click_on_job_and_add_description(job_card): job_card.click() iframe = driver.find_element_by_xpath("//iframe[@id='vjs-container-iframe']") driver.switch_to_frame(iframe) wait.until(EC.presence_of_element_located((By.ID, 'viewJobSSRRoot'))) scraped_descriptions.append(driver.find_element_by_id('viewJobSSRRoot').text) driver.switch_to_default_content()

How to deal with Captcha?

Hi!
I'm trying to use your bot to gather job descriptions from announces. It worked for the first search but it stopped during the process.
Since then, I got Captcha every single session. I can manually solve the first captcha, but then the bot opens new instances of Chrome, giving errors since the new window asks for a new Captcha to be solved.
How do you deal with them?

Edit:
The bot is being recognised after some pages. Probably the bot crawls faster then expected, and that's why Captcha are triggered (?)

Thank you!

Always times out on any query.

I'm testing to see if the tool works by running the test search. Once I get to "GETTING JOB DESCRIPTIONS..." the code always times out with this error message:

Traceback (most recent call last):
File "indeed_crawler.py", line 98, in
scraped_descriptions = scrape.get_job_description(url, descriptions_list)
File "/home/ashley/Documents/GitHub/indeed-jobs-scraper/indeed_jobs_crawler/info_scraper.py", line 143, in get_job_description
wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'jobsearch-SerpJobCard')))
File "/home/ashley/.local/lib/python3.8/site-packages/selenium/webdriver/support/wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

I've tried making the wait longer, to 60 seconds, but that didn't help. I don't know what else I can modify to make this work.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.