Giter Club home page Giter Club logo

javawebcrawler's Introduction

JavaWebCrawler

A multi-threaded Java web crawler project to extract data from government websites and analyze the overall trend on climate change and national security.

Yanglin Tao, Oct 29, 2023

Project Setup

MAC OS:

  1. Download chromedriver with Homebrew (MacOS) by running this command brew install chromedriver.
  2. If prompted that “chromedriver” can’t be opened because Apple cannot check it for malicious software, go to System Settings > Security & Privacy, and you should see a message about chromedriver being blocked. In that case, click Open Anyway to force open the application.
  3. Run which chromedriver to verify its path. Copy and paste the path to SeleniumCrawler.

WINDOWS:

  1. Make sure you are using the latest version of “Google Chrome”, if not, update it to the latest version before installing chromedriver.
  2. Download chromedriver with Chocolatey by running the following command in PowerShell (run as administrator): choco install chromedriver
  3. Run “chromedriver - -version” to verify that the chromedriver has been successfully installed.

Database Setup

  1. Download postgresql and pgAdmin tool.
  2. Create a database named 'webCrawler_db', use password 'root'.
  3. Use Query tool and run queries in WebCrawlerDatabase.sql to initialize the database.

Troubleshooting

If you encountered error like org.openqa.selenium.SessionNotCreatedException: session not created, it's likely that your ChromeDriver version is not compactible with your current Chrome browser version. In that case, run brew install chromedriver again and update Chrome browser to the latest version.

Guidelines on finding more websites to crawl

Finding URLs to parse

It's important to select and parse a suitable base URL, use URLs like https://www.gov.uk/search/news-and-communications, where a list of news titles and metadata about their update dates can be found.

Finding the metric elements

The crawler counts the number of articles, i.e. article titles or teasers, containing keywords. Go to Inspect Elements on the webpage, then locate the title and metadata.

Scope of the search.

The number of threads used for each each country vary based on how many pages will be parsed. We typically recommend using 150 threads for larger websites and 50 threads for smaller ones when creating new configuration to the Country table. The number of pages to be parsed is customizable through interface, and we recommend that the number of pages should be greater than number of threads.

javawebcrawler's People

Contributors

yanglin-tao avatar vaidishamehta avatar

Watchers

 avatar

Forkers

vaidishamehta

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.