Giter Club home page Giter Club logo

goodreadsscraper's Introduction

GOODREADSSCRAPER

A very simple scraper for goodreads edition details.

Travis (.org) Github All Releases GitHub release

Motivation

I wrote this scraper to help my sister with her bachelors thesis. It requires her to analyze a lot of data obtained from the goodreads website. Of course obtaining the data by hand is possible but very tedious (we are talking about reading a couple hundred webpages). Hence this little scraper was created.

Distribution

This program is only available from the GitHub releases.

Prerequisites

You need to have Java installed and available on the command line.

What it does

Goodreads has a page for every book where all the editions of that book are listed, together with some metadata (see this example). This scraper now expects a list of such addresses as input. It will look at every single one of the provided websites and read out the following information

Field Description
Title The title of the book as described in the title of the webpage
Type The type of each edition according to user configuration
Language The language of every edition on that page
Ratings The number of ratings for each edition
average rating The average rating for each edition

For ways to run and configure the program take a look at the usage section.

Usage

To run the program simply put the jar in a folder of your choice. Then open a command prompt (start -> execute -> cmd under windows) and navigate to the folder where you just put the jar.

You can now run the program with java -jar goodreadsscraper.jar.

To function the program needs two additional files:

input.txt

This file contains all the websites to look at. Simply put one URL per line into this file.

https://goodreads.com/book1
https://goodreads.com/book2
...

types.txt

This file contains the filter strings for the types you are looking for. The number of different types is endless and depending on your use case you might not be interested in some and they will also vary between books. With this file you can define keywords (again one per line). The program will search for these keywords on the website. If a keyword is found the line will be split by , and then the part that contains the keyword will be used for the output.

# assuming you have Paperback in your types.txt
Paperback, 343 pages -> Paperback
Unedited Paperback, 200 pages -> Unedited Paperback
Original, Paperback, 1 page -> Paperback

output

The output will be written to output.csv into the same folder. This csv file contains the following columns (currently without header).

title Type Ratings avg. Rating Language

Disclaimer

This program has been hacked together in a couple hours to help my sister. There is no real care gone into regarding code quality, extendability or maintainability. I will add some tests for my own good nights rest but not more. So.. you have been warned.

(Source: xkcd)

goodreadsscraper's People

Contributors

jonasjurczok avatar

Watchers

 avatar  avatar

goodreadsscraper's Issues

Pagination support

Sometimes you have a lot of editions on multiple pages.
It is tedious to add all these urls to the input.txt.

Therefore the scraper should be supporting pages and page sizes

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.