Light

miguellaura / shelf_control Goto Github PK

View Code? Open in Web Editor NEW

0.0 1.0 0.0 28 KB

Booknode scraper and analyser

License: MIT License

Makefile 2.32% Python 92.49% CSS 5.19%

shelf_control's Introduction

shelf_control

Booknode scraper and analyser (my playground to try things).

Installing, using and contributing

Installing

git clone https://github.com/MiguelLaura/shelf_control.git
make deps

Using

To scrape the top_1000:

python -m shelf_control.scraping

To use the dashboard:

python -m shelf_control.dashboard

And open localhost:8050 in a web browser.

Contributing

Changing the README.md

Make changes in README.template.md and, to generate the updated README.md:

make readme

Changing the code

To check if there are any unused imports:

make lint

To format the code using black:

make format

Once the changes are done, to lint, format and generate the README.md all at once:

make

Ideas of things to do

Scraping data from Booknode
- Top 1000 ✓
  - Time: 17min51s
  - Memory: 1,8M
- Specific book (✓)
- Editor
- Person
- Author
Analysing the data
- Build a dashboard (work in progress)
  - Tutorial from realpython
  - Tutorials from dash.plotly
Use machine learning to determine the themes of each resume (unsupervised and supervised learning possible)
Build a recommandation system
Output a graph of the books using the themes

Improvements

Use the progress bar from minet
Build an interface to use the scraping commands
Add tests when useful
Check how to properly stop dash
Dataframe transformations
- Into functions (more generally, simplify code by making smaller steps)
- Add corresponding test + CI

Usage

scraping
- scraper_specific_book
- scraper_top_1000

scraping

scraper_specific_book

Function to scrape the info for a specific book on Booknode.

Arguments

url_book str - the url of the book on Booknode.

Returns

dict - book data

scraper_top_1000

Generator yielding the info for each book in the top 1000 most liked books on Booknode.

Arguments

page_nb int, optional - page number to start from.

Yields

dict - books data

shelf_control's People

Contributors

Watchers

shelf_control's Issues

Scraping: need to handle errors better

Dashboard: when clicking on description, only description from first page are shown

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.