Giter Club home page Giter Club logo

ocr_p2.1's Introduction

ScrapBookin

Projet 2 - OCR

This project can scraped informations of the website : http://books.toscrape.com/
Informations scraped are :

  • Url of the page : product_page_url
  • ID of the article : universal_product_code
  • Title : title
  • Price including tax : price_including_tax
  • Price excluding tax :price_excluding_tax
  • Quantity available : number_available
  • Description of the product : product_description
  • Category : category
  • Evaluation of the product : review_rating
  • Url of the picture : image_url

Potential additional features :

  • Eraser : Propose to the user to erase datas if they might be corrupted
  • Verification : if datas already exist in the file .csv
  • Edit the path where files are written
  • Function that give the Price according to the UPC
  • Function update, which not erase datas that already exist on the .csv
  • Optimization : Asyncio

Install

Before using the program, you must set up your environment.

1. Clone

You have to clone this project on your computer. To do that, use Git Bash.
Type : git clone https://github.com/Emericdefay/OCR_P2.1.git from a folder path with Git Bash.

2. Virtual Environment

Activate your virtual env. at the root of the project. I use personaly virtualenv.
Type : virtualenv env at the root, from a terminal.
To activate it, type : source env/scripts/activate.

3. Libraries

Some libraries are a requisite to use this program: bs4, requests, lxml.
Type : pip install -r requirements.txt.

Usage

You are now able to use the scraper.
To do so, stay on your terminal.
Type : python -u ScrapBookin.py.
-u is useful if you want to see the progression.

Good to know

This programm create and structure the datas in two folders:

  • /datas : Contains the .csv files. Each one represant one category.
  • /pictures : Contains the .jpg files. Those .jpg are organized in folders named as categories.

Warning :
Currently, if you use this program more than one time. The .csv files will be corrupted.
Please, Cut and paste /datas & /pictures in another folder before another scrap.

ocr_p2.1's People

Contributors

emericdefay avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.