This project can scraped informations of the website : http://books.toscrape.com/
Informations scraped are :
- Url of the page : product_page_url
- ID of the article : universal_product_code
- Title : title
- Price including tax : price_including_tax
- Price excluding tax :price_excluding_tax
- Quantity available : number_available
- Description of the product : product_description
- Category : category
- Evaluation of the product : review_rating
- Url of the picture : image_url
- Eraser : Propose to the user to erase datas if they might be corrupted
- Verification : if datas already exist in the file .csv
- Edit the path where files are written
- Function that give the Price according to the UPC
- Function update, which not erase datas that already exist on the .csv
- Optimization : Asyncio
Before using the program, you must set up your environment.
You have to clone this project on your computer. To do that, use Git Bash.
Type : git clone https://github.com/Emericdefay/OCR_P2.1.git
from a folder path with Git Bash.
Activate your virtual env. at the root of the project. I use personaly virtualenv.
Type : virtualenv env
at the root, from a terminal.
To activate it, type : source env/scripts/activate
.
Some libraries are a requisite to use this program: bs4, requests, lxml.
Type : pip install -r requirements.txt
.
You are now able to use the scraper.
To do so, stay on your terminal.
Type : python -u ScrapBookin.py
.
-u is useful if you want to see the progression.
This programm create and structure the datas in two folders:
- /datas : Contains the .csv files. Each one represant one category.
- /pictures : Contains the .jpg files. Those .jpg are organized in folders named as categories.
Warning :
Currently, if you use this program more than one time. The .csv files will be corrupted.
Please, Cut and paste /datas & /pictures in another folder before another scrap.