Giter Club home page Giter Club logo

mm86133 / scraping-amazon-for-mobile-details-with-scrapy Goto Github PK

View Code? Open in Web Editor NEW

This project forked from chinmayrane16/scraping-amazon-for-mobile-details-with-scrapy

0.0 0.0 0.0 28.03 MB

Scraping Amazon website using Proxies for extracting Mobile details

Python 95.20% C++ 0.61% C 2.79% HTML 0.36% Objective-C 0.13% XSLT 0.69% Shell 0.01% GAP 0.08% Visual Basic 0.01% JavaScript 0.01% ASP 0.01% Tcl 0.11% PowerShell 0.01% Batchfile 0.01%

scraping-amazon-for-mobile-details-with-scrapy's Introduction

Scraping-Amazon-for-Mobile-details-with-Scrapy

In this repository I have build an Amazon scraper for extracting Mobile details and pricing using a python framework called scrapy. I have used Pycharm IDE and I have extracted the following details for 5 pages on the Amazon site.

  1. Mobile Name
  2. Mobile Reviews
  3. Mobile Prices
  4. Mobile Imagelinks

After extracting the information, I have saved it into a JSON file.

Requirements

Implementation

After installing Pycharm IDE, create a new project File->New Project
Install the requirements File->Settings->Project->Project Interpreter->(Click + symbol)->(package name)->Install

After installing the packages open terminal at the bottom left corner and type scrapy startproject projectname you can see the following files added to your project.

.
--- amazonscrap
|   --- __init__.py
|   --- items.py
|   --- middlewares.py
|   --- pipelines.py
|   --- settings.py
|   --- spiders
|     --- __init__.py
--- scrapy.cfg

Go to the items.py and add the fields you want to extract.
Go to the your projectspider.py and extract the details from the webpage by inspecting the page and locate the css class and copy and add it to your code.
Pass the variables containing the details to the items dictionary.

When your are done with the coding part, now you're ready to run the script. But to prevent Amazon from blocking you, you could use the following tricks to bypass their security measures.

  1. GoogleBot - Confuse the site by faking your user-agent to be google's bot agent. Amazon allows access to google to crawl it's website. Add the code to your settings.py -> USER_AGENT = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'

  2. Rotating User-Agents and Spoofing - Spoof the User Agent by creating a list of user agents and picking a random one for each request. Websites do not want to block genuine users so you should try to look like one. Add the code to your settings.py -> DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, 'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400, }

  3. Rotating IPs and Proxy Services - Use different IP addresses for making requests to a server, so that the detection becomes harder. Create a pool of IPs that you can use and use random ones for each request. We can use VPNs, shared proxies for the same.

Finally run the project on the terminal using the command scrapy crawl spidername and you can see the responses on the terminal. To generate json file with the responses run the command scrapy crawl spidername -o items.json. A JSON file will be created with the name "items.json".

scraping-amazon-for-mobile-details-with-scrapy's People

Contributors

chinmayrane16 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.