sushil-rgb / amazonme Goto Github PK

Introducing the AmazonMe webscraper - a powerful tool for extracting data from Amazon.com using the Requests and Beautifulsoup library in Python. This scraper allows users to easily navigate and extract information from Amazon's website.

License: GNU General Public License v3.0

Python 100.00%

amazon amazon-scraper python scraper web-automation beautifulsoup4 data-scraping discord-bot discord-py web-scraping

amazonme's Introduction

AmazonMe

Welcome to AmazonMe, a web scraper designed to extract information from the Amazon website and store it in a MongoDB databse. This repository contains the code for the scraper, which utilizes the Requests and BeautifulSoup libraries to automate the scraping process. The scraper also leverages asyncio concurrency to efficiently extract thousands of data points from the website.

Install necessary requirments:

It's always a good practice to install a virtual environment before installing necessary requirements:

python.exe -m venv environmentname
environmentname/scripts/activate

Install necessary requirements:

  pip install -r requirements.txt

Usage

  async def main():
        base_url = ""
        # Type True if you want to use proxy:
        proxy = False
        if proxy:
            mongo_to_db = await export_to_mong(base_url, f"http://{rand_proxies()}")
        else:
            mongo_to_db = await export_to_mong(base_url, None)
        # sheet_name = "Dinner Plates"  # Please use the name of the collection in your MongoDB database to specify the name of the spreadsheet you intend to export.
        # sheets = await mongo_to_sheet(sheet_name)  # Uncomment this to export to excel database.
        return mongo_to_db

To run the script, go to terminal and type:

  python main.py

Demo of the scraper scraping the content from Amazon

Features

Upon executing the program, the scraper commences its operation by extracting the following fields and storing the required product information in Mongo databases.

Product
Asin
Description
Breakdown
Price
Deal Price
You Saved
Rating
Rating count
Availability
Hyperlink
Image url
Image lists
Store
Store link

Supported domains:

".com" (US)
".co.uk" (UK)
".com.mx" (Mexico)
".com.br" (Brazil)
".com.au" (Australia)
".com.jp" (Japan)
".com.be" (Belgium)
".in" (India)
".fr" (France)
".se" (Sweden)
".de" (Germany)
".it" (Italy)

MongoDB Integration

Newly added to AmazonMe is the integration with MongoDB, allowing you to store the scraped data in a database for further analysis or usage. The scraper can now save the scraped data directly to a MongoDB database.

To enable MongoDB integration, you need to follow these steps:

Make sure you have MongoDB installed and running on your machine or a remote server.
Install the pymongo package by running the following command:
python pip install pymongo
In the script or module where you handle the scraping and data extraction, import the pymongo With the MongoDB integration, you can easily query and retrieve the scraped data from the database, perform analytics, or use it for other purposes.

Note

Please note that the script is designed to work with Amazon and may not work with other types of websites. Additionally, the script may be blocked by the website if it detects excessive scraping activity, so please use this tool responsibly and in compliance with Amazon's terms of service

If you have any issues or suggestions for improvements, please feel free to open an issue on the repository or submit a pull request.

License

This project is licensed under GPL-3.0 license. This scraper is provided as-is and for educational purposes only. The author is not repsonsible for any damages or legal issues that may result from its user. Use it at your own risk. Thank you for using the AmazonMe!

amazonme's People

Contributors

Stargazers

Watchers

amazonme's Issues

Amazon.ae

Hi,

Is it possible to support Amazon.ae?

Thanks a lot!

Request new functions and clarification of doubts

Good evening! I came across this repository of yours as I was interested in the topic of Amazon Scraping. I haven't had the chance to try and test what your program is capable of doing yet, so I apologize if I ask you obvious questions. I read a bit about the description of what this project could potentially do. At this point, however, a doubt arises regarding the implementation of Mongodb. From what I understand, this is nothing more than a sort of database in which the data scraped by Amazon is stored. The question at this point is, once the information has been extracted from Amazon, does the actual scraping take place in the database or does it continue to do so on the official website? Because I would like to understand if the IP address could be banned (even if you have implemented the user-agent). Next, I wanted to ask you if you have ever considered the possibility of implementing Telegram API to build a Bot, through which scrap offers can be posted on a channel or in private. Maybe it's time consuming and laborious to implement, but I just wanted to know if you've ever considered this as an idea. Thank you in advance and wish you a good evening!

Fetch offers in categories

Hi,
Is possible fetch offers in the categories?
For example fetch the "Elettronics" (category) offers of the day.

feature request

Hi,
The code works flawlessly.
Can you please modify it to get the results from https://www.amazon.com/gp/goldbox too?

Thanks.

Is there any requirement on MongoDB version?

I used 4.x but can't find the data in the database.

Suggestion for Enhancing Project Functionality

Hello, I recently came across your project and found it to be quite impressive!

Upon analyzing packets from the Amazon iOS app today, I discovered that it utilizes the "endpoint" to extract valuable information such as:

Product availability
Product price
Prime availability
Non-discounted product price (if applicable)

What's particularly intriguing is that each request allows the submission of up to 100 ASINs, making it seemingly resistant to bans (fingers crossed). The endpoint for this functionality is: "https://www.amazon.it/gp/twister/dimension?isDimensionSlotsAjax=1&asinList=B0BG8F7PCX&vs=1"

While the app employs a few other parameters, I have yet to find any of them particularly interesting.

To include ASINs, utilize the "asinList" parameter and separate the ASINs with a comma, as demonstrated here: "https://www.amazon.it/gp/twister/dimension?isDimensionSlotsAjax=1&asinList=B0BG8F7PCX,CB0CG7JG7N3&vs=1"

It's worth noting that the other parameters, apart from "asinList," are not optional, and any alterations to their values result in empty returns (I'm still trying to figure out why).

Although the provided endpoint is for Amazon.it, I believe it could potentially work for other Amazon countries as well.

Here is an example output:

{
    "ASIN": "B0BG8F7PCX",
    "Type": "JSON",
    "sortOfferInfo": "",
    "isPrimeEligible": "false",
    "Value": {
        "content": {
            "twisterSlotJson": {"price": "49.49"},
            "twisterSlotDiv": "<span id=\"_price\" class=\"a-color-secondary twister_swatch_price unified-price\"><span class=\"a-size-mini twisterSwatchPrice\"> 49,49 € </span></span>"
        }
    }
}
&&&
{
    "ASIN": "B0CG7JG7N3",
    "Type": "JSON",
    "sortOfferInfo": "",
    "isPrimeEligible": "false",
    "Value": {
        "content": {
            "twisterSlotJson": {"price": "69.99"},
            "twisterSlotDiv": "<span id=\"_price\" class=\"a-color-secondary twister_swatch_price unified-price\"><span class=\"a-size-mini twisterSwatchPrice\"> 69,99 € </span></span>"
        }
    }
}

If the non-discounted price is present, it will be embedded in the "content" HTML.

503 captcha error solution

https://github.com/a-maliarov/amazoncaptcha

Try and update your code

New Endpoint for Reviews

Issue Description:

While exploring ways to enhance the efficiency of retrieving Amazon product reviews, I stumbled upon a potential backup solution that involves parsing HTML responses. I believe this could serve as a valuable alternative in scenarios where the program encounters difficulties fetching reviews in the standard way.

URL:
https://www.amazon.it/gp/customer-reviews/widgets/average-customer-review/popover?contextId=dpx&asin=B0C4PTCPXQ

HTML Response Example:

<div class="a-fixed-left-grid"><div class="a-fixed-left-grid-inner" style="padding-left:300px"><div class="a-fixed-left-grid-col a-col-left" style="width:300px;margin-left:-300px;float:left;"><div class="a-icon-row a-spacing-small a-padding-none"><i data-hook="average-stars-rating-anywhere" class="a-icon a-icon-star a-star-5"><span class="a-icon-alt">5 su 5</span></i><span data-hook="acr-average-stars-rating-text" class="a-size-medium a-color-base a-text-beside-button a-text-bold">5 su 5</span></div><div class="a-row a-spacing-medium"><span data-hook="total-review-count" class="a-size-base a-color-secondary totalRatingCount">5 valutazioni globali</span></div>
<table id="histogramTable" class="a-normal a-align-center a-spacing-base">
    <tr data-reftag="" data-reviews-state-param="{&quot;filterByStar&quot;:&quot;five_star&quot;, &quot;pageNumber&quot;:&quot;1&quot;}" aria-label="100% di recensioni hanno 5 stelle" class="a-histogram-row a-align-center">
      <td class="aok-nowrap">
        <span class="a-size-base">
          <a aria-disabled="true" class="a-link-normal 5star" title="100% di recensioni hanno 5 stelle" href="/product-reviews/B0C4PTCPXQ/ref=acr_dpx_hist_5?ie=UTF8&amp;filterByStar=five_star&amp;reviewerType=all_reviews#reviews-filter-bar">
            5 stelle
          </a>
        </span>
        <span class="a-letter-space"></span>
      </td>
      <td class="a-span10">
        <a aria-disabled="true" aria-hidden="true" class="a-link-normal" title="100% di recensioni hanno 5 stelle" href="/product-reviews/B0C4PTCPXQ/ref=acr_dpx_hist_5?ie=UTF8&amp;filterByStar=five_star&amp;reviewerType=all_reviews#reviews-filter-bar">
          <div class="a-meter" role="progressbar" aria-valuenow="100%"><div class="a-meter-bar" style="width: 100%;"></div></div>
        </a>
      </td>
      <td class="a-text-right a-nowrap">
        <span class="a-letter-space"></span>
        <span class="a-size-base">
          <a aria-disabled="true" aria-hidden="true" class="a-link-normal" title="100% di recensioni hanno 5 stelle" href="/product-reviews/B0C4PTCPXQ/ref=acr_dpx_hist_5?ie=UTF8&amp;filterByStar=five_star&amp;reviewerType=all_reviews#reviews-filter-bar">
              100%
          </a>
        </span>
      </td>
    </tr>

Proposed Solution:

Considering the concise nature of the HTML file, parsing it should be a relatively quick and effective method for obtaining reviews. Implementing this as a backup solution can potentially improve the program's robustness.

Additional Notes:

It's worth noting that the HTML structure appears straightforward, making it suitable for parsing. This alternative method could prove particularly useful in situations where retrieving reviews from the standard html presents challenges.

I would appreciate your consideration of this suggestion and would be happy to provide further details or assistance as needed.

An other important note is that ASIN can't be concatenaited. You'll need to do a query for each ASIN.

Thank you.

Export to .CSV

I am trying to use the script but Mongo is a bit to advance for me. Is it possible to modify the main.py file to output the data to a .CSV file I can upload to Google Sheet for ease of viewing for me? Any other file that would need to be modified?

[FEATURE REQUEST] implement fake-useragent library

Description:
I would like to suggest implementing the use of the "fake-useragent" library (https://pypi.org/project/fake-useragent/) instead of the current approach, which involves managing a large .txt file of user agents and selecting one randomly.

Motivation:
The "fake-useragent" library provides a more streamlined and efficient solution for handling user agents. Instead of reading from a hefty 844KB .txt file, the library generates user agents dynamically, eliminating the need for file I/O operations and enhancing performance.

Benefits:

Reduced File I/O Overhead: The current method involves opening and reading a large .txt file, which can be time-consuming. By using "fake-useragent," we can eliminate the need for file operations altogether, resulting in faster execution times.
Dynamic User Agent Generation: "fake-useragent" generates user agents on-the-fly, ensuring a diverse set of user agents without the need for maintaining an extensive list in a text file.
Ease of Integration: Implementing "fake-useragent" is straightforward and can potentially lead to cleaner and more maintainable code.

References:

Keyword search rank

As the title says, is it possible to find the organic rank of a keyword? Basically search for keyword and find our own asin to see if its position.

Bro i got this error please help

I got this error while trying to setup the scraper

C:\Users\HP\Desktop\internship>python -u "c:\Users\HP\Desktop\internship\AmazonMe-master\AmazonMe-master\main.py"
Traceback (most recent call last):
File "c:\Users\HP\Desktop\internship\AmazonMe-master\AmazonMe-master\main.py", line 39, in
results = asyncio.run(main())
^^^^^^^^^^^^^^^^^^^
File "C:\Python311\Lib\asyncio\runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "C:\Python311\Lib\asyncio\runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Python311\Lib\asyncio\base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "c:\Users\HP\Desktop\internship\AmazonMe-master\AmazonMe-master\main.py", line 13, in main
status = await Amazon(base_url, None).status()
^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\HP\Desktop\internship\AmazonMe-master\AmazonMe-master\scrapers\scraper.py", line 58, in init
self.scrape = yaml_load('selector')
^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\HP\Desktop\internship\AmazonMe-master\AmazonMe-master\tools\tool.py", line 335, in yaml_load
with open(f"scrapers//{selectors}.yaml") as file:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'scrapers//selector.yaml'