crinibus / scraper Goto Github PK

Web scraper for scraping, tracking and visualizing prices of products on various websites.

License: MIT License

Python 100.00%

python scraper products komplett proshop computersalg elgiganten avcables tech-scraper amazon prices scrape-prices ebay expert mm-vision coolshop sharkgaming newegg web-scraping shein

Intro

With this program you can easily scrape and track prices on product at multiple websites.
This program can also visualize price over time of the products being tracked. That can be helpful if you want to buy a product in the future and wants to know if a discount might be around the corner.

Requires python 3.10+

Contributing

Feel free to fork the project and create a pull request with new features or refactoring of the code. Also feel free to make issues with problems or suggestions to new features.

UPDATE TO HOW DATA IS STORED IN V1.1

In version v1.1, I have changed how data is stored in records.json: dates under each product have been changed to datapoints and now a list containing dictionaries with date and price keys.
If you want to update your data to be compatible with version v1.1, then open an interactive python session where this repository is located and run the following commands:

>>> from scraper.format_to_new import Format
>>> Format.format_old_records_to_new()

UPDATE TO PRODUCTS.CSV IN V2.3.0

In version v2.3.0, I have add the column short_url to products.csv. If you have add products before v2.3.0, then run the following commands in an interactive python session to add the new column:

>>> from scraper.format_to_new import Format
>>> Format.add_short_urls_to_products_csv()

UPDATE TO HOW DATA IS STORED IN V3.0.0

In version v3.0.0, I have changed where data is stored from a json file to a SQLite database. If you have data from before v3.0.0, then run the following commands in an interactive python session to add the data from records.json to the database (OBS: Pandas is required):

>>> from scraper.format_to_new import Format
>>> Format.from_json_to_db()

NOTE: This will replace the content in the database with what is in records.json. That means if you have products and/or datapoints in the database but not records.json, they will be deleted.

OBS: If you doesn't have Pandas installed run this command:

pip3 install pandas

Installation

Requires python 3.10+

Clone this repository and move into the repository:

git clone https://github.com/Crinibus/scraper.git

cd scraper

Then make sure you have the modules, run this in the terminal:

pip3 install -r requirements.txt

Add products

To add a single product, use the following command, where you replace <category> and <url> with your category and url:

python3 main.py -a -c <category> -u <url>

e.g.

python3 main.py -a -c vr -u https://www.komplett.dk/product/1168594/gaming/spiludstyr/vr/vr-briller/oculus-quest-2-vr-briller

This adds the category (if new) and the product to the records.json file, and adds a line at the end of the products.csv file so the script can scrape price of the new product.

To add multiple products at once, just add specify another category and url with -c <category> and -u <url>. E.g. with the following command I add two products:

python3 main.py -a -c <category> -u <url> -c <category2> -u <url2>

This is equivalent to the above:

python3 main.py -a -c <category> <category2> -u <url> <url2>

OBS: The url must have a schema like: https:// or http://.
OBS: If an error occures when adding a product, then the error might happen because the url has a & in it, when this happens then just put quotation marks around the url. This should solve the problem. If this doesn't solve the problem then summit a issue.

Websites to scrape from

This scraper can (so far) scrape prices on products from:

*OBS these Amazon domains should work: .com, .ca, .es, .fr, .de and .it
The listed Amazon domains is from my quick testing with one or two products from each domain.
If you find that some other Amazon domains works or some of the listed doesn't please create an issue.

Scrape products

To scrape prices of products run this in the terminal:

python3 main.py -s

To scrape with threads run the same command but with the --threads argument:

python3 main.py -s --threads

Activating and deactivating products

When you add a new product the product is activated to be scraped. If you wish to not scrape a product anymore, you can deactivate the product with the following command:

python3 main.py --deactivate --id <id>

You can activate a product again with the following command:

python3 main.py --activate --id <id>

Delete data

If you want to start from scratch with no data in the records.json and products.csv files, then just run the following command:

python3 main.py --delete --all

You can also just delete some products or some categories:

python3 main.py --delete --id <id>

python3 main.py --delete --name <name>

python3 main.py --delete --category <category>

Then just add products like described here.

If you just want to delete all datapoints for every product, then run this command:

python3 main.py --reset --all

You can also just delete datapoints for some products:

python3 main.py --reset --id <id>

python3 main.py --reset --name <name>

python3 main.py --reset --category <category>

User settings

User settings can be added and changed in the file settings.ini.

ChangeName

Under the category ChangeName you can change how the script changes product names, so similar products will be placed in the same product in records.json file.

When adding a new setting under the category ChangeName in settings.ini, there must be a line with key<n> and a line with value<n>, where <n> is the "link" between keywords and valuewords. E.g. value3 is the value to key3.

In key<n> you set the keywords (seperated by a comma) that the product name must have for to be changed to what value<n> is equal to. Example if the user settings is the following:

[ChangeName]
key1 = asus,3080,rog,strix,oc
value1 = asus geforce rtx 3080 rog strix oc

The script checks if a product name has all of the words in key1, it gets changed to what value1 is.

Scraping

You can change the time between each time a url is being request by changing the field request_delay in the file scraper/settings.ini under the Scraping section.

Default is 0 seconds, but to avoid the website you scrape products from thinking you are DDOS attacting them or you being restricted from scraping on their websites temporarily, set the request_delay in settings.ini to a higher number of seconds, e.g. 5 seconds.

Clean up data

If you want to clean up your data, meaning you want to remove unnecessary datapoints (datapoints that have the same price as the datapoint before and after it), then run the following command:

python3 main.py --clean-data

Search products and categories

You can search for product names and categories you have in your records.json by using the argument --search [<word> ...]. The search is like a keyword search, so e.g. if you enter --search logitech all product names and categories that contains the word "logitech" are found.

You can search with multiple keywords, just seperate them with a space: --search logitech corsair. Here all the product names and categories that contains the words "logitech" or "corsair" are found.

View the latest datapoint of product(s)

If you want to view the latest datapoint of a product, you can use the argument --latest-datapoint with --id and/or --name.

Example:

python3 main.py --name "logitech z533" --latest-datapoint

The above command will show the latest datapoint for all the websites the specified product, in this case "logitech z533", has been scraped from and will show something like this:

LOGITECH Z533
> Komplett - 849816
  - DKK 999.0
  - 2022-09-12
> Proshop - 2511000
  - DKK 669.0
  - 2022-09-12
> Avxperten - 25630
  - DKK 699.0
  - 2022-09-12

View all products

To view all the products you have scraped, you can use the argument --list-products.

Example:

python3 main.py --list-products

This will list all the products in the following format:

CATEGORY
  > PRODUCT NAME
    - WEBSITE NAME - PRODUCT ID
    - ✓ WEBSITE NAME - PRODUCT ID

The check mark (✓) shows that the product is activated.

Visualize data

To visualize your data, just run main.py with the -v or --visualize argument and then specify which products you want to be visualized. These are your options for how you want to visualize your products:

--all to visualize all your products
-c [<category> [<category> ...]] or --category [<category> [<category> ...]] to visualize all products in one or more categories
--id [<id> [<id> ...]] to visualize one or more products with the specified id(s)
-n [<name> [<name> ...]] or --name [<name> ...]] to visualize one or more products with the specified name(s)
--compare to compare two or more products with the specified id(s), name(s) and/or category(s) or all products on one graph. Use with --id, --name, --category and/or --all

Example graph

Command examples

Show graphs for all products

To show graphs for all products, run the following command:

python3 main.py -v --all

Show graph(s) for specific products

To show a graph for only one product, run the following command where <id> is the id of the product you want a graph for:

python3 main.py -v --id <id>

For multiple products, just add another id, like so:

python3 main.py -v --id <id> <id>

Show graphs for products in one or more categories

To show graphs for all products in one category, run the following command where <category> is the category you want graph from:

python3 main.py -v -c <category>

For multiple categories, just add another category, like so:

python3 main.py -v -c <category> <category>

Show graps for products with a specific name

To show graphs for product(s) with a specific name, run the following command where <name> is the name of the product(s) you want graphs for:

python3 main.py -v --name <name>

For multiple products with different names, just add another name, like so:

python3 main.py -v --name <name> <name2>

If the name of a product has multiple words in it, then just add quotation marks around the name.

Only show graph for products that are up to date

To only show graphs for the products that are up to date, use the flag --up-to-date or -utd, like so:

python3 main.py -v --all -utd

The use of the flag -utd is only implemented when visualizing all products like the example above or when visualizing all products in a category:

python3 main.py -v -c <category> -utd

Compare two products

To compare two products on one graph, use the flag --compare with flag --id, --name, --category and/or --all, like so:

python3 main.py -v --compare --id <id>

python3 main.py -v --compare --name <name>

python3 main.py -v --compare --category <category>

python3 main.py -v --compare --id <id> --name <name> --category <category>

python3 main.py -v --compare --all

OBS when using --name or --category multiple products can be visualized

scraper's People

Contributors

Stargazers

Watchers

scraper's Issues

Add an arguments to just view the latest datapoint for a product

Show highest and lowest price when viewing graph

Tech scraper
visualize_data.py:

When viewing graph, then show highest and lowest price for each domain somewhere on the graph, maybe under the title.
Maybe also the median or related "stats".

Add ability to scrape Power.dk

Error with adding Elgiganten links after update and new feature request

Thank you so much for the new update! But now I'm having a problem with adding Elgiganten link from .dk and .se but just commented out the .dk link in domains.py so it works for me now :)

    def _get_json_api_data(self) -> dict:
        id_number = self._get_product_id()

        # API link to get price and currency
#        if "elgiganten.dk" in self.url:
 #           api_link = f"https://www.elgiganten.dk/cxorchestrator/dk/api?appMode=b2c&user=anonymous&operationName=getProductWithDynamicDetails&variables=%7B%22articleNumber%22%3A%22{id_number}%22%2C%22withCustomerSpecificPrices%22%3Afalse%7D&extensions=%7B%22persistedQuery%22%3A%7B%22version%22%3A1%2C%22sha256Hash%22%3A%229bfbc062032a2a6b924883b81508af5c77bbfc5f66cc41c7ffd7d519885ac5e4%22%7D%7D"  # noqa E501
  #      elif "elgiganten.se" in self.url:
        api_link = f"https://www.elgiganten.se/cxorchestrator/se/api?getProductWithDynamicDetails&appMode=b2c&user=anonymous&operationName=getProductWithDynamicDetails&variables=%7B%22articleNumber%22%3A%22{id_number}%22%2C%22withCustomerSpecificPrices%22%3Afalse%7D&extensions=%7B%22persistedQuery%22%3A%7B%22version%22%3A1%2C%22sha256Hash%22%3A%22229bbb14ee6f93449967eb326f5bfb87619a37e7ee6c4555b94496313c139ee1%22%7D%7D"  # noqa E501
#        else:
#            raise WebsiteVersionNotSupported(get_website_name(self.url, keep_tld=True))
        response = request_url(api_link)
        return response.json()

Also changed short_url to .se instead of the .dk

    def get_short_url(self) -> str:
        id = self._get_product_id()
        return f"https://www.elgiganten.se/product/{id}"

One more question, is it possible to fetch the outlet prices from like https://www.elgiganten.dk/product/mobil-tablet-smartwatch/mobiltelefon/google-pixel-7-pro-smartphone-12128-gb-obsidian/525075 or add a parameter to the database from like https://www.elgiganten.dk/product/outlet/google-pixel-7-pro-smartphone-12128-gb-obsidian/582585 to see when it is back in the outlet stock? Maybe hard to code it.. :)

Add functionality to compare two or more different products on the same graph

Maybe add a command line flag like --compare that takes the ids of the products that you want to compare on a graph

Make so add_product.py can use arguments to get category and url

Add ability to scrape avxperten.dk

Can you add support for shein.com?

Or any website can work?

Possibly of telegram notifications of the products ?

I think that would be great !

Remove all links in scraping.py and data from records.json

The point of this is for this repo to become more distributable and there for less start-up work for a new user

Move timeout for requests into settings.ini

Move number of seconds for requests from function request_url into file settings.ini to be easier to change.

Product info is not valid - Amazon - error getting product currency

There is still the problem

ubuntu@VM-0-7-ubuntu:~/amazon/scraper-2.4.1$ python3 main.py -a -c arcteryx -u https://www.amazon.com/dp/B0B8MC7QVT
Adding product with category 'arcteryx' and url 'https://www.amazon.com/dp/B0B8MC7QVT'
Product info is not valid - category: 'arcteryx' - url: https://www.amazon.com/dp/B0B8MC7QVT

Really good job but possibility for notifications?

Thank you for this web scraper! But is it possible to get an email notification if the price goes down or if the price gets lower than an requested price?

Seperate the codes into more files for better readability

For better readability and it would be easier to add features

Add ability to scrape prices from elgiganten.dk

Very good work! Possibility of adding other sites?

Hello, excellent job, how could I add other sites?

Add ability to scrape prices from power.dk

Add headers in requests.get calls

Example:
requests.get(URL, headers=headers)

Maybe add threading/multiprocessing or something similar

Product info is not valid - Amazon - error getting product price

Doesn't isupport Amazon?

ubuntu@VM-0-7-ubuntu:~/amazon/scraper$ python3 main.py -a -c arcteryx -u https://www.amazon.com/dp/B0B8MC7QVT
Adding product with category 'arcteryx' and url 'https://www.amazon.com/dp/B0B8MC7QVT'
Product info is not valid - category: 'arcteryx' - url: https://www.amazon.com/dp/B0B8MC7QVT

Make function "check_arguments" in "add_product.py" smarter and more compact

essentially easier to read and optimally use less if statements

sqlite3.IntegrityError: NOT NULL constraint failed: products.product_code

i am not sure what exactly is the issue here but it has something to do with sql. ive pasted the entire error in the link.

https://paste.ofcode.org/diyMQ6d4aKV75gnTG8H9dA

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

when i execute the code:
python main.py -a -c Magnifiers -u "https://www.amazon.com/Eye-Candy-Magnifier-Anti-Glare-Brightness/dp/B0BHM5YSR6"
the info is :
Product info is not valid - see logs for more info
And the info in logfile is:

2024-02-03 23:09:32,680 : INFO : scraper.add_product : Adding product with category 'Magnifiers' and url 'https://www.amazon.com/Eye-Candy-Magnifier-Anti-Glare-Brightness/dp/B0BHM5YSR6'
2024-02-03 23:09:36,323 : ERROR : scraper.domains : Could not get all the data needed from url: https://www.amazon.com/Eye-Candy-Magnifier-Anti-Glare-Brightness/dp/B0BHM5YSR6
Traceback (most recent call last):
  File "D:\github\scraper\scraper\domains.py", line 38, in get_product_info
    currency = self._get_product_currency()
  File "D:\github\scraper\scraper\domains.py", line 284, in _get_product_currency
    json.loads(parsed_url.replace("/af/sp-detail/feedback-form?pl=", ""))
  File "C:\Program Files\Python310\lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "C:\Program Files\Python310\lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Program Files\Python310\lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I pull the code today，I dont know how to resolve this problem

Add argument to delete or reset specific (or all) products or categories

Add an arguments like --delete where you then specify the category or product to delete with --category, --name or --id.
Make it so you can do something similar with --reset to reset specific products or categories or all products.

Could also replace argument --hard-reset with --delete --all

Add ability to scrape Expert.dk

Anti Bot and Captcha Support

Thank you for posting your scraper. I would like to write to you in more detail but I am unable to message you through Github. For scraping I almost always encounter CAPTCHA especially if I am running a scrape bot on a regular schedule from a web server. Scraping the data is the easy part, getting the data consistently I find is the difficult one. If you are willing to provide me with some contact information I would be more than happy to contribute to this project as I am only able to post an issue correctly. Thanks.

Change so the domains saved in records.json is without "www." and ".com"

Add ability to scrape Sharkgaming.dk

Add ability to scrape eBay.com

Need web scraping job done (for hire)

Hey! super impressed by your work with Crinibus. I'm in the market to build a large-scale e-commerce scraper for personal usage, and I think what you've built is a great start. Please shoot me a dm on twitter @krishmoran or reply here to discuss if you're interested to help build this out.

Add ability to scrape Coolshop.dk

Add so a user can add multiple links at once

E.g. the user search for a product term like "3080" and then provide the url with that search to the program, and the program adds the 4 or x first products with that search to the program

Use plotly or something more prettier to visualize data

Add ability to scrape Amazon.com

Split the methods "get_part_num" and "shorten_url" into the domain classes

Instead of being methods with lots of if-else-statements, it's just a line in each domain class.

For example for "get_part_num":
Instead of this:

class Scraper:
    def get_part_num(self):
        if self.URL_domain == domains['komplett']:
             self.part_num = self.URL.split('/')[4]
             ...

The self.part_num should be moved into it's domain class, like this:

class Komplett(Scraper):
    def get_info(self):
        ...
        self.part_num = self.URL.split('/')[4]

Add a dictionary with all the domains

E.g. instead of individual variables in "add_product.py" and instead of hardcoding it in both "add_product.py" and "scraping.py"

Makes for better reading and much easier to change a link (now you have to change multiple places in the code)

Api link elgiganten.se

Is it possible for you to get the API link for elgiganten.se? Tried to change it manually to .se but it seems to not be the same API link

Add ability to scrape mm-vision.dk

Add argument to print all product names

Could newegg be added to the list of sites

It'd be really helpful if you added newegg to the list of sites as they've got some pretty good deals now and then

Make a optional flag to add_product.py, so only certain domains gets added to records.py for the new product

E.g. --komplett, --proshop, --computersalg, --elgiganten

If --komplett is chosen as a flag in the command line, then komplett is the only domain that gets added under the product-name in records.json.
If none of the domain-flags is in the command line, then all of the domain gets added under the product-name in records.json.

From this: https://www.ebay.com/itm/Samsung-Galaxy-Note-20-Ultra-256GB-12GB-RAM-SM-N986B-DS-FACTORY-UNLOCKED-6-9/193625604205?_trkparms=aid%3D111001%26algo%3DREC.SEED%26ao%3D1%26asc%3D225074%26meid%3Dd6c93f1458884e65bcc434e38f6f303c%26pid%3D100970%26rk%3D8%26rkt%3D8%26mehot%3Dpp%26sd%3D402319206529%26itm%3D193625604205%26pmt%3D0%26noa%3D1%26pg%3D2380057%26brand%3DSamsung&_trksid=p2380057.c100970.m5481&_trkparms=pageci%3A6ffa204c-042b-11eb-baa4-3a1cc2bb9aea%7Cparentrq%3Ae60676341740a4d6b1579293fff1b710%7Ciid%3A1

To this: https://www.ebay.com/itm/Samsung-Galaxy-Note-20-Ultra-256GB-12GB-RAM-SM-N986B-DS-FACTORY-UNLOCKED-6-9/193625604205