dynamohuang / amazon-scrapy Goto Github PK
View Code? Open in Web Editor NEWScrapy the detail and lowest price of amazon best seller product by python spider
Scrapy the detail and lowest price of amazon best seller product by python spider
Hi @dynamohuang I am new to python scrapy and mysql anyway. I have finished creating the working environment for running your code (i.e. MySQL backend and run your code under amazon/db to create datasets accordingly)
However, how could i start scrapping the best seller's ASIN ? What I do now is running the main.py but results 0 item fetched. Would you please help me through the working process?
目前我采用了动态设置user-agent 加delay 会弹出验证码;
想请教一下大家有没有更好的方法
亚马逊location address是**的话,那么将会有一大半的商品不会出现再商品搜索页
hello there, there is missing information about how exactly is the search key words work, there are a lot of tables but no data to put in the tables, because of this reason the code doesn't run.
I'll appreciate if you could help or give advice on how to run it, because it's expecting to get a keyword and asin in the database but no database exists in the repo.
Don't know if it is because of the Amazon has detected our bot and block the IP?
But
https://www.amazon.com/best-sellers-video-games/zgbs/videogames/?aja
x=1&pg=3
Indeed doesn't existed, there is no page 3 there .
https://www.amazon.com/Best-Sellers-Sports-Outdoors/zgbs/sporting-go%20ods/?ajax=1&pg=2
is correct, I can open it with chrome browser.
How can I set up the proxy?
Because of this error, it will lose all the data even already got some data from previous pages?
twisted.internet.error.TimeoutError: User timeout caused connection failure.
2018-11-19 23:40:32 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.amazon.com/best-sellers-video-games/zgbs/videogames/?aja
x=1&pg=3>
Traceback (most recent call last):
File "/home/john/anaconda2/envs/amazon-scrapy/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_r
equest
defer.returnValue((yield download_func(request=request,spider=spider)))
scrapy.core.downloader.handlers.http11.TunnelError: Could not open CONNECT tunnel with proxy 46.38.52.36:8081 [{'status': 400, 'reason': b'B
ad Request'}]
2018-11-19 23:40:36 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.amazon.com/Best-Sellers-Sports-Outdoors/zgbs/sporting-go
ods/?ajax=1&pg=2>
Traceback (most recent call last):
File "/home/john/anaconda2/envs/amazon-scrapy/lib/python3.6/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/home/john/anaconda2/envs/amazon-scrapy/lib/python3.6/site-packages/twisted/python/failure.py", line 491, in throwExceptionIntoG
enerator
return g.throw(self.type, self.value, self.tb)
File "/home/john/anaconda2/envs/amazon-scrapy/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_r
equest
defer.returnValue((yield download_func(request=request,spider=spider)))
File "/home/john/anaconda2/envs/amazon-scrapy/lib/python3.6/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/john/anaconda2/envs/amazon-scrapy/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py", line 320, in _cb
_timeout
raise TimeoutError("Getting %s took longer than %s seconds." % (url, timeout))
twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https://www.amazon.com/Best-Sellers-Sports-Outdoors/zgb
s/sporting-goods/?ajax=1&pg=2 took longer than 30.0 seconds..
2018-11-19 23:41:51 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.amazon.com/best-sellers-software/zgbs/software/?ajax=1&p
g=2>
Traceback (most recent call last):
File "/home/john/anaconda2/envs/amazon-scrapy/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_r
equest
defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.TimeoutError: User timeout caused connection failure.
(1030, 'Got error 168 from storage engine')
total spent: 0:52:23.652052
done
访问https的网站用http+ip_port 怎么会生效
我使用的是scrapy 1.5,有部分错误,主要是utf8编码及其它问题,现在已能正常运行,但仍有几个问题请教:
1、product无保存item的代码及sql,我计划自行编写,不知后期会更新还是由使用者根据需要完善?
2、asin、cate、detail 3个爬虫可独立运行,keyword和review等如何执行?独立运行提示参数不足?
Hi, I would like to use this repo to get some info on amazon products. I'm not very familiar with scrapy (yet), and here's what I did :
-Git clone your project
-install requirement
-cd amazon-scrapy/amazon
-scrapy crawl asin
I get the following error :
Traceback (most recent call last):
File "/home/user/miniconda3/envs/scrap/bin/scrapy", line 10, in <module>
sys.exit(execute())
File "/home/user/miniconda3/envs/scrap/lib/python3.7/site-packages/scrapy/cmdline.py", line 109, in execute
settings = get_project_settings()
File "/home/user/miniconda3/envs/scrap/lib/python3.7/site-packages/scrapy/utils/project.py", line 68, in get_project_settings
settings.setmodule(settings_module_path, priority='project')
File "/home/user/miniconda3/envs/scrap/lib/python3.7/site-packages/scrapy/settings/__init__.py", line 292, in setmodule
module = import_module(module)
File "/home/user/miniconda3/envs/scrap/lib/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 965, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'amazon.settings'
Any idea how to fix that?
您好。
我是电商小白,最近在弄爬亚马逊网页的关键词对应ASIN的排名。
在第一步我就卡住了,找不到亚马逊network里面的cookie,只有一个user-agent和一些accept数据。
请问您是通过cookie爬,还是通过其他一些途径?望您有空的时候,可以指点小弟一二。
Hi, thanks for open source the project.
When the amazon review spider available?
I would like to get some data about amazon product review.
Thanks.
John
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.