Giter Club home page Giter Club logo

scraper's Introduction

Introduction

This is a firmware scraper that aims to download firmware images and associated metadata from supported device vendor websites.

Dependencies

Usage

  1. Configure the firmware/settings.py file. Comment out SQL_SERVER if metadata about downloaded firmware should not be inserted into a SQL server.

  2. To run a specific scraper, e.g. dlink:

scrapy crawl dlink

To run all scrapers with maximum 4 in parallel, using GNU Parallel:

parallel -j 4 scrapy crawl ::: `for i in ./firmware/spiders/*.py; do basename ${i%.*}; done`

scraper's People

Contributors

anderson-liu avatar brianpow avatar ddcc avatar double-q1015 avatar irwincong avatar mikimotoh avatar misterch0c avatar salmonx avatar seihtam avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scraper's Issues

Successfully crawl using python3.5, but with no images downloading

Hi, everyone

 I successfully run this code with python3.5, with the following outputs:
 2019-10-06 20:06:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://s.ssl.qhres.com/static/3bef0d53c1b56012/common_info.js>

{'description': 'ROM\u6b63\u5f0f\u7248\u3010360\u5bb6\u5ead\u9632\u706b\u5899\u00b7\u8def\u7531\u5668 '
'5S\u3011',
'file_urls': ['http://luyou.dl.qihucdn.com/luyou/360F5S/360-F5S-V3.1.2.64131.bin'],
'files': [],
'product': 'f5s',
'url': 'http://luyou.dl.qihucdn.com/luyou/360F5S/360-F5S-V3.1.2.64131.bin',
'vendor': '360',
'version': 'V3.1.2.64131'}

I found that no images are actually downloaded and saved in the output folder.
I did not install the SQL. My question is: does this code firstly crawl the image urls and save them in the SQL, then download the images by iterating the SQL?

Thanks for any reply.

Which version of scrapy to use?

I use python 2.7.6, scrapy 1.1.0 on Ubuntu 14.04
After I issue scrapy crawl netgear, then this exception happens:

crapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.throttle.AutoThrottle']
2016-06-03 09:56:57 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-06-03 09:56:57 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
Unhandled error in Deferred:
2016-06-03 09:56:57 [twisted] CRITICAL: Unhandled error in Deferred:


Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/commands/crawl.py", line 57, in run
    self.crawler_process.crawl(spname, **opts.spargs)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 163, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 167, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1274, in unwindGenerator
    return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 90, in crawl
    six.reraise(*exc_info)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 72, in crawl
    self.engine = self._create_engine()
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 97, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 69, in __init__
    self.scraper = Scraper(crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/scraper.py", line 71, in __init__
    self.itemproc = itemproc_cls.from_crawler(crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 58, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 36, in from_settings
    mw = mwcls.from_crawler(crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/pipelines/media.py", line 35, in from_crawler
    pipe = cls()
exceptions.TypeError: __init__() takes at least 2 arguments (1 given)
2016-06-03 09:56:57 [twisted] CRITICAL:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 90, in crawl
    six.reraise(*exc_info)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 72, in crawl
    self.engine = self._create_engine()
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 97, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 69, in __init__
    self.scraper = Scraper(crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/scraper.py", line 71, in __init__
    self.itemproc = itemproc_cls.from_crawler(crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 58, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 36, in from_settings
    mw = mwcls.from_crawler(crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/pipelines/media.py", line 35, in from_crawler
    pipe = cls()
TypeError: __init__() takes at least 2 arguments (1 given)

I think maybe my scrapy version is wrong. Which version is correct?

[boto] ERROR: Unable to read instance data, giving up

Hi,

I installed the deps and removed the SQL line in settings, and i get this (and it just stalls) when running scrapy crawl dlink:

2017-02-07 15:02:03 [boto] DEBUG: Retrieving credentials from metadata server.
2017-02-07 15:02:04 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
  File "/usr/lib/python2.7/urllib2.py", line 429, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 447, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
    raise URLError(err)
URLError: <urlopen error timed out>
2017-02-07 15:02:04 [boto] ERROR: Unable to read instance data, giving up
2017-02-07 15:02:04 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-02-07 15:02:04 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware

What am i doing wrong?

Thanks!

cannot crawl, cannot walk, cannot run

I'm working on a side project with some friends and I stumbled across your scrapy repo. It looks really cool and exactly what I need to scrape the web for a bunch of firmware! I'm unfamiliar with the scrapy package, and your readme's recommendation of:

scrapy crawl dlink

has yielded nothing but errors and hanging for me. I'm still reading though scrapy docs and your code, but I'm struggling to get any of the spiders to crawl. I was hoping you might be able to give me some guidance on how to go about using your tool. If there are any config files, scrapy command line options, unlisted dependencies, or other things I might be missing, a bit of direction on how to start web-crawling would help me a lot.

scarpyerror

Regardless, thank you for all your work on what looks like an amazing open source tool!

Error when running scrapy

When I try to run scrapy crawl dlink (out of scraper directory) I'm getting following error:

Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 11, in
sys.exit(execute())
File "/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py", line 109, in execute
settings = get_project_settings()
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/project.py", line 60, in get_project_settings
settings.setmodule(settings_module_path, priority='project')
File "/usr/local/lib/python2.7/dist-packages/scrapy/settings/init.py", line 108, in setmodule
module = import_module(module)
File "/usr/lib/python2.7/importlib/init.py", line 37, in import_module
import(name)
ImportError: No module named firmware.settings

Do you know what could cause this? Thanks!

new update possibly not working

Not sure if it has to do with the new update to this software or not. This is my first time trying scraper.

Am using python V 2.7.6 and scrapy v1.1.0

when trying scrapy crawl dlink, here is the output:

2016-06-04 20:35:02 [scrapy] ERROR: Error processing {'build': u'A',
'date': datetime.datetime(2013, 7, 16, 0, 0),
'description': u'Firmware (4.10B15)',
'mib': u'ftp://FTP2.DLINK.COM/PRODUCTS/DXS-3326GSR/REVA/DXS-3326GSR_MIBS_4.40B02.ZIP',
'product': u'DXS-3326GSR',
'url': u'ftp://FTP2.DLINK.COM/PRODUCTS/DXS-3326GSR/REVA/DXS-3326GSR_FIRMWARE_4.10B15.EXE',
'vendor': 'dlink',
'version': u'4.10B15'}
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, _args, *_kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/pipelines/media.py", line 44, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "/root/firmadyne/sources/scraper/firmware/pipelines.py", line 61, in get_media_requests
for x in ["mib", "url"] if x in item]
AttributeError: 'FirmwarePipeline' object has no attribute 'FILES_URLS_FIELD'
2016-06-04 20:35:02 [scrapy] ERROR: Error processing {'build': u'A',
'date': datetime.datetime(2013, 7, 16, 0, 0),
'description': u'Firmware (4.20B11)',
'mib': u'ftp://FTP2.DLINK.COM/PRODUCTS/DXS-3326GSR/REVA/DXS-3326GSR_MIBS_4.40B02.ZIP',
'product': u'DXS-3326GSR',
'url': u'ftp://FTP2.DLINK.COM/PRODUCTS/DXS-3326GSR/REVA/DXS-3326GSR_FIRMWARE_4.20B11.ZIP',
'vendor': 'dlink',
'version': u'4.20B11'}
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, _args, *_kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/pipelines/media.py", line 44, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "/root/firmadyne/sources/scraper/firmware/pipelines.py", line 61, in get_media_requests
for x in ["mib", "url"] if x in item]
AttributeError: 'FirmwarePipeline' object has no attribute 'FILES_URLS_FIELD'
2016-06-04 20:35:02 [scrapy] ERROR: Error processing {'build': u'A',
'date': datetime.datetime(2013, 7, 16, 0, 0),
'description': u'Firmware (4.30B11)',
'mib': u'ftp://FTP2.DLINK.COM/PRODUCTS/DXS-3326GSR/REVA/DXS-3326GSR_MIBS_4.40B02.ZIP',
'product': u'DXS-3326GSR',
'url': u'ftp://FTP2.DLINK.COM/PRODUCTS/DXS-3326GSR/REVA/DXS-3326GSR_FIRMWARE_4.30B11.ZIP',
'vendor': 'dlink',
'version': u'4.30B11'}
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, _args, *_kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/pipelines/media.py", line 44, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "/root/firmadyne/sources/scraper/firmware/pipelines.py", line 61, in get_media_requests
for x in ["mib", "url"] if x in item]
AttributeError: 'FirmwarePipeline' object has no attribute 'FILES_URLS_FIELD'

This Attribute error is actually coming up on all models that I tried to scrapy. Any idea why?

Can't find the images after I run scraper

Hi, after I run "scrapy crawl dlink", the "output" directory is created but there is no downloaded image files in it, and I can't find them anywhere. The output of this command is something like this:

2017-11-18 21:46:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.zyxel.com/us/en/support/SearchResultTab.shtml?c=us&l=en&t=dl&md=NWD-370N&mt=Firmware&mt=MIBFile> (referer: http://www.zyxel.com/us/en/support/download_landing.shtml)
2017-11-18 21:46:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.zyxel.com/us/en/support/SearchResultTab.shtml?c=us&l=en&t=dl&md=ZyWALL%201050&mt=Firmware&mt=MIBFile> (referer: http://www.zyxel.com/us/en/support/download_landing.shtml)
2017-11-18 21:46:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.zyxel.com/us/en/support/SearchResultTab.shtml?c=us&l=en&t=dl&md=PLA-470&mt=Firmware&mt=MIBFile> (referer: http://www.zyxel.com/us/en/support/download_landing.shtml)
2017-11-18 21:46:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.zyxel.com/us/en/support/SearchResultTab.shtml?c=us&l=en&t=dl&md=SFP-1000T&mt=Firmware&mt=MIBFile> (referer: http://www.zyxel.com/us/en/support/download_landing.shtml)
2017-11-18 21:46:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.zyxel.com/us/en/support/SearchResultTab.shtml?c=us&l=en&t=dl&md=XGS3600-28F&mt=Firmware&mt=MIBFile> (referer: http://www.zyxel.com/us/en/support/download_landing.shtml)
2017-11-18 21:46:27 [scrapy.core.scraper] ERROR: Error processing {'date': datetime.datetime(2016, 3, 3, 0, 0),
'mib': u'ftp://ftp2.zyxel.com/XGS3600-28F/mib_file/XGS3600-28F_1.mib',
'product': u'XGS3600-28F',
'url': u'ftp://ftp2.zyxel.com/XGS3600-28F/firmware/XGS3600-28F_V1.00(AAFM.2)C0.zip',
'vendor': 'zyxel',
'version': u'V1.00(AAFM.2)C0'}
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 587, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/pipelines/media.py", line 79, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "/root/firmadyne/sources/scraper/firmware/pipelines.py", line 78, in get_media_requests
for x in ["mib", "url"] if x in item]
File "/usr/local/lib/python2.7/dist-packages/scrapy/item.py", line 66, in setitem
(self.class.name, key))
KeyError: 'FirmwareImage does not support field: None'
2017-11-18 21:46:27 [scrapy.core.scraper] ERROR: Error processing {'date': datetime.datetime(2015, 10, 12, 0, 0),
'mib': u'ftp://ftp2.zyxel.com/XGS3600-28F/mib_file/XGS3600-28F_1.mib',
'product': u'XGS3600-28F',
'url': u'ftp://ftp2.zyxel.com/XGS3600-28F/firmware/XGS3600-28F_1.00(AAFM.0)C0.zip',
'vendor': 'zyxel',
'version': u'1.00(AAFM.0)C0'}
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 587, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/pipelines/media.py", line 79, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "/root/firmadyne/sources/scraper/firmware/pipelines.py", line 78, in get_media_requests
for x in ["mib", "url"] if x in item]
File "/usr/local/lib/python2.7/dist-packages/scrapy/item.py", line 66, in setitem
(self.class.name, key))
KeyError: 'FirmwareImage does not support field: None'
2017-11-18 21:46:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.zyxel.com/us/en/support/SearchResultTab.shtml?c=us&l=en&t=dl&md=P-660HW-61&mt=Firmware&mt=MIBFile> (referer: http://www.zyxel.com/us/en/support/download_landing.shtml)
2017-11-18 21:46:27 [scrapy.core.scraper] ERROR: Error processing {'date': datetime.datetime(2006, 11, 10, 0, 0),
'product': u'P-660HW-61',
'url': u'ftp://ftp2.zyxel.com/P-660HW-61/firmware/P-660HW-61_3.40(PE.11)C0.zip',
'vendor': 'zyxel',
'version': u'3.40(PE.11)C0'}
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 587, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/pipelines/media.py", line 79, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "/root/firmadyne/sources/scraper/firmware/pipelines.py", line 78, in get_media_requests
for x in ["mib", "url"] if x in item]
File "/usr/local/lib/python2.7/dist-packages/scrapy/item.py", line 66, in setitem
(self.class.name, key))
KeyError: 'FirmwareImage does not support field: None'
2017-11-18 21:46:27 [scrapy.core.scraper] ERROR: Error processing {'date': datetime.datetime(2006, 11, 10, 0, 0),
'product': u'P-660HW-61',
'url': u'ftp://ftp2.zyxel.com/P-660HW-61/firmware/P-660HW-61_3.40(PE.11)C0.zip',
'vendor': 'zyxel',
'version': u'3.40(PE.11)C0'}
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 587, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/pipelines/media.py", line 79, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "/root/firmadyne/sources/scraper/firmware/pipelines.py", line 78, in get_media_requests
for x in ["mib", "url"] if x in item]
File "/usr/local/lib/python2.7/dist-packages/scrapy/item.py", line 66, in setitem
(self.class.name, key))
KeyError: 'FirmwareImage does not support field: None'
2017-11-18 21:46:28 [scrapy.core.scraper] ERROR: Error processing {'date': datetime.datetime(2005, 11, 7, 0, 0),
'product': u'P-660HW-61',
'url': u'ftp://ftp2.zyxel.com/P-660HW-61/firmware/P-660HW-61_3.40(PE.10)C0.zip',
'vendor': 'zyxel',
'version': u'3.40(PE.10)C0'}
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 587, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/pipelines/media.py", line 79, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "/root/firmadyne/sources/scraper/firmware/pipelines.py", line 78, in get_media_requests
for x in ["mib", "url"] if x in item]
File "/usr/local/lib/python2.7/dist-packages/scrapy/item.py", line 66, in setitem
(self.class.name, key))
KeyError: 'FirmwareImage does not support field: None'
2017-11-18 21:46:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.zyxel.com/us/en/support/SearchResultTab.shtml?c=us&l=en&t=dl&md=MC100FX-SC30-A&mt=Firmware&mt=MIBFile> (referer: http://www.zyxel.com/us/en/support/download_landing.shtml)
2017-11-18 21:46:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.zyxel.com/us/en/support/SearchResultTab.shtml?c=us&l=en&t=dl&md=NAS520&mt=Firmware&mt=MIBFile> (referer: http://www.zyxel.com/us/en/support/download_landing.shtml)
2017-11-18 21:46:29 [scrapy.core.scraper] ERROR: Error processing {'date': datetime.datetime(2017, 6, 9, 0, 0),
'product': u'NAS520',
'url': u'ftp://ftp2.zyxel.com/NAS520/firmware/NAS520_V5.21(AASZ.0)C0.zip',
'vendor': 'zyxel',
'version': u'V5.21(AASZ.0)C0'}
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 587, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/pipelines/media.py", line 79, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "/root/firmadyne/sources/scraper/firmware/pipelines.py", line 78, in get_media_requests
for x in ["mib", "url"] if x in item]
File "/usr/local/lib/python2.7/dist-packages/scrapy/item.py", line 66, in setitem
(self.class.name, key))
KeyError: 'FirmwareImage does not support field: None'
2017-11-18 21:46:29 [scrapy.core.scraper] ERROR: Error processing {'date': datetime.datetime(2016, 10, 19, 0, 0),
'product': u'NAS520',
'url': u'ftp://ftp2.zyxel.com/NAS520/firmware/NAS520_V5.20(AASZ.0)C0.zip',
'vendor': 'zyxel',
'version': u'V5.20(AASZ.0)C0'}
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 587, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/pipelines/media.py", line 79, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "/root/firmadyne/sources/scraper/firmware/pipelines.py", line 78, in get_media_requests
for x in ["mib", "url"] if x in item]
File "/usr/local/lib/python2.7/dist-packages/scrapy/item.py", line 66, in setitem
(self.class.name, key))
KeyError: 'FirmwareImage does not support field: None'
2017-11-18 21:46:29 [scrapy.core.scraper] ERROR: Error processing {'date': datetime.datetime(2016, 6, 27, 0, 0),
'product': u'NAS520',
'url': u'ftp://ftp2.zyxel.com/NAS520/firmware/NAS520_V5.11(AASZ.3)C0.zip',
'vendor': 'zyxel',
'version': u'V5.11(AASZ.3)C0'}
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 587, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/pipelines/media.py", line 79, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "/root/firmadyne/sources/scraper/firmware/pipelines.py", line 78, in get_media_requests
for x in ["mib", "url"] if x in item]
File "/usr/local/lib/python2.7/dist-packages/scrapy/item.py", line 66, in setitem
(self.class.name, key))
KeyError: 'FirmwareImage does not support field: None'
2017-11-18 21:46:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.zyxel.com/us/en/support/SearchResultTab.shtml?c=us&l=en&t=dl&md=NBG5715&mt=Firmware&mt=MIBFile> (referer: http://www.zyxel.com/us/en/support/download_landing.shtml)
2017-11-18 21:46:29 [scrapy.core.scraper] ERROR: Error processing {'date': datetime.datetime(2015, 6, 18, 0, 0),
'product': u'NBG5715',
'url': u'ftp://ftp2.zyxel.com/NBG5715/firmware/NBG5715_V1.00(AAAG.8)C0.zip',
'vendor': 'zyxel',
'version': u'V1.00(AAAG.8)C0'}
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 587, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/pipelines/media.py", line 79, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "/root/firmadyne/sources/scraper/firmware/pipelines.py", line 78, in get_media_requests
for x in ["mib", "url"] if x in item]
File "/usr/local/lib/python2.7/dist-packages/scrapy/item.py", line 66, in setitem
(self.class.name, key))
KeyError: 'FirmwareImage does not support field: None'
2017-11-18 21:46:29 [scrapy.core.scraper] ERROR: Error processing {'date': datetime.datetime(2014, 1, 21, 0, 0),
'product': u'NBG5715',
'url': u'ftp://ftp2.zyxel.com/NBG5715/firmware/NBG5715_1.00(AAAG.7)C0.zip',
'vendor': 'zyxel',
'version': u'1.00(AAAG.7)C0'}
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 587, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/pipelines/media.py", line 79, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "/root/firmadyne/sources/scraper/firmware/pipelines.py", line 78, in get_media_requests
for x in ["mib", "url"] if x in item]
File "/usr/local/lib/python2.7/dist-packages/scrapy/item.py", line 66, in setitem
(self.class.name, key))
KeyError: 'FirmwareImage does not support field: None'
2017-11-18 21:46:29 [scrapy.core.scraper] ERROR: Error processing {'date': datetime.datetime(2013, 5, 15, 0, 0),
'product': u'NBG5715',
'url': u'ftp://ftp2.zyxel.com/NBG5715/firmware/NBG5715_1.00(AAAG.5)C0.zip',
'vendor': 'zyxel',
'version': u'1.00(AAAG.5)C0'}
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 587, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/pipelines/media.py", line 79, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "/root/firmadyne/sources/scraper/firmware/pipelines.py", line 78, in get_media_requests
for x in ["mib", "url"] if x in item]
File "/usr/local/lib/python2.7/dist-packages/scrapy/item.py", line 66, in setitem
(self.class.name, key))
KeyError: 'FirmwareImage does not support field: None'
2017-11-18 21:46:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.zyxel.com/us/en/support/SearchResultTab.shtml?c=us&l=en&t=dl&md=NWD2705&mt=Firmware&mt=MIBFile> (referer: http://www.zyxel.com/us/en/support/download_landing.shtml)
2017-11-18 21:46:30 [scrapy.core.engine] INFO: Closing spider (finished)
2017-11-18 21:46:30 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 340742,
'downloader/request_count': 878,
'downloader/request_method_count/GET': 878,
'downloader/response_bytes': 2485303,
'downloader/response_count': 878,
'downloader/response_status_count/200': 878,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 11, 19, 5, 46, 30, 279406),
'item_dropped_count': 2,
'item_dropped_reasons_count/DropItem': 2,
'log_count/DEBUG': 879,
'log_count/ERROR': 976,
'log_count/INFO': 16,
'log_count/WARNING': 2,
'memusage/max': 37892096,
'memusage/startup': 29310976,
'request_depth_max': 1,
'response_received_count': 878,
'scheduler/dequeued': 878,
'scheduler/dequeued/memory': 878,
'scheduler/enqueued': 878,
'scheduler/enqueued/memory': 878,
'start_time': datetime.datetime(2017, 11, 19, 5, 37, 6, 979585)}
2017-11-18 21:46:30 [scrapy.core.engine] INFO: Spider closed (finished)

I don't know what's the problem.

Scrapy Hangs

After the following scrapy command and errors. Nothing else seems to happen and the scraper just hangs forever. I'm not sure what I'm missing to make this work. This is being run from an Ubuntu 16.04 VM.

scrapy crawl dlink
2017-03-21 17:11:51 [scrapy] INFO: Scrapy 1.0.3 started (bot: firmware)
2017-03-21 17:11:51 [scrapy] INFO: Optional features available: ssl, http11, boto
2017-03-21 17:11:51 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scraper.spiders', 'DOWNLOAD_TIMEOUT': 1200, 'CONCURRENT_REQUESTS': 8, 'DOWNLOAD_WARNSIZE': 0, 'SPIDER_MODULES': ['firmware.spiders'], 'BOT_NAME': 'firmware', 'DOWNLOAD_MAXSIZE': 0, 'USER_AGENT': 'FirmwareBot/1.0 (+https://github.com/firmadyne/scraper)'}
2017-03-21 17:11:51 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState, AutoThrottle
2017-03-21 17:11:51 [boto] DEBUG: Retrieving credentials from metadata server.
2017-03-21 17:11:52 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url r = opener.open(req, timeout=timeout)
File "/usr/lib/python2.7/urllib2.py", line 429, in open response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 447, in _open '_open', req)
File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open raise URLError(err)
URLError:
2017-03-21 17:11:52 [boto] ERROR: Unable to read instance data, giving up
2017-03-21 17:11:52 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-03-21 17:11:52 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware

Database schema missing

After creating a database firmware with user firmadyne, using SQL_SERVER = "127.0.0.1" in settings.py and running, the following errors occur:

Traceback (most recent call last):
  File "/home/xx/.local/lib/python3.9/site-packages/twisted/internet/defer.py", line 857, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/xx/xx/scraper/firmware/pipelines.py", line 99, in item_completed
    cur.execute("SELECT id FROM image WHERE hash=%s",
psycopg2.errors.UndefinedTable: relation "image" does not exist
LINE 1: SELECT id FROM image WHERE hash='efbe8afe54100787cc53157d698...
                       ^

2022-02-11 23:18:58 [firmware.pipelines] WARNING: Database connection exception: relation "image" does not exist
LINE 1: SELECT id FROM image WHERE hash='a3bc15e95e865e08a9d0ad6be62...
                       ^

This is probably caused by my database not having any tables. Unfortunately, I can not find how to initialize the database. If someone has the database running the schema can be exported using:

sudo su - postgres
pg_dump -s firmware

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.