Giter Club home page Giter Club logo

scrapy's Introduction

image

Scrapy

PyPI Version

Supported Python Versions

Ubuntu

Windows

Wheel Status

Coverage report

Conda Version

Overview

Scrapy is a BSD-licensed fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Scrapy is maintained by Zyte (formerly Scrapinghub) and many other contributors.

Check the Scrapy homepage at https://scrapy.org for more information, including a list of features.

Requirements

  • Python 3.8+
  • Works on Linux, Windows, macOS, BSD

Install

The quick way:

pip install scrapy

See the install section in the documentation at https://docs.scrapy.org/en/latest/intro/install.html for more details.

Documentation

Documentation is available online at https://docs.scrapy.org/ and in the docs directory.

Releases

You can check https://docs.scrapy.org/en/latest/news.html for the release notes.

Community (blog, twitter, mail list, IRC)

See https://scrapy.org/community/ for details.

Contributing

See https://docs.scrapy.org/en/master/contributing.html for details.

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct.

By participating in this project you agree to abide by its terms. Please report unacceptable behavior to [email protected].

Companies using Scrapy

See https://scrapy.org/companies/ for a list.

Commercial Support

See https://scrapy.org/support/ for details.

scrapy's People

Contributors

adityaa30 avatar alexcepoi avatar alexpdev avatar anubhavp28 avatar aspidites avatar broodingkangaroo avatar curita avatar dangra avatar digenis avatar elacuesta avatar eliasdorneles avatar gallaecio avatar georgea92 avatar jdemaeyer avatar jxlil avatar kmike avatar laerte avatar lopuhin avatar maramsumanth avatar nramirezuy avatar nyov avatar pablohoffman avatar pawelmhm avatar redapple avatar rmax avatar stummjr avatar victor-torres avatar void avatar whalebot-helmsman avatar wrar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrapy's Issues

xml iternodes bug with nested nodes

Originally reported by Damian Canabal on Trac: http://dev.scrapy.org/ticket/275

If itertag = 'product' then returned nodes will be incomplete:

<?xml version="1.0" encoding="UTF-8"?>
<merchandiser xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="merchandiser.xsd">
  <product product_id="95" name="Pure Shetland Wool Throw" sku_number="10095" manufacturer_name="Biome Lifestyle" part_number="">
    <URL>
      <product>http://click.linksynergy.com/fs-bin/click?id=bNgl5*KPhYY&amp;offerid=211619.95&amp;type=15&amp;subid=0</product>
      <productImage>http://assets1.notonthehighstreet.com/system/product_images/images/000/000/368/normal_95_pure_shetland_wool_throw_main.jpg</productImage>
    </URL>
    <description>
Made from the purest and softest Shetland wool, this throw remains undyedto stay as eco-friendly as possible. She
    </description>
  </product>
</merchandiser>

returned node:

<product product_id="95" name="Pure Shetland Wool Throw" sku_number="10095" manufacturer_name="Biome Lifestyle" part_number="">
    <URL>
      <product>
http://click.linksynergy.com/fs-bin/click?id=bNgl5*KPhYY&amp;offerid=211619.95&amp;type=15&amp;subid=0
      </product>
    </URL>
</product>

Add Proxy CONNECT support (fixes bug with https urls and proxies)

Proxy CONNECT support is required for https urls to work.

When proxy support was added to Scrapy, urllib still didn't supported CONNECT, but that has been fixed now:
http://bugs.python.org/issue1424152

Now we need to add support to Scrapy (which doesn't use urllib at all).

This link contains useful tips on how to add CONNECT support with Twisted:
http://twistedmatrix.com/pipermail/twisted-web/2008-August/003878.html

See originally reported ticket on Trac for more comments: http://dev.scrapy.org/ticket/159

Interrupted system call (OSError) on "scrapy deploy"

The following problem happens when running "scrapy deploy" on MAC:

DraixBook:ibcrest martin$ python2.6 $(which scrapy) deploy
Building egg of 51-1311257580
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.7/bin/scrapy", line 5, in <module>
    pkg_resources.run_script('Scrapy==0.13.0', 'scrapy')
  File "/Library/Python/2.6/site-packages/setuptools-0.6c9-py2.6.egg/pkg_resources.py", line 448, in run_script

  File "/Library/Python/2.6/site-packages/setuptools-0.6c9-py2.6.egg/pkg_resources.py", line 1166, in run_script
    script_code = compile(script_text,script_filename,'exec')
  File "/Library/Python/2.6/site-packages/Scrapy-0.13.0-py2.6.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
    execute()
  File "/Library/Python/2.6/site-packages/Scrapy-0.13.0-py2.6.egg/scrapy/cmdline.py", line 131, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/Library/Python/2.6/site-packages/Scrapy-0.13.0-py2.6.egg/scrapy/cmdline.py", line 97, in _run_print_help
    func(*a, **kw)
  File "/Library/Python/2.6/site-packages/Scrapy-0.13.0-py2.6.egg/scrapy/cmdline.py", line 138, in _run_command
    cmd.run(args, opts)
  File "/Library/Python/2.6/site-packages/Scrapy-0.13.0-py2.6.egg/scrapy/commands/deploy.py", line 98, in run
    egg, tmpdir = _build_egg()
  File "/Library/Python/2.6/site-packages/Scrapy-0.13.0-py2.6.egg/scrapy/commands/deploy.py", line 208, in _build_egg
    check_call([sys.executable, 'setup.py', 'clean', '-a', 'bdist_egg', '-d', d], stdout=f)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/subprocess.py", line 457, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/subprocess.py", line 444, in call
    return Popen(*popenargs, **kwargs).wait()
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/subprocess.py", line 1137, in wait
    pid, sts = os.waitpid(self.pid, 0)
OSError: [Errno 4] Interrupted system call

The problem was solved when upgrading to Python 2.7.2

exceptions.AssertionError: No free spider slots when opening 'default'

In scrapy shell, when fetch() is run for the second time, it crashes with the following exception:

exceptions.AssertionError: No free spider slots when opening 'default'

Traceback:

2012-01-29 23:45:01-0600 [-] ERROR: Unhandled error in Deferred:
2012-01-29 23:45:01-0600 [-] Unhandled Error
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/threads.py", line 113, in _callFromThread
result = defer.maybeDeferred(f, _a, *_kw)
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 133, in maybeDeferred
result = f(_args, *_kw)
File "/usr/lib/pymodules/python2.7/scrapy/shell.py", line 64, in _schedule
self.crawler.engine.open_spider(spider, close_if_idle=False)
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1141, in unwindGenerator
return _inlineCallbacks(None, f(_args, *_kwargs), Deferred())
--- ---
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1020, in _inlineCallbacks
result = g.send(result)
File "/usr/lib/pymodules/python2.7/scrapy/core/engine.py", line 214, in open_spider
spider.name
exceptions.AssertionError: No free spider slots when opening 'default'

Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/pymodules/python2.7/scrapy/shell.py", line 80, in fetch
self._schedule, request, spider)
File "/usr/lib/python2.7/dist-packages/twisted/internet/threads.py", line 118, in blockingCallFromThread
result.raiseException()
File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 338, in raiseException
raise self.type, self.value, self.tb
AssertionError: Spider 'default' not opened when crawling: <GET http://www.economist.com/blogs/freeexchange/2011/10/generational-warfare>

Offsite middleware doesn't filter redirected responses

Reported by fencer on Trac: http://dev.scrapy.org/ticket/100

Using a BaseSpider to harvest links. The spider evaluates every anchor link on the page, processes them and applies an algorithm to it. The spider's parse function returns both items for output and request with the harvested links for further crawling.

Extra domains were not specified, only the domain_name value was set in the spider to "agd.org". In testing the spider, I noticed it was crawling URLs outside the domain_name.

In examining the log file, I noticed that there were 302 redirects from an URL inside the domain to an URL outside the domain. All domains crawled outside of the original domain_name correlated with a 302 redirect.

2009-09-01 12:44:25-0700 [agd.org] DEBUG: Redirecting (302) to 
   <http://www.goarmy.com/amedd/dental/index.jsp?iom=9618-ITBP-MCDE-07012009-16-09021-180AD1> 
   from <http://www.agd.org/adtracking/a.aspx?ZoneID=18&Task=Click&Mode=HTML&SiteID=1&PageID=28659>

I have not examined the SpiderMiddleware in detail, but I am guessing that the 302 redirect is somehow circumventing the
scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware

Not sure if this is a bug or way it was intentionally designed when handling 302 redirects.

Crawl spider crawls previously visited URL on redirect

Previously reported by michaelvmata on Trac http://dev.scrapy.org/ticket/299

If a crawl spider is redirected to an already visited page, it will still crawl it.

From the mailing list http://groups.google.com/group/scrapy-users/browse_thread/thread/ee9ad68f5dbacc6d:

"...the dupe filter only catches requests after they leave the spider, so redirected pages are ignored by the dupe filter.

Since the dupefilter and the redirect middleware components are decoupled now, it would be awkward to implement what you suggest, but nevertheless I think it would be useful"

RegEx gets too big when allowed_domains long

lines 49 and 50 of scrapy.contrib.spidermiddleware.offsite:

    regex = r'^(.*\.)?(%s)$' % '|'.join(domains)
    return re.compile(regex)

Tested with a list of 3000 allowed domains -

Caught OverflowError while rendering: regular expression code size limit exceeded

Fixed by subclassing and using url_is_from_any_domain

class MyOffsiteMiddleware(OffsiteMiddleware):

def should_follow(self, request, spider):
    allowed_domains = getattr(spider, 'allowed_domains', None)
    return url_is_from_any_domain(request.url, allowed_domains)

def spider_opened(self, spider):
    self.host_regexes[spider] = re.compile('')
    self.domains_seen[spider] = set()

start_requests() and parse() should always be generators

start_requests() and parse() in BaseSpider are expected to return iterables. I suggest to modify this behaviour and force them to return generators. I start_requests wants to generate requests for url by a pattern it may eat a lot of memory:

from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy.spider import BaseSpider


class TestSpider(BaseSpider):
    name = "test_spider"
    start_urls = ['http://www.amazon.com/dp/B005890G8Y/']

    def parse(self, response):
        print 'parse'
        for i in xrange(100000000):
            url = 'http://www.amazon.com/dp/%i/' % i
            print i,
            yield self.make_requests_from_url(url)


crawler = CrawlerProcess(settings)
crawler.install()
crawler.configure()

spider = TestSpider()
crawler.queue.append_spider(spider)
crawler.start()

or

from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy.spider import BaseSpider


class TestSpider(BaseSpider):
    name = "test_spider"
    #start_urls = ['http://www.amazon.com/dp/B005890G8Y/']

    def start_requests(self):
        for i in xrange(100000000):
            url = 'http://www.amazon.com/dp/%i/' % i
            print 'yielding a start url: %s' % url
            yield self.make_requests_from_url(url)

    def parse(self, response):
        '''does nothing'''
        print 'parse' 


crawler = CrawlerProcess(settings)
crawler.install()
crawler.configure()

spider = TestSpider()
crawler.queue.append_spider(spider)
crawler.start()

scrapy shell raising an exception in ipython

I'm running a standard scrapy shell command and receiving the following response, which occurs after it has finished parsing and enters into the shell

scrapy shell http://doc.scrapy.org/_static/selectors-sample1.html
2011-12-28 19:37:34+0100 [scrapy] INFO: Scrapy 0.12.0.2543 started (bot: botname)
2011-12-28 19:37:34+0100 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, CloseSpider
2011-12-28 19:37:34+0100 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2011-12-28 19:37:34+0100 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, DownloaderStats
2011-12-28 19:37:34+0100 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2011-12-28 19:37:35+0100 [scrapy] DEBUG: Enabled item pipelines: DuvelPipeline
2011-12-28 19:37:35+0100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2011-12-28 19:37:35+0100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-12-28 19:37:35+0100 [default] INFO: Spider opened
2011-12-28 19:37:35+0100 [default] DEBUG: Crawled (404) <GET http://doc.scrapy.org/_static/selectors-sample1.html> (referer: None)
2011-12-28 19:37:35+0100 [default] INFO: Closing spider (finished)
2011-12-28 19:37:35+0100 [default] INFO: Spider closed (finished)
[s] Available Scrapy objects:
[s] hxs <HtmlXPathSelector xpath=None data=u'<script type="text'>
[s] item DuvelItem()
[s] request <GET http://doc.scrapy.org/_static/selectors-sample1.html>
[s] response <404 http://doc.scrapy.org/_static/selectors-sample1.html>
[s] settings <CrawlerSettings module=<module 'spidertree.settings' from '/Volumes/HDD/home/duvel/spidertree/settings.pyc'>>
[s] spider <BaseSpider 'default' at 0x10fcdda10>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
2011-12-28 19:37:35+0100 [scrapy] ERROR: Shell error
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/threading.py", line 504, in __bootstrap
self.__bootstrap_inner()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/threading.py", line 484, in run
self.__target(_self.__args, *_self.__kwargs)
--- ---
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/twisted/python/threadpool.py", line 207, in _worker
result = context.call(ctx, function, _args, *_kwargs)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/twisted/python/context.py", line 118, in callWithContext
return self.currentContext().callWithContext(ctx, func, _args, *_kw)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/twisted/python/context.py", line 81, in callWithContext
return func(args,*kw)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/scrapy/shell.py", line 56, in _start
start_python_console(self.vars)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/scrapy/utils/console.py", line 14, in start_python_console
shell = IPython.Shell.IPShellEmbed(argv=[], user_ns=namespace)
exceptions.AttributeError: 'module' object has no attribute 'Shell'

Evidently there's a problem with IPython's shell.
I'm running version 0.12 of ipython and 0.12.0.2543 of scrapy

Probe command

Sometimes pages depend on certain HTTP request headers sent, for rendering the expected result, and it's a manual and tedious job to find out which headers those are.

So, here's an idea for automating this probing mechanism: create a new scrapy command probe which takes a url as argument and a text to look for.

Scrapy then tries several combinations of HTTP headers (user-agent, accept, etc) and return a set that works (where works mean that the text passed is found).

Here's a real world example to illustrate:

http://www.storage-cabinets-online.com/IVG2/N/ProductID-118021.htm

The page should contain a string 'var sFeatures', but that string is not returned with the default Scrapy HTTP request headers. So we run scrapy probe on it:

$ scrapy probe http://www.storage-cabinets-online.com/IVG2/N/ProductID-118021.htm 'var sFeatures'
Found set of working headers:
{'Host': 'www.storage-cabinets-online.com', 'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.10) Gecko/20100915 Ubuntu/10.04 (lucid)
Firefox/3.6.10', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en-us,en;q=0.5', 'Accept-Encoding': 'gzip,deflate',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.7', 'Keep-Alive': '115'}

The Scrapy probe command would try a different list of well known user-agents, along with Accept headers.

Real support for returning iterators on parse() method

New features and settings:

...
Real support for returning iterators on start_requests() method. The iterator is now consumed during the crawl when the spider is getting idle (r2704)
...

This works! Thanks!

But the issue with iterators on parse() methods still exists:

class AmazonSpider(BaseSpider):
    name = 'amazon'
    allowed_domains = ['amazon.com']
    start_urls = ['http://www.amazon.com/dp/B005890G8Y/']

    def parse(self, response):
        print 'parse'
        for i in xrange(100000000):
            url = 'http://www.amazon.com/dp/%i/' % i
            print i,
            yield self.make_requests_from_url(url)

This causes memory consumption to grow very fast.

Also (i guess), because of this when scrapyd in 'scrapy server' is interrupted one of my 'scrapy crawl' subprocesses cannot stop for a long time:

2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 102, model 978, years 2007-2007
2011-11-18 14:14:32+0200 [scrapy] INFO: Received SIGINT, shutting down gracefully. Send again to force unclean shutdown
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 622, model 3404, years 2002-2002
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 108, model 1451, years 2003-2003
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 82, model 2293, years 2007-2007
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 42, model 805, years 2011-2011
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 25, model 3345, years 2011-2011
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 33, model 665, years 2011-2011
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 130, model 2139, years 2009-2009
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 92, model 1975, years 2009-2009
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 134, model 3453, years 2010-2010
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 21, model 375, years 2011-2011
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 8, model 720, years 2011-2011
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 95, model 644, years 2009-2009
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 72, model 2117, years 2007-2007
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 32, model 636, years 2011-2011
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 70, model 654, years 2011-2011
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 106, model 2843, years 2011-2011
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Found item link: http://www.carbusiness.it/527781/auto-nuova/MICROCAR_MC2.ashx
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Redirecting (302) to <GET http://www.carbusiness.it/ricerca/risultati.aspx?idM=106&idMM=&PDa=0&PA=0&idTCb=0&idTCz=0&CC=0&CDa=0&CA=0&idTC=0&EA=0&nr=50&ADa=2011&Aa=2011&KmDa=-1&KmA=-1&G=0> from <POST http://www.carbusiness.it/Default.aspx>
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Crawled (200) <GET http://www.carbusiness.it/ricerca/risultati.aspx?idM=82&idMM=&PDa=0&PA=0&idTCb=0&idTCz=0&CC=0&CDa=0&CA=0&idTC=0&EA=0&nr=50&ADa=2011&Aa=2011&KmDa=-1&KmA=-1&G=0> (referer: http://www.carbusiness.it/jp/jp.aspx?action=load_modelli&id_marca=82)
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Crawled (200) <GET http://www.carbusiness.it/ricerca/risultati.aspx?idM=25&idMM=&PDa=0&PA=0&idTCb=0&idTCz=0&CC=0&CDa=0&CA=0&idTC=0&EA=0&nr=50&ADa=2011&Aa=2011&KmDa=-1&KmA=-1&G=0> (referer: http://www.carbusiness.it/jp/jp.aspx?action=load_modelli&id_marca=25)
2011-11-18 14:14:32+0200 [carbusiness_it] INFO: Closing spider (shutdown)
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 131, model 2152, years 2010-2010
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 79, model 2164, years 2007-2007
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 102, model 2464, years 2007-2007
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 622, model 3403, years 2002-2002
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 108, model 2853, years 2002-2002
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 82, model 266, years 2007-2007
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 42, model 2432, years 2011-2011
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 25, model 413, years 2011-2011
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 33, model 658, years 2011-2011
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 130, model 2137, years 2009-2009
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 92, model 1977, years 2008-2008
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 134, model 3452, years 2010-2010
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 21, model 383, years 2011-2011
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 8, model 2404, years 2011-2011
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 95, model 3214, years 2009-2009
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 72, model 3477, years 2007-2007
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 32, model 2388, years 2011-2011

Because in a parse method there is really big loop:

def parse2(self, response):
    '''Request search my brands, models and year.'''

    hxs = HtmlXPathSelector(response)
    models = hxs.select("//option/@value").extract()[1:]
    random.shuffle(models)

    brandId = response.meta['brand_id']
    self.log('parse2, brandId=%s, models=%s' % (brandId, models))

    searchPageResponse = response.meta['prev_response']
    hxs = HtmlXPathSelector(searchPageResponse)
    years = hxs.select("//*[@id='AnnoDa']/option/@value").extract()[1:]
    years = list(map(int, years))
    years.sort(reverse= True) # be sure its sorted desc

    for i in range(len(years) - 1): #
        yearTo = years[i]
        yearFrom = years[i + 1]
        if yearFrom == yearTo - 1:
            yearFrom = yearTo # search is inclusive. so do not search two consecutive years

        for modelId in models:
            formdata = {'ddlMarca': brandId, 'ddlModello': modelId,
                        'AnnoDa': yearFrom, 'Annoa': yearTo, 'ddlRisultatiPerPagina': 50}
            formRequest = FormRequest.from_response(searchPageResponse, 'ctl00', formdata= formdata, # we specify which submit button to click
                                callback= self.parseBrand, clickdata= {'name': 'btnRicerca'}, priority= -i)
            self.log('Requesting search: brand %s, model %s, years %d-%d' %
                     (brandId, modelId, yearFrom, yearTo), log.DEBUG)
            yield formRequest

Failing test case for get_meta_refresh function

Originally reported in Trac by Daniel: http://dev.scrapy.org/ticket/111

diff --git a/scrapy/tests/test_utils_response.py b/scrapy/tests/test_utils_response.py
--- a/scrapy/tests/test_utils_response.py
+++ b/scrapy/tests/test_utils_response.py
@@ -62,11 +62,16 @@
         response = Response(url='http://example.org', body=body)
         self.assertEqual(get_meta_refresh(response), (1, 'http://example.org/newpage'))

         # entities in the redirect url
         body = """<meta http-equiv="refresh" content="3; url=&#39;http://www.example.com/other&#39;">"""
         response = Response(url='http://example.com', body=body)
         self.assertEqual(get_meta_refresh(response), (3, 'http://www.example.com/other'))

+        # entities in the redirect url with single quotes
+        body = """<meta http-equiv="refresh" content='3; url=&#39;http://www.example.com/other&#39;'>"""
+        response = Response(url='http://example.com', body=body)
+        self.assertEqual(get_meta_refresh(response), (3, 'http://www.example.com/other'))
+
         # relative redirects
         body = """<meta http-equiv="refresh" content="3; url=other.html">"""
         response = Response(url='http://example.com/page/this.html', body=body)

_disconnectedDeferred error using twisted 11.1.0

Seen with Scrapy 14.0, Twisted 11.1.0, Python 2.6/2.7. After downgrade to twisted 11.0.0 the error does not show up.

In my case, happened in a long running spider which doesn't perform anything unusual, and after the error the crawler hangs up.

See the log below:

2011-11-23 09:32:40-0600 [projects] DEBUG: Crawled (200) <GET http://www.example.net/p/foo> (referer: http://www.example.net/p?page=51903&sort=users)
2011-11-23 09:32:40-0600 [projects] DEBUG: Crawled (200) <GET http://www.example.net/p/bar> (referer: http://www.example.net/p?page=51903&sort=users)
2011-11-23 09:38:10-0600 [projects] INFO: Crawled 89600 pages (at 42 pages/min), scraped 31075 items (at 16 items/min)
2011-11-23 09:38:11-0600 [-] Unhandled Error
    Traceback (most recent call last):
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/scrapy/commands/crawl.py", line 45, in run
        self.crawler.start()
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/scrapy/crawler.py", line 76, in start
        reactor.run(installSignalHandlers=False) # blocking call
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1169, in run
        self.mainLoop()
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1178, in mainLoop
        self.runUntilCurrent()
    --- <exception caught here> ---
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/tcp.py", line 337, in failIfNotConnected
        self.connector.connectionFailed(failure.Failure(err))
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1055, in connectionFailed
        self.factory.clientConnectionFailed(self, reason)
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/web/client.py", line 413, in clientConnectionFailed
        self._disconnectedDeferred.callback(None)
    exceptions.AttributeError: ScrapyHTTPClientFactory instance has no attribute '_disconnectedDeferred'

2011-11-23 09:38:13-0600 [-] Unhandled Error
    Traceback (most recent call last):
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/scrapy/commands/crawl.py", line 45, in run
        self.crawler.start()
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/scrapy/crawler.py", line 76, in start
        reactor.run(installSignalHandlers=False) # blocking call
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1169, in run
        self.mainLoop()
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1178, in mainLoop
        self.runUntilCurrent()
    --- <exception caught here> ---
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/tcp.py", line 337, in failIfNotConnected
        self.connector.connectionFailed(failure.Failure(err))
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1055, in connectionFailed
        self.factory.clientConnectionFailed(self, reason)
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/web/client.py", line 413, in clientConnectionFailed
        self._disconnectedDeferred.callback(None)
    exceptions.AttributeError: ScrapyHTTPClientFactory instance has no attribute '_disconnectedDeferred'

2011-11-23 09:38:13-0600 [-] Unhandled Error
    Traceback (most recent call last):
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/scrapy/commands/crawl.py", line 45, in run
        self.crawler.start()
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/scrapy/crawler.py", line 76, in start
        reactor.run(installSignalHandlers=False) # blocking call
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1169, in run
        self.mainLoop()
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1178, in mainLoop
        self.runUntilCurrent()
    --- <exception caught here> ---
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/tcp.py", line 337, in failIfNotConnected
        self.connector.connectionFailed(failure.Failure(err))
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1055, in connectionFailed
        self.factory.clientConnectionFailed(self, reason)
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/web/client.py", line 413, in clientConnectionFailed
        self._disconnectedDeferred.callback(None)
    exceptions.AttributeError: ScrapyHTTPClientFactory instance has no attribute '_disconnectedDeferred'

2011-11-23 09:38:13-0600 [-] Unhandled Error
    Traceback (most recent call last):
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/scrapy/commands/crawl.py", line 45, in run
        self.crawler.start()
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/scrapy/crawler.py", line 76, in start
        reactor.run(installSignalHandlers=False) # blocking call
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1169, in run
        self.mainLoop()
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1178, in mainLoop
        self.runUntilCurrent()
    --- <exception caught here> ---
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/tcp.py", line 337, in failIfNotConnected
        self.connector.connectionFailed(failure.Failure(err))
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1055, in connectionFailed
        self.factory.clientConnectionFailed(self, reason)
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/web/client.py", line 413, in clientConnectionFailed
        self._disconnectedDeferred.callback(None)
    exceptions.AttributeError: ScrapyHTTPClientFactory instance has no attribute '_disconnectedDeferred'

2011-11-23 09:38:13-0600 [-] Unhandled Error
    Traceback (most recent call last):
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/scrapy/commands/crawl.py", line 45, in run
        self.crawler.start()
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/scrapy/crawler.py", line 76, in start
        reactor.run(installSignalHandlers=False) # blocking call
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1169, in run
        self.mainLoop()
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1178, in mainLoop
        self.runUntilCurrent()
    --- <exception caught here> ---
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/tcp.py", line 337, in failIfNotConnected
        self.connector.connectionFailed(failure.Failure(err))
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1055, in connectionFailed
        self.factory.clientConnectionFailed(self, reason)
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/web/client.py", line 413, in clientConnectionFailed
        self._disconnectedDeferred.callback(None)
    exceptions.AttributeError: ScrapyHTTPClientFactory instance has no attribute '_disconnectedDeferred'

2011-11-23 09:38:13-0600 [-] Unhandled Error
    Traceback (most recent call last):
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/scrapy/commands/crawl.py", line 45, in run
        self.crawler.start()
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/scrapy/crawler.py", line 76, in start
        reactor.run(installSignalHandlers=False) # blocking call
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1169, in run
        self.mainLoop()
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1178, in mainLoop
        self.runUntilCurrent()
    --- <exception caught here> ---
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/tcp.py", line 337, in failIfNotConnected
        self.connector.connectionFailed(failure.Failure(err))
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1055, in connectionFailed
        self.factory.clientConnectionFailed(self, reason)
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/web/client.py", line 413, in clientConnectionFailed
        self._disconnectedDeferred.callback(None)
    exceptions.AttributeError: ScrapyHTTPClientFactory instance has no attribute '_disconnectedDeferred'

2011-11-23 09:38:58-0600 [projects] INFO: Crawled 89600 pages (at 0 pages/min), scraped 31075 items (at 0 items/min)
2011-11-23 09:39:58-0600 [projects] INFO: Crawled 89600 pages (at 0 pages/min), scraped 31075 items (at 0 items/min)
2011-11-23 09:40:58-0600 [projects] INFO: Crawled 89600 pages (at 0 pages/min), scraped 31075 items (at 0 items/min)

scrapy shell raising an exception with iPython 0.11

2011-10-04 11:33:47+0200 [scrapy] ERROR: Shell error
        Traceback (most recent call last):
          File "/usr/lib/python2.6/threading.py", line 504, in __bootstrap
            self.__bootstrap_inner()
          File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
            self.run()
          File "/usr/lib/python2.6/threading.py", line 484, in run
            self.__target(*self.__args, **self.__kwargs)
        --- <exception caught here> ---
          File "/usr/local/lib/python2.6/dist-packages/twisted/python/threadpool.py", line 207, in _worker
            result = context.call(ctx, function, *args, **kwargs)
          File "/usr/local/lib/python2.6/dist-packages/twisted/python/context.py", line 59, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/usr/local/lib/python2.6/dist-packages/twisted/python/context.py", line 37, in callWithContext
            return func(*args,**kw)
          File "/usr/local/lib/python2.6/dist-packages/scrapy/shell.py", line 56, in _start
            start_python_console(self.vars)
          File "/usr/local/lib/python2.6/dist-packages/scrapy/utils/console.py", line 14, in start_python_console
            shell = IPython.Shell.IPShellEmbed(argv=[], user_ns=namespace)
        exceptions.AttributeError: 'module' object has no attribute 'Shell'

However, works fine with 0.10.1

SSL error 'sslv3 alert illegal parameter' is generated on certain URLs

I previously reported this issue on Trac: http://dev.scrapy.org/ticket/315

For example:

$scrapy fetch "https://ui2web1.apps.uillinois.edu/BANPROD1/bwskfcls.P_GetCrse"
...
2011-03-24 10:58:03+0000 [default] ERROR: Error downloading <https://ui2web1.apps.uillinois.edu/BANPROD1/bwskfcls.P_GetCrse>: [Failure instance: Traceback (failure with no frames): <class 'OpenSSL.SSL.Error'>: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'sslv3 alert illegal parameter')]

This issue is discussed here http://bugs.python.org/issue11220

It would be nice to be able to specify the SSL method and options on requests or use scrapy defaults, instead of those hardcoded in twisted.internet.ssl.ClientContextFactor

Another option might be to try SSLv3 when an error is encountered with SSLv23.

Defect in FormRequest constructor

Reported by cdeyoung on Trac http://dev.scrapy.org/ticket/323

In the constructor of the FormRequest? class in scrapy/http/request.form.py -- in the IF statement that handles the "formdata" argument -- there is a line that says:

    self.method = 'POST'

That line either shouldn't be there, falling back to the base class's method variable, or it should be handled differently if you want FormRequest? to default to POST rather that GET, as the Response base class does. Currently, you are hard-coding FormRequest? objects to be submitted via POST, and that isn't always valid. You should still allow the developer to specify method='GET' when using a FormRequest? object, I think.

If you want FormRequest? to default to submitting forms via POST, then I would recommend the following change, or something like it:

class FormRequest(Request):
    __slots__ = ()
    def __init__(self, *args, **kwargs):
        formdata = kwargs.pop('formdata', None)
        method = kwargs.pop('method', 'POST')

        super(FormRequest, self).__init__(*args, **kwargs)

        if formdata:
            self.method = method
            ...

Plugable Link Extractor backends

We want to had pluggable link extractor backends, maybe having a LINKEXTRACTOR_CLASS setting.

Some backends that come to mind: pure-regex, scrapely, libxml2, lxml, sgml

The sgml backend is not working very well, as there have been some issues reported about it:

The scrapely backend workrks quite well and it's pure-python, so it's a good choice. But we'd have to add the dependency to scrapely.

Cookies middleware is slow when crawling many domains from the same spider

When crawling many domains from the same spider, cookies middleware slows down.

The patch is attached.

The main difference is that it only checks for cookies belonging to relevant domains (the
potential_domain_matches) instead of all domains, but to do this required a bit of refactoring. It also
only calls clear_expired_cookies periodically instead of every request.

The performance problems are very large and noticeable pretty quickly if a single spider has to manage
many cookies. This happened when we had a spider that crawled many sites with the cookies middleware
enabled.

It would be great if you could review and even if we can maybe then try it on more websites.

see http://dev.scrapy.org/ticket/333 for more info

Allow misc.load_object() to take a reference to an actual object

It's nice to have the flexibility of giving a reference to the actual object when setting settings, as well as specifying paths to objects.

Maybe something as simple as:

if not isinstance(path, basestring):
    return path

..at the beginning of misc.load_object(path)?

scrapy.item.Item - allow item.url along with item['url]

scrapy.item.Item:

def __getattr__(self, name):
    if name in self.fields:
        raise AttributeError("Use item[%r] to get field value" % name)
    raise AttributeError(name)

def __setattr__(self, name, value):
    if not name.startswith('_'):
        raise AttributeError("Use item[%r] = %r to set field value" % \
            (name, value))
    super(DictItem, self).__setattr__(name, value)

I suggest to allow using item.url=response.url along with item['url'] - this allows IDEs to autocomplete field names:

def __getattr__(self, name):
    if name in self.fields:
        return self.__getitem__(name)
    raise AttributeError(name)

def __setattr__(self, name, value):
    if not name.startswith('_'):
        return self.__setitem__(name, value)
    super(DictItem, self).__setattr__(name, value)

canonicalize_url() function breaks some urls

Current behavior of Scrapy when finding links like:
/fclick.php?variable

is to canonicalize them to:
/fclick.php?variable=

This however makes Scrapy follow an incorrect link and cause an error page to load. This is really fault of web script programmers really who use variables without value. But for the sake of robustness Scrapy should follow the correct links.

I made a small patch for this. All it does really is that when it faces variables with 0 length value it crops out the =.

see http://dev.scrapy.org/ticket/133 for more info

SqlitePriorityQueue.pop() return None may crash Poller.poll()

Previously reported by mrkschan on Trac http://dev.scrapy.org/ticket/313

I have a scrapy project that has several spiders in it. Those spiders are scheduled to execute on an hourly-basis.

The scrapyd eventually report unhandled error (as shown below). When I dig through the source of scrapy, I suspect the concurrent control of sqlite3 access is not well guarded as the error below is caused by popping from an empty queue.

The case can simply be explained by the scenario with two spiders A and B (also refer to the source - http://is.gd/ht4HSs).

Spider A's poller poll() get to line 21 of scrapyd/poller.py.
Meanwhile, Spider B's poller also get to line 21.
Spider A's poller poll() get to line 23, the queue in sqlite becomes empty and sqlite3 is locked (according to py doc - http://is.gd/67PKpn).
Spider B's poller poll() also get to line 23, wait the lock to be released.
Spider A commit and release the sqlite lock.
Spider B pop() from an empty queue. Raise Error.

Traceback (most recent call last):
  File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 778, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/usr/lib/python2.6/dist-packages/twisted/internet/task.py", line 194, in __call__
    d = defer.maybeDeferred(self.f, *self.a, **self.kw)
  File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 117, in maybeDeferred
    result = f(*args, **kw)
  File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 944, in unwindGenerator
    return _inlineCallbacks(None, f(*args, **kwargs), Deferred())
--- <exception caught here> ---
  File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 823, in _inlineCallbacks
    result = g.send(result)
  File "/usr/lib/pymodules/python2.6/scrapyd/poller.py", line 24, in poll
    returnValue(self.dq.put(self._message(msg, p)))
  File "/usr/lib/pymodules/python2.6/scrapyd/poller.py", line 33, in _message
    d = queue_msg.copy()
exceptions.AttributeError: 'NoneType' object has no attribute 'copy'

changed scrapyd data dir to be ".scrapy/scrapyd" instead of ".scrapy/.scrapy/scrapyd"

changed scrapyd data dir to be ".scrapy/scrapyd" instead of ".scrapy/.scrapy/scrapyd"

--- a/scrapyd/script.py
+++ b/scrapyd/script.py
@@ -14,7 +14,7 @@ from scrapyd import get_application
from scrapyd.config import Config

def _get_config():
-    datadir = os.path.join(project_data_dir(), '.scrapy', 'scrapyd')
+    datadir = os.path.join(project_data_dir(), 'scrapyd')
    conf = {
        'eggs_dir': os.path.join(datadir, 'eggs'),
        'logs_dir': os.path.join(datadir, 'logs'),

scrapy doesn't respect CLOSESPIDER_ITEMCOUNT

scrapy crawl example --set CLOSESPIDER_ITEMCOUNT=1
<...>
2011-09-19 19:06:40+0400 [example] INFO: Dumping spider stats:
        {'downloader/request_bytes': 337,
         'downloader/request_count': 2,
         'downloader/request_method_count/GET': 2,
         'downloader/response_bytes': 715698,
         'downloader/response_count': 2,
         'downloader/response_status_count/200': 2,
         'finish_reason': 'closespider_itemcount',
         'finish_time': datetime.datetime(2011, 9, 19, 15, 6, 40, 574142),
         'item_scraped_count': 892,
         'scheduler/memory_enqueued': 2,
         'start_time': datetime.datetime(2011, 9, 19, 15, 6, 38, 560994)}
2011-09-19 19:06:40+0400 [example] INFO: Spider closed (closespider_itemcount)
<...>

'item_scraped_count': 892
'INFO: Closing spider (closespider_itemcount)' appears in log after first (CLOSESPIDER_ITEMCOUNT value in example) scraped item.

exception handling

scrapy/spidermanager.py

def create(self, spider_name, **spider_kwargs):
    try:
        return self._spiders[spider_name](**spider_kwargs)
    except KeyError:
        raise KeyError("Spider not found: %s" % spider_name)

change to

def create(self, spider_name, **spider_kwargs):
    try:
        spcls = self._spiders[spider_name]
    except KeyError:
        raise KeyError("Spider not found: %s" % spider_name)
    return spcls(**spider_kwargs)

i had a case when spider.__init__ had a KeyError causing a misleading exception - 'spider not found'

Offsite middleware ignoring port

In my spider I have the following:

class MySpider(BaseSpider):

    allowed_domains = ['192.169.0.15:8080']

and in the parse method I do something like:

    yield Request('http://192.169.0.15:8080/mypage.html', self.my_callback_function)

the result when I run the code is that that scrapy reports:

DEBUG: Filtered offsite request to '192.168.0.15': <GET http://192.168.0.15:8080/mypage.html>

Which is wrong - it seems to be ignoring the port. If I change the allowed_domains to:

    allowed_domains = ['192.169.0.15:8080', '192.16.0.15']

Then it works as you would expect it to. No big deal, can work around it but I think it is a bug. The problem being located in the should_follow method of the OffsiteMiddleware class in contrib/spidermiddleware/offsite.py

scrapyd should allow multiple jobs to be scheduled with one URL

Previously reported by agtilden on Trac

scrapyd only allows one job to be scheduled per URL invocation. This makes scheduling lots of jobs needlessly time consuming.

I propose adding a file upload option that would contain a json string with the following structure:

[{"project" : {"spider": {"spider_arg_name": "spider_arg_value"}}},
    {"another_project" : {"another_spider": {"spider_arg_name": "spider_arg_value"}}}]

The return value would be a json list of the same length. Elements in the list would be either the jobid assigned by scrapyd or null in case the scheduler encountered an exception.

This can be implemented so that the existing parameter passing continues to work for scheduling one job at a time.

Add support for FTP downloads

We should add support for following FTP links like:
ftp://www.example.com/somedir/somefile.xml

I suppose Requests will only use the URL attribute (and perhaps some data in meta, if it's needed).

As for Responses, they will contain the file contents in the body, as one would expect.
here should be a flag to enable/disable passive FTP, perhaps even per spider.

Scrapy hangs if an exception raises in start_requests

When start_requests iterator throws an exception, it makes engine._next_request fail with an UnhandledError and prevents scrapy from correctly stop the engine hanging forever

It is requried Ctrl-C to stop it.


2012-01-27 17:10:09-0200 [scrapy] INFO: Scrapy 0.15.1 started (bot: testbot)
2012-01-27 17:10:09-0200 [spidername.com] INFO: Spider opened
2012-01-27 17:10:09-0200 [spidername.com] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-01-27 17:10:09-0200 [-] Unhandled Error
    Traceback (most recent call last):
      File "/home/daniel/src/scrapy/scrapy/commands/crawl.py", line 45, in run
        self.crawler.start()
      File "/home/daniel/src/scrapy/scrapy/crawler.py", line 76, in start
        reactor.run(installSignalHandlers=False) # blocking call
      File "/home/daniel/envs/mytestenv/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1169, in run
        self.mainLoop()
      File "/home/daniel/envs/mytestenv/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1178, in mainLoop
        self.runUntilCurrent()
    --- <exception caught here> ---
      File "/home/daniel/envs/mytestenv/local/lib/python2.7/site-packages/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/home/daniel/src/scrapy/scrapy/utils/reactor.py", line 41, in __call__
        return self._func(*self._a, **self._kw)
      File "/home/daniel/src/scrapy/scrapy/core/engine.py", line 108, in _next_request
        request = slot.start_requests.next()
      File "/home/daniel/src/testbot/testbot/spiders_dev/myspider.py", line 32, in start_requests
        'spidername.com does not support url mapping'
    exceptions.AssertionError: spidername.com does not support url mapping

^C2012-01-27 17:10:11-0200 [scrapy] INFO: Received SIGINT, shutting down gracefully. Send again to force unclean shutdown
2012-01-27 17:10:11-0200 [spidername.com] INFO: Closing spider (shutdown)
2012-01-27 17:10:11-0200 [spidername.com] INFO: Dumping spider stats:
    {'finish_reason': 'shutdown',
     'finish_time': datetime.datetime(2012, 1, 27, 19, 10, 11, 757102),
     'start_time': datetime.datetime(2012, 1, 27, 19, 10, 9, 487178)}
2012-01-27 17:10:11-0200 [spidername.com] INFO: Spider closed (shutdown)
2012-01-27 17:10:11-0200 [scrapy] INFO: Dumping global stats:
    {'memusage/max': 111865856, 'memusage/startup': 111865856}

Documentation speaks of re.match while meaning re.search

Reported by Vasily Alexeev on Trac http://dev.scrapy.org/ticket/328

In link extractor reference we see passages like

"allow (str or list) โ€“ a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links."

There's two quite different methods for working with regexps: matching and searching. A quick look in sources reveals that in this case we deal with searching, not matching:

_matches = lambda url, regexs: any((r.search(url) for r in regexs))

So documentation is clearly misleading and should be corrected.

SSL compatibility issues with some servers

Attempting to fetch this page over HTTPS resulted in the not very enlightening error Connection was closed cleanly.

ERROR: Error downloading <GET https://rn.ftc.gov/pls/textilern/wrnquery$.startup>: Connection was closed cleanly.

Some very helpful people on #twisted managed to work out that it was an issue with the server not liking empty fragments in the SSL communication. This could be fixed by specifying the OP_DONT_INSERT_EMPTY_FRAGMENTS option when making the OpenSSL context.

Currently it seems that in order to be able to scrape a website that has any SSL compatibility issues, you have to subclass ClientContextFactory in order to specify the compatibility options, and then subclass HTTPDownloadHandler and tell it to use that context factory.

Could this be made easier? Would it even be a good idea to set some compatibility options by default? Maybe even SSL.OP_ALL?

Add dont_cache flag

Reported by binarybug on Trac http://dev.scrapy.org/ticket/325

If a website is using a session to maintain the client's state than resuming a crawl doesn't work when cache is enabled. If we can instruct scrapy not to cache some requests than resuming a crawl would create a session when those requests are encountered and subsequent requests wouldn't fail e.g.

yield Request(' http://www.example.com', meta={'dont_cache': True}) 

Make Request class configurable setting

Previously reported by wecacuee on Trac http://dev.scrapy.org/ticket/301

Rationale

Currently we use scrapy.http.Request as the default class for Request throughout scrapy. This makes scrapy quite bound to HTTP protocol requests. I understand that we have download handlers for "http", "ftp" and "s3", but this enforces request differentiation only by "URI" scheme.

We should have a common protocol general request: scrapy.Request and the request class should be configurable by settings.

DEFAULT_REQUEST_CLASS = 'scrapy.http.Request'

Support for binding interface to another ip

I have a suggestion for an improvement. I've added this to my local scrapy installation. Im sure it can be done more elegant, but its a start :)

The adress to bind to the socket needs to be passed to reactor.connectTCP in core.downloader.handlers.http._connect as bindAddress

See the diff file for an example.
Just add ip_bind = (ip-number, port-number) to your spider if you want to override the default.

see http://dev.scrapy.org/ticket/153 for more info

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.