scrapy / scrapy Goto Github PK

View Code? Open in Web Editor NEW

50.9K 50.9K 10.3K 25.09 MB

Scrapy, a fast high-level web crawling & scraping framework for Python.

Home Page: https://scrapy.org

License: BSD 3-Clause "New" or "Revised" License

Shell 0.01% Python 99.75% HTML 0.15% Roff 0.09%

crawler crawling framework hacktoberfest python scraping web-scraping web-scraping-python

scrapy's Introduction

Scrapy

Overview

Scrapy is a BSD-licensed fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Scrapy is maintained by Zyte (formerly Scrapinghub) and many other contributors.

Check the Scrapy homepage at https://scrapy.org for more information, including a list of features.

Requirements

Python 3.8+
Works on Linux, Windows, macOS, BSD

Install

The quick way:

pip install scrapy

See the install section in the documentation at https://docs.scrapy.org/en/latest/intro/install.html for more details.

Documentation

Documentation is available online at https://docs.scrapy.org/ and in the docs directory.

Releases

You can check https://docs.scrapy.org/en/latest/news.html for the release notes.

Community (blog, twitter, mail list, IRC)

See https://scrapy.org/community/ for details.

Contributing

See https://docs.scrapy.org/en/master/contributing.html for details.

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct.

By participating in this project you agree to abide by its terms. Please report unacceptable behavior to [email protected].

Companies using Scrapy

See https://scrapy.org/companies/ for a list.

Commercial Support

See https://scrapy.org/support/ for details.

scrapy's People

Contributors

Stargazers

Watchers

Forkers

dangra rmax-contrib pkufranky trepca herberthamaral zz kenzouyeh joehillen murtada robyoung taoh ntmartin bihicheng ilustreous zhangcheng saidimu floppya netconstructor tml julien-duponchelle paulorcf sabren mattd yoyo2k hammadk373 hahakubile seabear simonratner pjq cloudappsetup krya hantaniold umars kodotkom tomoyo johtso mbeydon emj hezll asemx tazjel valejo ranjithtenz nonzero streety japerk malfaux david9ml sys520084 antoniodevsar aboleab xiufan nilp0inter sergeospb bladefury lucianu redapple ndemir fguilpain tsadaoui proximamonkey andrix seacoastboy saxicek alexcepoi connorsml tstmar chintanop xucheng hlongmore tonal errord philippwinkler carlosnasillo moxery torkn msabramo mouadino gkzsolt aliscott georgesequeira fateiswar xiongww seaeast sp00 bahtou nuklea forschnix achimnol tomsheep lispmind underrun seniorwebdeveloper ananthbalashankar rishipatel frrp tonyzhu raghuramn paulproteus romley

scrapy's Issues

xml iternodes bug with nested nodes

Originally reported by Damian Canabal on Trac: http://dev.scrapy.org/ticket/275

If itertag = 'product' then returned nodes will be incomplete:

<?xml version="1.0" encoding="UTF-8"?>
<merchandiser xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="merchandiser.xsd">
  <product product_id="95" name="Pure Shetland Wool Throw" sku_number="10095" manufacturer_name="Biome Lifestyle" part_number="">
    <URL>
      <product>http://click.linksynergy.com/fs-bin/click?id=bNgl5*KPhYY&amp;offerid=211619.95&amp;type=15&amp;subid=0</product>
      <productImage>http://assets1.notonthehighstreet.com/system/product_images/images/000/000/368/normal_95_pure_shetland_wool_throw_main.jpg</productImage>
    </URL>
    <description>
Made from the purest and softest Shetland wool, this throw remains undyedto stay as eco-friendly as possible. She
    </description>
  </product>
</merchandiser>

returned node:

<product product_id="95" name="Pure Shetland Wool Throw" sku_number="10095" manufacturer_name="Biome Lifestyle" part_number="">
    <URL>
      <product>
http://click.linksynergy.com/fs-bin/click?id=bNgl5*KPhYY&amp;offerid=211619.95&amp;type=15&amp;subid=0
      </product>
    </URL>
</product>

Some pages are missing in the documentation pdf

Some pages are blank(contents and part of chapter 1)

Add Proxy CONNECT support (fixes bug with https urls and proxies)

Proxy CONNECT support is required for https urls to work.

When proxy support was added to Scrapy, urllib still didn't supported CONNECT, but that has been fixed now:
http://bugs.python.org/issue1424152

Now we need to add support to Scrapy (which doesn't use urllib at all).

This link contains useful tips on how to add CONNECT support with Twisted:
http://twistedmatrix.com/pipermail/twisted-web/2008-August/003878.html

See originally reported ticket on Trac for more comments: http://dev.scrapy.org/ticket/159

Interrupted system call (OSError) on "scrapy deploy"

The following problem happens when running "scrapy deploy" on MAC:

DraixBook:ibcrest martin$ python2.6 $(which scrapy) deploy
Building egg of 51-1311257580
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.7/bin/scrapy", line 5, in <module>
    pkg_resources.run_script('Scrapy==0.13.0', 'scrapy')
  File "/Library/Python/2.6/site-packages/setuptools-0.6c9-py2.6.egg/pkg_resources.py", line 448, in run_script

  File "/Library/Python/2.6/site-packages/setuptools-0.6c9-py2.6.egg/pkg_resources.py", line 1166, in run_script
    script_code = compile(script_text,script_filename,'exec')
  File "/Library/Python/2.6/site-packages/Scrapy-0.13.0-py2.6.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
    execute()
  File "/Library/Python/2.6/site-packages/Scrapy-0.13.0-py2.6.egg/scrapy/cmdline.py", line 131, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/Library/Python/2.6/site-packages/Scrapy-0.13.0-py2.6.egg/scrapy/cmdline.py", line 97, in _run_print_help
    func(*a, **kw)
  File "/Library/Python/2.6/site-packages/Scrapy-0.13.0-py2.6.egg/scrapy/cmdline.py", line 138, in _run_command
    cmd.run(args, opts)
  File "/Library/Python/2.6/site-packages/Scrapy-0.13.0-py2.6.egg/scrapy/commands/deploy.py", line 98, in run
    egg, tmpdir = _build_egg()
  File "/Library/Python/2.6/site-packages/Scrapy-0.13.0-py2.6.egg/scrapy/commands/deploy.py", line 208, in _build_egg
    check_call([sys.executable, 'setup.py', 'clean', '-a', 'bdist_egg', '-d', d], stdout=f)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/subprocess.py", line 457, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/subprocess.py", line 444, in call
    return Popen(*popenargs, **kwargs).wait()
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/subprocess.py", line 1137, in wait
    pid, sts = os.waitpid(self.pid, 0)
OSError: [Errno 4] Interrupted system call

The problem was solved when upgrading to Python 2.7.2

exceptions.AssertionError: No free spider slots when opening 'default'

In scrapy shell, when fetch() is run for the second time, it crashes with the following exception:

exceptions.AssertionError: No free spider slots when opening 'default'

Traceback:

2012-01-29 23:45:01-0600 [-] ERROR: Unhandled error in Deferred: 2012-01-29 23:45:01-0600 [-] Unhandled Error Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/twisted/internet/threads.py", line 113, in _callFromThread result = defer.maybeDeferred(f, _a, *_kw) File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 133, in maybeDeferred result = f(_args, *_kw) File "/usr/lib/pymodules/python2.7/scrapy/shell.py", line 64, in _schedule self.crawler.engine.open_spider(spider, close_if_idle=False) File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1141, in unwindGenerator return _inlineCallbacks(None, f(_args, *_kwargs), Deferred()) --- --- File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1020, in _inlineCallbacks result = g.send(result) File "/usr/lib/pymodules/python2.7/scrapy/core/engine.py", line 214, in open_spider spider.name exceptions.AssertionError: No free spider slots when opening 'default'

Traceback (most recent call last): File "", line 1, in File "/usr/lib/pymodules/python2.7/scrapy/shell.py", line 80, in fetch self._schedule, request, spider) File "/usr/lib/python2.7/dist-packages/twisted/internet/threads.py", line 118, in blockingCallFromThread result.raiseException() File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 338, in raiseException raise self.type, self.value, self.tb AssertionError: Spider 'default' not opened when crawling: <GET http://www.economist.com/blogs/freeexchange/2011/10/generational-warfare>

Offsite middleware doesn't filter redirected responses

Reported by fencer on Trac: http://dev.scrapy.org/ticket/100

Using a BaseSpider to harvest links. The spider evaluates every anchor link on the page, processes them and applies an algorithm to it. The spider's parse function returns both items for output and request with the harvested links for further crawling.

Extra domains were not specified, only the domain_name value was set in the spider to "agd.org". In testing the spider, I noticed it was crawling URLs outside the domain_name.

In examining the log file, I noticed that there were 302 redirects from an URL inside the domain to an URL outside the domain. All domains crawled outside of the original domain_name correlated with a 302 redirect.

2009-09-01 12:44:25-0700 [agd.org] DEBUG: Redirecting (302) to 
   <http://www.goarmy.com/amedd/dental/index.jsp?iom=9618-ITBP-MCDE-07012009-16-09021-180AD1> 
   from <http://www.agd.org/adtracking/a.aspx?ZoneID=18&Task=Click&Mode=HTML&SiteID=1&PageID=28659>

I have not examined the SpiderMiddleware in detail, but I am guessing that the 302 redirect is somehow circumventing the
scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware

Not sure if this is a bug or way it was intentionally designed when handling 302 redirects.

Crawl spider crawls previously visited URL on redirect

Previously reported by michaelvmata on Trac http://dev.scrapy.org/ticket/299

If a crawl spider is redirected to an already visited page, it will still crawl it.

From the mailing list http://groups.google.com/group/scrapy-users/browse_thread/thread/ee9ad68f5dbacc6d:

"...the dupe filter only catches requests after they leave the spider, so redirected pages are ignored by the dupe filter.

Since the dupefilter and the redirect middleware components are decoupled now, it would be awkward to implement what you suggest, but nevertheless I think it would be useful"

RegEx gets too big when allowed_domains long

lines 49 and 50 of scrapy.contrib.spidermiddleware.offsite:

    regex = r'^(.*\.)?(%s)$' % '|'.join(domains)
    return re.compile(regex)

Tested with a list of 3000 allowed domains -

Caught OverflowError while rendering: regular expression code size limit exceeded

Fixed by subclassing and using url_is_from_any_domain

class MyOffsiteMiddleware(OffsiteMiddleware):

def should_follow(self, request, spider):
    allowed_domains = getattr(spider, 'allowed_domains', None)
    return url_is_from_any_domain(request.url, allowed_domains)

def spider_opened(self, spider):
    self.host_regexes[spider] = re.compile('')
    self.domains_seen[spider] = set()

start_requests() and parse() should always be generators

start_requests() and parse() in BaseSpider are expected to return iterables. I suggest to modify this behaviour and force them to return generators. I start_requests wants to generate requests for url by a pattern it may eat a lot of memory:

from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy.spider import BaseSpider


class TestSpider(BaseSpider):
    name = "test_spider"
    start_urls = ['http://www.amazon.com/dp/B005890G8Y/']

    def parse(self, response):
        print 'parse'
        for i in xrange(100000000):
            url = 'http://www.amazon.com/dp/%i/' % i
            print i,
            yield self.make_requests_from_url(url)


crawler = CrawlerProcess(settings)
crawler.install()
crawler.configure()

spider = TestSpider()
crawler.queue.append_spider(spider)
crawler.start()

from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy.spider import BaseSpider


class TestSpider(BaseSpider):
    name = "test_spider"
    #start_urls = ['http://www.amazon.com/dp/B005890G8Y/']

    def start_requests(self):
        for i in xrange(100000000):
            url = 'http://www.amazon.com/dp/%i/' % i
            print 'yielding a start url: %s' % url
            yield self.make_requests_from_url(url)

    def parse(self, response):
        '''does nothing'''
        print 'parse' 


crawler = CrawlerProcess(settings)
crawler.install()
crawler.configure()

spider = TestSpider()
crawler.queue.append_spider(spider)
crawler.start()

scrapy shell raising an exception in ipython

I'm running a standard scrapy shell command and receiving the following response, which occurs after it has finished parsing and enters into the shell

scrapy shell http://doc.scrapy.org/_static/selectors-sample1.html
2011-12-28 19:37:34+0100 [scrapy] INFO: Scrapy 0.12.0.2543 started (bot: botname)
2011-12-28 19:37:34+0100 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, CloseSpider
2011-12-28 19:37:34+0100 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2011-12-28 19:37:34+0100 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, DownloaderStats
2011-12-28 19:37:34+0100 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2011-12-28 19:37:35+0100 [scrapy] DEBUG: Enabled item pipelines: DuvelPipeline
2011-12-28 19:37:35+0100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2011-12-28 19:37:35+0100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-12-28 19:37:35+0100 [default] INFO: Spider opened
2011-12-28 19:37:35+0100 [default] DEBUG: Crawled (404) <GET http://doc.scrapy.org/_static/selectors-sample1.html> (referer: None)
2011-12-28 19:37:35+0100 [default] INFO: Closing spider (finished)
2011-12-28 19:37:35+0100 [default] INFO: Spider closed (finished)
[s] Available Scrapy objects:
[s] hxs <HtmlXPathSelector xpath=None data=u'<script type="text'>
[s] item DuvelItem()
[s] request <GET http://doc.scrapy.org/_static/selectors-sample1.html>
[s] response <404 http://doc.scrapy.org/_static/selectors-sample1.html>
[s] settings <CrawlerSettings module=<module 'spidertree.settings' from '/Volumes/HDD/home/duvel/spidertree/settings.pyc'>>
[s] spider <BaseSpider 'default' at 0x10fcdda10>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
2011-12-28 19:37:35+0100 [scrapy] ERROR: Shell error
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/threading.py", line 504, in __bootstrap
self.__bootstrap_inner()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/threading.py", line 484, in run
self.__target(_self.__args, *_self.__kwargs)
--- ---
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/twisted/python/threadpool.py", line 207, in _worker
result = context.call(ctx, function, _args, *_kwargs)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/twisted/python/context.py", line 118, in callWithContext
return self.currentContext().callWithContext(ctx, func, _args, *_kw)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/twisted/python/context.py", line 81, in callWithContext
return func(args,*kw)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/scrapy/shell.py", line 56, in _start
start_python_console(self.vars)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/scrapy/utils/console.py", line 14, in start_python_console
shell = IPython.Shell.IPShellEmbed(argv=[], user_ns=namespace)
exceptions.AttributeError: 'module' object has no attribute 'Shell'

Evidently there's a problem with IPython's shell.
I'm running version 0.12 of ipython and 0.12.0.2543 of scrapy

Probe command

Sometimes pages depend on certain HTTP request headers sent, for rendering the expected result, and it's a manual and tedious job to find out which headers those are.

So, here's an idea for automating this probing mechanism: create a new scrapy command probe which takes a url as argument and a text to look for.

Scrapy then tries several combinations of HTTP headers (user-agent, accept, etc) and return a set that works (where works mean that the text passed is found).

Here's a real world example to illustrate:

http://www.storage-cabinets-online.com/IVG2/N/ProductID-118021.htm

The page should contain a string 'var sFeatures', but that string is not returned with the default Scrapy HTTP request headers. So we run scrapy probe on it:

$ scrapy probe http://www.storage-cabinets-online.com/IVG2/N/ProductID-118021.htm 'var sFeatures'
Found set of working headers:
{'Host': 'www.storage-cabinets-online.com', 'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.10) Gecko/20100915 Ubuntu/10.04 (lucid)
Firefox/3.6.10', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en-us,en;q=0.5', 'Accept-Encoding': 'gzip,deflate',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.7', 'Keep-Alive': '115'}

The Scrapy probe command would try a different list of well known user-agents, along with Accept headers.

CrawlSpider shouldn't follow links from non-HTML responses

When CrawlSpider tries to extract links from a non-HTML response, it fails.

CrawlSpider should only follow links from HTML responses.

Real support for returning iterators on parse() method

New features and settings:

...
Real support for returning iterators on start_requests() method. The iterator is now consumed during the crawl when the spider is getting idle (r2704)
...

This works! Thanks!

But the issue with iterators on parse() methods still exists:

class AmazonSpider(BaseSpider):
    name = 'amazon'
    allowed_domains = ['amazon.com']
    start_urls = ['http://www.amazon.com/dp/B005890G8Y/']

    def parse(self, response):
        print 'parse'
        for i in xrange(100000000):
            url = 'http://www.amazon.com/dp/%i/' % i
            print i,
            yield self.make_requests_from_url(url)

This causes memory consumption to grow very fast.

Also (i guess), because of this when scrapyd in 'scrapy server' is interrupted one of my 'scrapy crawl' subprocesses cannot stop for a long time:

2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 102, model 978, years 2007-2007
2011-11-18 14:14:32+0200 [scrapy] INFO: Received SIGINT, shutting down gracefully. Send again to force unclean shutdown
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 622, model 3404, years 2002-2002
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 108, model 1451, years 2003-2003
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 82, model 2293, years 2007-2007
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 42, model 805, years 2011-2011
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 25, model 3345, years 2011-2011
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 33, model 665, years 2011-2011
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 130, model 2139, years 2009-2009
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 92, model 1975, years 2009-2009
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 134, model 3453, years 2010-2010
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 21, model 375, years 2011-2011
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 8, model 720, years 2011-2011
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 95, model 644, years 2009-2009
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 72, model 2117, years 2007-2007
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 32, model 636, years 2011-2011
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 70, model 654, years 2011-2011
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 106, model 2843, years 2011-2011
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Found item link: http://www.carbusiness.it/527781/auto-nuova/MICROCAR_MC2.ashx
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Redirecting (302) to <GET http://www.carbusiness.it/ricerca/risultati.aspx?idM=106&idMM=&PDa=0&PA=0&idTCb=0&idTCz=0&CC=0&CDa=0&CA=0&idTC=0&EA=0&nr=50&ADa=2011&Aa=2011&KmDa=-1&KmA=-1&G=0> from <POST http://www.carbusiness.it/Default.aspx>
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Crawled (200) <GET http://www.carbusiness.it/ricerca/risultati.aspx?idM=82&idMM=&PDa=0&PA=0&idTCb=0&idTCz=0&CC=0&CDa=0&CA=0&idTC=0&EA=0&nr=50&ADa=2011&Aa=2011&KmDa=-1&KmA=-1&G=0> (referer: http://www.carbusiness.it/jp/jp.aspx?action=load_modelli&id_marca=82)
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Crawled (200) <GET http://www.carbusiness.it/ricerca/risultati.aspx?idM=25&idMM=&PDa=0&PA=0&idTCb=0&idTCz=0&CC=0&CDa=0&CA=0&idTC=0&EA=0&nr=50&ADa=2011&Aa=2011&KmDa=-1&KmA=-1&G=0> (referer: http://www.carbusiness.it/jp/jp.aspx?action=load_modelli&id_marca=25)
2011-11-18 14:14:32+0200 [carbusiness_it] INFO: Closing spider (shutdown)
2011-11-18 14:14:32+0200 [carbusiness_it] DEBUG: Requesting search: brand 131, model 2152, years 2010-2010
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 79, model 2164, years 2007-2007
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 102, model 2464, years 2007-2007
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 622, model 3403, years 2002-2002
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 108, model 2853, years 2002-2002
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 82, model 266, years 2007-2007
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 42, model 2432, years 2011-2011
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 25, model 413, years 2011-2011
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 33, model 658, years 2011-2011
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 130, model 2137, years 2009-2009
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 92, model 1977, years 2008-2008
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 134, model 3452, years 2010-2010
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 21, model 383, years 2011-2011
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 8, model 2404, years 2011-2011
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 95, model 3214, years 2009-2009
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 72, model 3477, years 2007-2007
2011-11-18 14:14:33+0200 [carbusiness_it] DEBUG: Requesting search: brand 32, model 2388, years 2011-2011

Because in a parse method there is really big loop:

def parse2(self, response):
    '''Request search my brands, models and year.'''

    hxs = HtmlXPathSelector(response)
    models = hxs.select("//option/@value").extract()[1:]
    random.shuffle(models)

    brandId = response.meta['brand_id']
    self.log('parse2, brandId=%s, models=%s' % (brandId, models))

    searchPageResponse = response.meta['prev_response']
    hxs = HtmlXPathSelector(searchPageResponse)
    years = hxs.select("//*[@id='AnnoDa']/option/@value").extract()[1:]
    years = list(map(int, years))
    years.sort(reverse= True) # be sure its sorted desc

    for i in range(len(years) - 1): #
        yearTo = years[i]
        yearFrom = years[i + 1]
        if yearFrom == yearTo - 1:
            yearFrom = yearTo # search is inclusive. so do not search two consecutive years

        for modelId in models:
            formdata = {'ddlMarca': brandId, 'ddlModello': modelId,
                        'AnnoDa': yearFrom, 'Annoa': yearTo, 'ddlRisultatiPerPagina': 50}
            formRequest = FormRequest.from_response(searchPageResponse, 'ctl00', formdata= formdata, # we specify which submit button to click
                                callback= self.parseBrand, clickdata= {'name': 'btnRicerca'}, priority= -i)
            self.log('Requesting search: brand %s, model %s, years %d-%d' %
                     (brandId, modelId, yearFrom, yearTo), log.DEBUG)
            yield formRequest

Failing test case for get_meta_refresh function

Originally reported in Trac by Daniel: http://dev.scrapy.org/ticket/111

diff --git a/scrapy/tests/test_utils_response.py b/scrapy/tests/test_utils_response.py
--- a/scrapy/tests/test_utils_response.py
+++ b/scrapy/tests/test_utils_response.py
@@ -62,11 +62,16 @@
         response = Response(url='http://example.org', body=body)
         self.assertEqual(get_meta_refresh(response), (1, 'http://example.org/newpage'))

         # entities in the redirect url
         body = """<meta http-equiv="refresh" content="3; url=&#39;http://www.example.com/other&#39;">"""
         response = Response(url='http://example.com', body=body)
         self.assertEqual(get_meta_refresh(response), (3, 'http://www.example.com/other'))

+        # entities in the redirect url with single quotes
+        body = """<meta http-equiv="refresh" content='3; url=&#39;http://www.example.com/other&#39;'>"""
+        response = Response(url='http://example.com', body=body)
+        self.assertEqual(get_meta_refresh(response), (3, 'http://www.example.com/other'))
+
         # relative redirects
         body = """<meta http-equiv="refresh" content="3; url=other.html">"""
         response = Response(url='http://example.com/page/this.html', body=body)

Refactor signals

Refactor signal handling similar how Django did for 1.0.

According to them, the new approach brings up to 90% speed improvements

Add Proxy CONNECT support

Proxy CONNECT support is required for https urls to work.

When proxy support was added to Scrapy, urllib still didn't supported CONNECT, but that has been fixed now:

http://bugs.python.org/issue1424152

Now we need to add support to Scrapy (which doesn't use urllib at all).

This link contains useful tips on how to add CONNECT support with Twisted:

http://twistedmatrix.com/pipermail/twisted-web/2008-August/003878.html

see http://dev.scrapy.org/ticket/159 for more info

Can't register namespaces in XMLFeedSpider when using 'iternodes' iterator

Reported by manuelaristaran on Trac: http://dev.scrapy.org/ticket/98

Since scrapy.utils.iterators.xmliter instances XmlXPathSelector only upon matching nodename via regexes, it can't register the namespaces.

Not very important, of course :)

_disconnectedDeferred error using twisted 11.1.0

Seen with Scrapy 14.0, Twisted 11.1.0, Python 2.6/2.7. After downgrade to twisted 11.0.0 the error does not show up.

In my case, happened in a long running spider which doesn't perform anything unusual, and after the error the crawler hangs up.

See the log below:

2011-11-23 09:32:40-0600 [projects] DEBUG: Crawled (200) <GET http://www.example.net/p/foo> (referer: http://www.example.net/p?page=51903&sort=users)
2011-11-23 09:32:40-0600 [projects] DEBUG: Crawled (200) <GET http://www.example.net/p/bar> (referer: http://www.example.net/p?page=51903&sort=users)
2011-11-23 09:38:10-0600 [projects] INFO: Crawled 89600 pages (at 42 pages/min), scraped 31075 items (at 16 items/min)
2011-11-23 09:38:11-0600 [-] Unhandled Error
    Traceback (most recent call last):
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/scrapy/commands/crawl.py", line 45, in run
        self.crawler.start()
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/scrapy/crawler.py", line 76, in start
        reactor.run(installSignalHandlers=False) # blocking call
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1169, in run
        self.mainLoop()
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1178, in mainLoop
        self.runUntilCurrent()
    --- <exception caught here> ---
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/tcp.py", line 337, in failIfNotConnected
        self.connector.connectionFailed(failure.Failure(err))
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1055, in connectionFailed
        self.factory.clientConnectionFailed(self, reason)
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/web/client.py", line 413, in clientConnectionFailed
        self._disconnectedDeferred.callback(None)
    exceptions.AttributeError: ScrapyHTTPClientFactory instance has no attribute '_disconnectedDeferred'

2011-11-23 09:38:13-0600 [-] Unhandled Error
    Traceback (most recent call last):
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/scrapy/commands/crawl.py", line 45, in run
        self.crawler.start()
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/scrapy/crawler.py", line 76, in start
        reactor.run(installSignalHandlers=False) # blocking call
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1169, in run
        self.mainLoop()
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1178, in mainLoop
        self.runUntilCurrent()
    --- <exception caught here> ---
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/tcp.py", line 337, in failIfNotConnected
        self.connector.connectionFailed(failure.Failure(err))
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1055, in connectionFailed
        self.factory.clientConnectionFailed(self, reason)
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/web/client.py", line 413, in clientConnectionFailed
        self._disconnectedDeferred.callback(None)
    exceptions.AttributeError: ScrapyHTTPClientFactory instance has no attribute '_disconnectedDeferred'

2011-11-23 09:38:13-0600 [-] Unhandled Error
    Traceback (most recent call last):
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/scrapy/commands/crawl.py", line 45, in run
        self.crawler.start()
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/scrapy/crawler.py", line 76, in start
        reactor.run(installSignalHandlers=False) # blocking call
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1169, in run
        self.mainLoop()
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1178, in mainLoop
        self.runUntilCurrent()
    --- <exception caught here> ---
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/tcp.py", line 337, in failIfNotConnected
        self.connector.connectionFailed(failure.Failure(err))
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1055, in connectionFailed
        self.factory.clientConnectionFailed(self, reason)
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/web/client.py", line 413, in clientConnectionFailed
        self._disconnectedDeferred.callback(None)
    exceptions.AttributeError: ScrapyHTTPClientFactory instance has no attribute '_disconnectedDeferred'

2011-11-23 09:38:13-0600 [-] Unhandled Error
    Traceback (most recent call last):
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/scrapy/commands/crawl.py", line 45, in run
        self.crawler.start()
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/scrapy/crawler.py", line 76, in start
        reactor.run(installSignalHandlers=False) # blocking call
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1169, in run
        self.mainLoop()
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1178, in mainLoop
        self.runUntilCurrent()
    --- <exception caught here> ---
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/tcp.py", line 337, in failIfNotConnected
        self.connector.connectionFailed(failure.Failure(err))
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1055, in connectionFailed
        self.factory.clientConnectionFailed(self, reason)
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/web/client.py", line 413, in clientConnectionFailed
        self._disconnectedDeferred.callback(None)
    exceptions.AttributeError: ScrapyHTTPClientFactory instance has no attribute '_disconnectedDeferred'

2011-11-23 09:38:13-0600 [-] Unhandled Error
    Traceback (most recent call last):
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/scrapy/commands/crawl.py", line 45, in run
        self.crawler.start()
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/scrapy/crawler.py", line 76, in start
        reactor.run(installSignalHandlers=False) # blocking call
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1169, in run
        self.mainLoop()
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1178, in mainLoop
        self.runUntilCurrent()
    --- <exception caught here> ---
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/tcp.py", line 337, in failIfNotConnected
        self.connector.connectionFailed(failure.Failure(err))
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1055, in connectionFailed
        self.factory.clientConnectionFailed(self, reason)
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/web/client.py", line 413, in clientConnectionFailed
        self._disconnectedDeferred.callback(None)
    exceptions.AttributeError: ScrapyHTTPClientFactory instance has no attribute '_disconnectedDeferred'

2011-11-23 09:38:13-0600 [-] Unhandled Error
    Traceback (most recent call last):
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/scrapy/commands/crawl.py", line 45, in run
        self.crawler.start()
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/scrapy/crawler.py", line 76, in start
        reactor.run(installSignalHandlers=False) # blocking call
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1169, in run
        self.mainLoop()
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1178, in mainLoop
        self.runUntilCurrent()
    --- <exception caught here> ---
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/tcp.py", line 337, in failIfNotConnected
        self.connector.connectionFailed(failure.Failure(err))
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/internet/base.py", line 1055, in connectionFailed
        self.factory.clientConnectionFailed(self, reason)
      File "/home/josemanuel/envs/global/lib/python2.7/site-packages/twisted/web/client.py", line 413, in clientConnectionFailed
        self._disconnectedDeferred.callback(None)
    exceptions.AttributeError: ScrapyHTTPClientFactory instance has no attribute '_disconnectedDeferred'

2011-11-23 09:38:58-0600 [projects] INFO: Crawled 89600 pages (at 0 pages/min), scraped 31075 items (at 0 items/min)
2011-11-23 09:39:58-0600 [projects] INFO: Crawled 89600 pages (at 0 pages/min), scraped 31075 items (at 0 items/min)
2011-11-23 09:40:58-0600 [projects] INFO: Crawled 89600 pages (at 0 pages/min), scraped 31075 items (at 0 items/min)

scrapy shell raising an exception with iPython 0.11

2011-10-04 11:33:47+0200 [scrapy] ERROR: Shell error
        Traceback (most recent call last):
          File "/usr/lib/python2.6/threading.py", line 504, in __bootstrap
            self.__bootstrap_inner()
          File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
            self.run()
          File "/usr/lib/python2.6/threading.py", line 484, in run
            self.__target(*self.__args, **self.__kwargs)
        --- <exception caught here> ---
          File "/usr/local/lib/python2.6/dist-packages/twisted/python/threadpool.py", line 207, in _worker
            result = context.call(ctx, function, *args, **kwargs)
          File "/usr/local/lib/python2.6/dist-packages/twisted/python/context.py", line 59, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/usr/local/lib/python2.6/dist-packages/twisted/python/context.py", line 37, in callWithContext
            return func(*args,**kw)
          File "/usr/local/lib/python2.6/dist-packages/scrapy/shell.py", line 56, in _start
            start_python_console(self.vars)
          File "/usr/local/lib/python2.6/dist-packages/scrapy/utils/console.py", line 14, in start_python_console
            shell = IPython.Shell.IPShellEmbed(argv=[], user_ns=namespace)
        exceptions.AttributeError: 'module' object has no attribute 'Shell'

However, works fine with 0.10.1

Add tests for Feed export extension

Some feed storages are tested, but we need to tests the main FeedExport extension too.

SSL error 'sslv3 alert illegal parameter' is generated on certain URLs

I previously reported this issue on Trac: http://dev.scrapy.org/ticket/315

For example:

$scrapy fetch "https://ui2web1.apps.uillinois.edu/BANPROD1/bwskfcls.P_GetCrse"
...
2011-03-24 10:58:03+0000 [default] ERROR: Error downloading <https://ui2web1.apps.uillinois.edu/BANPROD1/bwskfcls.P_GetCrse>: [Failure instance: Traceback (failure with no frames): <class 'OpenSSL.SSL.Error'>: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'sslv3 alert illegal parameter')]

This issue is discussed here http://bugs.python.org/issue11220

It would be nice to be able to specify the SSL method and options on requests or use scrapy defaults, instead of those hardcoded in twisted.internet.ssl.ClientContextFactor

Another option might be to try SSLv3 when an error is encountered with SSLv23.

Defect in FormRequest constructor

Reported by cdeyoung on Trac http://dev.scrapy.org/ticket/323

In the constructor of the FormRequest? class in scrapy/http/request.form.py -- in the IF statement that handles the "formdata" argument -- there is a line that says:

    self.method = 'POST'

That line either shouldn't be there, falling back to the base class's method variable, or it should be handled differently if you want FormRequest? to default to POST rather that GET, as the Response base class does. Currently, you are hard-coding FormRequest? objects to be submitted via POST, and that isn't always valid. You should still allow the developer to specify method='GET' when using a FormRequest? object, I think.

If you want FormRequest? to default to submitting forms via POST, then I would recommend the following change, or something like it:

class FormRequest(Request):
    __slots__ = ()
    def __init__(self, *args, **kwargs):
        formdata = kwargs.pop('formdata', None)
        method = kwargs.pop('method', 'POST')

        super(FormRequest, self).__init__(*args, **kwargs)

        if formdata:
            self.method = method
            ...

Plugable Link Extractor backends

We want to had pluggable link extractor backends, maybe having a LINKEXTRACTOR_CLASS setting.

Some backends that come to mind: pure-regex, scrapely, libxml2, lxml, sgml

The sgml backend is not working very well, as there have been some issues reported about it:

The scrapely backend workrks quite well and it's pure-python, so it's a good choice. But we'd have to add the dependency to scrapely.

SgmlLinkExtractor mangles some url encoding (specifically "/" encoded)

Reported by kurtjx on Trac http://dev.scrapy.org/ticket/316

In [10]: fetch('http://www.last.fm/music/AC%252FDC/+images')
In [11]: lx = SgmlLinkExtractor(restrict_xpaths=('//a[@class="nextlink"]'))

In [12]: lx.extract_links(response)
Out[12]: [<Link url='http://www.last.fm/music/AC%2FDC/+images?page=2' text=u'Next' >]

the found link gets mangled to have "AC%2FDC" but it should be "AC%252FDC"

Cookies middleware is slow when crawling many domains from the same spider

When crawling many domains from the same spider, cookies middleware slows down.

The patch is attached.

The main difference is that it only checks for cookies belonging to relevant domains (the
potential_domain_matches) instead of all domains, but to do this required a bit of refactoring. It also
only calls clear_expired_cookies periodically instead of every request.

The performance problems are very large and noticeable pretty quickly if a single spider has to manage
many cookies. This happened when we had a spider that crawled many sites with the cookies middleware
enabled.

It would be great if you could review and even if we can maybe then try it on more websites.

see http://dev.scrapy.org/ticket/333 for more info

Add command line option for doing POSTs

Previously reported by pablo on Trac http://dev.scrapy.org/ticket/292

It would be nice to add a command line option for performing POSTs, like curl -d does.

This would be added to all commands that accepts a url argument, like crawl, parse, shell, fetch and view.

Allow misc.load_object() to take a reference to an actual object

It's nice to have the flexibility of giving a reference to the actual object when setting settings, as well as specifying paths to objects.

Maybe something as simple as:

if not isinstance(path, basestring):
    return path

..at the beginning of misc.load_object(path)?

scrapy.item.Item - allow item.url along with item['url]

scrapy.item.Item:

def __getattr__(self, name):
    if name in self.fields:
        raise AttributeError("Use item[%r] to get field value" % name)
    raise AttributeError(name)

def __setattr__(self, name, value):
    if not name.startswith('_'):
        raise AttributeError("Use item[%r] = %r to set field value" % \
            (name, value))
    super(DictItem, self).__setattr__(name, value)

I suggest to allow using item.url=response.url along with item['url'] - this allows IDEs to autocomplete field names:

def __getattr__(self, name):
    if name in self.fields:
        return self.__getitem__(name)
    raise AttributeError(name)

def __setattr__(self, name, value):
    if not name.startswith('_'):
        return self.__setitem__(name, value)
    super(DictItem, self).__setattr__(name, value)

canonicalize_url() function breaks some urls

Current behavior of Scrapy when finding links like:
/fclick.php?variable

is to canonicalize them to:
/fclick.php?variable=

This however makes Scrapy follow an incorrect link and cause an error page to load. This is really fault of web script programmers really who use variables without value. But for the sake of robustness Scrapy should follow the correct links.

I made a small patch for this. All it does really is that when it faces variables with 0 length value it crops out the =.

see http://dev.scrapy.org/ticket/133 for more info

Set the 'nofollow' attribute in Link objects

I previously reported this on Trac: http://dev.scrapy.org/ticket/310

The Link object now has a 'nofollow' attribute. This needs support in the link extractors

Ignore meta-refresh tag embedded inside noscript tag

Some sites uses a meta-refresh redirect to detect browsers without javascript support, ie: http://www.surlatable.com/product/id/195440.do

<NOSCRIPT>
 <META HTTP-EQUIV="refresh" CONTENT="1; URL=/jsmessage.html">
</NOSCRIPT>

see http://dev.scrapy.org/ticket/185 for more info

SqlitePriorityQueue.pop() return None may crash Poller.poll()

Previously reported by mrkschan on Trac http://dev.scrapy.org/ticket/313

I have a scrapy project that has several spiders in it. Those spiders are scheduled to execute on an hourly-basis.

The scrapyd eventually report unhandled error (as shown below). When I dig through the source of scrapy, I suspect the concurrent control of sqlite3 access is not well guarded as the error below is caused by popping from an empty queue.

The case can simply be explained by the scenario with two spiders A and B (also refer to the source - http://is.gd/ht4HSs).

Spider A's poller poll() get to line 21 of scrapyd/poller.py.
Meanwhile, Spider B's poller also get to line 21.
Spider A's poller poll() get to line 23, the queue in sqlite becomes empty and sqlite3 is locked (according to py doc - http://is.gd/67PKpn).
Spider B's poller poll() also get to line 23, wait the lock to be released.
Spider A commit and release the sqlite lock.
Spider B pop() from an empty queue. Raise Error.

Traceback (most recent call last):
  File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 778, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/usr/lib/python2.6/dist-packages/twisted/internet/task.py", line 194, in __call__
    d = defer.maybeDeferred(self.f, *self.a, **self.kw)
  File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 117, in maybeDeferred
    result = f(*args, **kw)
  File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 944, in unwindGenerator
    return _inlineCallbacks(None, f(*args, **kwargs), Deferred())
--- <exception caught here> ---
  File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 823, in _inlineCallbacks
    result = g.send(result)
  File "/usr/lib/pymodules/python2.6/scrapyd/poller.py", line 24, in poll
    returnValue(self.dq.put(self._message(msg, p)))
  File "/usr/lib/pymodules/python2.6/scrapyd/poller.py", line 33, in _message
    d = queue_msg.copy()
exceptions.AttributeError: 'NoneType' object has no attribute 'copy'

changed scrapyd data dir to be ".scrapy/scrapyd" instead of ".scrapy/.scrapy/scrapyd"

--- a/scrapyd/script.py
+++ b/scrapyd/script.py
@@ -14,7 +14,7 @@ from scrapyd import get_application
from scrapyd.config import Config

def _get_config():
-    datadir = os.path.join(project_data_dir(), '.scrapy', 'scrapyd')
+    datadir = os.path.join(project_data_dir(), 'scrapyd')
    conf = {
        'eggs_dir': os.path.join(datadir, 'eggs'),
        'logs_dir': os.path.join(datadir, 'logs'),

scrapy doesn't respect CLOSESPIDER_ITEMCOUNT

scrapy crawl example --set CLOSESPIDER_ITEMCOUNT=1

<...>
2011-09-19 19:06:40+0400 [example] INFO: Dumping spider stats:
        {'downloader/request_bytes': 337,
         'downloader/request_count': 2,
         'downloader/request_method_count/GET': 2,
         'downloader/response_bytes': 715698,
         'downloader/response_count': 2,
         'downloader/response_status_count/200': 2,
         'finish_reason': 'closespider_itemcount',
         'finish_time': datetime.datetime(2011, 9, 19, 15, 6, 40, 574142),
         'item_scraped_count': 892,
         'scheduler/memory_enqueued': 2,
         'start_time': datetime.datetime(2011, 9, 19, 15, 6, 38, 560994)}
2011-09-19 19:06:40+0400 [example] INFO: Spider closed (closespider_itemcount)
<...>

'item_scraped_count': 892
'INFO: Closing spider (closespider_itemcount)' appears in log after first (CLOSESPIDER_ITEMCOUNT value in example) scraped item.

exception handling

scrapy/spidermanager.py

def create(self, spider_name, **spider_kwargs):
    try:
        return self._spiders[spider_name](**spider_kwargs)
    except KeyError:
        raise KeyError("Spider not found: %s" % spider_name)

change to

def create(self, spider_name, **spider_kwargs):
    try:
        spcls = self._spiders[spider_name]
    except KeyError:
        raise KeyError("Spider not found: %s" % spider_name)
    return spcls(**spider_kwargs)

i had a case when spider.__init__ had a KeyError causing a misleading exception - 'spider not found'

Offsite middleware ignoring port

In my spider I have the following:

class MySpider(BaseSpider):

    allowed_domains = ['192.169.0.15:8080']

and in the parse method I do something like:

    yield Request('http://192.169.0.15:8080/mypage.html', self.my_callback_function)

the result when I run the code is that that scrapy reports:

DEBUG: Filtered offsite request to '192.168.0.15': <GET http://192.168.0.15:8080/mypage.html>

Which is wrong - it seems to be ignoring the port. If I change the allowed_domains to:

    allowed_domains = ['192.169.0.15:8080', '192.16.0.15']

Then it works as you would expect it to. No big deal, can work around it but I think it is a bug. The problem being located in the should_follow method of the OffsiteMiddleware class in contrib/spidermiddleware/offsite.py

scrapyd should allow multiple jobs to be scheduled with one URL

Previously reported by agtilden on Trac

scrapyd only allows one job to be scheduled per URL invocation. This makes scheduling lots of jobs needlessly time consuming.

I propose adding a file upload option that would contain a json string with the following structure:

[{"project" : {"spider": {"spider_arg_name": "spider_arg_value"}}},
    {"another_project" : {"another_spider": {"spider_arg_name": "spider_arg_value"}}}]

The return value would be a json list of the same length. Elements in the list would be either the jobid assigned by scrapyd or null in case the scheduler encountered an exception.

This can be implemented so that the existing parameter passing continues to work for scheduling one job at a time.

support for two-legged OAuth

Add support so scrapy will support two-legged OAuth , similar to the work of - https://github.com/simplegeo/python-oauth2

Transfer-Encoding chunked hasn't been handled

Reported by MichaelPeng on Trac http://dev.scrapy.org/ticket/322

I tried scrapy 0.12 to download some site and got a gzip error. I found the error was due to 'chunked' transfer encoding hadn't been handled. Transfer encoding 'chunked' was part of http1.1 and it should be supported by spiders.

Add support for FTP downloads

We should add support for following FTP links like:
ftp://www.example.com/somedir/somefile.xml

I suppose Requests will only use the URL attribute (and perhaps some data in meta, if it's needed).

As for Responses, they will contain the file contents in the body, as one would expect.
here should be a flag to enable/disable passive FTP, perhaps even per spider.

inside_project does not work with SCRAPY_SETTINGS_MODULE

I previously reported this on Trac http://dev.scrapy.org/ticket/300

The Assertion error in scrapy/utils/project.py inside project_data_dir incorrectly fails when running with SCRAPY_SETTINGS_MODULE instead of a scrapy.cfg.

Scrapy hangs if an exception raises in start_requests

When start_requests iterator throws an exception, it makes engine._next_request fail with an UnhandledError and prevents scrapy from correctly stop the engine hanging forever

It is requried Ctrl-C to stop it.


2012-01-27 17:10:09-0200 [scrapy] INFO: Scrapy 0.15.1 started (bot: testbot)
2012-01-27 17:10:09-0200 [spidername.com] INFO: Spider opened
2012-01-27 17:10:09-0200 [spidername.com] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-01-27 17:10:09-0200 [-] Unhandled Error
    Traceback (most recent call last):
      File "/home/daniel/src/scrapy/scrapy/commands/crawl.py", line 45, in run
        self.crawler.start()
      File "/home/daniel/src/scrapy/scrapy/crawler.py", line 76, in start
        reactor.run(installSignalHandlers=False) # blocking call
      File "/home/daniel/envs/mytestenv/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1169, in run
        self.mainLoop()
      File "/home/daniel/envs/mytestenv/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1178, in mainLoop
        self.runUntilCurrent()
    --- <exception caught here> ---
      File "/home/daniel/envs/mytestenv/local/lib/python2.7/site-packages/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/home/daniel/src/scrapy/scrapy/utils/reactor.py", line 41, in __call__
        return self._func(*self._a, **self._kw)
      File "/home/daniel/src/scrapy/scrapy/core/engine.py", line 108, in _next_request
        request = slot.start_requests.next()
      File "/home/daniel/src/testbot/testbot/spiders_dev/myspider.py", line 32, in start_requests
        'spidername.com does not support url mapping'
    exceptions.AssertionError: spidername.com does not support url mapping

^C2012-01-27 17:10:11-0200 [scrapy] INFO: Received SIGINT, shutting down gracefully. Send again to force unclean shutdown
2012-01-27 17:10:11-0200 [spidername.com] INFO: Closing spider (shutdown)
2012-01-27 17:10:11-0200 [spidername.com] INFO: Dumping spider stats:
    {'finish_reason': 'shutdown',
     'finish_time': datetime.datetime(2012, 1, 27, 19, 10, 11, 757102),
     'start_time': datetime.datetime(2012, 1, 27, 19, 10, 9, 487178)}
2012-01-27 17:10:11-0200 [spidername.com] INFO: Spider closed (shutdown)
2012-01-27 17:10:11-0200 [scrapy] INFO: Dumping global stats:
    {'memusage/max': 111865856, 'memusage/startup': 111865856}

Documentation speaks of re.match while meaning re.search

Reported by Vasily Alexeev on Trac http://dev.scrapy.org/ticket/328

In link extractor reference we see passages like

"allow (str or list) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links."

There's two quite different methods for working with regexps: matching and searching. A quick look in sources reveals that in this case we deal with searching, not matching:

_matches = lambda url, regexs: any((r.search(url) for r in regexs))

So documentation is clearly misleading and should be corrected.

SSL compatibility issues with some servers

Attempting to fetch this page over HTTPS resulted in the not very enlightening error Connection was closed cleanly.

ERROR: Error downloading <GET https://rn.ftc.gov/pls/textilern/wrnquery$.startup>: Connection was closed cleanly.

Some very helpful people on #twisted managed to work out that it was an issue with the server not liking empty fragments in the SSL communication. This could be fixed by specifying the OP_DONT_INSERT_EMPTY_FRAGMENTS option when making the OpenSSL context.

Currently it seems that in order to be able to scrape a website that has any SSL compatibility issues, you have to subclass ClientContextFactory in order to specify the compatibility options, and then subclass HTTPDownloadHandler and tell it to use that context factory.

Could this be made easier? Would it even be a good idea to set some compatibility options by default? Maybe even SSL.OP_ALL?

Display stats in scrapyd web interface

It could be very useful to display the scrapy stats in scrapyd web interface.

Add dont_cache flag

Reported by binarybug on Trac http://dev.scrapy.org/ticket/325

If a website is using a session to maintain the client's state than resuming a crawl doesn't work when cache is enabled. If we can instruct scrapy not to cache some requests than resuming a crawl would create a session when those requests are encountered and subsequent requests wouldn't fail e.g.

yield Request(' http://www.example.com', meta={'dont_cache': True})

joining (base, relative) urls give wrong urls in some cases!

I was trying to crawl a website, and they had a <a></a> with @href starts with a space!
the @href is an absolute url " http://..."
but because it starts with a blank, it will be considered as a relative url.

Make Request class configurable setting

Previously reported by wecacuee on Trac http://dev.scrapy.org/ticket/301

Rationale

Currently we use scrapy.http.Request as the default class for Request throughout scrapy. This makes scrapy quite bound to HTTP protocol requests. I understand that we have download handlers for "http", "ftp" and "s3", but this enforces request differentiation only by "URI" scheme.

We should have a common protocol general request: scrapy.Request and the request class should be configurable by settings.

DEFAULT_REQUEST_CLASS = 'scrapy.http.Request'

Allow LinkExtractor.process_value to be a string

Reported by Vasily Alexeev on Trac http://dev.scrapy.org/ticket/329

It is sometimes handy to pass LinkExtractor.process_value a string which then would map to a method of the spider object, but Scrapy unfortunately doesn't support it.

Support for binding interface to another ip

I have a suggestion for an improvement. I've added this to my local scrapy installation. Im sure it can be done more elegant, but its a start :)

The adress to bind to the socket needs to be passed to reactor.connectTCP in core.downloader.handlers.http._connect as bindAddress

See the diff file for an example.
Just add ip_bind = (ip-number, port-number) to your spider if you want to override the default.

see http://dev.scrapy.org/ticket/153 for more info