maxliaops / scrapy-itzhaopin Goto Github PK

View Code? Open in Web Editor NEW

95.0 16.0 121.0 168 KB

Python 100.00%

scrapy-itzhaopin's Introduction

scrapy-itzhaopin

scrapy-itzhaopin's People

Contributors

Stargazers

Watchers

Forkers

yolanda1989 newlionwang xiaoyebupt charmlynn zuiwufenghua beiyexertz lapuda xiaodc aliqiaoqiao fyhsky luis-wang jackwang429 veinyy helloliubai point-line-surface-body jguoguo sayounara richardleeh hongxs rockflying aylhex lee-eve waha590927 qgf haozhuoran1991 tensorspace guiying212 kimisshao smithzhang010 duguruiyuan jhangli cousepig py-web wangshen2014 gischen agz1990 beibeirory fsfree satiago yiyoxy sinbh ccavxx liangmingjie jiezhu2007 wwdgit xukui44 chijiaodaxie kingstorm ver007 fvames daize1994 augustlong yiyihome shliujing fanyingying batmancn zsmj513 nanjinghhu stupidboy2015 baitongda supereagleflying fengchunsgit xiaosimao wodear gaosongyue xianwenye bertomartin leelong80 xiaoabiang piaosanlang swordlidev gallantzhangyu five3 annaye skyeyeslive bigtiger079 fds whpiano ylcolala xiaohhhh cherish24 lanlingsheng zhugeburu paulmrzhang torans ziyubiti frcmail xiaoliandanzi bygonetime woden234 hanmichael nuadaandre memego hunterchao kingleoric2010 861 xiju2003 hack-cc always20 caolusg

scrapy-itzhaopin's Issues

最近在学scrapy框架，觉得你写的这个实例不错，然后也按照最简单多方法写了一个爬虫同样是爬腾讯招聘，但是我发现虽然爬虫运行良好，但是始终爬不到第一页的数据，然后clone里你多程序试一试，发现你的程序同样有这个问题，所以想问问是哪里出了问题，我们一起进步一下。
这里是主要部分的代码，运行后能同样爬出2000+的数据，但是就是没有第一页：
class TencentSpider(CrawlSpider):
name = "tencenthr"
# download_delay = 1
allowed_domains = ["tencent.com"]
start_urls = ["http://hr.tencent.com/position.php"]

rules = [
    Rule(LinkExtractor(allow = ('/position.php\?&start=\d*#a',),restrict_xpaths=('//*[@id="next"]')), follow=True, callback='parse_item')
]

def parse_item(self, response):
    self.logger.info('Now is spidering in this page:   %s', response.url)
    base = response.xpath('//div[@id="position"]/div[1]/table/tr[@class="even" or @class="odd"]')
    pages = response.xpath('//a[@class="active"]/text()').extract()
    for sel in base:
        item = TencenthrItem()
        item['work'] = sel.xpath('td[1]/a/text()').extract()
        item['worktype'] = sel.xpath('td[2]/text()').extract()
        item['number'] = sel.xpath('td[3]/text()').extract()
        item['location'] = sel.xpath('td[4]/text()').extract()
        item['date'] = sel.xpath('td[5]/text()').extract()
        item['page'] = pages
        yield item

第一页的数据没有爬下来，探讨解决

class TencentSpider(CrawlSpider):
name = "tencenthr"
# download_delay = 1
allowed_domains = ["tencent.com"]
start_urls = ["http://hr.tencent.com/position.php"]

rules = [
    Rule(LinkExtractor(allow = ('/position.php\?&start=\d*#a',),restrict_xpaths=('//*[@id="next"]')), follow=True, callback='parse_item')
]

def parse_item(self, response):
    self.logger.info('Now is spidering in this page:   %s', response.url)
    base = response.xpath('//div[@id="position"]/div[1]/table/tr[@class="even" or @class="odd"]')
    pages = response.xpath('//a[@class="active"]/text()').extract()
    # items = []
    for sel in base:
        item = TencenthrItem()
    # item['company'] = sel.xpath('div[@class="jobnote-r"]/a/@href').extract()
        item['work'] = sel.xpath('td[1]/a/text()').extract()
        item['worktype'] = sel.xpath('td[2]/text()').extract()
        item['number'] = sel.xpath('td[3]/text()').extract()
        item['location'] = sel.xpath('td[4]/text()').extract()
        item['date'] = sel.xpath('td[5]/text()').extract()
        item['page'] = pages
        yield item

maxliaops / scrapy-itzhaopin Goto Github PK

scrapy-itzhaopin's Introduction

scrapy-itzhaopin

scrapy-itzhaopin's People

Contributors

Stargazers

Watchers

Forkers

scrapy-itzhaopin's Issues

运行成功后没有生成tencent.json文件

第一页的数据没有爬下来，探讨解决

第一页的数据没有爬下来，探讨解决

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent