cuanboy / scrapyproject Goto Github PK

View Code? Open in Web Editor NEW

416.0 20.0 233.0 983 KB

开始Scrapy实战如：存数据库、下载文件、爬京东、淘宝、Anti-Anti-Spider……

Home Page: http://www.scrapyd.cn

Python 100.00%

scrapyproject's Introduction

Scrapy实战项目合集

AoiSolas（scrapy采花大盗小爬虫）

本爬虫，实现整个妹纸网站爬取，妹纸4000多，图片10W多，合计10G多数据量……项目详情

scrapyMysql

Scrapy爬取到的数据如何存入MySQL？如何编写Scrapy组件Piplie本项目会详细告诉你，项目详情

InputMongodb

本项目主要演示如何把Scrapy爬到的数据存入MongoDB。项目详情

ImageSpider

本项目主要演示如何把Scrapy下载图片。学会了便可以去菜花了项目详情

ImagesRename

scrapy下载图片并重命名而且放入不同目录项目详情

scrapyproject's People

Contributors

Stargazers

Watchers

Forkers

ghrhomeebook liangkoong 424138799 scrapyspider tong60su 81815658qq fengyingfb rainlixq liuhaonan2003 hanwangkun gaoxiaosai ra61hub xiaominglei001 haochuang mrmiaolei charm139 oumiga1314 pearlriverrunner hlmx123 marshalws qeq66 pythondjangogit qepwqlpf zhoodj 1024vinceli redleaves ilokin mjdong abbieharris kerrz justforheart pi314159126 jfanfung keven998 frsq shzym86 fangbo6699 joeyho728 yutiya ctest11 philipccc callmebinge ssl834 xmc2014 xiangxiaodong ssskming ljxok2001 jasonfoxtrot j6l lllllliulei twanfan zlb2016 hansel163 jinghunao zhibai-xx zyxceng miaohua1982 hello344868264 hackjsw akidongzi i65 champion-yang ysguoqiang yuliming5218 houchanglong mrfiveii qiantangjun had1128 shellwang hehan502 dw1997 mrbin96 wangzeling baixue1 kingking888 sillylawliet desirefire juphy hhy5277 nycchen yangziping superxuu xiangnanxiangbei mayun1987 feirenk smartisantt lihu2018 zhaojunchen bfd2018 github716 manderls yqzhang0326 qiu957919102 gangxuezhang actioncr kioco stephencurry33 cnoveler wangafsadfx crazyhb

scrapyproject's Issues

子类重写def item_completed(self, results, item, info),可以实现文件重命名功能

from scrapy.pipelines.images import ImagesPipeline
from scrapy import Request
from ImageSpider.settings import IMAGES_STORE as images_store
import os

class ImagespiderPipeline(ImagesPipeline):

def get_media_requests(self, item, info):
    # 循环每一张图片地址下载，若传过来的不是集合则无需循环直接yield
    for image_url in item['imgurl']:
        yield Request(image_url)

# def file_path(self, request, response=None, info=None):
#     # 重命名，若不重写这函数，图片名为哈希，就是一串乱七八糟的名字
#     image_guid = request.url.split('/')[-1]  # 提取url前面名称作为图片名。
#     return image_guid

# def item_completed(self, results, item, info):
# 	#重命名文件,并把默认路径D:\ImageSpider\full\*图片 
# 	#修改为D:\ImageSpider\*.jpg,提取item['imgurl']中url前面名称作为图片名
# 	#功能上类似file_path
# 	image_path = [x["path"] for ok, x in results if ok]
# 	for i in range(len(image_path)):
# 		os.rename(images_store+'/'+image_path[i],images_store+'/'+item['imgurl'][i].split('/')[-1])

原网站反爬虫机制做了更改. 代码也需要相应更改(到2019年9月4日为止,更改两个地方即可绕过反爬机制)
①将AoiSolaSpider.py中的allowed_domains = ["www.mm131.com"]变为allowed_domains = ["www.mm131.com", "www.mm131.net"] (这里解决的是content被过滤的问题)
②将middlewares.py下AoisolasSpiderMiddleware类中process_request函数的内容整个换成: request.headers['referer'] = "http://www.mm131.com/?zzaqkey=4087969942"
(绕过防盗链)

after running the code, my ip was blocked by the site. congrats.

ModuleNotFoundError: No module named 'AoiSolas'错误

代码完全复制，但是总会出现
Traceback (most recent call last):
File "D:/zart/Aoisolas/Aoisolas/spiders/AoiSolaSpider.py", line 13, in
from AoiSolas.items import AoisolasItem
ModuleNotFoundError: No module named 'AoiSolas'
这样的错误，爬虫跑不起来
python3.7版本
前面几个项目都试了一下都可以，就这个出现这个错误，水平有限，找不出原因，求大佬指教