Giter Club home page Giter Club logo

scrapyproject's Introduction

Scrapy实战项目合集

AoiSolas(scrapy采花大盗小爬虫)

本爬虫,实现整个妹纸网站爬取,妹纸4000多,图片10W多,合计10G多数据量……项目详情

image

scrapyMysql

Scrapy爬取到的数据如何存入MySQL?如何编写Scrapy组件Piplie本项目会详细告诉你,项目详情

InputMongodb

本项目主要演示如何把Scrapy爬到的数据存入MongoDB。项目详情

ImageSpider

本项目主要演示如何把Scrapy下载图片。学会了便可以去菜花了 项目详情

ImagesRename

scrapy下载图片并重命名而且放入不同目录 项目详情

scrapyproject's People

Contributors

cuanboy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrapyproject's Issues

子类重写def item_completed(self, results, item, info),可以实现文件重命名功能

from scrapy.pipelines.images import ImagesPipeline
from scrapy import Request
from ImageSpider.settings import IMAGES_STORE as images_store
import os

class ImagespiderPipeline(ImagesPipeline):

def get_media_requests(self, item, info):
    # 循环每一张图片地址下载,若传过来的不是集合则无需循环直接yield
    for image_url in item['imgurl']:
        yield Request(image_url)

# def file_path(self, request, response=None, info=None):
#     # 重命名,若不重写这函数,图片名为哈希,就是一串乱七八糟的名字
#     image_guid = request.url.split('/')[-1]  # 提取url前面名称作为图片名。
#     return image_guid

# def item_completed(self, results, item, info):
# 	#重命名文件,并把默认路径D:\ImageSpider\full\*图片 
# 	#修改为D:\ImageSpider\*.jpg,提取item['imgurl']中url前面名称作为图片名
# 	#功能上类似file_path
# 	image_path = [x["path"] for ok, x in results if ok]
# 	for i in range(len(image_path)):
# 		os.rename(images_store+'/'+image_path[i],images_store+'/'+item['imgurl'][i].split('/')[-1])

[2019年9月4日]原网站反爬措施更新了, 代码更改如下

原网站反爬虫机制做了更改. 代码也需要相应更改(到2019年9月4日为止,更改两个地方即可绕过反爬机制)
①将AoiSolaSpider.py中的allowed_domains = ["www.mm131.com"]变为allowed_domains = ["www.mm131.com", "www.mm131.net"] (这里解决的是content被过滤的问题)
②将middlewares.py下AoisolasSpiderMiddleware类中process_request函数的内容整个换成: request.headers['referer'] = "http://www.mm131.com/?zzaqkey=4087969942"
(绕过防盗链)

ModuleNotFoundError: No module named 'AoiSolas'错误

代码完全复制,但是总会出现
Traceback (most recent call last):
File "D:/zart/Aoisolas/Aoisolas/spiders/AoiSolaSpider.py", line 13, in
from AoiSolas.items import AoisolasItem
ModuleNotFoundError: No module named 'AoiSolas'
这样的错误,爬虫跑不起来
python3.7版本
前面几个项目都试了一下都可以,就这个出现这个错误,水平有限,找不出原因,求大佬指教

下载失败

一直提示WARNING: File (code: 403): Error downloading file from *** referred in ***
WARNING: Dropped: Item contains no images,好像没有一张下载成功过,win10+python3.6+scrapy1.5

No module named 'PIL'

display error:
from scrapy.pipelines.images import ImagesPipeline
File "D:\Anaconda3\envs\my_env\lib\site-packages\scrapy\pipelines\images.py", line 15, in
from PIL import Image
ModuleNotFoundError: No module named 'PIL'

pip install pillow

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.