Giter Club home page Giter Club logo

picture_scrapy's Introduction

picture_scrapy

项目地址: https://github.com/HeLiangHIT/picture_scrapy

.=======================================================================================================.
||           _          _                                                                              ||
||   _ __   (_)   ___  | |_   _   _   _ __    ___           ___    ___   _ __    __ _   _ __    _   _  ||
||  | '_ \  | |  / __| | __| | | | | | '__|  / _ \  _____  / __|  / __| | '__|  / _` | | '_ \  | | | | ||
||  | |_) | | | | (__  | |_  | |_| | | |    |  __/ |_____| \__ \ | (__  | |    | (_| | | |_) | | |_| | ||
||  | .__/  |_|  \___|  \__|  \__,_| |_|     \___|         |___/  \___| |_|     \__,_| | .__/   \__, | ||
||  |_|                                                                                |_|      |___/  ||
|'-----------------------------------------------------------------------------------------------------'|
||                                                                                   -- 美女图片爬取框架。||
|'====================================================================================================='|
||                                                  .::::.                                             ||
||                                                .::::::::.                                           ||
||                                                :::::::::::                                          ||
||                                                ':::::::::::..                                       ||
||                                                .:::::::::::::::'                                    ||
||                                                  '::::::::::::::.`                                  ||
||                                                    .::::::::::::::::.'                              ||
||                                                  .::::::::::::..                                    ||
||                                                .::::::::::::::''                                    ||
||                                     .:::.       '::::::::''::::                                     ||
||                                   .::::::::.      ':::::'  '::::                                    ||
||                                  .::::':::::::.    :::::    '::::.                                  ||
||                                .:::::' ':::::::::. :::::.     ':::.                                 ||
||                              .:::::'     ':::::::::.::::::.      '::.                               ||
||                            .::::''         ':::::::::::::::'       '::.                             ||
||                           .::''              '::::::::::::::'        ::..                           ||
||                        ..::::                  ':::::::::::'         :'''`                          ||
||                     ..''''':'                    '::::::.'                                          ||
|'====================================================================================================='|
||                                                                              [email protected] ||
||                                                                       https://github.com/HeLiangHIT ||
'======================================================================================================='

项目介绍

使用 scrapy 实现的图片爬取框架,融合了 UserAgentMiddleware/ChromeDownloaderMiddleware 中间件,RedisSetPipeline 管道用于将爬取到的图片保存到redis的set类型中,另外提供一个多线程异步下载器从redis中依次取出图片地址进行批量下载并保存。

下载结果示例:

爬取过程... 爬取过程... 爬取结果...

PS. 爬取过程看似有点缓慢实际很惊人,感觉爬一晚上后得到的图片这辈子都已经看不完👀了...美图太多看不过来现在已经审美疲劳了。

软件架构

实现了四个美女图片的爬虫:

这样做的优势是"支持分布式爬取 + 分布式下载",比如我就使用 mac 爬取图片地址,然后用 windows 连上移动硬盘下载图片, win/mac 搭配,干活不累。如果有更多电脑的话可以更好的配合。

安装教程 && 使用说明

  1. 在某台机器上启动 redis-server path/to/redis.conf 注意配置中注释掉 bind 127.0.0.1 ::1、设置protected-mode no
  2. 在多个电脑上分别 git clone 本项目地址, 然后到工程目录下使用 pip install -r requirement.txt 或者使用 pipenv shell
  3. 在 settings.py 中设置正确的 REDIS_IPREDIS_PORT 参数。
  4. 分别使用 scrapy crawl xxx 爬取指定的网站
  5. 分别使用 python picture_downloader.py --key='xxx' --dir='xxx' 下载指定网站的图片,更多参数python picture_downloader.py --help
异步协程下载器:从 redis 里面连续读取图片json信息,然后使用协程下载保存到指定文件夹中。有效的json举例如下:
`{"url": "http://www.a.com/a/a.jpg", "name": "a.jpg", "folder": "a", "page":"www.a.com"}`

Usage:
  picture_download.py [--dir=dir] [--ip=ip] [--port=port] [--key=key] [--empty_exit=empty_exit] [--concurrency=concurrency]
  picture_download.py --version
Options:
  --dir=dir                    select picture save dir. * default: '$HOME/Pictures/scrapy/'
  --ip=ip                      select redis ip. [default: 127.0.0.1]
  --port=port                  select redis ip. [default: 6379]
  --key=key                    select redis key. [default: picture:jiandan]
  --empty_exit=empty_exit      select if exit when redis set empty. [default: true]
  --concurrency=concurrency    select the concurrency number of downloader. [default: 20]

example of mine:

function start_crawl(){
    name=$1
    rm -f log/name.log
    scrapy crawl ${name} &
    sleep 2 && python picture_downloader.py --key=picture:${name} --dir=/Users/heliang/Pictures/scrapy/${name} --empty_exit=0 --concurrency=20
}
function stop_crawl(){
    name=$1
    while [ $(ps -ef | grep "scrapy crawl ${name}" | grep -v grep | wc -l) -ge 1 ]; do
        ps -ef | grep "scrapy crawl ${name}" | awk '{print $2}' | xargs kill # 停止爬虫
        sleep 1
    done
}
function clear_all(){
    while [ $(ps -ef | grep "scrapy crawl" | grep -v grep | wc -l) -ge 1 ]; do
        ps -ef | grep 'scrapy crawl' | awk '{print $2}' | xargs kill # 停止所有爬虫
    done
    while [ $(ps -ef | grep "chromedriver" | grep -v grep | wc -l) -ge 1 ]; do
        ps -ef | grep chromedriver | awk '{print $2}' | xargs kill -9 # 清理后台可能残留的 chromedriver 进程
    done
    rm -f log/*.log
}

start_crawl jiandan # meizitu mzitu mmjpg

todo

  1. 代理ip: 当前没有遇到封锁ip的现象,所以未实现ip池,如果后期有需要可以增加
  2. 下载文件去重复功能,发现本地已经存在的文件就不再下载了 -- done
  3. 爬取网页去重复功能,爬取过的网页就不再爬了(某些主页列表例外) - 即使重启机器/爬虫,如何实现? -- 使用RedisCrawlSpider
  4. 扩展到视频爬取的功能,再参考 picture_downloader.py 实现一个 video_downloader.py
  5. 扩展 mysql/sqlite3 等实现更多方便易用的 pipline 保存数据,方便后期扩展其它功能
  6. 删除对系统默认的某些中间件的依赖,加速处理

参与贡献

  1. Fork 本项目
  2. 新建 Feat_xxx 分支
  3. 提交代码
  4. 新建 Pull Request

欢迎扫码关注作者,获取更多信息哦~另外如果本源码对你有所帮助,可以点赞以支持作者的持续更新

关注作者

picture_scrapy's People

Contributors

helianghit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

picture_scrapy's Issues

建议:以观赏为中心的体验

建议增强为以观赏为中心的体验:

  1. 抓取逻辑后台化。静默进行,用户不再需要知晓抓取这个概念。
  2. 图片/视频浏览外壳。支持随机打开一个主题/套图,并记住观赏历史,下次不再打开。
  3. 自动维护本地存储。控制存储大小、自动更新热点、删除老旧等。
  4. (plus)star/like等简单的自娱/互娱功能。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.