kong36088 / baiduimagespider Goto Github PK

View Code? Open in Web Editor NEW

845.0 25.0 383.0 54 KB

一个超级轻量的百度图片爬虫

License: MIT License

Python 100.00%

crawler spider baidu python3

baiduimagespider's Introduction

BaiduImageSpider

百度图片爬虫，基于python3

个人学习开发用

单线程爬取百度图片。

爬虫工具 Required

需要安装python版本 >= 3.6

使用方法

$ python crawling.py -h
usage: crawling.py [-h] -w WORD -tp TOTAL_PAGE -sp START_PAGE
                   [-pp [{10,20,30,40,50,60,70,80,90,100}]] [-d DELAY]

optional arguments:
  -h, --help            show this help message and exit
  -w WORD, --word WORD  抓取关键词
  -tp TOTAL_PAGE, --total_page TOTAL_PAGE
                        需要抓取的总页数
  -sp START_PAGE, --start_page START_PAGE
                        起始页数
  -pp [{10,20,30,40,50,60,70,80,90,100}], --per_page [{10,20,30,40,50,60,70,80,90,100}]
                        每页大小
  -d DELAY, --delay DELAY
                        抓取延时（间隔）

开始爬取图片

python crawling.py --word "美女" --total_page 10 --start_page 1 --per_page 30

另外也可以在crawling.py最后一行修改编辑查找关键字图片默认保存在项目路径运行爬虫：

python crawling.py

博客

爬虫总结

效果图：

捐赠

您的支持是对我的最大鼓励！谢谢你请我吃糖

baiduimagespider's People

Contributors

Stargazers

Watchers

Forkers

mind-owner-fork uxiexs rockelbel arsenluca lnzyp gzm1997 xzblueofsky johnson-yue windy11 w-freagle cmj1993 kingbirdpaz taichu012 frostvk fanhuanji yunzhongke wwb9523 treediagramac eizopeter cedar-renjun githubruowong neuwangmeng mionikwang github-cqk ahnyoung1209 dansonc yfzmk2013 jerryjunz jingushengren fei-z whaozl jiangwqcooler jking8866 iloeng njblur aixiamomo koreyoshi-wang zhianlin cupwater novembersun open-git chinanet001 dechangwang lijinfeng0713 xinerfeixiang huqi447916779 zswsunshine maxbobo007 jacktank zhangnn016 lq95v5 phlovexz hibikiverniy zhly0 xingxl zhengdixin zgsxwsdxg lihhaorrran xuhanlan ngu12138 dit4fun awesome-archive hzy9981 gyanxue we0091234 whrenstone xnkjdxyql gjjbase weilaweila xiaopengyou0000 shizhan84 whitepoplar022 tsenghan veetsin tongjilishu dawn-1 xieci bairuiworld lexigua wangzhongzhen catefour onepnot1 dylanxult guodongvc zvz427 liuxiaoan8008 qdhqf wwminger shuaishuaizhao cstianshi andysofan simmon2014 869369851 amalle davidishere niuyuanyuanna jqjm wuweida hunkguo anazou

baiduimagespider's Issues

每次都只能下载150张图片

作者您好，感谢您提供的脚本，我修改了crawler.start('消防车', 10, 1, 30)里的中间两个参数，似乎怎么修改下载的图片张数总是150张，您能指点一下吗？谢谢！

My account was compromised, as a result many spam issues got created across multiple repos. I am deleting all such issues. Please check my tweet: https://x.com/arghyac35/status/1729721954909684064?s=20

图片重复图片

我想多爬一点图片，为什么会在60张左右重复？

作者大大，可以上传点图片资源吗？我要好看的。嗯...当学习资料的。

作者非常棒

鼓励下

爬取报错

我直接作者的运行 cmd
不停显示以下报错：

The read operation timed out
产生未知错误，放弃保存

爬取到200张就被ban了

我只是想在百度上爬1w张图片，我有什么错？（doge）

运行报错

错误内容：UnboundLocalError: local variable 'page' referenced before assignment

<urlopen error unknown url type: socks5>
-----urlErrorurl: http://image.baidu.com/search/avatarjson?tn=resultjsonavatarnew&ie=utf-8&word=%E5%9C%9F%E5%A3%A4&cg=girl&pn=0&rn=60&itg=0&z=0&fr=&width=&height=&lm=-1&ic=0&s=0&st=-1&gsm=1e0000001e
Traceback (most recent call last):
  File "index.py", line 135, in <module>
    crawler.start('土壤', 1, 1)  # 抓取关键词为 “二次元 美女”，总数为 10 页（即总共 10*60=600 张），起始抓取的页码为 1
  File "index.py", line 128, in start
    self.get_images(word)
  File "index.py", line 114, in get_images
    page.close()
UnboundLocalError: local variable 'page' referenced before assignment

还可以用，很棒

百度爬虫，页数过多比如2000页，图片都是重复的？？

大佬请问能抓高清的图片吗

如题非常感谢

疑问，求解

下载到了空文件，跳过!
下载到了空文件，跳过
下载到了空文件，跳过!
Remote end closed connection without response
产生未知错误，放弃保存

以上是问题，总是下载到了空文件，pc可以，但是termux进行就会这个问题，无root，请问怎么解决

你好，之前爬虫没问题，最近好像失效了，报错如下图所示

请问可以帮忙看下嘛？麻烦你了

使用一般網站

Hello

請問如果是要一般網站的下載圖片
要如何修改呢?

Thanks

最后一段代码执行不了？问题如下：

usage: ipykernel_launcher.py [-h] -w WORD -tp TOTAL_PAGE -sp START_PAGE
[-pp [{10,20,30,40,50,60,70,80,90,100}]]
[-d DELAY]
ipykernel_launcher.py: error: the following arguments are required: -w/--word, -tp/--total_page, -sp/--start_page
An exception has occurred, use %tb to see the full traceback.

SystemExit: 2
这是为啥？

失效了

大佬，还能搞一下吗

爬1000多张就不行了

关于百度图片网址问题

为什么我把从百度图片返回的json数据中得到的图片网址放到浏览器中访问，显示403错误。

奈斯

学习了

下载下一页触发后，抛出UnicodeDecodeErrorurl

-----UnicodeDecodeErrorurl: http://image.baidu.com/search/avatarjson?tn=resultjsonavatarnew&ie=utf-8&word=%E8%A1%80%E5%B8%B8%E8%A7%84&cg=girl&pn=180&rn=60&itg=0&z=0&fr=&width=&height=&lm=-1&ic=0&s=0&st=-1&gsm=1e0000001e

pn 突然从0到了180

已经爬去不了了，怎么进行更新？

Traceback (most recent call last):
File "index.py", line 140, in
crawler.start('树',100,1)
File "index.py", line 131, in start
self.get_images(word)
File "index.py", line 111, in get_images
rsp_data = json.loads(rsp)
File "/home/ch/anaconda3/envs/py3/lib/python3.6/json/init.py", line 354, in loads
return _default_decoder.decode(s)
File "/home/ch/anaconda3/envs/py3/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/ch/anaconda3/envs/py3/lib/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

反爬解决？

只能爬两页，第三页就开始触发反爬机制了咋办哥？

json.decoder.JSONDecodeError: Invalid \escape: line 34 column 151 (char 60496)

爬到第120张图片的时候遇到了标题所示的报错，我用的代码是稍微改过的，改成用图片的原名称命名而非递增的序号：

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import argparse
import os
import re
import sys
import urllib
import json
import socket
import urllib.request
import urllib.parse
import urllib.error
# 设置超时
import time

timeout = 5
socket.setdefaulttimeout(timeout)


class Crawler:
    # 睡眠时长
    __time_sleep = 0.1
    __amount = 0
    __start_amount = 0
    __counter = 0
    # 更多User-Agent见：http://tools.jb51.net/table/useragent
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
    __per_page = 30

    # 获取图片url内容等
    # t 下载图片时间间隔
    def __init__(self, t=0.1):
        self.time_sleep = t

    # 获取后缀名
    @staticmethod
    def get_suffix(name):
        m = re.search(r'\.[^\.]*$', name)
        if m.group(0) and len(m.group(0)) <= 5:
            return m.group(0)
        else:
            return '.jpeg'

    # 保存图片
    def save_image(self, rsp_data, word):
        if not os.path.exists("./" + word):
            os.mkdir("./" + word)
        # 判断名字是否重复，获取图片长度
        self.__counter = len(os.listdir('./' + word)) + 1
        for image_info in rsp_data['data']:
            try:
                if 'replaceUrl' not in image_info or len(image_info['replaceUrl']) < 1:
                    continue
                obj_url = image_info['replaceUrl'][0]['ObjUrl']
                thumb_url = image_info['thumbURL']
                url = 'https://image.baidu.com/search/down?tn=download&ipn=dwnl&word=download&ie=utf8&fr=result&url=%s&thumburl=%s' % (urllib.parse.quote(obj_url), urllib.parse.quote(thumb_url))
                time.sleep(self.time_sleep)
                suffix = self.get_suffix(obj_url)
                # 指定UA和referrer，减少403
                opener = urllib.request.build_opener()
                opener.addheaders = [
                    ('User-agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'),
                ]
                urllib.request.install_opener(opener)
                # 保存图片，word是你要从百度图片爬取的关键词，程序会新建一个以它为名的文件夹
                image_name = url.split('/')[-1]
                # print(f"image_name is {image_name}")
                filepath = './%s/%s' % (word, image_name)
                if os.path.exists(filepath):
                    print(f'此图片已存在{filepath}，跳过下载')
                    continue
                else:
                    urllib.request.urlretrieve(url, filepath)  # 保存图片到本地
                if os.path.getsize(filepath) < 5:
                    print("下载到了空文件，跳过!")
                    os.unlink(filepath)
                    continue
                print("文件夹里已有" + str(self.__counter) + "张图片")
                self.__counter += 1
            except urllib.error.HTTPError as urllib_err:
                print(urllib_err)
                continue
            except Exception as err:
                time.sleep(1)
                print(err)
                print("产生未知错误，放弃保存")
                continue
            # else:
            #     print("文件夹里已有" + str(self.__counter) + "张图片")
            #     self.__counter += 1
        return

    # 开始获取
    def get_images(self, word):
        search = urllib.parse.quote(word)
        # pn int 图片数
        pn = self.__start_amount
        while pn < self.__amount:

            url = 'https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord=%s&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=&hd=&latest=&copyright=&word=%s&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&pn=%s&rn=%d&gsm=1e&1594447993172=' % (search, search, str(pn), self.__per_page)
            # 设置header防403
            try:
                time.sleep(self.time_sleep)
                req = urllib.request.Request(url=url, headers=self.headers)
                page = urllib.request.urlopen(req)
                rsp = page.read()
                page.close()
            except UnicodeDecodeError as e:
                print(e)
                print('-----UnicodeDecodeErrorurl:', url)
            except urllib.error.URLError as e:
                print(e)
                print("-----urlErrorurl:", url)
            except socket.timeout as e:
                print(e)
                print("-----socket timout:", url)
            else:
                # 解析json
                # rsp = rsp.decode().replace('\\', '\\\\')
                rsp_data = json.loads(rsp)
                self.save_image(rsp_data, word)
                # 读取下一页
                print("下载下一页")
                pn += self.__per_page
        print("下载任务结束")
        return

    def start(self, word, total_page=1, start_page=1, per_page=30):
        """
        爬虫入口
        :param word: 抓取的关键词
        :param total_page: 需要抓取数据页数 总抓取图片数量为 页数 x per_page
        :param start_page:起始页码
        :param per_page: 每页数量
        :return:
        """
        self.__per_page = per_page
        self.__start_amount = (start_page - 1) * self.__per_page
        self.__amount = total_page * self.__per_page + self.__start_amount
        self.get_images(word)


if __name__ == '__main__':
    if len(sys.argv) > 1:
        parser = argparse.ArgumentParser()
        parser.add_argument("-w", "--word", type=str, help="抓取关键词", required=True)
        parser.add_argument("-tp", "--total_page", type=int, help="需要抓取的总页数", required=True)
        parser.add_argument("-sp", "--start_page", type=int, help="起始页数", required=True)
        parser.add_argument("-pp", "--per_page", type=int, help="每页大小", choices=[10, 20, 30, 40, 50, 60, 70, 80, 90, 100], default=30, nargs='?')
        parser.add_argument("-d", "--delay", type=float, help="抓取延时（间隔）", default=0.05)
        args = parser.parse_args()

        crawler = Crawler(args.delay)
        crawler.start(args.word, args.total_page, args.start_page, args.per_page)  # 抓取关键词为 “美女”，总数为 1 页（即总共 1*60=60 张），开始页码为 2
    else:
        # 如果不指定参数，那么程序会按照下面进行执行
        crawler = Crawler(0.05)  # 抓取延迟为 0.05

        crawler.start('警车', 10, 1, 30)  # 抓取关键词为 “美女”，总数为 10 页，开始页码为 1，每页30张（即总共 10*30=300 张）
        # crawler.start('二次元 美女', 10, 1)  # 抓取关键词为 “二次元 美女”
        # crawler.start('帅哥', 5)  # 抓取关键词为 “帅哥”

看起来失效了，用不起来

json.decoder.JSONDecodeError: Invalid \escape: line 12 column 145 (char 30889)

Hi, great jobs
when i check issue history, it seems you have resolved this issue, but when i try to run it, i also meet this issue.
could you check it. Thanks