catchzeng / bing_images Goto Github PK

Python library to fetch image urls based on keywords and download from Bing.com.

License: MIT License

Python 97.23% Makefile 2.77%

bing bing-image bing-image-search-api bing-image-scrapping image-downloader image-scraper image-collector python-image-download python-imagesearch multithreading

bing_images's Introduction

👋 Hi, I'm CatchZeng

A full-stack engineer who is passionate about AI (Machine Learning) and DevOps.

😄 Blog

https://makeoptim.com/

🎬 Projects

Neural Style Transfer Online & Free Tool

🖥 Research Areas

🎗 Qualifications

Google Analytics Individual Qualification certification

bing_images's People

Contributors

Stargazers

Watchers

Forkers

liam2014 codeprotocol paozhuanyinyuba tejashah88 fivefishstudios sdhca zrichz ornciwen luanjunyi emielvb lionheart27786

bing_images's Issues

Error on macos Big Sur

Followed the install using anaconda. Script fires up Chrome and starts loading up images, but then runs into the following error and fails without any images being downloaded.

I did download the chrome 95 driver (same as chrome itself) and added location to .bash_profile.

Any ideas?

(BINGIMG) Ember:bing_images me$ python download.py
Save path: /Users/me/Documents/_testImages/train_images/bing/socks/dl_001
Traceback (most recent call last):
  File "download.py", line 3, in <module>
    bing.download_images("socks",
  File "/Users/me/Documents/bing_images/bing_images/bing.py", line 58, in download_images
    urls = fetch_image_urls(query, max_number, file_type, filters)
  File "/Users/me/Documents/bing_images/bing_images/bing.py", line 28, in fetch_image_urls
    urls = crawl_image_urls(keywords, filters, limit)
  File "/Users/me/Documents/bing_images/bing_images/crawler.py", line 58, in crawl_image_urls
    image_urls = image_url_from_webpage(driver, max_number)
  File "/Users/me/Documents/bing_images/bing_images/crawler.py", line 35, in image_url_from_webpage
    smb[0].click()
  File "/opt/anaconda3/envs/BINGIMG/lib/python3.8/site-packages/selenium/webdriver/remote/webelement.py", line 80, in click
    self._execute(Command.CLICK_ELEMENT)
  File "/opt/anaconda3/envs/BINGIMG/lib/python3.8/site-packages/selenium/webdriver/remote/webelement.py", line 693, in _execute
    return self._parent.execute(command, params)
  File "/opt/anaconda3/envs/BINGIMG/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 418, in execute
    self.error_handler.check_response(response)
  File "/opt/anaconda3/envs/BINGIMG/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 243, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.ElementClickInterceptedException: Message: element click intercepted: Element <a class="btn_seemore cbtn mBtn" role="button" href="javascript:void(0);" h="ID=images,6761.1">...</a> is not clickable at point (960, 26). Other element would receive the click: <div class="it">...</div>
  (Session info: chrome=95.0.4638.69)
Stacktrace:
0   chromedriver                        0x0000000109b79bb9 chromedriver + 2747321
1   chromedriver                        0x000000010a22fe03 chromedriver + 9784835
2   chromedriver                        0x0000000109906118 chromedriver + 176408
3   chromedriver                        0x0000000109941e21 chromedriver + 421409
4   chromedriver                        0x000000010993fa7e chromedriver + 412286
5   chromedriver                        0x000000010993d25a chromedriver + 402010
6   chromedriver                        0x000000010993bea7 chromedriver + 396967
7   chromedriver                        0x000000010992fe49 chromedriver + 347721
8   chromedriver                        0x0000000109957da2 chromedriver + 511394
9   chromedriver                        0x000000010992fbd5 chromedriver + 347093
10  chromedriver                        0x000000010995801e chromedriver + 512030
11  chromedriver                        0x000000010996a2fb chromedriver + 586491
12  chromedriver                        0x0000000109957fc3 chromedriver + 511939
13  chromedriver                        0x000000010992e40e chromedriver + 341006
14  chromedriver                        0x000000010992f735 chromedriver + 345909
15  chromedriver                        0x0000000109b405df chromedriver + 2512351
16  chromedriver                        0x0000000109b5326f chromedriver + 2589295
17  chromedriver                        0x0000000109b24cbb chromedriver + 2399419
18  chromedriver                        0x0000000109b546ea chromedriver + 2594538
19  chromedriver                        0x0000000109b35c8c chromedriver + 2469004
20  chromedriver                        0x0000000109b6df58 chromedriver + 2699096
21  chromedriver                        0x0000000109b6e0e1 chromedriver + 2699489
22  chromedriver                        0x0000000109b7ebc8 chromedriver + 2767816
23  libsystem_pthread.dylib             0x00007fff2051f8fc _pthread_start + 224
24  libsystem_pthread.dylib             0x00007fff2051b443 thread_start + 15

Styled with Colorama and f-strings

Since you helped me out, figured I'd share my mod.

Used colorama, pathlib and f-strings to display the output cleaner, and added number padding to the downloads. Might be a bit tacky with the emojis, but just in case anyone wants it. Some symbols might not display on github. Preview images below. Apologies if it isn't perfect, but I'm no pro, did this fast, and didn't check all the error messages.

Was going to add the search phrase but too busy. Should be easy to add.

This has only been tested on macOS Big Sur. ANSI blue displays purple on Mojave for me, so you may want to change that if it looks ugly, but I'm only running this on my laptop, so the blue looks fine.

preview

bing.py

try:
    from util import get_file_name, rename, make_image_dir, download_image
except ImportError:  # Python 3
    from .util import get_file_name, rename, make_image_dir, download_image
try:
    from crawler import crawl_image_urls
except ImportError:  # Python 3
    from .crawler import crawl_image_urls
from typing import List
from multiprocessing.pool import ThreadPool
from time import time as timer
import os
import math
import pathlib
from colorama import init, Fore, Style
init(autoreset=True)

print(Fore.RED + r'''
    ██████  ██ ███    ██  ██████        ██ ███    ███  █████   ██████  ███████ ███████ 
    ██   ██ ██ ████   ██ ██             ██ ████  ████ ██   ██ ██       ██      ██      
    ██████  ██ ██ ██  ██ ██   ███ █████ ██ ██ ████ ██ ███████ ██   ███ █████   ███████ 
    ██   ██ ██ ██  ██ ██ ██    ██       ██ ██  ██  ██ ██   ██ ██    ██ ██           ██ 
    ██████  ██ ██   ████  ██████        ██ ██      ██ ██   ██  ██████  ███████ ███████ ''' + Style.DIM + '''
    //////// ''' + Style.NORMAL + '''Automated, Multithreaded Chrome URL Fetcher and Image Downloader''' + Style.DIM +''' ////////''' + Fore.WHITE + '''

    https://github.com/CatchZeng/bing_images
''')

_FINISH = False


def fetch_image_urls(
    query: str,
    limit: int = 20,
    file_type: str = '',
    filters: str = ''
) -> List[str]:
    result = list()
    keywords = query
    if len(file_type) > 0:
        keywords = query + " " + file_type
    urls = crawl_image_urls(keywords, filters, limit)
    for url in urls:
        if isValidURL(url, file_type) and url not in result:
            result.append(url)
            if len(result) >= limit:
                break
    return result


def isValidURL(url, file_type):
    if len(file_type) < 1:
        return True
    return url.endswith(file_type)


def download_images(
    query: str,
    limit: int = 20,
    output_dir='',
    pool_size: int = 20,
    file_type: str = '',
    filters: str = '',
    force_replace=False
):
    start = timer()
    image_dir = make_image_dir(output_dir, force_replace)
    print(f"📁 Save path: {Fore.BLUE}{image_dir}")

    # Fetch more image URLs to avoid some images are invalid.
    max_number = math.ceil(limit*1.5)
    urls = fetch_image_urls(query, max_number, file_type, filters)
    entries = get_image_entries(urls, image_dir)

    print(f"⬇️  Downloading images\n")
    ps = pool_size
    if limit < pool_size:
        ps = limit
    download_image_entries(entries, ps, limit)

    rename_images(image_dir, query)

    print(f"✅ {Fore.GREEN}Done\n")
    elapsed = timer() - start
    print(f"⏱  {Fore.WHITE}Elapsed time: {Fore.RED}%.2fs\n" % elapsed)


def rename_images(dir, prefix):
    files = os.listdir(dir)
    index = 1
    print(f"📝  {Fore.BLUE}Renaming images{Fore.LIGHTBLACK_EX}...")
    for f in files:
        if f.startswith("."):
            print(f"{Fore.YELLOW}Escaping name of {f}{Fore.LIGHTBLACK_EX}...\n")
            continue
        src = os.path.join(dir, f)
        name = rename(f, index, prefix)
        dst = os.path.join(dir, name)
        os.rename(src, dst)
        index = index + 1
    print(f"{Fore.GREEN}    Finished renaming 👍\n")


def download_image_entries(entries, pool_size, limit):
    global _FINISH
    counter = 1
    _FINISH = False
    pool = ThreadPool(pool_size)
    results = pool.imap_unordered(
        download_image_with_thread, entries)
    for (url, result) in results:
        if counter > limit:
            _FINISH = True
            pool.terminate()
            break
        if result:
            urldir = pathlib.PurePath(url)
            urlp = urldir.parents[0]
            print(f"{Fore.YELLOW}   #{str(format(counter, '03'))}{Fore.LIGHTBLACK_EX}: {Fore.LIGHTBLACK_EX}{urlp}/{Fore.WHITE}{urldir.name}\n\t {Fore.GREEN}Downloaded! \n")
            counter = counter + 1


def get_image_entries(urls, dir):
    entries = []
    i = 0
    for url in urls:
        name = get_file_name(url, i, "#tmp#")
        path = os.path.join(dir, name)
        entries.append((url, path))
        i = i + 1
    return entries


def download_image_with_thread(entry):
    if _FINISH:
        return
    url, path = entry
    result = download_image(url, path)
    return (url, result)


if __name__ == '__main__':
    download_images("cat",
                    20,
                    output_dir="/Users/catchzeng/Desktop/cat",
                    pool_size=10,
                    file_type="png",
                    force_replace=True)

util.py

import requests
import shutil
import posixpath
import urllib
import os
from colorama import init, Fore, Style
init(autoreset=True)

DEFAULT_OUTPUT_DIR = "bing-images"


def download_image(url, path) -> bool:
    try:
        r = requests.get(url, stream=True)
        if r.status_code == 200:
            with open(path, 'wb') as f:
                r.raw.decode_content = True
                shutil.copyfileobj(r.raw, f)
            return True
        else:
            print(f"{Fore.RED}      ⚠️  Download image: {Fore.YELLOW}{url}\n{Fore.RED}         Err :: {Fore.WHITE}{r.status_code}\n")
            return False
    except Exception as e:
        print(f"{Fore.RED}      ⚠️  Download image: {Fore.YELLOW}{url}{Fore.RED}\n         Err :: {Fore.WHITE}{e}\n")
        return False


def get_file_name(url, index, prefix='image') -> str:
    try:
        path = urllib.parse.urlsplit(url).path
        filename = posixpath.basename(path).split('?')[0]
        type, _ = file_data(filename)
        result = "{}_{}.{}".format(prefix, index, type)
        return result
    except Exception as e:
        print(f"⚠️  {Fore.RED}Get file name: {Fore.YELLOW}{url}{Fore.RED}\n   Err :: {Fore.WHITE}{e}\n")
        return prefix


def rename(name, index, prefix='image') -> str:
    try:
        type, _ = file_data(name)
        result = "{}_{}.{}".format(prefix, index, type)
        return result
    except Exception as e:
        print(f"{Fore.RED}⚠️  Rename: {Fore.YELLOW}{name}{Fore.RED}\n   Err :: {Fore.WHITE}{e}\n")
        return prefix


def file_data(name):
    try:
        type = name.split(".")[-1]
        name = name.split(".")[0]
        if type.lower() not in ["jpe", "jpeg", "jfif", "exif", "tiff", "gif", "bmp", "png", "webp", "jpg"]:
            type = "jpg"
        return (type, name)
    except Exception as e:
        print(f"{Fore.RED}⚠️  Issue getting: {Fore.YELLOW}{name}{Fore.RED}\n    Err :: {Fore.WHITE}{e}\n")
        return (name, "jpg")


def make_image_dir(output_dir, force_replace=False) -> str:
    image_dir = output_dir
    if len(output_dir) < 1:
        image_dir = os.path.join(os.getcwd(), DEFAULT_OUTPUT_DIR)

    if force_replace:
        if os.path.isdir(image_dir):
            shutil.rmtree(image_dir)
    try:
        if not os.path.isdir(image_dir):
            os.makedirs(image_dir)
    except:
        pass

    return image_dir


if __name__ == '__main__':
    print("util")

crawler.py

from urllib.parse import quote
import shutil
from selenium import webdriver
import time
import json
from colorama import init, Fore, Style
init(autoreset=True)

BASE_URL = "https://www.bing.com/images/search?"


def gen_query_url(keywords, filters):
    keywords_str = "&q=" + quote(keywords)
    query_url = BASE_URL + keywords_str
    if len(filters) > 0:
        query_url += "&qft="+filters
    return query_url


def image_url_from_webpage(driver, max_number=10000):
    image_urls = list()

    time.sleep(10)
    img_count = 0

    while True:
        image_elements = driver.find_elements_by_class_name("iusc")
        if len(image_elements) > max_number:
            break
        if len(image_elements) > img_count:
            img_count = len(image_elements)
            driver.execute_script(
                "window.scrollTo(0, document.body.scrollHeight);")
        else:
            smb = driver.find_elements_by_class_name("btn_seemore")
            if len(smb) > 0 and smb[0].is_displayed():
                smb[0].click()
            else:
                break
        time.sleep(3)
    for image_element in image_elements:
        m_json_str = image_element.get_attribute("m")
        m_json = json.loads(m_json_str)
        image_urls.append(m_json["murl"])
    return image_urls


def crawl_image_urls(keywords, filters, max_number=10000, proxy=None, proxy_type="http"):
    chrome_path = shutil.which("chromedriver")
    chrome_path = "./bin/chromedriver" if chrome_path is None else chrome_path
    chrome_options = webdriver.ChromeOptions()
    if proxy is not None and proxy_type is not None:
        chrome_options.add_argument(
            "--proxy-server={}://{}".format(proxy_type, proxy))
    driver = webdriver.Chrome(chrome_path, chrome_options=chrome_options)

    query_url = gen_query_url(keywords, filters)
    driver.set_window_size(1920, 1080)
    driver.get(query_url)
    image_urls = image_url_from_webpage(driver, max_number)
    driver.close()

    if max_number > len(image_urls):
        output_num = len(image_urls)
    else:
        output_num = max_number

    print(f"{Fore.YELLOW}\n🕷  Crawled {Fore.RED}{len(image_urls)}{Fore.YELLOW} image urls.\n")

    return image_urls[0:output_num]


if __name__ == '__main__':
    images = crawl_image_urls(
        "mbot png", "+filterui:aspect-square", max_number=10)
    for i in images:
        print(f"{Fore.BLUE}{i}+\n")

Only getting non-explicit search results

By default, the module seems to only respond with non-explicit search results (SafeSearch: moderate).
Is there a way to change this to SafeSearch: off ?

Help with filters

Hello. Where can i see all the possible filters? I am trying to filter by size.

AttributeError: 'WebDriver' object has no attribute 'find_elements_by_class'

After running this snippet,

from bing_images import bing

bing.download_images("cat",
                      2,
                      output_dir="/path/to/imgs",
                      pool_size=10,
                      file_type="png",
                      force_replace=True,
                      extra_query_params='&first=1')

I get the following error:

Save path: /path/to/imgs
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [2], in <cell line: 3>()
      1 from bing_images import bing
----> 3 bing.download_images("cat",
      4                       2,
      5                       output_dir="/path/to/imgs",
      6                       pool_size=10,
      7                       file_type="png",
      8                       force_replace=True,
      9                       extra_query_params='&first=1')

File ~/anaconda3/envs/real_meal/lib/python3.10/site-packages/bing_images/bing.py:60, in download_images(query, limit, output_dir, pool_size, file_type, filters, force_replace, extra_query_params)
     58 # Fetch more image URLs to avoid some images are invalid.
     59 max_number = math.ceil(limit*1.5)
---> 60 urls = fetch_image_urls(query, max_number, file_type, filters, extra_query_params=extra_query_params)
     61 entries = get_image_entries(urls, image_dir)
     63 print("Downloading images")

File ~/anaconda3/envs/real_meal/lib/python3.10/site-packages/bing_images/bing.py:29, in fetch_image_urls(query, limit, file_type, filters, extra_query_params)
     27 if len(file_type) > 0:
     28     keywords = query + " " + file_type
---> 29 urls = crawl_image_urls(keywords, filters, limit, extra_query_params=extra_query_params)
     30 for url in urls:
     31     if isValidURL(url, file_type) and url not in result:

File ~/anaconda3/envs/real_meal/lib/python3.10/site-packages/bing_images/crawler.py:59, in crawl_image_urls(keywords, filters, max_number, proxy, proxy_type, extra_query_params)
     57 driver.set_window_size(1920, 1080)
     58 driver.get(query_url)
---> 59 image_urls = image_url_from_webpage(driver, max_number)
     60 driver.close()
     62 if max_number > len(image_urls):

File ~/anaconda3/envs/real_meal/lib/python3.10/site-packages/bing_images/crawler.py:26, in image_url_from_webpage(driver, max_number)
     23 img_count = 0
     25 while True:
---> 26     image_elements = driver.find_elements_by_class("iusc")
     27     if len(image_elements) > max_number:
     28         break

AttributeError: 'WebDriver' object has no attribute 'find_elements_by_class'

I tried changing find_elements_by_class("iusc") to find_elements("class", "iusc") on crawler.py as the former is deprecated, but it did not work, it resulted in new issues.

I'm using version 103.0.5060.114 of Chrome and version ChromeDriver 103.0.5060.53 of Chrome Driver. I tried with other versions unsuccesfully.

Thank you in advance.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.