A full-stack engineer who is passionate about AI (Machine Learning) and DevOps.
- Google Analytics Individual Qualification certification
Python library to fetch image urls based on keywords and download from Bing.com.
License: MIT License
A full-stack engineer who is passionate about AI (Machine Learning) and DevOps.
Followed the install using anaconda. Script fires up Chrome and starts loading up images, but then runs into the following error and fails without any images being downloaded.
I did download the chrome 95 driver (same as chrome itself) and added location to .bash_profile
.
Any ideas?
(BINGIMG) Ember:bing_images me$ python download.py
Save path: /Users/me/Documents/_testImages/train_images/bing/socks/dl_001
Traceback (most recent call last):
File "download.py", line 3, in <module>
bing.download_images("socks",
File "/Users/me/Documents/bing_images/bing_images/bing.py", line 58, in download_images
urls = fetch_image_urls(query, max_number, file_type, filters)
File "/Users/me/Documents/bing_images/bing_images/bing.py", line 28, in fetch_image_urls
urls = crawl_image_urls(keywords, filters, limit)
File "/Users/me/Documents/bing_images/bing_images/crawler.py", line 58, in crawl_image_urls
image_urls = image_url_from_webpage(driver, max_number)
File "/Users/me/Documents/bing_images/bing_images/crawler.py", line 35, in image_url_from_webpage
smb[0].click()
File "/opt/anaconda3/envs/BINGIMG/lib/python3.8/site-packages/selenium/webdriver/remote/webelement.py", line 80, in click
self._execute(Command.CLICK_ELEMENT)
File "/opt/anaconda3/envs/BINGIMG/lib/python3.8/site-packages/selenium/webdriver/remote/webelement.py", line 693, in _execute
return self._parent.execute(command, params)
File "/opt/anaconda3/envs/BINGIMG/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 418, in execute
self.error_handler.check_response(response)
File "/opt/anaconda3/envs/BINGIMG/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 243, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.ElementClickInterceptedException: Message: element click intercepted: Element <a class="btn_seemore cbtn mBtn" role="button" href="javascript:void(0);" h="ID=images,6761.1">...</a> is not clickable at point (960, 26). Other element would receive the click: <div class="it">...</div>
(Session info: chrome=95.0.4638.69)
Stacktrace:
0 chromedriver 0x0000000109b79bb9 chromedriver + 2747321
1 chromedriver 0x000000010a22fe03 chromedriver + 9784835
2 chromedriver 0x0000000109906118 chromedriver + 176408
3 chromedriver 0x0000000109941e21 chromedriver + 421409
4 chromedriver 0x000000010993fa7e chromedriver + 412286
5 chromedriver 0x000000010993d25a chromedriver + 402010
6 chromedriver 0x000000010993bea7 chromedriver + 396967
7 chromedriver 0x000000010992fe49 chromedriver + 347721
8 chromedriver 0x0000000109957da2 chromedriver + 511394
9 chromedriver 0x000000010992fbd5 chromedriver + 347093
10 chromedriver 0x000000010995801e chromedriver + 512030
11 chromedriver 0x000000010996a2fb chromedriver + 586491
12 chromedriver 0x0000000109957fc3 chromedriver + 511939
13 chromedriver 0x000000010992e40e chromedriver + 341006
14 chromedriver 0x000000010992f735 chromedriver + 345909
15 chromedriver 0x0000000109b405df chromedriver + 2512351
16 chromedriver 0x0000000109b5326f chromedriver + 2589295
17 chromedriver 0x0000000109b24cbb chromedriver + 2399419
18 chromedriver 0x0000000109b546ea chromedriver + 2594538
19 chromedriver 0x0000000109b35c8c chromedriver + 2469004
20 chromedriver 0x0000000109b6df58 chromedriver + 2699096
21 chromedriver 0x0000000109b6e0e1 chromedriver + 2699489
22 chromedriver 0x0000000109b7ebc8 chromedriver + 2767816
23 libsystem_pthread.dylib 0x00007fff2051f8fc _pthread_start + 224
24 libsystem_pthread.dylib 0x00007fff2051b443 thread_start + 15
Since you helped me out, figured I'd share my mod.
Used colorama, pathlib and f-strings to display the output cleaner, and added number padding to the downloads. Might be a bit tacky with the emojis, but just in case anyone wants it. Some symbols might not display on github. Preview images below. Apologies if it isn't perfect, but I'm no pro, did this fast, and didn't check all the error messages.
Was going to add the search phrase but too busy. Should be easy to add.
This has only been tested on macOS Big Sur. ANSI blue displays purple on Mojave for me, so you may want to change that if it looks ugly, but I'm only running this on my laptop, so the blue looks fine.
try:
from util import get_file_name, rename, make_image_dir, download_image
except ImportError: # Python 3
from .util import get_file_name, rename, make_image_dir, download_image
try:
from crawler import crawl_image_urls
except ImportError: # Python 3
from .crawler import crawl_image_urls
from typing import List
from multiprocessing.pool import ThreadPool
from time import time as timer
import os
import math
import pathlib
from colorama import init, Fore, Style
init(autoreset=True)
print(Fore.RED + r'''
โโโโโโ โโ โโโ โโ โโโโโโ โโ โโโ โโโ โโโโโ โโโโโโ โโโโโโโ โโโโโโโ
โโ โโ โโ โโโโ โโ โโ โโ โโโโ โโโโ โโ โโ โโ โโ โโ
โโโโโโ โโ โโ โโ โโ โโ โโโ โโโโโ โโ โโ โโโโ โโ โโโโโโโ โโ โโโ โโโโโ โโโโโโโ
โโ โโ โโ โโ โโ โโ โโ โโ โโ โโ โโ โโ โโ โโ โโ โโ โโ โโ
โโโโโโ โโ โโ โโโโ โโโโโโ โโ โโ โโ โโ โโ โโโโโโ โโโโโโโ โโโโโโโ ''' + Style.DIM + '''
//////// ''' + Style.NORMAL + '''Automated, Multithreaded Chrome URL Fetcher and Image Downloader''' + Style.DIM +''' ////////''' + Fore.WHITE + '''
https://github.com/CatchZeng/bing_images
''')
_FINISH = False
def fetch_image_urls(
query: str,
limit: int = 20,
file_type: str = '',
filters: str = ''
) -> List[str]:
result = list()
keywords = query
if len(file_type) > 0:
keywords = query + " " + file_type
urls = crawl_image_urls(keywords, filters, limit)
for url in urls:
if isValidURL(url, file_type) and url not in result:
result.append(url)
if len(result) >= limit:
break
return result
def isValidURL(url, file_type):
if len(file_type) < 1:
return True
return url.endswith(file_type)
def download_images(
query: str,
limit: int = 20,
output_dir='',
pool_size: int = 20,
file_type: str = '',
filters: str = '',
force_replace=False
):
start = timer()
image_dir = make_image_dir(output_dir, force_replace)
print(f"๐ Save path: {Fore.BLUE}{image_dir}")
# Fetch more image URLs to avoid some images are invalid.
max_number = math.ceil(limit*1.5)
urls = fetch_image_urls(query, max_number, file_type, filters)
entries = get_image_entries(urls, image_dir)
print(f"โฌ๏ธ Downloading images\n")
ps = pool_size
if limit < pool_size:
ps = limit
download_image_entries(entries, ps, limit)
rename_images(image_dir, query)
print(f"โ
{Fore.GREEN}Done\n")
elapsed = timer() - start
print(f"โฑ {Fore.WHITE}Elapsed time: {Fore.RED}%.2fs\n" % elapsed)
def rename_images(dir, prefix):
files = os.listdir(dir)
index = 1
print(f"๐ {Fore.BLUE}Renaming images{Fore.LIGHTBLACK_EX}...")
for f in files:
if f.startswith("."):
print(f"{Fore.YELLOW}Escaping name of {f}{Fore.LIGHTBLACK_EX}...\n")
continue
src = os.path.join(dir, f)
name = rename(f, index, prefix)
dst = os.path.join(dir, name)
os.rename(src, dst)
index = index + 1
print(f"{Fore.GREEN} Finished renaming ๐\n")
def download_image_entries(entries, pool_size, limit):
global _FINISH
counter = 1
_FINISH = False
pool = ThreadPool(pool_size)
results = pool.imap_unordered(
download_image_with_thread, entries)
for (url, result) in results:
if counter > limit:
_FINISH = True
pool.terminate()
break
if result:
urldir = pathlib.PurePath(url)
urlp = urldir.parents[0]
print(f"{Fore.YELLOW} #{str(format(counter, '03'))}{Fore.LIGHTBLACK_EX}: {Fore.LIGHTBLACK_EX}{urlp}/{Fore.WHITE}{urldir.name}\n\t {Fore.GREEN}Downloaded! \n")
counter = counter + 1
def get_image_entries(urls, dir):
entries = []
i = 0
for url in urls:
name = get_file_name(url, i, "#tmp#")
path = os.path.join(dir, name)
entries.append((url, path))
i = i + 1
return entries
def download_image_with_thread(entry):
if _FINISH:
return
url, path = entry
result = download_image(url, path)
return (url, result)
if __name__ == '__main__':
download_images("cat",
20,
output_dir="/Users/catchzeng/Desktop/cat",
pool_size=10,
file_type="png",
force_replace=True)
import requests
import shutil
import posixpath
import urllib
import os
from colorama import init, Fore, Style
init(autoreset=True)
DEFAULT_OUTPUT_DIR = "bing-images"
def download_image(url, path) -> bool:
try:
r = requests.get(url, stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
return True
else:
print(f"{Fore.RED} โ ๏ธ Download image: {Fore.YELLOW}{url}\n{Fore.RED} Err :: {Fore.WHITE}{r.status_code}\n")
return False
except Exception as e:
print(f"{Fore.RED} โ ๏ธ Download image: {Fore.YELLOW}{url}{Fore.RED}\n Err :: {Fore.WHITE}{e}\n")
return False
def get_file_name(url, index, prefix='image') -> str:
try:
path = urllib.parse.urlsplit(url).path
filename = posixpath.basename(path).split('?')[0]
type, _ = file_data(filename)
result = "{}_{}.{}".format(prefix, index, type)
return result
except Exception as e:
print(f"โ ๏ธ {Fore.RED}Get file name: {Fore.YELLOW}{url}{Fore.RED}\n Err :: {Fore.WHITE}{e}\n")
return prefix
def rename(name, index, prefix='image') -> str:
try:
type, _ = file_data(name)
result = "{}_{}.{}".format(prefix, index, type)
return result
except Exception as e:
print(f"{Fore.RED}โ ๏ธ Rename: {Fore.YELLOW}{name}{Fore.RED}\n Err :: {Fore.WHITE}{e}\n")
return prefix
def file_data(name):
try:
type = name.split(".")[-1]
name = name.split(".")[0]
if type.lower() not in ["jpe", "jpeg", "jfif", "exif", "tiff", "gif", "bmp", "png", "webp", "jpg"]:
type = "jpg"
return (type, name)
except Exception as e:
print(f"{Fore.RED}โ ๏ธ Issue getting: {Fore.YELLOW}{name}{Fore.RED}\n Err :: {Fore.WHITE}{e}\n")
return (name, "jpg")
def make_image_dir(output_dir, force_replace=False) -> str:
image_dir = output_dir
if len(output_dir) < 1:
image_dir = os.path.join(os.getcwd(), DEFAULT_OUTPUT_DIR)
if force_replace:
if os.path.isdir(image_dir):
shutil.rmtree(image_dir)
try:
if not os.path.isdir(image_dir):
os.makedirs(image_dir)
except:
pass
return image_dir
if __name__ == '__main__':
print("util")
from urllib.parse import quote
import shutil
from selenium import webdriver
import time
import json
from colorama import init, Fore, Style
init(autoreset=True)
BASE_URL = "https://www.bing.com/images/search?"
def gen_query_url(keywords, filters):
keywords_str = "&q=" + quote(keywords)
query_url = BASE_URL + keywords_str
if len(filters) > 0:
query_url += "&qft="+filters
return query_url
def image_url_from_webpage(driver, max_number=10000):
image_urls = list()
time.sleep(10)
img_count = 0
while True:
image_elements = driver.find_elements_by_class_name("iusc")
if len(image_elements) > max_number:
break
if len(image_elements) > img_count:
img_count = len(image_elements)
driver.execute_script(
"window.scrollTo(0, document.body.scrollHeight);")
else:
smb = driver.find_elements_by_class_name("btn_seemore")
if len(smb) > 0 and smb[0].is_displayed():
smb[0].click()
else:
break
time.sleep(3)
for image_element in image_elements:
m_json_str = image_element.get_attribute("m")
m_json = json.loads(m_json_str)
image_urls.append(m_json["murl"])
return image_urls
def crawl_image_urls(keywords, filters, max_number=10000, proxy=None, proxy_type="http"):
chrome_path = shutil.which("chromedriver")
chrome_path = "./bin/chromedriver" if chrome_path is None else chrome_path
chrome_options = webdriver.ChromeOptions()
if proxy is not None and proxy_type is not None:
chrome_options.add_argument(
"--proxy-server={}://{}".format(proxy_type, proxy))
driver = webdriver.Chrome(chrome_path, chrome_options=chrome_options)
query_url = gen_query_url(keywords, filters)
driver.set_window_size(1920, 1080)
driver.get(query_url)
image_urls = image_url_from_webpage(driver, max_number)
driver.close()
if max_number > len(image_urls):
output_num = len(image_urls)
else:
output_num = max_number
print(f"{Fore.YELLOW}\n๐ท Crawled {Fore.RED}{len(image_urls)}{Fore.YELLOW} image urls.\n")
return image_urls[0:output_num]
if __name__ == '__main__':
images = crawl_image_urls(
"mbot png", "+filterui:aspect-square", max_number=10)
for i in images:
print(f"{Fore.BLUE}{i}+\n")
By default, the module seems to only respond with non-explicit search results (SafeSearch: moderate).
Is there a way to change this to SafeSearch: off ?
Hello. Where can i see all the possible filters? I am trying to filter by size.
After running this snippet,
from bing_images import bing
bing.download_images("cat",
2,
output_dir="/path/to/imgs",
pool_size=10,
file_type="png",
force_replace=True,
extra_query_params='&first=1')
I get the following error:
Save path: /path/to/imgs
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [2], in <cell line: 3>()
1 from bing_images import bing
----> 3 bing.download_images("cat",
4 2,
5 output_dir="/path/to/imgs",
6 pool_size=10,
7 file_type="png",
8 force_replace=True,
9 extra_query_params='&first=1')
File ~/anaconda3/envs/real_meal/lib/python3.10/site-packages/bing_images/bing.py:60, in download_images(query, limit, output_dir, pool_size, file_type, filters, force_replace, extra_query_params)
58 # Fetch more image URLs to avoid some images are invalid.
59 max_number = math.ceil(limit*1.5)
---> 60 urls = fetch_image_urls(query, max_number, file_type, filters, extra_query_params=extra_query_params)
61 entries = get_image_entries(urls, image_dir)
63 print("Downloading images")
File ~/anaconda3/envs/real_meal/lib/python3.10/site-packages/bing_images/bing.py:29, in fetch_image_urls(query, limit, file_type, filters, extra_query_params)
27 if len(file_type) > 0:
28 keywords = query + " " + file_type
---> 29 urls = crawl_image_urls(keywords, filters, limit, extra_query_params=extra_query_params)
30 for url in urls:
31 if isValidURL(url, file_type) and url not in result:
File ~/anaconda3/envs/real_meal/lib/python3.10/site-packages/bing_images/crawler.py:59, in crawl_image_urls(keywords, filters, max_number, proxy, proxy_type, extra_query_params)
57 driver.set_window_size(1920, 1080)
58 driver.get(query_url)
---> 59 image_urls = image_url_from_webpage(driver, max_number)
60 driver.close()
62 if max_number > len(image_urls):
File ~/anaconda3/envs/real_meal/lib/python3.10/site-packages/bing_images/crawler.py:26, in image_url_from_webpage(driver, max_number)
23 img_count = 0
25 while True:
---> 26 image_elements = driver.find_elements_by_class("iusc")
27 if len(image_elements) > max_number:
28 break
AttributeError: 'WebDriver' object has no attribute 'find_elements_by_class'
I tried changing find_elements_by_class("iusc")
to find_elements("class", "iusc")
on crawler.py
as the former is deprecated, but it did not work, it resulted in new issues.
I'm using version 103.0.5060.114 of Chrome and version ChromeDriver 103.0.5060.53 of Chrome Driver. I tried with other versions unsuccesfully.
Thank you in advance.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.