a-nau / easy-image-scraping Goto Github PK

View Code? Open in Web Editor NEW

9.0 3.0 3.0 4.74 MB

Web application to automatically scrape images from Google, Bing, Baidu and Yahoo.

Home Page: https://a-nau.github.io/parcel2d/

License: MIT License

Dockerfile 6.50% Python 93.50%

bing docker google-images image-scraper image-scrapping baidu frontend gui web-application yahoo

easy-image-scraping's Introduction

Easy Image Scraping from Google, Bing, Yahoo and Baidu

Automatically scrape images with your query from the popular search engines

Google
Bing
Baidu
Yahoo (currently only low resolution)

using an easy-to-use Frontend or using scripts.

This code is part of a paper (citation), also check the project page if you are interested in creation a dataset for instance segmentation.

Usage

Front End

Start the front end with a single command (adjust the /PATH/TO/OUTPUT to your desired output path)

docker run -it --rm --name easy_image_scraping --mount type=bind,source=/PATH/TO/OUTPUT,target=/usr/src/app/output -p 5000:5000 ghcr.io/a-nau/easy-image-scraping:latest

Enter your query and wait for the results to show in the output folder. The web applications also shows a preview of downloaded images.

Command Line

Start using the command line with

docker run -it --rm --name easy_image_scraping --mount type=bind,source=/PATH/TO/OUTPUT,target=/usr/src/app/output -p 5000:5000 ghcr.io/a-nau/easy-image-scraping:latest bash

Search for a keyword

If you just want to search for a single keywords adjust and run search_by_keyword.py

Search for a list of keywords

Write the list of search terms in the file search_terms_eng.txt.
You can then use Google Translate to translate the whole file to new languages. Change the ending of the translated file to the respective language.
Adjust config.py to define search engines for each language
Run search_by_keywords_from_files

Installation (optional)

This is optional - you can also directly use our provided container.

Docker

You can also build the image yourself using

docker build -t easy_image_scraping .

The run it by using

docker run -it --rm --name easy_image_scraping -p 5000:5000 --mount type=bind,source=/PATH/TO/OUTPUT,target=/usr/src/app/output easy_image_scraping

For Local Setup, check this

Local installation

Set up an environment using

conda env create -f environment.yml

pip install -r requirements.txt

To use Selenium, we need to download the Chrome Driver (also see this)
Check your Chrome Version and download the corresponding webdriver version

Unzip it, and add it to the path (for details, see here). Alternatively, you can adjust scrape_and_download.py

with webdriver.Chrome(
    executable_path="path/to/chrome_diver.exe",  # add this line
    options=set_chrome_options()
) as wd:

Affiliations

License and Credits

Code is partially based on and borrowed from
- sczhengyabin/Image-Downloader ( mostly crawler.py) , MIT License
- Article with Gists by Fabian Bosler, see fetch_image_urls.py
Dockerfile is based on joyzoursky/ docker-python-chromedriver , MIT License
Cookie notices are handled by the I still don't care about cookies extension GNU General Public License v3.0

Unless stated otherwise, this project is licensed under the MIT license.

Citation

If you use this code for scientific research, please consider citing

@inproceedings{naumannScrapeCutPasteLearn2022,
	title        = {Scrape, Cut, Paste and Learn: Automated Dataset Generation Applied to Parcel Logistics},
	author       = {Naumann, Alexander and Hertlein, Felix and Zhou, Benchun and Dörr, Laura and Furmans, Kai},
	booktitle    = {{{IEEE Conference}} on {{Machine Learning}} and Applications ({{ICMLA}})},
	date         = 2022
}

Disclaimer

Please be aware of copyright restrictions that might apply to images you download.

easy-image-scraping's People

Contributors

Stargazers

Watchers

Forkers

lukaszhoszowski aviex411

easy-image-scraping's Issues

WebDriverException: failed to wait for extension background page to load

After clicking "Start Search" I'm getting the following error:

WebDriverException: Message: unknown error: failed to wait for extension background page to load: chrome-extension://fihnjjcciajhdojfnbdddfaoknhalnja/_generated_background_page.html from unknown error: page could not be found: chrome-extension://fihnjjcciajhdojfnbdddfaoknhalnja/_generated_background_page.html Stacktrace: #0 0x56128a9874e3 <unknown> #1 0x56128a6b6c76 <unknown> #2 0x56128a68d896 <unknown> #3 0x56128a6dfa58 <unknown> #4 0x56128a6dc029 <unknown> #5 0x56128a71accc <unknown> #6 0x56128a71a47f <unknown> #7 0x56128a711de3 <unknown> #8 0x56128a6e72dd <unknown> #9 0x56128a6e834e <unknown> #10 0x56128a9473e4 <unknown> #11 0x56128a94b3d7 <unknown> #12 0x56128a955b20 <unknown> #13 0x56128a94c023 <unknown> #14 0x56128a91a1aa <unknown> #15 0x56128a9706b8 <unknown> #16 0x56128a970847 <unknown> #17 0x56128a980243 <unknown> #18 0x7fbc22108fd4 <unknown>
Traceback:
File "/usr/local/lib/python3.9/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
    exec(code, module.__dict__)
File "/usr/src/app/src/tools/frontend.py", line 53, in <module>
    main()
File "/usr/src/app/src/tools/frontend.py", line 35, in main
    search_by_keyword(
File "/usr/src/app/src/tools/search_by_keyword.py", line 17, in search_by_keyword
    search_and_download(
File "/usr/src/app/src/scraping/scrape_and_download.py", line 45, in search_and_download
    with webdriver.Chrome(
File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/chrome/webdriver.py", line 49, in __init__
    super().__init__(
File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/chromium/webdriver.py", line 54, in __init__
    super().__init__(
File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 206, in __init__
    self.start_session(capabilities)
File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 291, in start_session
    response = self.execute(Command.NEW_SESSION, caps)["value"]
File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 346, in execute
    self.error_handler.check_response(response)
File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/errorhandler.py", line 245, in check_response
    raise exception_class(message, screen, stacktrace)

I have no idea how to solve this.

Idea for SearchGoogle class

I was running into the issue that Google seems to use different CSS class names. To address this, I came up with the idea of modifying the script to search for the most commonly occurring class names and iterate over them to find the desired results. However, I'm facing two challenges that need resolution:

The find_common_classnames function is being called multiple times, but it should ideally be called only once.
I'm unable to find a way to select the full-size image in a more dynamic way

Any ideas on this?

class SearchGoogle(Search):
    def __init__(self, wd):
        super(SearchGoogle, self).__init__(wd)
        self.url = "https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q={q}&oq={q}&gs_l=img"
        # self.url = "https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q={q}&oq={q}&gs_l=img&tbs=il:cl"  # creative commons license only

    def find_common_classnames(self):
        image_elements = self.wd.find_elements(By.CSS_SELECTOR, "img")
        classes = [img.get_attribute("class").split(" ") for img in image_elements] # put all in list, split by space if multiple
        classes = [item for sublist in classes for item in sublist] # flatten
        classes = [x for x in classes if classes.count(x) > 1] # remove names that occur only once
        return classes

    def find_thumbnail_elements(self):
        common_classnames = self.find_common_classnames()
        # loop over classnames until we get results
        for selector in common_classnames:
            thumbnail_results = self.wd.find_elements(By.CLASS_NAME, selector)
            if len(thumbnail_results) > 0:
                return thumbnail_results
        return []        

    def get_image_urls(self, thumbnail_results):
        img_urls = []
        for thumbnail in thumbnail_results:
            try:
                thumbnail.click()  # try to click thumbnail to get img src
                self.sleep()
            except Exception:
                continue
            
            #TODO FIX THIS SELECTOR
            images = self.wd.find_elements(By.CSS_SELECTOR, "#islsp img")
            for image in images:
                if image.get_attribute("src") and "http" in image.get_attribute("src"):
                    img_urls.append(image.get_attribute("src"))
        return img_urls

    def click_show_more_button(self):
        # <input jsaction="Pmjnye" class="mye4qd" type="button" value="Weitere Ergebnisse ansehen">
        show_more_btn = self.wd.find_elements(By.CLASS_NAME, "mye4qd")
        if (
            len(show_more_btn) == 1
            and show_more_btn[0].is_displayed()
            and show_more_btn[0].is_enabled()
        ):
            show_more_btn[0].click()