Giter Club home page Giter Club logo

easy-image-scraping's Introduction

arxiv project page

Easy Image Scraping from Google, Bing, Yahoo and Baidu

Automatically scrape images with your query from the popular search engines

  • Google
  • Bing
  • Baidu
  • Yahoo (currently only low resolution)

using an easy-to-use Frontend or using scripts.

This code is part of a paper (citation), also check the project page if you are interested in creation a dataset for instance segmentation.

Usage

Front End

Start the front end with a single command (adjust the /PATH/TO/OUTPUT to your desired output path)

docker run -it --rm --name easy_image_scraping --mount type=bind,source=/PATH/TO/OUTPUT,target=/usr/src/app/output -p 5000:5000 ghcr.io/a-nau/easy-image-scraping:latest

Enter your query and wait for the results to show in the output folder. The web applications also shows a preview of downloaded images.

Command Line

Start using the command line with

docker run -it --rm --name easy_image_scraping --mount type=bind,source=/PATH/TO/OUTPUT,target=/usr/src/app/output -p 5000:5000 ghcr.io/a-nau/easy-image-scraping:latest bash

Search for a keyword

If you just want to search for a single keywords adjust and run search_by_keyword.py

Search for a list of keywords

  • Write the list of search terms in the file search_terms_eng.txt.
  • You can then use Google Translate to translate the whole file to new languages. Change the ending of the translated file to the respective language.
  • Adjust config.py to define search engines for each language
  • Run search_by_keywords_from_files

Installation (optional)

This is optional - you can also directly use our provided container.

Docker

You can also build the image yourself using

docker build -t easy_image_scraping .

The run it by using

docker run -it --rm --name easy_image_scraping -p 5000:5000 --mount type=bind,source=/PATH/TO/OUTPUT,target=/usr/src/app/output easy_image_scraping
For Local Setup, check this

Local installation

  • Set up an environment using
    conda env create -f environment.yml
    or
    pip install -r requirements.txt
  • To use Selenium, we need to download the Chrome Driver (also see this)
  • Check your Chrome Version and download the corresponding webdriver version
  • Unzip it, and add it to the path (for details, see here). Alternatively, you can adjust scrape_and_download.py
    with webdriver.Chrome(
        executable_path="path/to/chrome_diver.exe",  # add this line
        options=set_chrome_options()
    ) as wd:

Affiliations

FZI Logo

License and Credits

Unless stated otherwise, this project is licensed under the MIT license.

Citation

If you use this code for scientific research, please consider citing

@inproceedings{naumannScrapeCutPasteLearn2022,
	title        = {Scrape, Cut, Paste and Learn: Automated Dataset Generation Applied to Parcel Logistics},
	author       = {Naumann, Alexander and Hertlein, Felix and Zhou, Benchun and Dörr, Laura and Furmans, Kai},
	booktitle    = {{{IEEE Conference}} on {{Machine Learning}} and Applications ({{ICMLA}})},
	date         = 2022
}

Disclaimer

Please be aware of copyright restrictions that might apply to images you download.

easy-image-scraping's People

Contributors

a-nau avatar m-a-x-s-e-e-l-i-g avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

easy-image-scraping's Issues

WebDriverException: failed to wait for extension background page to load

After clicking "Start Search" I'm getting the following error:

WebDriverException: Message: unknown error: failed to wait for extension background page to load: chrome-extension://fihnjjcciajhdojfnbdddfaoknhalnja/_generated_background_page.html from unknown error: page could not be found: chrome-extension://fihnjjcciajhdojfnbdddfaoknhalnja/_generated_background_page.html Stacktrace: #0 0x56128a9874e3 <unknown> #1 0x56128a6b6c76 <unknown> #2 0x56128a68d896 <unknown> #3 0x56128a6dfa58 <unknown> #4 0x56128a6dc029 <unknown> #5 0x56128a71accc <unknown> #6 0x56128a71a47f <unknown> #7 0x56128a711de3 <unknown> #8 0x56128a6e72dd <unknown> #9 0x56128a6e834e <unknown> #10 0x56128a9473e4 <unknown> #11 0x56128a94b3d7 <unknown> #12 0x56128a955b20 <unknown> #13 0x56128a94c023 <unknown> #14 0x56128a91a1aa <unknown> #15 0x56128a9706b8 <unknown> #16 0x56128a970847 <unknown> #17 0x56128a980243 <unknown> #18 0x7fbc22108fd4 <unknown>
Traceback:
File "/usr/local/lib/python3.9/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
    exec(code, module.__dict__)
File "/usr/src/app/src/tools/frontend.py", line 53, in <module>
    main()
File "/usr/src/app/src/tools/frontend.py", line 35, in main
    search_by_keyword(
File "/usr/src/app/src/tools/search_by_keyword.py", line 17, in search_by_keyword
    search_and_download(
File "/usr/src/app/src/scraping/scrape_and_download.py", line 45, in search_and_download
    with webdriver.Chrome(
File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/chrome/webdriver.py", line 49, in __init__
    super().__init__(
File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/chromium/webdriver.py", line 54, in __init__
    super().__init__(
File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 206, in __init__
    self.start_session(capabilities)
File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 291, in start_session
    response = self.execute(Command.NEW_SESSION, caps)["value"]
File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 346, in execute
    self.error_handler.check_response(response)
File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/errorhandler.py", line 245, in check_response
    raise exception_class(message, screen, stacktrace)

I have no idea how to solve this.

Idea for SearchGoogle class

I was running into the issue that Google seems to use different CSS class names. To address this, I came up with the idea of modifying the script to search for the most commonly occurring class names and iterate over them to find the desired results. However, I'm facing two challenges that need resolution:

  • The find_common_classnames function is being called multiple times, but it should ideally be called only once.
  • I'm unable to find a way to select the full-size image in a more dynamic way

Any ideas on this?

class SearchGoogle(Search):
    def __init__(self, wd):
        super(SearchGoogle, self).__init__(wd)
        self.url = "https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q={q}&oq={q}&gs_l=img"
        # self.url = "https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q={q}&oq={q}&gs_l=img&tbs=il:cl"  # creative commons license only

    def find_common_classnames(self):
        image_elements = self.wd.find_elements(By.CSS_SELECTOR, "img")
        classes = [img.get_attribute("class").split(" ") for img in image_elements] # put all in list, split by space if multiple
        classes = [item for sublist in classes for item in sublist] # flatten
        classes = [x for x in classes if classes.count(x) > 1] # remove names that occur only once
        return classes

    def find_thumbnail_elements(self):
        common_classnames = self.find_common_classnames()
        # loop over classnames until we get results
        for selector in common_classnames:
            thumbnail_results = self.wd.find_elements(By.CLASS_NAME, selector)
            if len(thumbnail_results) > 0:
                return thumbnail_results
        return []        

    def get_image_urls(self, thumbnail_results):
        img_urls = []
        for thumbnail in thumbnail_results:
            try:
                thumbnail.click()  # try to click thumbnail to get img src
                self.sleep()
            except Exception:
                continue
            
            #TODO FIX THIS SELECTOR
            images = self.wd.find_elements(By.CSS_SELECTOR, "#islsp img")
            for image in images:
                if image.get_attribute("src") and "http" in image.get_attribute("src"):
                    img_urls.append(image.get_attribute("src"))
        return img_urls

    def click_show_more_button(self):
        # <input jsaction="Pmjnye" class="mye4qd" type="button" value="Weitere Ergebnisse ansehen">
        show_more_btn = self.wd.find_elements(By.CLASS_NAME, "mye4qd")
        if (
            len(show_more_btn) == 1
            and show_more_btn[0].is_displayed()
            and show_more_btn[0].is_enabled()
        ):
            show_more_btn[0].click()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.