chuanenlin / shutterscrape Goto Github PK

Web scrapper for Shutterstock

License: MIT License

Python 100.00%

shutterstock gettyimages webscraper scraper chromedriver selenium beautifulsoup python

shutterscrape's Introduction

ShutterScrape

ShutterScrape is a web scrapper for bulk downloading images and videos from Shutterstock with speed. ⚡
It implements Selenium for browser automation and Beautiful Soup for parsing.

Setting up

Configure shutterscrape.py to your Python version.
Install requirements from Terminal:

pip install beautifulsoup4
pip install selenium
pip install lxml

Install ChromeDriver.
(Optional) Configure environment variables paths for python.exe and chromedriver.exe.

Running

Open terminal in the directory of shutterscrape.py and enter:

python shutterscrape.py

Go grab a cup of coffee while waiting... oh wait, it's already done!

Definitions

Search mode: Enter i for scraping images and v for scraping videos .
Number of search terms: For example, if you want to search for drone single person, enter 3.
Search term: Keyword(s) for searching on Shutterstock.
Number of pages to scrape: Higher number of pages means greater quantity of content with lower keyword precision.

Updates

10/1/2020
Updated for new shutterstock page layout as of 10/1/2020.

4/26/2019
Updated for new shutterstock page layout as of 4/26/2019.

10/1/2018
Added GUI for save directory selection.

07/31/2018
More stability fixes.

07/25/2018
Added gettyscrape.py for scraping videos from Getty Images.

07/23/2018
Stability fixes.

shutterscrape's People

Contributors

Stargazers

Watchers

shutterscrape's Issues

ERROR:data_channel.cc(44)]

I have this error and nothing gets downloaded:

DevTools listening on ws://127.0.0.1:54653/devtools/browser/95f0ac6b-a67a-4d57-baa6-b29cc3412005 Page 1 [20232:25456:0621/042844.217:ERROR:data_channel.cc(44)] Accepting maxRetransmits = -1 for backwards compatibility [20232:25456:0621/042844.217:ERROR:data_channel.cc(49)] Accepting maxRetransmitTime = -1 for backwards compatibility [20232:25456:0621/042845.736:ERROR:data_channel.cc(44)] Accepting maxRetransmits = -1 for backwards compatibility [20232:25456:0621/042845.736:ERROR:data_channel.cc(49)] Accepting maxRetransmitTime = -1 for backwards compatibility Page 2

Fix the code plz

Doesn't work

Chromedriver error

I am attempting to scrape and continue to get this error. Any ideas?

Message: session not created: Chrome version must be between 70 and 73
(Driver info: chromedriver=73.0.3683.86,platform=Linux 4.18.0-18-generic x86_64)

gettyscrape Indentation error

line no.36 section = container.find_element_by_xpath(".//section[@Class='image-section']") has error.

Script runs through pages but does not scrape any images

I am having a wired issue running this script. It has worked fine before, but now all of a sudden the script seems to visit however many pages I tell it to but it does not scrape any images from it (refer to screenshot below). The only thing I have modified in the script is, under def imageScrape: I have commented out the line driver.maximize_window() since the chromedriver is having trouble maximizing the screen and that line seems to crash the script, but otherwise the script is exactly the same. I have already tried copying and pasting the original script from here and just commenting that line out to make sure it was the only change. The script has worked before perfectly fine, I have no idea why it started doing this. What could be the problem?

Terminal Screen Shot

Lenght of image container

img_container = scraper.find_all("div", {"class":"z_c_b"})

img_container value gets stored as 1 .

So not able to retrieve all images in the page.

how do I solve this ?

[Errno socket error] [SSL: CERTIFICATE_VERIFY_FAILED]

has anyone run into the ssl certificate error? does anyone know what the solution is? [Errno socket error] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:727)

This code is for Python 3.X and whitout chromedriver deprecation issue on "chrome_options" parameter

shutterscrape.zip

Video Scraping not working

Image scraping is working but for video scraping the videos are not downloading and its looping in the first page itself. any fix?? Thanks

crawl data with full resolution

i can using your code to crawler data from shutterstock, but i get the thumbnail of image - image with low resolution (300x300 pixel). how can i crawl the data with full resolution?

Use API endpoint instead of Scraping

https://www.shutterstock.com/studioapi/search?q={query}

https://www.shutterstock.com/studioapi/images/{image_id}

Use requests, not selenium

You do not need to use selenium at all and just use requests, this will make your scripts run way faster.
Here is an example I created to show off how for gettyimages: https://gist.github.com/xtream1101/090aab1e00e245284a15af3f7cfaab05

Also for shutterstock you can hit this url where I searched for house
https://www.shutterstock.com/sstk/api/footage/search?language=en&q=house&page%5Bsize%5D=50&page%5Boffset%5D=0&recordActivity=true&fields%5Bvideos%5D=description%2Cpreview_video_urls%2Cpreview_image_url%2Cduration%2Csizes%2Cuploaded_date

Which will yield nice json data of all the results.

You can also thread the downloads to be even faster.

Script runs through pages but does not scrape any images (2)

Hi,

the shutterstock's page format changed again. Line 87 should now be changed to img_container = scraper.find_all("img", {"class":"z_g_h"})

/martijn

import error urllib

Hi Chuan,

when I try to run shutterscrape.py it prompts me an import error.

Alexanders-MacBook-Pro:shutterscrape alexandersantiago$ python shutterscrape.py
Traceback (most recent call last):
File "shutterscrape.py", line 6, in
from urllib import urlopen
ImportError: cannot import name 'urlopen'

Can you help me on that one please?