Giter Club home page Giter Club logo

dsc-scraping-images-online-ds-sp-000's Introduction

Scraping Images

Introduction

You've definitely started to hone your skills at scraping now! With that, let's look at another data format you're apt to want to pull from the web: images! In this lesson, you'll see how to save images from the web as well as display them in a pandas DataFrame for easy perusal!

Objectives

You will be able to:

  • Save Images from the Web
  • Display Images in a Pandas DataFrame

Grabbing an HTML Page

Start with the same page that you've been working with: books.toscrape.com.

from bs4 import BeautifulSoup
import requests
html_page = requests.get('http://books.toscrape.com/') #Make a get request to retrieve the page
soup = BeautifulSoup(html_page.content, 'html.parser') #Pass the page contents to beautiful soup for parsing
warning = soup.find('div', class_="alert alert-warning")
book_container = warning.nextSibling.nextSibling

Finding Images

First, simply retrieve a list of images by searching for img tags with beautiful soup:

images = book_container.findAll('img')
ex_img = images[0] #Preview an entry
ex_img
<img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/>
#Use tab complete to preview what types of methods are available for the entry
ex_img.
<img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/>
#While there's plenty of other methods to explore, simply select the url for the image for now.
ex_img.attrs['src']
'media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg'

Saving Images

Great! Now that you have a url (well, a url extension to be more precise) you can download the image locally!

import shutil
url_base = "http://books.toscrape.com/"
url_ext = ex_img.attrs['src']
full_url = url_base + url_ext
r = requests.get(full_url, stream=True)
if r.status_code == 200:
    with open("images/book1.jpg", 'wb') as f:
        r.raw.decode_content = True
        shutil.copyfileobj(r.raw, f)

Showing Images in the File Directory

You can also run a simple bash command in a standalone cell to preview that the image is indeed there:

ls images/
book-section.png  book1.jpg

Previewing an Individual Image

import matplotlib.pyplot as plt
import matplotlib.image as mpimg
img=mpimg.imread('images/book1.jpg')
imgplot = plt.imshow(img)
plt.show()

png

Displaying Images in Pandas DataFrames

You can even display images within a pandas DataFrame by using a little HTML yourself!

import pandas as pd
from IPython.display import Image, HTML
row1 = [ex_img.attrs['alt'], '<img src="images/book1.jpg"/>']
df = pd.DataFrame(row1).transpose()
df.columns = ['title', 'cover']
HTML(df.to_html(escape=False))
title cover
0 A Light in the Attic

All Together Now

data = []
for n, img in enumerate(images):
    url_base = "http://books.toscrape.com/"
    url_ext = img.attrs['src']
    full_url = url_base + url_ext
    r = requests.get(full_url, stream=True)
    path = "images/book{}.jpg".format(n+1)
    title = img.attrs['alt']
    if r.status_code == 200:
        with open(path, 'wb') as f:
            r.raw.decode_content = True
            shutil.copyfileobj(r.raw, f)
        row = [title, '<img src="{}"/>'.format(path)]
        data.append(row)
df = pd.DataFrame(data)
print('Number of rows: ', len(df))
df.columns = ['title', 'cover']
HTML(df.to_html(escape=False))   
Number of rows:  20
title cover
0 A Light in the Attic
1 A Light in the Attic
2 A Light in the Attic
3 A Light in the Attic
4 A Light in the Attic
5 A Light in the Attic
6 A Light in the Attic
7 A Light in the Attic
8 A Light in the Attic
9 A Light in the Attic
10 A Light in the Attic
11 A Light in the Attic
12 A Light in the Attic
13 A Light in the Attic
14 A Light in the Attic
15 A Light in the Attic
16 A Light in the Attic
17 A Light in the Attic
18 A Light in the Attic
19 A Light in the Attic

Summary

Voila! You really are turning into a scraping champion! Now go get scraping!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.