Giter Club home page Giter Club logo

wswp's Introduction

Web Scraping with Python

Welcome to the code repository for Web Scraping with Python, Second Edition! I hope you find the code and data here useful. If you have any questions reach out to @kjam on Twitter or GitHub.

Code Structure

All of the code samples are in folders separated by chapter. Scripts are intended to be run from the code folder, allowing you to easily import from the chapters.

Code Examples

I have not included every code sample you've found in the book, but I have included a majority of the finished scripts. Although these are included, I encourage you to write out each code sample on your own and use these only as a reference.

Firefox Issues

Depending on your version of Firefox and Selenium, you may run into JavaScript errors. Here are some fixes:

  • Use an older version of Firefox
  • Upgrade Selenium to >=3.0.2 and download the geckodriver. Make sure the geckodriver is findable by your PATH variable. You can do this by adding this line to your .bashrc or .bash_profile. (Wondering what these are? Please read the Appendix C on learning the command line).
  • Use PhantomJS with Selenium (change your browser line to webdriver.PhantomJS('path/to/your/phantomjs/installation'))
  • Use Chrome, InternetExplorer or any other supported browser

Feel free to reach out if you have any questions!

Issues with Module Import

Seeing chp1 ModuleNotFound errors? Try adding this snippet to the file:

import os
import sys
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir)))

What this does is append the main module to your system path, which is where Python looks for imports. On some installations, I have noticed the current directory is not immediately added (common practice), so this code explicitly adds that directory to your path.

Corrections?

If you find any issues in these code examples, feel free to submit an Issue or Pull Request. I appreciate your input!

First edition repository

If you are looking for the first edition's repository, you can find it here: Web Scraping with Python, First Edition

Questions?

Reach out to @kjam on Twitter or GitHub. @kjam is also often on freenode. :)

wswp's People

Contributors

cianoflynn avatar kjam avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wswp's Issues

threaded_crawler_with_queue.py ( the updated one )

When running the above script from Chapter04 [ I.E with the updates posted about 7 months prior to this post]
at
def mp_threaded_crawler(…….):
proc.start() produces the following error
" Can't Pickle _Thread.lock objects "

windows 10 pro
python 3.6.6
gpu 960m

There are many problems with the sample website.

Hi,through the practice of the first chapter of the fourth quarter code found that using the sample website, the results of the code running and the code in the book run completely inconsistent, please modify the original author as soon as possible, thank you!

CsvCallback

Using the posted code for the advanced_ link_crawler and the csv_callback class, whenever I run the example code for the link_crawler at the end of Chapter 2, i get back 'IndexError: list out of range'

The list comprehension does not seem to work..

Please help

pls give details about the python version and virtual environment - and packages

I may be doing something wrong, but I'm not certain what. BTW, thanks very much for the book.

I couldn't get the example on page 57-58 to work before I changed anything, and again, I assume I am doing something incorrectly or have something setup incorrectly. I am running on a Windows 10 system using python 3.6.2.

Thanks for any help you can provide...

I made one change in "advanced_link_crawler.py" as follows:
if re.search(link_regex, link):
match() doesn't work, so I changed it to search() since not beginning of the string

I made some debug modifications to "csv_callback.py" because nothing was being written to the CSV file when the exception was thrown, as follows:

import sys  # DEBUG ONLY
import csv
import re
from lxml.html import fromstring

class CsvCallback:
    def __init__(self):
        self.filename = r'..\..\data\countries.csv' # CHANGE: to get something written to CSV file before exception thrown
        self.handle = open(self.filename, 'w')      # CHANGE
        try:
            self.writer = csv.writer(self.handle)   # CHANGE
            self.fields = ('area', 'population', 'iso', 'country', 'capital',
                           'continent', 'tld', 'currency_code', 'currency_name',
                           'phone', 'postal_code_format', 'postal_code_regex',
                           'languages', 'neighbours')
            self.writer.writerow(self.fields)
            self.handle.close() # CHANGE: to get something written to CSV file before exception thrown
        except csv.Error as e:
            sys.exit('file {}, line {}: {}'.format(self.filename, self.writer.line_num, e)) # DEBUG ONLY
        except:
            print("Unexpected error:", sys.exc_info()[0])   # DEBUG ONLY  
            raise

    def __call__(self, url, html):
        if re.search('/view/', url):
            print(url)  # DEBUG ONLY
            tree = fromstring(html)
            try:
                all_rows = [
                    tree.xpath('//tr[@id="places_%s__row"]/td[@class="w2p_fw"]' % field)[0].text_content()
                    for field in self.fields]
            except:
                print(self.fields)  # DEBUG ONLY
                print("Unexpected error:", sys.exc_info()[0])   # DEBUG ONLY
                raise               # DEBUG ONLY
            self.handle = open(self.filename, 'a')  # CHANGE: to get something written to CSV file before exception thrown
            self.writer = csv.writer(self.handle)   # CHANGE
            self.writer.writerow(all_rows)
            self.handle.close()                     # CHANGE

This is my main routine calling everything:

import os
import sys

sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir)))

from chp2.advanced_link_crawler import link_crawler
from chp2.csv_callback import CsvCallback

link_crawler('http://example.webscraping.com/', '/(index|view)', max_depth=-1, scrape_callback=CsvCallback())

I get these exceptions:
File "C:\Users\mrich\Documents\PythonCode\wswp-master\wswp-master\code\chp1\ScrapedToCSV.py", line 9, in
link_crawler('http://example.webscraping.com/', '/(index|view)', max_depth=-1, scrape_callback=CsvCallback())
File "C:\Users\mrich\Documents\PythonCode\wswp-master\wswp-master\code\chp2\advanced_link_crawler.py", line 110, in link_crawler
data.extend(scrape_callback(url, html) or [])
File "C:\Users\mrich\Documents\PythonCode\wswp-master\wswp-master\code\chp2\csv_callback.py", line 33, in call
for field in self.fields]
File "C:\Users\mrich\Documents\PythonCode\wswp-master\wswp-master\code\chp2\csv_callback.py", line 33, in
for field in self.fields]

builtins.IndexError: list index out of range

I get this output:
Downloading: http://example.webscraping.com/
Downloading: http://example.webscraping.com/places/default/index/1
Downloading: http://example.webscraping.com/places/default/index/2
Downloading: http://example.webscraping.com/places/default/index/3
Downloading: http://example.webscraping.com/places/default/index/4
Downloading: http://example.webscraping.com/places/default/index/5
Downloading: http://example.webscraping.com/places/default/index/6
Downloading: http://example.webscraping.com/places/default/index/7
Downloading: http://example.webscraping.com/places/default/index/8
Downloading: http://example.webscraping.com/places/default/index/9
Downloading: http://example.webscraping.com/places/default/index/10
Downloading: http://example.webscraping.com/places/default/index/11
Downloading: http://example.webscraping.com/places/default/index/12
Downloading: http://example.webscraping.com/places/default/index/13
Downloading: http://example.webscraping.com/places/default/index/14
Downloading: http://example.webscraping.com/places/default/index/15
Downloading: http://example.webscraping.com/places/default/index/16
Downloading: http://example.webscraping.com/places/default/index/17
Downloading: http://example.webscraping.com/places/default/index/18
Downloading: http://example.webscraping.com/places/default/index/19
Downloading: http://example.webscraping.com/places/default/index/20
Downloading: http://example.webscraping.com/places/default/index/21
Downloading: http://example.webscraping.com/places/default/index/22
Downloading: http://example.webscraping.com/places/default/index/23
Downloading: http://example.webscraping.com/places/default/index/24
Downloading: http://example.webscraping.com/places/default/index/25
Downloading: http://example.webscraping.com/places/default/view/Zimbabwe-252
http://example.webscraping.com/places/default/view/Zimbabwe-252
Downloading: http://example.webscraping.com/places/default/user/login?_next=/places/default/view/Zimbabwe-252
http://example.webscraping.com/places/default/user/login?_next=/places/default/view/Zimbabwe-252
('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours')
Unexpected error: <class 'IndexError'>

I get 2 rows added to my CSV file: a header and 1 row of data.

Skipping due to the depth

As i understand you changed to way the Web Site works. in book there are http://example.webscraping.com/view/Afghanistan-1, however now http://example.webscraping.com/places/default/view/Afghanistan-1. I am a biggener, therefore faced with problem:
Downloading: http://example.webscraping.com/places/default/index
Skipping http://example.webscraping.com/places/default/index/1 due to depth
Skipping http://example.webscraping.com/places/default/view/Antigua-and-Barbuda-10 due to depth
Skipping http://example.webscraping.com/places/default/view/Antarctica-9 due to depth
Skipping http://example.webscraping.com/places/default/view/Anguilla-8 due to depth
Skipping http://example.webscraping.com/places/default/view/Angola-7 due to depth
Skipping http://example.webscraping.com/places/default/view/Andorra-6 due to depth
Skipping http://example.webscraping.com/places/default/view/American-Samoa-5 due to depth
Skipping http://example.webscraping.com/places/default/view/Algeria-4 due to depth
Skipping http://example.webscraping.com/places/default/view/Albania-3 due to depth
Skipping http://example.webscraping.com/places/default/view/Aland-Islands-2 due to depth
Skipping http://example.webscraping.com/places/default/view/Afghanistan-1 due to depth
Skipping http://example.webscraping.com/places/default/user/login?_next=/places/default/index due to depth
Skipping http://example.webscraping.com/places/default/user/register?_next=/places/default/index due to depth
Skipping http://example.webscraping.com/places/default/index due to depth

THE CODE .TXT
P.S. PyCharm, Python 3.6.3. If you have newer edition of book, pleaser give me some information, I will be very thankful!!!
final.txt

Is example.webscraping.com down

I've tried to read example.webscraping.com, unfortunately i couldn't reach the site.
Could you resolve this. i'm cannot practicing code on your books.

from chp1.throttle import Throttle : Anaconda Error

Hi,

I am trying to use from
chp1.throtte import Throttle
In Anaconda 5 for Python 3.6 I get the error no module named chp1

i try !pip install chp1 it does not work either no matching distributions found for chp1

Could you let me know what is wrong.

Regards,
Ren.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.