kjam / wswp Goto Github PK

Code for the second edition Web Scraping with Python book by Packt Publications

Python 100.00%

webscraping python3 selenium scrapy teaching

wswp's Introduction

Web Scraping with Python

Welcome to the code repository for Web Scraping with Python, Second Edition! I hope you find the code and data here useful. If you have any questions reach out to @kjam on Twitter or GitHub.

Code Structure

All of the code samples are in folders separated by chapter. Scripts are intended to be run from the code folder, allowing you to easily import from the chapters.

Code Examples

I have not included every code sample you've found in the book, but I have included a majority of the finished scripts. Although these are included, I encourage you to write out each code sample on your own and use these only as a reference.

Firefox Issues

Depending on your version of Firefox and Selenium, you may run into JavaScript errors. Here are some fixes:

Use an older version of Firefox
Upgrade Selenium to >=3.0.2 and download the geckodriver. Make sure the geckodriver is findable by your PATH variable. You can do this by adding this line to your .bashrc or .bash_profile. (Wondering what these are? Please read the Appendix C on learning the command line).
Use PhantomJS with Selenium (change your browser line to webdriver.PhantomJS('path/to/your/phantomjs/installation'))
Use Chrome, InternetExplorer or any other supported browser

Feel free to reach out if you have any questions!

Issues with Module Import

Seeing chp1 ModuleNotFound errors? Try adding this snippet to the file:

import os
import sys
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir)))

What this does is append the main module to your system path, which is where Python looks for imports. On some installations, I have noticed the current directory is not immediately added (common practice), so this code explicitly adds that directory to your path.

Corrections?

If you find any issues in these code examples, feel free to submit an Issue or Pull Request. I appreciate your input!

First edition repository

If you are looking for the first edition's repository, you can find it here: Web Scraping with Python, First Edition

Questions?

Reach out to @kjam on Twitter or GitHub. @kjam is also often on freenode. :)

wswp's People

Contributors

Stargazers

Watchers

Forkers

pyrizard ncrojoey quanlidavid mahmadmujtaba torstava rushabhsooni andyiac proffl028 rudy190 keepwonder geejayy fred12 databaaz vodaka wudidoubao rickyfanglinyang mentrics lukes1973 blueviper619 optionalg leoleon506 magnetogit claudiu515 peterceyu lucaswu hillbamboo tiagosly michaelpboyle tianlunte eddiejunzhang cianoflynn jvwke knight76 nakamurus kcsekhar-de oudy525i mindaugasvaitkus2 jangtsekiang mahmud83 sayanddude mhcrnl omulosi zekiguven rasiqueira dsvrsec fitsumdev ro9ueadmin exsourcode xsgchao ning609 d0tn3t viniciustm saravananpsg aleksm84 alejandroalffer ahmad-k-mansour kasiarein sreear m0tky christophhaushofer mayamalkoti mrjas0n samy1937 eduardishion rrosajp tuksik kurhula osayi senorhimanshu melvinw94 ricmor70 henry126 zhengguangli theeyeofhorus anupama-angadi otaserver jiaruishao tigaso cocapai sabrazil2012 tobiadefami ivan-wg tuleo anthonybassa cousins6tt josegabrielrosas escapeplan4 petermuidev bubersome jessielee619 arifmudi linkedatai bss1211

wswp's Issues

re.match(link_regex, link) may change to re.search(....) in chp1/link_crawler.py

thanks for the book.
when i run the code in "Link Crawler" section, and it just download "http://example.webscraping.com" only, after i dig into the code and change
re.match(...) to re.search in the "if" statement, the code works out.

threaded_crawler_with_queue.py ( the updated one )

When running the above script from Chapter04 [ I.E with the updates posted about 7 months prior to this post]
at
def mp_threaded_crawler(…….):
proc.start() produces the following error
" Can't Pickle _Thread.lock objects "

windows 10 pro
python 3.6.6
gpu 960m

Chapter 1 robots.txt not available, or sitemap.xml

Hi. I'm a beginner, with rusty Python skills, but I think their may be an issue here:
These links don't display the text indicated in the book. They just display the home page:

Apologies if this is on the errata page. Haven't checked it yet. ...Just checked ...couldn't find the book to post it there.

Link_Crawler.py

Whenever I try to use link_crawler.py, nothing happens.
The only output is 'Downloading: http://example.webscraping.com'.

windows 10
Python 3.7.3 64 bit (AMD64)] on win32

Running into some issues with certain scripts

Hi, I actually opened issues on the Packt Github, but I saw that your info was also included within the book, so I figured I will link those issues here as well.

https://github.com/PacktPublishing/Python-Web-Scraping-Second-Edition/issues

Thank you.

There are many problems with the sample website.

Hi,through the practice of the first chapter of the fourth quarter code found that using the sample website, the results of the code running and the code in the book run completely inconsistent, please modify the original author as soon as possible, thank you!

CsvCallback

Using the posted code for the advanced_ link_crawler and the csv_callback class, whenever I run the example code for the link_crawler at the end of Chapter 2, i get back 'IndexError: list out of range'

The list comprehension does not seem to work..

Please help

pls give details about the python version and virtual environment - and packages

Incorrect url for webscraping

url should be http://example.webscraping.com/places/default/view/United-Kingdom-239

I may be doing something wrong, but I'm not certain what. BTW, thanks very much for the book.

I couldn't get the example on page 57-58 to work before I changed anything, and again, I assume I am doing something incorrectly or have something setup incorrectly. I am running on a Windows 10 system using python 3.6.2.

Thanks for any help you can provide...

I made one change in "advanced_link_crawler.py" as follows:
if re.search(link_regex, link):
match() doesn't work, so I changed it to search() since not beginning of the string

I made some debug modifications to "csv_callback.py" because nothing was being written to the CSV file when the exception was thrown, as follows:

import sys  # DEBUG ONLY
import csv
import re
from lxml.html import fromstring

class CsvCallback:
    def __init__(self):
        self.filename = r'..\..\data\countries.csv' # CHANGE: to get something written to CSV file before exception thrown
        self.handle = open(self.filename, 'w')      # CHANGE
        try:
            self.writer = csv.writer(self.handle)   # CHANGE
            self.fields = ('area', 'population', 'iso', 'country', 'capital',
                           'continent', 'tld', 'currency_code', 'currency_name',
                           'phone', 'postal_code_format', 'postal_code_regex',
                           'languages', 'neighbours')
            self.writer.writerow(self.fields)
            self.handle.close() # CHANGE: to get something written to CSV file before exception thrown
        except csv.Error as e:
            sys.exit('file {}, line {}: {}'.format(self.filename, self.writer.line_num, e)) # DEBUG ONLY
        except:
            print("Unexpected error:", sys.exc_info()[0])   # DEBUG ONLY  
            raise

    def __call__(self, url, html):
        if re.search('/view/', url):
            print(url)  # DEBUG ONLY
            tree = fromstring(html)
            try:
                all_rows = [
                    tree.xpath('//tr[@id="places_%s__row"]/td[@class="w2p_fw"]' % field)[0].text_content()
                    for field in self.fields]
            except:
                print(self.fields)  # DEBUG ONLY
                print("Unexpected error:", sys.exc_info()[0])   # DEBUG ONLY
                raise               # DEBUG ONLY
            self.handle = open(self.filename, 'a')  # CHANGE: to get something written to CSV file before exception thrown
            self.writer = csv.writer(self.handle)   # CHANGE
            self.writer.writerow(all_rows)
            self.handle.close()                     # CHANGE

This is my main routine calling everything:

import os
import sys

sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir)))

from chp2.advanced_link_crawler import link_crawler
from chp2.csv_callback import CsvCallback

link_crawler('http://example.webscraping.com/', '/(index|view)', max_depth=-1, scrape_callback=CsvCallback())

I get these exceptions:
File "C:\Users\mrich\Documents\PythonCode\wswp-master\wswp-master\code\chp1\ScrapedToCSV.py", line 9, in
link_crawler('http://example.webscraping.com/', '/(index|view)', max_depth=-1, scrape_callback=CsvCallback())
File "C:\Users\mrich\Documents\PythonCode\wswp-master\wswp-master\code\chp2\advanced_link_crawler.py", line 110, in link_crawler
data.extend(scrape_callback(url, html) or [])
File "C:\Users\mrich\Documents\PythonCode\wswp-master\wswp-master\code\chp2\csv_callback.py", line 33, in call
for field in self.fields]
File "C:\Users\mrich\Documents\PythonCode\wswp-master\wswp-master\code\chp2\csv_callback.py", line 33, in
for field in self.fields]

builtins.IndexError: list index out of range

I get 2 rows added to my CSV file: a header and 1 row of data.

Skipping due to the depth

THE CODE .TXT
P.S. PyCharm, Python 3.6.3. If you have newer edition of book, pleaser give me some information, I will be very thankful!!!
final.txt

Is example.webscraping.com down

I've tried to read example.webscraping.com, unfortunately i couldn't reach the site.
Could you resolve this. i'm cannot practicing code on your books.

from chp1.throttle import Throttle : Anaconda Error

Hi,

I am trying to use from
chp1.throtte import Throttle
In Anaconda 5 for Python 3.6 I get the error no module named chp1

i try !pip install chp1 it does not work either no matching distributions found for chp1

Could you let me know what is wrong.

Regards,
Ren.

sitemaps have links which can be extracted.

sitemaps have links which can be extracted. these codes make the sitemap_crawler stop. may I know how to address this issue