Giter Club home page Giter Club logo

python-crawling-tutorial's Introduction

Python-Crawling-Tutorial 基礎爬蟲實戰

相關資源

最新的投影片放在 slideshare 上, 會不定期更新, 程式碼可透過這個頁面右邊的 Clone or download 下載 demo

2017 年以前的投影片教材放在 release, 但是部份實戰練習網站會失效 或是可透過 link 下載投影片

安裝環境

Anaconda (建議)

  • 下載 Python 3.6 版本 https://www.continuum.io/downloads
  • 練習題會使用到瀏覽器 Chrome,麻煩各位選擇自己電腦的平台安裝 Chrome
  • 動態網站的爬蟲也需要下載 webdriver,需要額外下載
  • 題目都是以 jupyter notebook 進行,安裝完 Anaconda 後即可用內建 jupyter notebook 打開 .ipynb
  • 建議安裝 Anaconda,如有安裝 Anaconda 只需安裝以下套件
$ pip install selenium tldextract Pillow

pip

pip 是 Python 的套件管理系統,在部份系統裏面會用 pip3 代表 Python3 的版本,請各位依照自己的系統安裝 pip3 後,安裝以下 Python3 版本的套件

# 視情況而定, 使用 pip 或是 pip3
$ pip install requests beautifulsoup4 lxml Pillow selenium tldextract

Optional: 資料分析

沒有練習題但會有範例 code 可以執行,可自行選擇是否安裝 (如果安裝 wordcloud 時有問題,可能是沒有下載 visual studio,可以從 warining 中提供的網址下載安裝)

# Anaconda
$ pip install jieba wordcloud

# pip
$ pip3 install numpy pandas matplotlib scipy scikit-learn jieba wordcloud

請遵守別人的規則

有些網站會在目錄底下加上 robots.txt, 基本上這就是對方定義的爬蟲規則,請大家在練習爬蟲的時候要尊重對方的規則

robots.txt 詳細的語法與用途請參考 wikigoogle 文件


Q&A

Q: 有哪些常用的 API

課堂中有說到,爬蟲只是一種得到資料的手段,如果對方有提供 API 就可以直接使用 API, API 通常對方都會幫你整理好資料格式,或是根據權限決定你可以獲取的資料內容

python-crawling-tutorial's People

Contributors

afuntw avatar dependabot[bot] avatar jimmy15923 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.