Giter Club home page Giter Club logo

ptt-web-crawler's Introduction

ptt-web-crawler (PTT 網路版爬蟲) Build Status

Scrapy 版本 by afunTW

特色

  • 支援單篇及多篇文章抓取
  • 過濾資料內空白、空行及特殊字元
  • JSON 格式輸出
  • 支援 Python 2.7, 3.4-3.6

輸出 JSON 格式

{
    "article_id": 文章 ID,
    "article_title": 文章標題 ,
    "author": 作者,
    "board": 板名,
    "content": 文章內容,
    "date": 發文時間,
    "ip": 發文位址,
    "message_count": { # 推文
        "all": 總數,
        "boo": 噓文數,
        "count": 推文數-噓文數,
        "neutral": → 數,
        "push": 推文數
    },
    "messages": [ # 推文內容
      {
        "push_content": 推文內容,
        "push_ipdatetime": 推文時間及位址,
        "push_tag": 推/噓/→ ,
        "push_userid": 推文者 ID
      },
      ...
      ]
}

參數說明

python crawler.py -b 看板名稱 -i 起始索引 結束索引 (設為負數則以倒數第幾頁計算) 
python crawler.py -b 看板名稱 -a 文章ID 

範例

爬取 PublicServan 板第 100 頁 (https://www.ptt.cc/bbs/PublicServan/index100.html) 到第 200 頁 (https://www.ptt.cc/bbs/PublicServan/index200.html) 的內容, 輸出至 PublicServan-100-200.json

  • 直接執行腳本
cd PttWebCrawler
python crawler.py -b PublicServan -i 100 200
  • 呼叫 package
python setup.py install
python -m PttWebCrawler -b PublicServan -i 100 200
  • 作為函式庫呼叫
from PttWebCrawler.crawler import *

c = PttWebCrawler(as_lib=True)
c.parse_articles(100, 200, 'PublicServan')

測試

python test.py

ptt-web-crawler is a crawler for the web version of PTT, the largest online community in Taiwan.

usage: python crawler.py [-h] -b BOARD_NAME (-i START_INDEX END_INDEX | -a ARTICLE_ID) [-v]
optional arguments:
  -h, --help                  show this help message and exit
  -b BOARD_NAME               Board name
  -i START_INDEX END_INDEX    Start and end index
  -a ARTICLE_ID               Article ID
  -v, --version               show program's version number and exit

Output would be BOARD_NAME-START_INDEX-END_INDEX.json (or BOARD_NAME-ID.json)

ptt-web-crawler's People

Contributors

david30907d avatar duckingod avatar gogochi avatar gogog22510 avatar jwlin avatar kingispeak avatar marlboromoo avatar yoeugene avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ptt-web-crawler's Issues

特殊字元保留

抱歉python新手
由於需要爬ptt資料所以找到這好用的工具

用的過程中發現程式自動把全形空白、換行等
都替換掉了
想請教有沒有選項可以關掉filter
或是python程式中哪邊可以如何修改?

感謝

推文內含連結

ex. Gossiping M.1425003598.A.630

<span class="f3 push-content">: <a href="http://ppt.cc/FMSc" rel="nofollow" 
target="_blank">http://ppt.cc/FMSc</a> 圍剿:包圍起來消滅  誰有要消滅她嗎</span>

line 131: 
push_content = push.find('span', 'push-content').string[1:].strip(' \t\n\r') 
TypeError: 'NoneType' object is not subscriptable

about time out

你好我想請問 我在執行時有發生 Read timed out 的問題

No data found

python crawler.py -b PublicServan -i 100 200
Processing index: 100
Traceback (most recent call last):
File "crawler.py", line 188, in
crawler()
File "crawler.py", line 56, in crawler
cookies={'over18': '1'}, verify=VERIFY
File "/usr/local/lib/python2.6/site-packages/requests/api.py", line 67, in get
return request('get', url, params=params, *_kwargs)
File "/usr/local/lib/python2.6/site-packages/requests/api.py", line 53, in request
return session.request(method=method, url=url, *_kwargs)
File "/usr/local/lib/python2.6/site-packages/requests/sessions.py", line 468, in request
resp = self.send(prep, *_send_kwargs)
File "/usr/local/lib/python2.6/site-packages/requests/sessions.py", line 576, in send
r = adapter.send(request, *_kwargs)
File "/usr/local/lib/python2.6/site-packages/requests/adapters.py", line 447, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: [Errno 1] _ssl.c:493: error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error

python3.5的beautifulSoup會不支援

您好 很喜歡您寫的爬蟲幫了我很多忙
不過python3.5 beautifulSoup要用4.4.0才會相容

所以能不能麻煩您把requirements.txt更新成:
beautifulsoup4==4.4.0

抱歉打擾拉

Is end index only available with -1 ?

你好,關於結束頁可以設定負數的參數,
但我實際執行只有 -1 是有效的參數,
看程式碼也是只有判定 -1 ,
如果是這樣的話,應該就不是  (設為負數則以倒數第幾頁計算)
再麻煩解答一下,感謝

Feature Request: Filter articles by author / pattern

謝謝大大分享這個工具!

PCMan 裡面有個實用的功能,能以/搜尋標題,或以a搜尋作者,
不知道您有沒有打算讓這個工具支援這兩個功能呢?
有時候只是想要收藏特定作者的文章。

如果您有意願但無瑕實作的話,我也可以嘗試丟個PR來處理,謝謝。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.