Giter Club home page Giter Club logo

dianping-crawler's Introduction

大众点评爬虫

抓取页面:

  1. shop profile
  2. shop review
  3. user profile

用法

Scanner

以 shop review 为例, 下载的数据保存在 /home/jackon/media/dianping/reviews 目录下. 希望从 reviews 页面中找出所有的 user-id

import re
from scanner import Scanner

uid_ptn = re.compile(r'href="/member/(\d+)(?:\?[^"]+)?"')
json_name = 'uid.json'

s = Scanner(json_name, uid_ptn)
s.scan('/home/jackon/media/dianping/reviews')

for k, v in s.data.items():
    print '{} items in {}'.format(len(v), k)

扫描完成后输出如下格式:

20 items in 6845514_1.html
20 items in 550426_18.html
0 items in 3926803_2.html
20 items in 4550817_72.html
0 items in 6006104_3.html
0 items in 22281825_3.html
20 items in 2817364_18.html
20 items in 18221165_1.html
20 items in 550099_10.html
20 items in 21293756_2.html
20 items in 586687_31.html
20 items in 20815806_10.html

压缩 / 解压数据的 shell 命令

$ time 7z x shop_prof_20150821.7z
# Folders: 1
# Files: 164805
# Size:       18068875270
# Compressed: 359098116

# real    164m37.408s
# user    18m2.913s
# sys 39m50.784s
$ ls shop_prof | wc -l
# 164805
7z l shop_prof_20150821.7z | grep '.html' | wc -l
# 164805

dianping-crawler's People

Contributors

vivian-xu avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.