Giter Club home page Giter Club logo

58ucrawler's Introduction

58ucrawler

58同城用户抓取

##项目包含两个程序:

  1. 在搜索页面的用户id抓取
  2. 在用户页面的用户信息抓取

##抓取思路:

  1. 在搜索页面抓取用户id(按城市),并通过翻页持续抓取,将抓到的id写入数据库
  2. 使用用户id + baseurl拼成用户的个人页面url,抓取其中内容,写入数据库

##程序环境依赖:

  • python2.6+
  • sqlite

##程序包依赖:

  • lxml
  • sqlite3

##程序启动过程: 程序入口为app.py,还有一个city.txt的资源文件.

  1. 创建数据库:

    $ python app.py setup

  2. 启动uid爬虫,需要多少个线程根据网络调试, 可以使用nohup命令后台运行程序

    $ python app.py crawluid # 一个线程

    $ python app.py crawluid 3 1000 # 三个线程, 1000ms抓取间隔

  3. 启动page爬虫,需要多少个线程请根据网络调试,可以使用nohup命令后台运行程序

    $ python app.py crawlpage # 一个线程

    $ python app.py crawlpage 3 1000 # 三个线程, 1000ms抓取间隔

  4. 完成数据抓取,可以从数据库导出数据

    • 方法一:

      找到tc_skill.db文件所在目录,打开终端,以此执行以下几条命令

      $ sqlite3 tc_skill.db # 进入数据库管理程序

      $ .output result.csv # 将管理程序的输出重定向到文件(默认是标准输出stdout)

      $ select * from t_tc_user; # 查询表中所有数据(上一步已经重定向输出,查询的结果被输出到文件)

    • 方法二:

      继续使用app.py

      $ python app.py export result.csv

58ucrawler's People

Contributors

sandomingo avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.