Giter Club home page Giter Club logo

py-spider-for-wechat's Introduction

简介

利用python构建爬虫,爬取公众号历史文章及其内容,获取所有节日相关文章(类型可改)

进度

已完成
  1. 完成基本构建流程,可以得到大部分文章数据,少量数据出现丢失,优化中。
  2. 完成数据清洗、整理、持久化存储,存储格式为{时间,标题,url,文字内容}
  3. 完成节日相关文章的正则匹配,存储格式同上
待优化
  1. 优化操作步骤,一键完成信息的爬取与保存 [v]
  2. 不再使用翻页获取url的方式,开辟新途径,在短时间内拿到更多的数据,并且减少被封次数 [ ]
  3. 获取文章内容时,使用多线程,提升爬取文章文字内容的速度 [v]
  4. 独立config为文件 [ ]
  5. 自动获取cookie,token [ ]
  6. 公众号fakeid改为命令行输入形式 [ ]

使用方式

  1. 登录自己的微信公众号平台,获得、更新: cookie,公众号唯一fakeid,token,在getAllUrls.py文件中修改
  2. 运行run-spider.sh, 根据提示输入正确内容,即可完成指定公众号历史文章爬取

作者

· 王思哲 · [email protected]

py-spider-for-wechat's People

Contributors

zzzz0zzzz avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.