Giter Club home page Giter Club logo

bdwenku-spider's Introduction

bdwenku-spider

一只百度文库的爬虫 A spider of baiduwenku

支持txt, word, pdf, ppt类型资源的下载

分析资源所在页面的源码,获取请求资源的接口,用requests库请求资源,然后手动实现文本的拼接规则,最后把文本内容输出到脚本同级目录下的文件夹中

简书详细使用说明:http://www.jianshu.com/p/8c103a566bd9

  • 百度文库有一些需要下载券,才能下载的资料
  • 但其实文库是允许我们预览的,可是不允许我们复制内容
  • 我们只是需要里面的文字内容,对内容的样式没有什么要求

windows平台运行

doc.gif

下载器实现的功能:

1.按照输入的网址,自动判断文档类型,并将下载好的资源放在相应的文件夹中 自动分类.png 2. 将ppt类型的文档自动转换为图片,并按原本的顺序命名保存 image.png 3.pdf,word.txt类型的数据全部消除格式,以txt格式保存文本 image.png

实现效果

下载word与pdf.png

下载ppt.png

下载txt.png

下载器的数据来源

分析资源所在页面的源码,获取请求资源的接口,用requests库请求资源,然后手动实现文本的拼接规则,最后把文本内容输出到脚本同级目录下的文件夹中

word类型文档

word.png

ppt类型文档

ppt.png

txt文档

image.png

我把这个脚本编译成了exe文件,windows用户从下面的资源帖子里按文章名自取: http://www.jianshu.com/p/4f28e1ae08b1

bdwenku-spider's People

Contributors

zhaoolee avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.