Light

yidao620c / core-scrapy Goto Github PK

View Code? Open in Web Editor NEW

810.0 69.0 316.0 1.65 MB

python-scrapy demo

Shell 1.26% Python 98.74%

core-scrapy's Introduction

Python网络爬虫Scrapy框架研究

Scrapy1.0教程

Wiki

Scrapy是Python开发的一个快速,高层次的屏幕抓取和web抓取框架，用于抓取web站点并从页面中提取结构化的数据。 Scrapy用途广泛，可以用于数据挖掘、监测和自动化测试。

Scrapy吸引人的地方在于它是一个框架，任何人都可以根据需求方便的修改。它也提供了多种类型爬虫的基类，如BaseSpider、sitemap爬虫等，还有对web2.0爬虫的支持。

Scrach是抓取的意思，这个Python的爬虫框架叫Scrapy，大概也是这个意思吧，就叫它：小刮刮吧。

基于最新的Scrapy 1.0编写，已更新至Python3.6

对多个内容网站的采集，主要功能实现如下：

最新文章列表的爬取
采集的数据放入MySQL数据库中，并且包含标题，发布日期，文章来源，链接地址等等信息
URL去重复，程序保证对于同一个链接不会爬取两次
防止封IP策略，如果抓取太频繁了，就被被封IP，目前采用三种策略保证不会被封：
- 策略1：设置download_delay下载延迟，数字设置为5秒，越大越安全
- 策略2：禁止Cookie，某些网站会通过Cookie识别用户身份，禁用后使得服务器无法识别爬虫轨迹
- 策略3：使用user agent池。也就是每次发送的时候随机从池中选择不一样的浏览器头信息，防止暴露爬虫身份
- 策略4：使用IP池，这个需要大量的IP资源，貌似还达不到这个要求
- 策略5：分布式爬取，这个是针对大型爬虫系统的，对目前而言我们还用不到。
模拟登录后的爬取
针对RSS源的爬取
对于每个新的爬取目标网站，或者原来的网站格式有变动的时候，需要做到可配置，只修改配置文件即可，而不是修改源文件，增加一段爬虫代码，主要是用xpath配置爬取规则
定时爬取，设置定时任务周期性爬取
与微信公共平台的结合，给大量的订阅号随机分配最新的订阅文章。
利用scrapy-splash执行页面javascript后的内容爬取

贡献代码

Fork
创建您的特性分支 git checkout -b my-new-feature
提交您的改动 git commit -am 'Added some feature'
将您的修改记录提交到远程 git 仓库 git push origin my-new-feature
然后到 github 网站的该 git 远程仓库的 my-new-feature 分支下发起 Pull Request

许可证

Copyright (c) 2014-2016 Xiong Neng

基于 MIT 协议发布: http://www.opensource.org/licenses/MIT

core-scrapy's People

Contributors

Stargazers

Watchers

Forkers

footballlijun14 listen-zhou saisai lovoror playplaydata shengxian2004 guoyu- saber233 wahello orchestor javajiao kinkir ipetu tumao yiminyangguang520 laomagege chansonz alexanlee dutianbo vanderxx icewwn themycode xiaosimao moxikai shangshanshi liqiang0330 freefly801213 onestarshang chang3106 xgdwq slumzzw shijihao davischan3168 beviszheng stamhe bug51 lavenliu firber fengyin123 co89757 lidapang wang046218 mickelfeng leo4617 wzjwhtur mikuyves wliu88ca xwqiang sosarly lcking tianmin757 pinionwang transposition baihao8904 wudaclark rickywong1991 1900wi letitgrow huangshizhi maggietian moolighty resolvewang xren615 cqzan hanmichael blankxyz lafengnan gateray bolee onvno leo650 lawwp fly365 xunux stringli wwwxmu linkyfish raymondzhaoy hunny-lh lifengg chaims charlieen gaoq1 2charles shawsola nansept ponusjang vacat xlelou kasaimiluo yexuwei flyzhuwenbing yycmmc zaoyubo geekhuyang xiaofei512 weihaoxie noraxie benjamesbabala lw-leo

core-scrapy's Issues

爬虫

定时爬取

你好，看到你说明中提到可以进行定期爬取内容，在代码中没有找到。我也遇到了这个问题，使用这种方法
d = runner.crawl(spider)
d.addBoth(lambda _: reactor.stop())，执行完成后下个时间点后开始后会报错，twisted.internet.error.ReactorNotRestartable求解

Scrapy笔记02- 完整示例是不是教程有点乱啊？

Scrapy笔记02- 完整示例

这个里面的保存数据到数据库
是不是跟上面的不对应呢？！
huxiu_spider.py 和 article_spider.py是两个文件啊？！

一个PHPer表示懵逼了好久

这里的session_scope是干什么的呢？

python3不支持吗？python3环境下运行错误

 a = Article(url=item["url"],
                    title=item["title"].encode("utf-8"),
                    publish_time=item["publish_time"].encode("utf-8"),
                    body=item["body"].encode("utf-8"),
                    source_site=item["source_site"].encode("utf-8"))
        with session_scope(self.Session) as session: #这里
            session.add(a)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.