Giter Club home page Giter Club logo

bilibili_video_stat's Introduction

bilibili_video_stat

爬取b站视频信息,供大数据分析用户喜好。使用scrapy-redis分布式,在16核服务器上实现抓取2500万条/天。可长期部署抓取,实现视频趋势分析

  • 1.提供代理ip池
  • 2.提供user agent池
  • 3.使用scrapy-redis分布式
  • 4.使用mongodb保存数据(也可使用mysql,提供了相应代码)
  • 5.使用多进程(单个爬虫线程数请根据设备情况自行在settings.py中修改)
  • 6.提供完整error记录
  • 7.提供自定义重试中间件UserRetryMiddleware(如使用,请在settings.py中修改)
  • 8.提供在redis中添加starurls的代码,请阅读redis_n.py文件

启动命令

pip3 install -r requirements.txt
python3 start.py 进程数
# python3 start.py 32
  • 建议配合代理ip池使用,提供 https://github.com/arthurmmm/hq-proxies 代理池的调用中间件,需在settings中开启中间件

  • 请在使用前设置settings.py中的redis和mongodb的连接

  • pipeline.py注释了使用mysql保存的代码,如使用,请在settings.py添加相应设置

千万级大数据量/天的爬取经验:

  • 1.请准备较大的代理IP池:若数据量在400条/秒,根据b站封ip的测试(短时间请求1000次则封ip),代理池需维持300-400的代理ip

  • 2.请使用短有效期的代理ip:代理ip有效期在3-5分钟需300-400的代理池IP量,若有效期较长,需增加代理池的ip量

  • 3.请准备较多的user agent组成池,useragent.py中的量差不多了

  • 4.请开启重试:超时时间调小为10s,重试次数增加为10;实际测试中,重试1次约占30%,重试2次及以上约占10%,最多的重试了6次;重试占比于代理ip质量相关

  • 5.若出现所有请求全部都重试的情况,请开启DOWNLOAD_DELAY = 0.01;实际测试,即使只延迟0.01s也会显著减少重试次数;若重试情况仍严重,请增加至0.25s

  • 6.需详细记录失败的url,保持数据的完整性

bilibili_video_stat's People

Contributors

wangler2333 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.