Giter Club home page Giter Club logo

caoliu-backup's Introduction

一次草榴爬虫网站的搭建

主要工具

SpringBoot、Redis、ElasticSearch、Joup

一些值得注意的点

  1. 由于阿里云的主机配置内存只有2g,因此es经常崩溃,目前的解决办法是,在爬虫玩之后将数据从redis同步到es。当然,如果现在只是爬取信息而不下载视频的话es应该是不会崩溃的,因为推测崩溃的原因是下载视频时java程序的内存占用过高
  2. 在本机测试时,es是装在docker里的,配了9300的端口映射,但是transport还是一直连不上es
  3. redis的分页目前还是比较粗糙,后面再看看
  4. redis的资源释放:一定要在finally块中将jedis放回jedispool,不然会报NoSuchElementException: Timeout waiting for idle object
  5. es远程连接的配置,以及不能以root用户进行操作
  6. nginx403的排查
  7. 爬虫的时候,尤其是文件下载的部分,要写得健壮
  8. 控制文件下载的数量,使用redis实现一盒类似于分布式锁,应当在代码层面上对get和incr外面加一把锁,然后进行下载,如果下载报了异常,搞一个类似回滚的操作,decr,不用加锁
  9. 熟悉linux命令
  10. 线程池,阻塞队列

遗留的问题:

排名系统

第一次迭代需要解决的问题:

  1. 前台直接跳转到视频源存在access denied问题 (已解决)
  2. 很多的previewUrl没有爬出来 (已解决)
  3. 很多视频会失效 (未解决,想法是开一个后台线程每天定时扫一遍)

10.29

目前搜索和分页都基本搞定,不过网页加载速度似乎有些慢,不确定具体的原因,可能的情况:

  1. 阿里云的带宽波动,毕竟乞丐版服务器
  2. es分页的速度问题
  3. 图片资源的加载问题

caoliu-backup's People

Contributors

xdcao avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.