Giter Club home page Giter Club logo

Comments (5)

BigDeep avatar BigDeep commented on June 11, 2024

应该是需要做一个比较大的ip池,每次选取不同的ip进行代理请求,这样可以避免。在公司没做过spider,这是我从网上看到的解决方案。

from fe-interview.

azl397985856 avatar azl397985856 commented on June 11, 2024

应该是需要做一个比较大的ip池,每次选取不同的ip进行代理请求,这样可以避免。在公司没做过spider,这是我从网上看到的解决方案。

回答地不错,如果我是用node,具体应该怎么做呢

from fe-interview.

BigDeep avatar BigDeep commented on June 11, 2024

应该是需要做一个比较大的ip池,每次选取不同的ip进行代理请求,这样可以避免。在公司没做过spider,这是我从网上看到的解决方案。

回答地不错,如果我是用node,具体应该怎么做呢

之前写过一点点,我是穷屌丝,没钱去淘宝买,网上说可以直接爬取ip代理网站的ip

爬ip

爬取可用的代理ip,具体网址我不列出。用cheerio解析目标地址HTML, 获取到代理的 protocol、ip、
和port。

检查可用性

对爬取到代理ip的可用性进行检查。用request来进行检查代理ip的可用性,假如5秒不能访问baidu,这边就视该ip为不可用ip,然后就在回调里写一些我们需要的处理。 比如把可用代理写到我们的文件里面。或者是存到数组中待用。 我之前写的是写入一个文件中。

request({
            url:'http://www.baidu.com',
            proxy: proxy['protocol'].toLowerCase()+"://"+proxy['ip']+":"+proxy['port'],
            method:'GET',
            timeout:5000
        },(err,res,body)=>{
        })

使用爬到的ip池

上面已经完成了ip的可用性检测了 那就用爬到的ip来进行爬虫编写

  1. 读取文件中的ip,建立ip池
  2. 同时再写一个user-agent 池
  3. 看看用哪个模块啦 http,request,superagent 都可以。superagent 要打上 superagent-proxy
  4. 在请求中使用代理,从ip池,user-agent池里进行随机选择。referer这个参数看情况,之前爬取豆瓣
    图片一直不行,后来发现要设置referer这个参数。

总结

之前自己在业余时间做个微博图片,豆瓣图片的爬取,上面是一些自己写的过程中的心得,不知道有没有错误的地方,而且也比较想知道大公司的解决方案。毕业没多久,在公司一直做一些比较乏味的业务,自己的技术感觉提升的很慢,所以业余时间,有时候会想做点别的,希望能学习到更多。

from fe-interview.

azl397985856 avatar azl397985856 commented on June 11, 2024

大公司要么会用代理IP服务商的IP,要么会自建IP池。

他们通常能够做到被封的时候秒级切换,可以说换IP已经算是一个服务了。谁想用可以直接用,
切换成本几乎也是没有的。

from fe-interview.

stale avatar stale commented on June 11, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from fe-interview.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.