Giter Club home page Giter Club logo

spider-snippet's Introduction

My Spider Snippet

我的爬虫片段,记录了编写爬虫时常用到了一些小工具,并对其进行了简单的封装。包括伪造请求头( 基于 fake-useragent ),IP代理池(基于 proxy-pool ),正则表达式等小工具,并配被简单的案例提供学习。

简单好用的Faker类

class Faker:
    def __init__(self, level=0):
        """
        Faker class to fool the server
        :param level: int, which fake level you want to use, default 0, the lowest, do not fake
        0: do not fake
        1: use fake-user-agent to get random headers
        2: use fake-user-agent + proxy-pool, make sure you have started the proxy-pool server
        """
        self.ua = UserAgent()
        self.proxy_api_url = "http://127.0.0.1:5010/get/"
        self.level = level

    def get_headers(self):
        headers = {
            "UserAgent": self.ua.random
        }
        return headers

    def get_proxy(self):
        content = requests.get(self.proxy_api_url).content
        return ast.literal_eval(content.decode('utf8'))

    def faked_get(self, url: str) -> Response:
        if self.level == 0:
            return requests.get(url)
        elif self.level == 1:
            return requests.get(url, headers=self.get_headers())
        else:
            proxy = self.get_proxy()
            return requests.get(url, headers=self.get_headers(), proxies=proxy)

爬取案例

实现细节

对网页的抓取基本使用了requests和bs4两个库来实现,简单明了。 之后采用了Scrapy进行了重构,基于Scrapy的代码见这个仓库

spider-snippet's People

Contributors

ronden avatar

Stargazers

 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.