Light

ronden / spider-snippet Goto Github PK

View Code? Open in Web Editor NEW

1.0 2.0 0.0 23.93 MB

Some Snippet for a small spider

Python 11.63% HTML 85.95% JavaScript 1.45% Shell 0.96%

spider-snippet's Introduction

My Spider Snippet

我的爬虫片段，记录了编写爬虫时常用到了一些小工具，并对其进行了简单的封装。包括伪造请求头（基于 fake-useragent ），IP代理池（基于 proxy-pool ），正则表达式等小工具，并配被简单的案例提供学习。

简单好用的Faker类

class Faker:
    def __init__(self, level=0):
        """
        Faker class to fool the server
        :param level: int, which fake level you want to use, default 0, the lowest, do not fake
        0: do not fake
        1: use fake-user-agent to get random headers
        2: use fake-user-agent + proxy-pool, make sure you have started the proxy-pool server
        """
        self.ua = UserAgent()
        self.proxy_api_url = "http://127.0.0.1:5010/get/"
        self.level = level

    def get_headers(self):
        headers = {
            "UserAgent": self.ua.random
        }
        return headers

    def get_proxy(self):
        content = requests.get(self.proxy_api_url).content
        return ast.literal_eval(content.decode('utf8'))

    def faked_get(self, url: str) -> Response:
        if self.level == 0:
            return requests.get(url)
        elif self.level == 1:
            return requests.get(url, headers=self.get_headers())
        else:
            proxy = self.get_proxy()
            return requests.get(url, headers=self.get_headers(), proxies=proxy)

爬取案例

爬取csdn博客文章：csdn_spider
爬取古诗词信息：ningyangtv
测试selenium工具：selenium-test
测试MongoDB Python连接：mongo-server

实现细节

对网页的抓取基本使用了requests和bs4两个库来实现，简单明了。之后采用了Scrapy进行了重构，基于Scrapy的代码见这个仓库。

spider-snippet's People

Contributors

Stargazers

Watchers

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.