Giter Club home page Giter Club logo

goudan's Introduction

goudan(狗蛋)

Goudan(狗蛋)is a tunnel proxy, it's support all tcp proxy(theoretically), such as http,https,socks. By default, goudan crawl free proxies from some websites. So, you can use it out of box.

Why do this

When I develop a spider to crawl some web sites, most time they have some defense measures.

So, I must change my IP to crawl it at a moment.

The best way is set a proxy address for a web requests libray, such as "Requests","urlib", "aiohttp" and so on.

But, I need write those code in every project. And I want't to do this.

This why I start this project.

How to use

Use by docker(Recommend)

docker run -p 1991:1991 -d --restart always --name goudan daoye/goudan

or

docker run -p 1991:1991 -d --restart always --name goudan daoye/goudan --log_level 10 -r 10 -i 60 -t socks

If you want see some help documents:

docker run daoye/goudan -h

From source(need python3.7)

git clone https://github.com/daoye/goudan.git
git checkout develop
cd goudan
python3 main.py

The best way is use virtualenv.

Add your proxies

If you have some other proxies, you can add them to the proxy pool.

To do this, you must create a new spider. For example:

#!/usr/bin/env python
# -*- coding:utf-8 -*-


class MySpider():
    def run(self):
        return [
            {"host": "127.0.0.1", 'port': 1080, 'type': 'socks', 'loc': 'jp'},
            {"host": "127.0.0.1", 'port': 1087, 'type': 'http', 'loc': 'jp'}
        ]

This spider return an array include some proxies.

Anyway, you can collect some proxies from other web site:

#!/usr/bin/env python
# -*- coding: utf-8 -*-


from lxml import etree
from spiders.baseSpider import BaseSpider
import logging

class MySpider(BaseSpider):
    def __init__(self):
        BaseSpider.__init__(self)

        # These are target urls.
        self.urls = [
            'http://www.xxx.xxx/'
        ]

        # This means crawl per 10 minutes.
        self.idle = 10 * 60 

    def _parse(self, results, text):
        # parse the "text"
        # then add it to "results"

        for r in rows:
            results.append({
                'host': r.ip,
                'port': r.port,
                'type': 'http',
                'loc': 'cn'
            })

A proxy item is a dictionary, it has these key:

host: The ip address.

port: The port, it must an integer.

type: The proxy's type, it can be: http, https, http/https,socks.

loc: Location of proxy(not imoprtant, use for feature).

When you create a spider, you must modify the "setting.py"

Open the file "setting.py", then find the "spiders" variable, add you spider in it:

spiders = [
    ...
    'spiders.mySpider.MySpider'  # This is you spider.
]

The end

Enjoy!

License

MIT License

goudan's People

Contributors

daoye avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.