Giter Club home page Giter Club logo

easycrawl's Introduction

EasyCrawl

A easy tool for crawl resource from URLs.

Install

conda create -n easycrawl -y python=3.11
conda activate easycrawl

pip install requests
pip install beautifulsoup4
pip install git+https://github.com/guanhuankang/easyBucket.git
pip install git+https://github.com/guanhuankang/easyCrawl.git

Run Toy Code

rm -rf easybucketdatabase
python crawl.py

Tutorial

from easycrawl import EasyCrawl
from easycrawl import defaultHanlder, md5

if __name__=="__main__":
    urls = ["https://..."]
    entrances = [(md5(str(x)), x) for x in urls]
    
    easyCrawl = EasyCrawl(entrances=entrances, handler=defaultHanlder, n_threads=16)
    easyCrawl.start()
    easyCrawl.join()

where "defaultHanlder" is the page handler function, and usually you need to write this function by yourself to meet your personal requirements. "defaultHanlder" has the following interface:

def defaultHanlder(hash, url, queue):
    '''
    # hash:str is the unique identifer to refer to the url:any
    # url:any is the URL resouce, you can define any type of URL resouce, such as https://... 
    # queue:EasyQueue EasyQueue (from easycrawl import EasyQueue) is a thread-safe FIFO queue. We use this queue to record all urls in a fifo manner.
    '''
    ## Code Here
    ## Toy Code: find all <img> tags util no more resouces that are available.
    res = requests.get(url)
    html = BeautifulSoup(res.content, 'html.parser')
    print(html.find_all('img'))  ## mark down all img tag

    ## append more urls to queue for future visiting (BFS)
    for link in html.find_all('a'):
        if link.get("href", "#").startswith("http") and not queue.visited(link["href"]):
            queue.push(md5(link["href"]), link["href"])  ## push href to queue so that we can visit it later.
    
    print(f"Queue: {queue.size()}", end="\r")  ## print the remaining links

EasyQueue

We also include a thread-safe FIFO queue, named easyQueue, in this repo. What are the advantages of it:

[x] Thread-Safe
[x] Simple to use: it supports push, pop, size, has, visited.
[x] Memory-Efficiency: it adopts a page-mechanism to dump part of the queue into storage to save memory, which make it friendly for memory-limited machine, like most personal vps.

from easycrawl import EasyQueue, md5

data = {"str": "Hello World", "int": 666}
hash = md5(data["str"])  ## we use md5 value as unique identify, you can choose any one you like

easyQueue = EasyQueue(name="anyname")

easyQueue.push(hash, data)   ## push a data into queue
print("size:", easyQueue.size())  ## size of the queue
print("has:", easyQueue.has(hash))  ## whether queue has this data, whose unique identify is "hash"
print("visited:", easyQueue.visited(hash))  ## whether queue has visited this data (no matter it is in queue for now)
print("Advantage usage# setVisitedData:", easyQueue.setVisitedData(hash, data={"status": "in queue"}))  ## we can store some additional data in the visiting tree.
print("Advantage usage# info:", easyQueue.info(hash))  ## we can store some additional data in the visiting tree.
print(easyQueue.pop())  ## pop the top value: hash, data

'''
size: 1
has: True
visited: True
Advantage usage# setVisitedData: None
Advantage usage# info: {'push': 1, 'pop': 0, 'data': {'status': 'in queue'}}
('b10a8db164e0754105b7a99be72e3fe5', {'str': 'Hello World', 'int': 666})
'''

easycrawl's People

Contributors

guanhuankang avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.