Giter Club home page Giter Club logo

pcrawl's Introduction

This is in alpha stage.

PCrawl

PCrawl is a PHP library for crawling and scraping web pages.
It supports multiple clients: curl, guzzle. Options to debug, modify and parse responses.

Features

  • Rapidly create custom clients. Fluently change clients and client options like user-agent, with method chaining.
  • Responses can be modified using reusable callback functions.
  • Debug Responses using different criterias - httpcode, regex etc.
  • Parse responses using querypath library. Several convenience functions are provided.
  • Fluent API. Different debuggers, clients and response mod objects can be be changed on the fly !

Full Example

We'll try to fetch a bad page, then detect using a debugger and finally change client options to fetch the page correctly.

  • Setup up some clients
// simple clients.
$gu = new GuzzleClient();

// Custom Client, that does not allow redirects.
$uptightNoRedirectClient = new CurlClient();
$uptightNoRedirectClient->setRedirects(0); // disable redirects

// Custom client - thin wrapper around curl
class ConvertToHttpsClient extends CurlClient
{
    public function get(string $url, array $options = []): PResponse
    {
        $url = str_replace('http://', 'https://', $url);
        return parent::get($url, $options);
    }
}
  • Lets make some debugger objects
$redirectDetector = new ResponseDebug();
$redirectDetector->setMustNotExistHttpCodes([301, 302, 303, 307, 308]);
$fullPageDetector = new ResponseDebug();
$fullPageDetector->setMustExistRegex(['#</html>#']);
Start fetching!

For testing, we will fetch page with a client that does not support redirects, then use the redirectDetector to detect 301. If so we change client option to support redirects and fetch again.

$req = new Request();
$url = "http://www.whatsmyua.info";
$req->setClient($uptightNoRedirectClient);
$count = 0;
do {
    $res = $req->get($url);
    $redirectDetector->setResponse($res);
    if ($redirectDetector->isFail()) {
        var_dump($redirectDetector->getFailDetail());
        $uptightNoRedirectClient->setRedirects(1);
        $res = $req->get($url);
    }
} while ($redirectDetector->isFail() && $count++ < 1);

Use the fullPageDetector to detect if the page is proper.
Then parse the response body using Parser

if ($fullPageDetector->setResponse($res)->isFail()) {
    var_dump($redirectDetector->getFailDetail());
} else {
    $parser = new ParserCommon($res->getBody()); 
    $h1 = $parser->find('h1')->text();
    $htmlClass = $parser->find('html')->attr('class');
}

Note: the debuggers, clients, parsers can be reused.

Detailed Usage

Usage of functions can be divided into parts:

Installation

  • Composer:
composer init   # for new projects. 
composer config minimum-stability dev # Will be removed once stable.
composer require gyaaniguy/pcrawl
composer update
include __DIR__ . '/vendor/autoload.php'; #in PHP
  • github:
git clone [email protected]:gyaaniguy/PCrawl.git # clone repo 
cd PCrawl 
composer update # update composer 
mv ../PCrawl /desired/location # Move dir to desired location.
require __DIR__ . '../PCrawl/vendor/autoload.php'; #in PHP

TODO list

  • Leverage guzzlehttp asynchronous support

Standards

PSR-12
PHPUnit tests 

pcrawl's People

Contributors

gyaaniguy avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.