Giter Club home page Giter Club logo

chrome-php's Introduction

A Chrome Headless wrapper for PHP

Version Build Status StyleCI

Get the DOM of any webpage by using headless Chrome. Inspired by Browsershot.

Requirements

This package requires the Puppeteer Chrome Headless Node library. If you want to install it on Ubuntu 16.04 you can do it like this:

sudo apt-get update
curl -sL https://deb.nodesource.com/setup_8.x | sudo -E bash -
sudo apt-get install -y nodejs gconf-service libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils wget
sudo npm install --global --unsafe-perm puppeteer
sudo chmod -R o+rx /usr/lib/node_modules/puppeteer/.local-chromium

Installation

To add this package to your project, you can install it via composer by running

composer require milivojsa/chrome-php

Usage

Here is a quick example how to use this package:

use ChromeHeadless\ChromeHeadless;

$html = ChromeHeadless::url('https://example.com')->getHtml();

Instead of getting the DOM as a string, you can also use thegetDOMCrawler() method, which will return a Symfony\Component\DomCrawler\Crawler instance.

use ChromeHeadless\ChromeHeadless;

$dom = ChromeHeadless::url('https://example.com')->getDOMCrawler();
    
$title = $dom->filter('title')->text();

This makes it easy to filter the DOM for specific elements. Check the full documentation here.

Timeout

You can specify a timeout after which the process will be killed. The timeout should be given in seconds.

ChromeHeadless::url('https://example.com')
                ->setTimeout(10)
                ->getDOMCrawler();

If the process runs out of time a Symfony\Component\Process\Exception\ProcessTimedOutException will be thrown.

Custom Chrome Path

You can specify a custom path to your Chrome installation.

ChromeHeadless::url('https://example.com')
                ->setChromePath('/path/to/chrome')
                ->getDOMCrawler();

Custom User Agent

You can specify a custom user agent. By default the standard Chrome Headless user agent will be used.

ChromeHeadless::url('https://example.com')
                ->setUserAgent('nice-user-agent')
                ->getDOMCrawler();

Custom Headers

You can specify custom headers which will be used for the request.

ChromeHeadless::url('https://example.com')
                ->setHeaders([
                    'DNT' => 1 // DO NOT TRACK
                ])
                ->getDOMCrawler();

Blacklist

You can specify a list of regular expressions for files that should not be loaded when you request a website. These expressions will be checked against the url of the file. Default behaviour of the method setBlacklist(array $blacklist, $clean = false) is to merge array passed as $blacklist with current blacklist property. If you want to override this default behaviour then you can set parameter $clean to be true.

ChromeHeadless::url('https://example.com')
                ->setBlacklist([
                    'www.example.com'
                ])
                ->setBlacklist([
                    'www.google-analytics.com',
                    'analytics.js'
                ]) // property blacklist now will have www.example.com and those two
                ->getDOMCrawler();
ChromeHeadless::url('https://example.com')
                ->setBlacklist([
                    'www.google-analytics.com',
                    'analytics.js'
                ])
                ->setBlacklist([
                    'www.example.com'
                ], true) // property blacklist now will only have www.example.com 
                ->getDOMCrawler();

Excluded

You can specify a list of resource types that should not be loaded when you request a website. These resource types will be checked against the resource type of the file. You can pass values: document, stylesheet, image, media, font and script. Default behaviour of the method setExcluded(array $excluded, $clean = false) is to merge array passed as $excluded with current excluded property. If you want to override this default behaviour then you can set parameter $clean to be true.

ChromeHeadless::url('https://example.com')
                ->setExcluded([
                    'document'
                ])
                ->setExcluded([
                    'stylesheet',
                    'image'
                ]) // property excluded now will only have document and those two
                ->getDOMCrawler();
ChromeHeadless::url('https://example.com')
                ->setExcluded([
                    'stylesheet'
                    'image'
                ]) 
                ->setExcluded([
                    'document'
                ], true) // property excluded now will only have only document
                ->getDOMCrawler();

Viewport

You can specify a custom viewport that will be used when you make a request. By default the Chrome Headless standard of 800x600px will be used.

ChromeHeadless::url('https://example.com')
                ->setViewport([
                    'width' => 1920,
                    'height' => 1080
                ])
                ->getDOMCrawler();

Testing

You can run the tests by using

composer test

chrome-php's People

Contributors

helloiamlukas avatar milivojsa avatar torgheh avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.