Giter Club home page Giter Club logo

browserext's Introduction

BrowserExt - php extension for web scraping and browser emulation

LAST UPDATE: Upgrading to Qt5.

Features

BrowserExt PHP extension is a programmatic browser, based on QtWebKit and intended for web scraping.

  • Supports javascript and AJAX.
  • Uses xpath for selecting elements
  • Allows you to fill forms, click on the elements of the document
  • Allows to retrieve attributes, properties and other parameters of the elements of the document, iterate through the elements in the tree
  • Allows to download files by links
  • Allows to scroll the page vertically
  • Supports a list of proxy servers, checking proxies in few threads

A short example:

$br = new PhpBrowser();
$br->load('http://localhost/index.html');

//retrieves all links to files
$links = $br->elements('//div[@id="files"]/a');
foreach ($links as $l)
{
    //extracts href property
    $href = $l->prop('href');
    //downloads a file by href link and saves it in С:\test
    $br->download($href, 'C:\\test\\'.basename($href));
}

Installation

Linux (Ubuntu)

It requires X11 Server. On the server without desktop you must use Xvfb. For loading Xvfb at system startup add this line to /etc/rc.local:

Xvfb :0 -screen 0 1024x768x16 > /dev/null 2>&1 &

For user, from which web server is running, you must set the environment variable DISPLAY with server number with which the extension will work. For apache2 this can be done by adding this line to envvars file:

export DISPLAY=:0.0

The extension requires compilation in Linux. This requires:

  • gcc (g++)
  • make
  • php (php5, php5-dev)
  • Qt5 (qtbase5-dev, qt5-qmake, libqt5webkit5, libqt5webkit5-dev, qt5-image-formats-plugins, qt5-default)

Follow these steps:

  1. For compilation you must run the build script

    $ ./build.sh

  2. For installation of the compiled extension

    $ sudo ./install.sh

  3. Next add this line to php.ini

    extension=browserext.so

  4. And restart the web server, for example:

    $ sudo service apache2 restart

Windows

For Windows the extension comes with binaries for php 5.6 and is located in the directory binaries\win32.

For extension work required:

  • php 5.6
  • Qt5 with QtWebKit - Qt 5.4 or 5.3 for example (install Qt5 and set QTDIR environment variable)
  • Microsoft Visual C++ 2012 Redistributable Package (x86)

To install, do the following:

  1. Copy php_browserext.dll, the appropriate version of php in the extensions directory. For example, it may be C:\php\ext.

  2. Download and install Microsoft Visual C++ 2012 Redistributable Package (x86).

  3. In the php.ini file, enable the extension by adding the line

    extension=php_browserext.dll

If you want to compile BrowserExt in Windows, see BUILDWIN.md

Usage

First you need to create a browser class:

$br = new PhpBrowser();

Next lets loading the page:

$br->load('http://localhost');

Each page is loaded in a new tab. To load in the same tab, you must pass a second parameter to true. To go to the previous page you call the back().

You can click on a link or button, passing its xpath.

$br->click('//input[@type="submit"]');  

The page will be loaded in a new tab, to load in same tab, you must pass a second parameter to true.

You can select elements by xpath:

$els = $br->elements('//a');

This method returns an array of objects of class PhpWebElement. For each element you can retrieve attributes, properties, tag name, element value and others:

$id = $els[0]->attr('id');
$prop = $els[0]->prop('href');
$tag = $els[0]->tagName();
$text = $els[0]->text();

You can go to the parent or to the child elements, they will also be an objects of PhpWebElement class:

$parent = $els[0]->parent();
$arr = array();
while (!$parent->isNull())
{
    $arr[] = $parent->tagName();
    $parent = $parent->parent();
}

In the above we iterate through all parents and stores its tags in an array.

For element can be performed a relative xpath:

$items = $br->elements('//*[@class="item"]');
foreach ($items as $item)
{
    $a1 = $item->elements('./a[1]');
    $a2 = $item->elements('./a[2]');
    echo $item->tagName().' '.$a1[0]->text().' '.$a2[0]->text();
}

This code loops through all the elements with a class item and displays the text of the first and second links.

You can retrieve the xpath of the element or click on it:

$xp = $items[0]->getXPath();
$items[0]->click();

The browser can use a list of proxy servers for loading pages. Each new page is loaded with new proxy:

$proxy = array('192.168.0.2:3128', 'user:[email protected]:8888');
$br->setProxyList($proxy, true);
var_dump($br->proxyList());

In the above given an array of two proxies and pass to the browser. The second parameter specifies the need to check the proxy. Next command returns a list of remaining proxies after checking.

API

Detail class description see in API.md

XPath Inspector - a Browser utility

Also supplied with the extension program browser, which can be used to retrieve or test xpath of the page elements. Browser is a very simple web browser. The top line is introduced url page. Any page opens in a new tab, right click on the page displays a context menu with Close tab and Get XPath menu (Inspector window shows xpath). Several elements can be highlight by pressing Ctrl. In this case, will be extracted common xpath expression. For example, one can obtain xpath all items of the list, highlighting two elements and clicking Get XPath. In Inspector selecting multiple items also working.

License

BrowserExt is licensed under the MIT license.

browserext's People

Contributors

scraperlab avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

browserext's Issues

Segmentantion fault on some webpages

Hi,
I get Segmentation fault on some web pages. Maybe there's a memory limit problem?
Please advise how to debug and report for a fix.
I run it on php5.4 / Debian Wheezy 64bit
Thank you!

not working with windows 2012

please help i have problem with installing on windows server 2012. when i add line
extension=php_browserext.dll
error1
error2

please help

Form submission problem

Hi, I haven't been able to use the click methods to submit a form successfully. Here's a simple test; if I serve this page on my localhost:

`

<title>FormTest</title> Content: " . $_POST['content'] . "

"; } ?> Submit ` then hit it using browserext, by running this script via CLI:

`

!/usr/bin/env php

load("http://localhost/formtest.php"); $inp = $br->elements("//input"); foreach($inp as $input) { if($input->attr('name') == 'content') { if ($br->fill($input->getXPath(), 'Hello World')) echo "Fill OK\n"; } } $inp = $br->elements("//button"); foreach($inp as $input) { if($input->attr('class') == 'submit') { if($input->click()) echo "Click OK\n"; } } $br->wait(8); ?>

`

The fill works, I can see the text appear in the phpbrowser window on the Ubuntu desktop, and if I manually click Submit during the final wait, the form is submitted correctly, but I can't get the click call to work. It returns true, but the form is not submitted.

Click seems to work for <a href="..."> elements but not <button>. I noticed that calling click() in php ultimately calls PhpWebView::click2(), and there is an unused PhpWebView::click() function. I tried adding code to use this instead for <button>, but this was not effective. Any ideas?

EDIT: Looks like there is something strange happening with QWebElement::evaluateJavaScript. I enabled console output to do some debugging, and all I can see is alerts etc.. Even if I try to run garbage as a JS string, no errors are displayed. PhpWebPage::javaScriptConsoleMessage doesn't seem to even get called.

Ubuntu 16.04 64bit: stack smashing error on wait

Hi, I was seeing this error when calling the wait method:

 *** stack smashing detected ***: /usr/sbin/apache2 terminated
[Fri Jul 29 16:31:33.311719 2016] [core:notice] [pid 15513] AH00051: child pid 15525 exit signal Aborted (6), possible coredump in /etc/apache2

After running phpbrowser/test.php through valgrind, it turned out that the problem was on line 394 of phpbrowser/browserext.php. I changed int sec = 0; to long sec = 0; and the error has gone away.

issue with centos

hello i am trying to install this ext on centos i got this error while run ./build.sh

make: g++: Command not found // fix by yum install gcc-c++
make: *** [downloader.o] Error 127
make: *** No rule to make target /browserext-master/browser/../browserext-static/Linux_Debug/libphpbrowser.a', needed bybrowser'. Stop.

Screenshot of the page

Hi, I have a suggestion to add to the next update: a method for creating a screenshot of the page, that returns JPG or PNG :)

installing on centos 7

Hi

can you please check this error

[root@omega browserext]# ./build.sh make: Nothing to be done for first'.
g++ -Wl,-O1 -Wl,-z,relro -Wl,-rpath-link,/usr/lib64 -o browser main.o -L/root/browserext/browser/../browserext-static -lphpbrowser -lQt5WebKitWidgets -lQt5WebKit -lQt5Widgets -lQt5Gui -lQt5Network -lQt5Core -lGL -lpthread
/root/browserext/browser/../browserext-static/libphpbrowser.a(downloader.o): In function Downloader::~Downloader()': /usr/include/QtCore/qstring.h:880: undefined reference to QString::free(QString::Data*)'
/usr/include/QtCore/qstring.h:880: undefined reference to QString::free(QString::Data*)' /root/browserext/browser/../browserext-static/libphpbrowser.a(downloader.o): In function Downloader::Downloader(QNetworkRequest const&, QString const&, QObject*)':
/usr/include/QtCore/qstring.h:879: undefined reference to QString::shared_null' /root/browserext/browser/../browserext-static/libphpbrowser.a(downloader.o): In function Downloader::Downloader(QString const&, QString const&, QObject*)':
/usr/include/QtCore/qstring.h:879: undefined reference to QString::shared_null' /root/browserext/browser/../browserext-static/libphpbrowser.a(downloader.o): In function Downloader::Downloader(QString const&, QString const&, QObject*)':
/root/browserext/browserext-static/downloader.cpp:20: undefined reference to QUrl::QUrl(QString const&)' /root/browserext/browser/../browserext-static/libphpbrowser.a(downloader.o): In function Downloader::Downloader(QNetworkReply*, QString const&, QObject*)':
/usr/include/QtCore/qstring.h:879: undefined reference to QString::shared_null' /root/browserext/browser/../browserext-static/libphpbrowser.a(downloader.o): In function Downloader::slotError(QNetworkReply::NetworkError)':
/usr/include/QtCore/qstring.h:880: undefined reference to QString::free(QString::Data*)' /root/browserext/browser/../browserext-static/libphpbrowser.a(downloader.o): In function Downloader::download(bool)':
/usr/include/QtCore/qstring.h:1031: undefined reference to QString::fromAscii(char const*, int)' /usr/include/QtCore/qstring.h:1029: undefined reference to QString::fromAscii(char const*, int)'
/usr/include/QtCore/qstring.h:880: undefined reference to QString::free(QString::Data*)' /usr/include/QtCore/qstring.h:880: undefined reference to QString::free(QString::Data*)'
/usr/include/QtCore/qstring.h:880: undefined reference to `QString::free(QString::Data*)'
collect2: error: ld returned 1 exit status
make: *** [browser] Error 1
Configuring for:
PHP Api Version: 20131106
Zend Module Api No: 20131226
Zend Extension Api No: 220131226
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for a sed that does not truncate output... /usr/bin/sed
checking for cc... cc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether cc accepts -g... yes
checking for cc option to accept ISO C89... none needed
checking how to run the C preprocessor... cc -E
checking for icc... no
checking for suncc... no
checking whether cc understands -c and -o together... yes
checking for system library directory... lib
checking if compiler supports -R... no
checking if compiler supports -Wl,-rpath,... yes
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking target system type... x86_64-unknown-linux-gnu
configure: error: Cannot find php-config. Please use --with-php-config=PATH
make: *** No targets specified and no makefile found. Stop.
[root@omega browserext]#

`

Segmentation fault

Don't know why, but I can't launch an example. It's segmentation fault

Errors

phpbrowser: cannot connect to X server :0.0
No protocol specified

I get that on running the thing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.