Giter Club home page Giter Club logo

querypath_guide's Introduction

How to parse HTML in PHP using querypath library

Intro:

Querypath - HTML DOM parsing and manipulation PHP library
Original is abandoned. Fork alive here: https://github.com/GravityPDF/querypath

This is a guide that explains how to parse html,xml documents using querypath. Written from the point of view of web-scraping

Sources

Article on ibm.com: Archive.org link
That link is now dead and un-googleable. So its content now can be freely stolen without guilt.

API docs relevant to parsing: http://querypath.org/classes/QueryPath.DOMQuery.html

Guide:

Quick example

//Create a new QueryPath object and supply it with source $html page
$qp = QueryPath::withHTML($html);
// find desired html nodes
$linkNodes = $qp->find('a')
//Loop through all the links in the page
foreach ($linkNodes as $li) { 
    echo $li->text() ;
}
// Quickly get title text
$titleText = $qp->find('title')->text();

Generally this is the flow:

  • We create a querypath object and supply it with the html source.
  • Then Various traversing functions can be used to find matching html nodes.
  • We can then optionally loop through the nodes
  • Finally we can use attr() or text() or other functions to extract from individual nodes

Common traversing methods

Method Description Takes CSS selector?
find() Select any element (beneath the currently selected nodes) that matches the selector Yes
xpath() Select any elements matching the given XPath query No (XPath query instead)
top() Select the document element (the root element) No
parents() Select any ancestor element Yes
parent() Select the direct parent element Yes
siblings() Select all siblings (both previous and next) Yes
next() Select the next sibling element Yes
nextAll() Select all siblings after the present element Yes
prev() Select the previous sibling Yes
prevAll() Select all previous siblings Yes
children() Select elements immediately beneath this one Yes
deepest() Select the deepest node or nodes beneath this one No

stolen_image_querypath.jpg

Observe: the traversing functions can accept css/xpath selectors to narrow down the search.

Common functions to extract data from nodes

text() // Get combined text contents of each element in the set of matched elements, including their descendants.
attr('src') // Get value of an attribute with a given name.
html() // Get HTML contents of matching node
innerHtml() // Get the HTML contents INSIDE the node.

IMPORTANT: If traversing functions match multiple nodes. The above functions will return data from first node.

Example: find('a') matches multiple links. text() will return text from first link.

Advanced usage examples:

Convert encoding of html page to utf-8
htmlqp($html, 'body', array('convert_to_encoding' => 'utf-8'))->children('p.a');
Use chain of traversing functions to find nodes
$tr = $this->qp->top('body')->find('table[id="main"]')->find('tr:nth-child(3)');

Here top('body') gets the top most ancestor matching the selector. The next find commands use css selectors.
Same can be written using an xpath

$tr = $this->qp->xpath('//body/table[@id="main"]/tr[3]');

TODO : add more examples as we find them

querypath_guide's People

Contributors

gyaaniguy avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.