code4craft / xsoup Goto Github PK

View Code? Open in Web Editor NEW

464.0 43.0 153.0 190 KB

When jsoup meets XPath.

License: MIT License

Java 100.00%

xsoup's Introduction

Xsoup

XPath selector based on Jsoup.

Get started:

    @Test
    public void testSelect() {

        String html = "<html><div><a href='https://github.com'>github.com</a></div>" +
                "<table><tr><td>a</td><td>b</td></tr></table></html>";

        Document document = Jsoup.parse(html);

        String result = Xsoup.compile("//a/@href").evaluate(document).get();
        Assert.assertEquals("https://github.com", result);

        List<String> list = Xsoup.compile("//tr/td/text()").evaluate(document).list();
        Assert.assertEquals("a", list.get(0));
        Assert.assertEquals("b", list.get(1));
    }

Performance:

Xsoup use Jsoup as HTML parser.

Compare with another most used XPath selector for HTML - HtmlCleaner, Xsoup is much faster:

Normal HTML, size 44KB
XPath: "//a"	
Run for 2000 times

Environment：Mac Air MD231CH/A 
CPU: 1.8Ghz Intel Core i5

Operation	Xsoup	HtmlCleaner
parse	3,207(ms)	7,999(ms)
select	95(ms)	380(ms)

Syntax supported:

XPath1.0:

Name	Expression	Support
nodename	nodename	yes
immediate parent	/	yes
parent	//	yes
attribute	[@key=value]	yes
nth child	tag[n]	yes
attribute	/@key	yes
wildcard in tagname	/*	yes
wildcard in attribute	/[@*]	yes
function	function()	part
or	a \| b	yes since 0.2.0
parent in path	. or ..	no
predicates	price>35	no
predicates logic	@class=a or @class=b	yes since 0.2.0

Function supported:

In Xsoup, we use some function (maybe not in Standard XPath 1.0):

Expression	Description	Standard XPath
text(n)	nth text content of element(0 for all)	text() only
allText()	text including children	not support
tidyText()	text including children, well formatted	not support
html()	innerhtml of element	not support
outerHtml()	outerHtml of element	not support
regex(@attr,expr,group)	use regex to extract content	not support

Extended syntax supported:

These XPath syntax are extended only in Xsoup (for convenience in extracting HTML, refer to Jsoup CSS Selector):

Name	Expression	Support
attribute value not equals	[@key!=value]	yes
attribute value start with	[@key~=value]	yes
attribute value end with	[@key$=value]	yes
attribute value contains	[@key*=value]	yes
attribute value match regex	[@key~=value]	yes

License

MIT License, see file LICENSE

xsoup's People

Contributors

Stargazers

Watchers

Forkers

nirvanatao youzhagui2006 byxxw geekcheng xiehurricane zkbswgs makeclan ouyanghuangzheng milton0518 popo4j freesnow weiguang3100 rohitc qingyu1229 aboutcrawler donsunsoft speed hellodcj qmac1989 jacobking thomz tsouza okuc narakai missly happygao521 jibaro bluesky8318 fireconter xsls hhhaiai evan-hu fysoft2006 songfj kaotikmynd dgilmurray fengjiang96 xxxmatata weishaohua xinqing gepparta desperado1992 naruto2902git haitian-pang huanhu ywf flyingimmortal chengyanglei rk55555 syngain guoyu07 liuchunlong ldd-daniel yuqall leechenyuan huangshaoze hendisantika yqmac frankzhuo alessandro-aglietti tanghaozheng mengw chengaoying sam-os-lee lomoye pythonzz cxfly darvsum magicjack junjiem zhaoshiling1017 yashu58 iadmin08 chaoyanjie liyajiegit penger yangbingxu yodeser genjiluo martteo ninjayoto qiuchunwei hunny-lh lnsdlhfem neilbean eyeshot julyleecn nnuwanjana uahaliubolun hddudu yiranleng xdewx minsifansi jiatongwu wuziliang18 littlesearch mazing27 tool-recommender-bot gregzq kana112233

xsoup's Issues

unexpexred SelectorParseException

My query is:
( //script | //*[@id] | //*[@class] | //*[@for] )
My query works on Selenium. But when I try to run the query to get a list of jsoup Elements by the following function:

	public static void normaliseHtmlDom(Document htmlDom) throws ConfigInitializationException {
		Elements elements = htmlDom.getAllElements();
		Elements ignoredElements = Xsoup.compile(ConfigUtils.getInstance().getIgnoredXPathUnion()).evaluate(htmlDom).getElements();
		for(Element element : elements) {
			if(ignoredElements.contains(element)) {
				elements.remove(element);
			}
		}
		
//		normaliseElementsByTagNames(elements);
//		normaliseElementsByAttributeNames(elements);
	}

The following Exception is thrown:
org.jsoup.select.Selector$SelectorParseException: Could not parse query '( //script | //*[@id] | //*[@class] | //*[@for] )': unexpected token at '( //script | //*[@id] | //*[@class] | //*[@for] )'

And condition does not work if arguments are reversed.

Hi,

I have found that the xpath

//div[@data-hveid and @class='g']

does not work, returns 0 elements

but the xpath

//div[@class='g' and @data-hveid]

does work, returning 1 element. The code example is below (jsoup 1.11.3, xsoup 0.3.1).

	String html = "<!DOCTYPE html>" +
		"<html>" +
		"  <head>" +
		"    <title>test</title>" +
		"  </head>" +
		"  <body>" +
		"  <div class=\"g\" data-hveid=\"CAYQAA\">" +
		"  </div>" +
		"  </body>" +
		"</html>";
	Document document = Jsoup.parse(html);

	// does not work
	String xpath = "//div[@data-hveid and @class='g']";

	// does work
	//String xpath = "//div[@class='g' and @data-hveid]";

	XElements elements = Xsoup.compile(xpath).evaluate(document);
	System.out.println(elements.getElements().size());

	for (Element element : elements.getElements())
	{
	  System.out.println(element.toString());
	}

I think it should work both ways.

Not support node array select. xpath("span[2]/small/text()")

Not support node array select. "span[2]"

Example:
page.getHtml().xpath("span[2]/small/text()");

XPathParser.java add code, I suggest:
private Evaluator consumePredicates(String queue) {
// +++ start add code ++++
if(StringUtils.isNumericSpace(queue)) {
return new XEvaluators.IsNthOfType(0, Integer.parseInt(queue.trim()));
}
// +++ end add code ++++
XTokenQueue predicatesQueue = new XTokenQueue(queue);
EvaluatorStack evaluatorStack = new EvaluatorStack();
Operation currentOperation = null;
predicatesQueue.consumeWhitespace();
while (!predicatesQueue.isEmpty()) {
...
}

Error description in readme.md

In the Extended syntax supported section:
https://github.com/code4craft/xsoup/blob/master/README.md

The expression of start with function and regex function are conflicted:
attribute value start with: [@key~=value] yes
attribute value match regex: [@key~=value] yes

According to Jsoup selector syntax:
http://jsoup.org/cookbook/extracting-data/selector-syntax
In the selector overview of Jsoup:
[attr^=value], [attr$=value], [attr_=value]: elements with attributes that start with, end with, or contain the value, e.g. [href_=/path/]

Hence I think the expression of start with in Xsoup should be that:
attribute value start with: [@key^=value]

Support for concat

I'm trying to use the XPath CONCAT function with no success, I think it isn't implemented yet.

Example:
String result = Xsoup.compile("CONCAT('https://themissingaddress.com/', //a/@href)").evaluate(document).get();

Fetch value from inside an attribute

Here is an html from one of the webpages i am parsing

<section class="property-header
                         " data-locale="de-DE" data-campsite-data="{&quot;id&quot;:2645,&quot;property_id&quot;:2645,&quot;property_name&quot;:&quot;Erzgebirgscamp Neuclausnitz&quot;,&quot;premium&quot;:false,&quot;amenities&quot;:&quot;[\&quot;WiFi\&quot;, \&quot;Free WiFi\&quot;]&quot;,&quot;activities&quot;:&quot;[\&quot;Cycle tracks\&quot;,\&quot;Hiking\&quot;,\&quot;Playground\&quot;,\&quot;Soccer\&quot;,\&quot;Mountainbiking\&quot;,\&quot;Cross-country skiing\&quot;]&quot;,&quot;leisure&quot;:&quot;[\&quot;Washing-up area\&quot;,\&quot;Washing machines\&quot;,\&quot;Tumble dryer\&quot;,\&quot;Washbasins\&quot;,\&quot;Shower cubicles\&quot;,\&quot;Heated sanitary facilities\&quot;,\&quot;Shared barbecueing area\&quot;,\&quot;Bread delivery\&quot;]&quot;,&quot;utilities&quot;:&quot;[\&quot;Guest boat moorings\&quot;,\&quot;Firewood available\&quot;,\&quot;Electric bike charging station\&quot;]&quot;,&quot;rules&quot;:&quot;[\&quot;Dogs Allowed\&quot;,\&quot;Barbecueing On Pitch Allowed\&quot;,\&quot;Car On Pitch Allowed\&quot;]&quot;,&quot;slug&quot;:&quot;erzgebirgscamp-neuclausnitz&quot;,&quot;coordinates&quot;:&quot;(50.7411,13.516)&quot;,&quot;distance&quot;:&quot;3594.66454744824&quot;,&quot;zip&quot;:&quot;09623&quot;,&quot;town&quot;:&quot;Rechenberg-Bienenm\u00fchle&quot;,&quot;latitude&quot;:50.7411,&quot;longitude&quot;:13.516,&quot;street&quot;:&quot;Hauptstra\u00dfe 25&quot;,&quot;city_id&quot;:16,&quot;city_slug&quot;:&quot;sachsen&quot;,&quot;city&quot;:&quot;Sachsen&quot;,&quot;country&quot;:&quot;Deutschland&quot;,&quot;country_slug&quot;:&quot;de&quot;,&quot;currency&quot;:&quot;EUR&quot;,&quot;rating&quot;:&quot;5&quot;,&quot;podio_id&quot;:508050624,&quot;older_children_lower_age_limit&quot;:6,&quot;older_children_upper_age_limit&quot;:17,&quot;younger_children_lower_age_limit&quot;:0,&quot;younger_children_upper_age_limit&quot;:5,&quot;photo&quot;:&quot;https:\/\/intcamp-eu-west-1-live01-public.s3-eu-west-1.amazonaws.com\/de-live\/gallery_970x545_270661681-campsite.jpg,https:\/\/intcamp-eu-west-1-live01-public.s3-eu-west-1.amazonaws.com\/de-live\/gallery_970x545_270661691-campsite.jpg,https:\/\/intcamp-eu-west-1-live01-public.s3-eu-west-1.amazonaws.com\/de-live\/gallery_970x545_270661694-campsite.jpg&quot;,&quot;availability&quot;:{&quot;9962&quot;:{&quot;is_available&quot;:true,&quot;has_instant_booking&quot;:true,&quot;pitch_type&quot;:&quot;[\&quot;Motorhome\&quot;]&quot;},&quot;9959&quot;:{&quot;is_available&quot;:true,&quot;has_instant_booking&quot;:true,&quot;pitch_type&quot;:&quot;[\&quot;Tent\&quot;]&quot;},&quot;9960&quot;:{&quot;is_available&quot;:true,&quot;has_instant_booking&quot;:true,&quot;pitch_type&quot;:&quot;[\&quot;Tent\&quot;]&quot;},&quot;9961&quot;:{&quot;is_available&quot;:true,&quot;has_instant_booking&quot;:true,&quot;pitch_type&quot;:&quot;[\&quot;Caravan\&quot;]&quot;},&quot;9963&quot;:{&quot;is_available&quot;:true,&quot;has_instant_booking&quot;:true,&quot;pitch_type&quot;:&quot;[\&quot;Motorhome\&quot;]&quot;}},&quot;instant_booking&quot;:true,&quot;category&quot;:&quot;Regul\u00e4r&quot;,&quot;address&quot;:&quot;Hauptstra\u00dfe 25, 09623 Rechenberg-Bienenm\u00fchle, Sachsen, Deutschland&quot;,&quot;slashed_price&quot;:null,&quot;coordinatesShow&quot;:&quot;50.741, 13.516&quot;}" data-all-pitches="{&quot;tent&quot;:&quot;Zelt&quot;,&quot;caravan&quot;:&quot;Wohnwagen&quot;,&quot;motorhome&quot;:&quot;Wohnmobil&quot;,&quot;cabin&quot;:&quot;Mietunterkunft&quot;}" data-price-request="price" data-currency-code="EUR">
<div class="row">
<div class="col-xs-12 col-sm-9">
<h1 itemprop="name">Campingurlaub auf dem Erzgebirgscamp Neuclausnitz</h1>
</div>
<div class="col-sm-3 text-right label-instant-wrapper">
<span class="label-instant">
<i class="icon-ic_flash"></i>
Sofortbuchung</span>
</div>
</div>
<div class="row">
<div class="col-xs-12 col-sm-9">
<div class="property-address" itemprop="address" itemscope="" itemtype="https://schema.org/PostalAddress">
<span itemprop="streetAddress">Hauptstraße 25, 09623 Rechenberg-Bienenmühle, Sachsen, Deutschland</span>
<meta itemprop="addressCountry" content="Deutschland">
<meta itemprop="postalCode" content="09623">
</div>
<a class="property-address" target="_blank" href="//google.com/maps/search/Erzgebirgscamp+Neuclausnitz/@">
Hauptstraße 25, 09623 Rechenberg-Bienenmühle, Sachsen, Deutschland </a>
</div>
<div class="col-xs-12 col-sm-9 col-lg-3 rating-wrapper">
<span class="rating-number">5,0</span>
<span class="rating-stars icon-ic-11-ratings"></span>
<a class="rating-popup-link rating-popup-link-desktop" href="#" data-property-id="2645">(Bewertungen)</a>
<a class="rating-popup-link rating-popup-link-mobile" href="#" data-property-id="2645" data-more="(Bewertungen)" data-less="(Bewertungen ausblenden)">(Bewertungen)</a>
</div>
</div>
</section>

Q: I need to fetch the value inside data-campsite-data

On Firefox Firepath this works: //@data-campsite-data

What would be its equivalent using xsoup

Add hasAttribute() to XElement

Sometimes we want to detect whether the select result XElement can directly trade as an element or just a String as attribute. Add hasAttribute() to XElement for detecting.

Use Antlr4 for XPath parsing

Now I use TokenQueue and write parser manually, it is difficult to cover all syntax.
I did some research on Antlr, and found some grammer file of XPath1.0.
I will try to use Antlr4 to parse XPath.

How to extract all text data from an HTML tag which has one or more than one child tags?

`<div class="columns small>

Xsoup use Jsoup as HTML parser.

` In above example, text data are present inside and

. Please give me solution to extract data from both the tags together as "Xsoup use Jsoup as HTML parser." Thanx.

Problem with text() method

Hi,
using XSoup and the XPATH Query //td[text()='Unverb. Preisempf.:'], i get the following exception:
Could not parse query 'td[text()='Unverb. Preisempf.:']': unexpected token at 'text()='Unverb. Preisempf.:''

Using the same query within chrome works fine.

How to use "and" this operator

//span[@class='info-name' and contains(text(), 'sometext')]/following-sibling::span/text()
this can't work on 0.3.1 and it parse to "span .info-name null :parent:root"

Support //*[text()='mytext']

Hey there, I am trying to use text() selectors matching a given string:

//*[text()='mytext']

From the documentation I see that some of the parts work, but it seems that the combination doesn't do. Could you make this clearer in the documentation?

Thanks for the work, xsoup + jsoup might replace jdom2 in our implementation here.

Add getter of Element to XElement

Now XElement has Element as private field but no accessor. Add getter of Element to XElement so we can get Element for further operation.

CombiningEvaluator.Or() works as AND

The codes vendered from Jsoup 1.7.2:

 /**
         * Create a new Or evaluator. The initial evaluators are ANDed together and used as the first clause of the OR.
         * @param evaluators initial OR clause (these are wrapped into an AND evaluator).
         */
        Or(Collection<Evaluator> evaluators) {
            super();
            if (evaluators.size() > 1)
                this.evaluators.add(new CombiningEvaluator.And(evaluators));
            else // 0 or 1
                this.evaluators.addAll(evaluators);
        }

So CombiningEvaluator.Or(a,b) will be AND instead of OR.

Change It to OR for my using.

Or(Collection<Evaluator> evaluators) {
            super();
            this.evaluators.addAll(evaluators);
}

Problem getting attribute value on specific element from element list (by index)

An xpath query like this one doesn't work with xsoup:

(//img[contains(@class, 'product_image')])[1]/@src

org.jsoup.select.Selector$SelectorParseException: Could not parse query '(//img[contains(@class, 'product_image')])[1]/@src': unexpected token at '(//img[contains(@class, 'product_image')])[1]/@src'

不支持xpath 的string()

xpath(”string(.)“)
java.lang.IllegalArgumentException: Unsupported function string(.)

Parsing error when separate chars in quotes cause

Separate chars such as "/" "|" will be recognized first.

For example, in XPath:

      //div/regex('/code4craft/(\w+)')

"/" in '/code4craft/(\w+)' will be recognized as a separator and cause parsing error.

Valid xpath crashes the lib with NPE

Example xml snippet:

<tbody>
    <tr>
        <td>NoLuck</td>
        <td>
            <span>NoHit</span>
        </td>
    </tr>
    <tr>
        <td>whatever</td>
        <td>
            <span>ShouldHit</span>
        </td>
    </tr>
    <tr>
        <td>Again</td>
        <td>
            <span>NoHit</span>
        </td>
    </tr>
</tbody>
XPath:
//table/tbody/tr[contains(td,'whatever']/td/span/text()

It should return: ShouldHit as verified with w3cSchool online tool.
Instead it throws NullPointerException:

Exception in thread "main" java.lang.NullPointerException
at us.codecraft.xsoup.xevaluator.CombiningEvaluator$And.matches(CombiningEvaluator.java:53)
at us.codecraft.xsoup.xevaluator.CombiningEvaluator$And.matches(CombiningEvaluator.java:53)
at us.codecraft.xsoup.xevaluator.StructuralEvaluator$ImmediateParent.matches(StructuralEvaluator.java:84)
at us.codecraft.xsoup.xevaluator.CombiningEvaluator$And.matches(CombiningEvaluator.java:53)
at us.codecraft.xsoup.xevaluator.StructuralEvaluator$ImmediateParent.matches(StructuralEvaluator.java:84)
at us.codecraft.xsoup.xevaluator.CombiningEvaluator$And.matches(CombiningEvaluator.java:53)
at org.jsoup.select.Collector$Accumulator.head(Collector.java:42)
at org.jsoup.select.NodeTraversor.traverse(NodeTraversor.java:31)
at org.jsoup.select.Collector.collect(Collector.java:24)
at us.codecraft.xsoup.xevaluator.DefaultXPathEvaluator.evaluate(DefaultXPathEvaluator.java:29)

Xsoup cannot compile valid xpath expression / ()[1] / first element

I need only the first element for a selector (div[@Class="fh-breadcrumb"])[1]. This expression work fine in chrome browser.

page.getHtml().xpath("//(div[@Class="fh-breadcrumb"])[1]//li").nodes();

But when i try i have this exception linked to Xsoup :

org.jsoup.select.Selector$SelectorParseException: Could not parse query '(div[@Class="fh-breadcrumb"])[1]': unexpected token at '(div[@Class="fh-breadcrumb"])[1]'
at us.codecraft.xsoup.xevaluator.XPathParser.findElements(XPathParser.java:166) ~[xsoup-0.3.1.jar:na]
at us.codecraft.xsoup.xevaluator.XPathParser.parse(XPathParser.java:76) ~[xsoup-0.3.1.jar:na]
at us.codecraft.xsoup.xevaluator.XPathParser.parse(XPathParser.java:408) ~[xsoup-0.3.1.jar:na]
at us.codecraft.xsoup.xevaluator.XPathParser.combinator(XPathParser.java:110) ~[xsoup-0.3.1.jar:na]
at us.codecraft.xsoup.xevaluator.XPathParser.parse(XPathParser.java:74) ~[xsoup-0.3.1.jar:na]
at us.codecraft.xsoup.xevaluator.XPathParser.parse(XPathParser.java:408) ~[xsoup-0.3.1.jar:na]
at us.codecraft.xsoup.Xsoup.compile(Xsoup.java:25) ~[xsoup-0.3.1.jar:na]

Sorry i not view you have specific project for Xsoup.(so dupplicate with issue code4craft/webmagic#339)

Problem with XPath

Hi, i'm using XSoup With XPath query //div[@Class="sp"], but i got this match result: <div class="logo sp">北邮人论坛手机版</div> Is it a bug?

Element not found in <head>

In version 0.3.1, when trying to evaluate "/html/head/link[@rel='canonical']/@href" over the DOM generated by JSOUP version 1.11.3 of the attached file, the element link is not found. Actually it is found in the body node "/html/head/link[@rel='canonical']/@href" which is obviously wrong.
milanuncios_busqueda_modular_synth.htm.zip

nth-of-type selector does not work with tag "SVG"

In XPath like //div/svg[2], only the first element of tag "svg" will be selected.
I checked code in Jsoup,

        protected int calculatePosition(Element root, Element element) {
            int pos = 0;
            Elements family = element.parent().children();
            for (int i = 0; i < family.size(); i++) {
                if (family.get(i).tag() == element.tag()) pos++;
                if (family.get(i) == element) break;
            }
            return pos;
        }

The element is compared by Tag object. And the Tag object is create by Tag.valueOf().
For known tags, they will be got from the map Tag.tags. But for unknown tag (such as svg), it is created ever time when call , so the compare "if (family.get(i).tag() == element.tag()) pos++;" will return false.
I have sent a pull request to Jsoup jhy/jsoup#402.
Before it is fixed, I will use XEvaluators.IsNthOfType instead of org.jsoup.select.Evaluator.IsNthOfType.

IsNthOfType does not support Nth of matching elements

Looking at the code for IsNthOfType it currently only supports finding the Nth element of the elements parent->children.

Consider: div[@id='rr_soc_top'][1]

This xpath is saying find all divs with id='rr_soc_top' and return the first match.

If the html document has, say 2, div[@id='rr_soc_top'] spread throughout the document then IsNthOfType does not work for this scenario.

CVE-2022-36033 on jsoup

jsoup 1.15.1 has security voluntary( https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-36033 ) and it fixed on 1.15.3.
Could you release a new version with only dependency upgraded?

Wrong One

NPE in only attribute selector

Xpath like '@href' will generate no evaluator and cause NullPointerException in

java.lang.NullPointerException
    at org.jsoup.select.Collector$Accumulator.head(Collector.java:42)
    at org.jsoup.select.NodeTraversor.traverse(NodeTraversor.java:30)
    at org.jsoup.select.Collector.collect(Collector.java:24)
    at us.codecraft.xsoup.DefaultXPathEvaluator.evaluate(DefaultXPathEvaluator.java:27)
    at us.codecraft.webmagic.selector.XpathSelector.selectList(XpathSelector.java:31)
    at us.codecraft.webmagic.selector.HtmlNode.selectElements(HtmlNode.java:80)
    at us.codecraft.webmagic.selector.HtmlNode.xpath(HtmlNode.java:43)

Improvement: Xsoup should return xpath attribute matches using getElements() method

At this time, i have to match against a regex rule to decide wich method should be used (wether it's an attribute xpath query or not). I think the getElements() method should also match against attribute xpath queries, this would be a fine improvement.

if (query.matches("(.*/@.*)")) {
            String result = Xsoup.compile(query).evaluate(document).get();
            matches.add(result);
} else {
            Elements results = Xsoup.compile(query).evaluate(document).getElements();
             // following more code wich adds the results to the matches list...
}

Best regards

Question: Xpath expression fails with Could not parse query

Hi, I am trying to parse the Stay Safe section at this url - http://wikitravel.org/en/San_Francisco and my xpath is //h2[span[text()='Stay safe']]/following-sibling::p//text()

When i run the xpath in Chrome dev console, it evaluates properly and returns text nodes to me. However fails in xsoup at XpathParser.byFunction() and throws
Could not parse query 'h2[span[text()='Stay safe']]': unexpected token at 'span[text()='Stay safe']'

Do you have suggestion on this? TIA.

More Syntax Support

Syntax todo:

[@Class=xxx][2]
[text()='xxx']
[contians(text(),'xx')]
tr[position()>3]

Xpath @class= does not work as in chrome.

If I run the following code:

String html = "<!DOCTYPE html>" +
    "<html>" +
    "  <head>" +
    "    <title>test</title>" +
    "  </head>" +
    "  <body>" +
    "  <div class=\"g\">" +
    "  </div>" +
    "  <div class=\"g x y t\">" +
    "  </div>" +
    "  </body>" +
    "</html>";
Document document = Jsoup.parse(html);

String xpath = "//div[@class='g']";

XElements elements = Xsoup.compile(xpath).evaluate(document);
System.out.println(elements.getElements().size());

for (Element element : elements.getElements())
{
  System.out.println(element.toString());
}

I get two elements as out put:

<div class="g">

and

<div class="g x y t">

In Chrome, I get only one, the one with the exact match. Who is wrong, xsoup or Chrome?

Export Document to w3c dom for more XPath evaluator

支持xpath2.0吗

你好，我想问下这个项目支持xpath2.0吗？

Logic Operation Support

Support for logic operation and/or/() in attribute selector.

@Test
    public void testLogicOperation() {

        Document document = Jsoup.parse(html);

        String result = Xsoup.select(document, "//*[@id=te or @id=test]/text()").get();
        assertEquals("aaa", result);

        result = Xsoup.select(document, "//*[@id=te and @id=test]/text()").get();
        assertNull(result);

        result = Xsoup.select(document, "//*[(@id=te or @id=test) and @id=test]/text()").get();
        assertEquals("aaa", result);

        result = Xsoup.select(document, "//*[@id=te or (@id=test and @id=test)]/text()").get();
        assertEquals("aaa", result);
    }

Roadmap of XPath syntax

Name	Expression	Version
condition and nth	//div[@Class='a'][1]
last	//div/span[last()-1]

Support for xpath axes

Does xsoup support xpath axes? If not it would be fine to support them.

XPath '|'(or) support

Support for multi XPath expression with | as seperator.

e.g.

 //book/title | //book/price

Xpath function "contains" support

https://developer.mozilla.org/en-US/docs/XPath/Functions/contains

example:

     //th[contains(text(),'xxx')]

class="class-name-with-a-space " cannot be found using @class="class-name-with-a-space "

Document doc = Jsoup.parse("<span><div class=\"class-name-with-a-space \" >This is a test element</div></span>");
Elements elems = Xsoup.compile("//div[@class=\"class-name-with-a-space \"]").evaluate(doc).getElements();
System.out.println(elems.size());// Output is 0- no elements are extracted.

When the xpath get evaluated it seems that the class name is getting trimmed as follwing code will give a element.

Document doc = Jsoup.parse("<span><div class=\"class-name-with-out-space\" >This is a test element</div></span>");
Elements elems = Xsoup.compile("//div[@class=\"class-name-with-out-space \"]").evaluate(doc).getElements();
System.out.println(elems.size());// Output is 1

Even though the xpath contains a space it ignores and give an element

Include dependency on README file

Include the dependency

<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>xsoup</artifactId>
    <version>RELEASE</version>
</dependency>

on README, so users don't have to search mvnrepository.

/cc @code4craft

Next release?

Hi @sutra

any chance to get a new release with updated jsoup and removed assertj dep? It would help to drop some exclusions in other projects.

Thanks in advance
Richard

使用Xpath的contains函数出现问题

举例：https://club.autohome.com.cn/bbs/thread/86f2870bac840396/72293736-1.html
在浏览器审查元素-Console中输入
$x("//div[@class='conleft fl']/ul[@class='leftlist']/li[contains(text(),'帖子')]/a[1]/text()")
得到(7) [text, text, text, text, text, text, text]
而使用Webmagic集成的Xsoup
page.getHtml().xpath("//div[@class='conleft fl']/ul[@class='leftlist']/li[contains(text(),'帖子')]/a[1]/text()").all();
会得到一个空集合，求解

last()怎么用不了

String lastHref = Xsoup.compile("//li[last()]/p/span/a/@href").evaluate(Jsoup.parse(htmlStr)).get();
提示错误 Could not parse query '[last()]': unexpected token at 'last()'

Support for XPath starts-with

String source = "\n" +
" AnnotationsBasedJMXAutoExporter\n" +
" org.springframework.jmx.export.MBeanExporter\n" +
" false\n" +
" assembler\n" +
" \n" +
"";
XpathSelector selector = new XpathSelector("//id[starts-with(text(),'Annotations')]");
selector.selectList(source);

Above Throws NullPointerException, but fine with xpathStr="//id[starts-with(@id,'Annotations')]"

Seems not used correctly? Could you help me check?

/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/bin/java "-javaagent:/Applications/IntelliJ IDEA CE.app/Contents/lib/idea_rt.jar=59972:/Applications/IntelliJ IDEA CE.app/Contents/bin" -Dfile.encoding=UTF-8 -classpath /Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/ext/cldrdata.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/ext/dnsns.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/ext/jaccess.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/ext/localedata.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/ext/nashorn.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/ext/sunec.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/ext/sunjce_provider.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/ext/sunpkcs11.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/ext/zipfs.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/jce.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/jsse.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/management-agent.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/resources.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/rt.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/lib/dt.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/lib/jconsole.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/lib/sa-jdi.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/lib/tools.jar:/Users/linyuechu/IdeaProjects/webCrawler/target/classes:/Users/linyuechu/Downloads/jsoup-1.12.1.jar:/Users/linyuechu/Downloads/xsoup-master/target/xsoup-0.3.2-SNAPSHOT.jar MyExtractor
Exception in thread "main" java.lang.NoClassDefFoundError: org/jsoup/helper/StringUtil
at us.codecraft.xsoup.XTokenQueue.matchesWhitespace(XTokenQueue.java:159)
at us.codecraft.xsoup.XTokenQueue.consumeWhitespace(XTokenQueue.java:398)
at us.codecraft.xsoup.xevaluator.XPathParser.consumeSubQuery(XPathParser.java:133)
at us.codecraft.xsoup.xevaluator.XPathParser.combinator(XPathParser.java:109)
at us.codecraft.xsoup.xevaluator.XPathParser.parse(XPathParser.java:74)
at us.codecraft.xsoup.xevaluator.XPathParser.parse(XPathParser.java:408)
at us.codecraft.xsoup.Xsoup.compile(Xsoup.java:25)
at MyExtractor.main(MyExtractor.java:38)
Caused by: java.lang.ClassNotFoundException: org.jsoup.helper.StringUtil
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 8 more

CVE-2021-37714 on jsoup

jsoup 1.13.1 has security voluntary(CVE-2021-37714) and it fixed on 1.14.2.
I wonder when the newest version of xsoup will be released.