Giter Club home page Giter Club logo

xsoup's Introduction

Xsoup

Build Status

XPath selector based on Jsoup.

Get started:

    @Test
    public void testSelect() {

        String html = "<html><div><a href='https://github.com'>github.com</a></div>" +
                "<table><tr><td>a</td><td>b</td></tr></table></html>";

        Document document = Jsoup.parse(html);

        String result = Xsoup.compile("//a/@href").evaluate(document).get();
        Assert.assertEquals("https://github.com", result);

        List<String> list = Xsoup.compile("//tr/td/text()").evaluate(document).list();
        Assert.assertEquals("a", list.get(0));
        Assert.assertEquals("b", list.get(1));
    }

Performance:

Xsoup use Jsoup as HTML parser.

Compare with another most used XPath selector for HTML - HtmlCleaner, Xsoup is much faster:

Normal HTML, size 44KB
XPath: "//a"	
Run for 2000 times

Environment:Mac Air MD231CH/A 
CPU: 1.8Ghz Intel Core i5
Operation Xsoup HtmlCleaner
parse 3,207(ms) 7,999(ms)
select 95(ms) 380(ms)

Syntax supported:

XPath1.0:

Name Expression Support
nodename nodename yes
immediate parent / yes
parent // yes
attribute [@key=value] yes
nth child tag[n] yes
attribute /@key yes
wildcard in tagname /* yes
wildcard in attribute /[@*] yes
function function() part
or a | b yes since 0.2.0
parent in path . or .. no
predicates price>35 no
predicates logic @class=a or @class=b yes since 0.2.0

Function supported:

In Xsoup, we use some function (maybe not in Standard XPath 1.0):

Expression Description Standard XPath
text(n) nth text content of element(0 for all) text() only
allText() text including children not support
tidyText() text including children, well formatted not support
html() innerhtml of element not support
outerHtml() outerHtml of element not support
regex(@attr,expr,group) use regex to extract content not support

Extended syntax supported:

These XPath syntax are extended only in Xsoup (for convenience in extracting HTML, refer to Jsoup CSS Selector):

Name Expression Support
attribute value not equals [@key!=value] yes
attribute value start with [@key~=value] yes
attribute value end with [@key$=value] yes
attribute value contains [@key*=value] yes
attribute value match regex [@key~=value] yes

License

MIT License, see file LICENSE

Bitdeli Badge

xsoup's People

Contributors

bitdeli-chef avatar code4craft avatar dependabot[bot] avatar donsunsoft avatar patricklam avatar rzo1 avatar snyk-bot avatar sutra avatar umishu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

xsoup's Issues

unexpexred SelectorParseException

My query is:
( //script | //*[@id] | //*[@class] | //*[@for] )
My query works on Selenium. But when I try to run the query to get a list of jsoup Elements by the following function:

	public static void normaliseHtmlDom(Document htmlDom) throws ConfigInitializationException {
		Elements elements = htmlDom.getAllElements();
		Elements ignoredElements = Xsoup.compile(ConfigUtils.getInstance().getIgnoredXPathUnion()).evaluate(htmlDom).getElements();
		for(Element element : elements) {
			if(ignoredElements.contains(element)) {
				elements.remove(element);
			}
		}
		
//		normaliseElementsByTagNames(elements);
//		normaliseElementsByAttributeNames(elements);
	}

The following Exception is thrown:
org.jsoup.select.Selector$SelectorParseException: Could not parse query '( //script | //*[@id] | //*[@class] | //*[@for] )': unexpected token at '( //script | //*[@id] | //*[@class] | //*[@for] )'

And condition does not work if arguments are reversed.

Hi,

I have found that the xpath

//div[@data-hveid and @class='g']

does not work, returns 0 elements

but the xpath

//div[@class='g' and @data-hveid]

does work, returning 1 element. The code example is below (jsoup 1.11.3, xsoup 0.3.1).

`

	String html = "<!DOCTYPE html>" +
		"<html>" +
		"  <head>" +
		"    <title>test</title>" +
		"  </head>" +
		"  <body>" +
		"  <div class=\"g\" data-hveid=\"CAYQAA\">" +
		"  </div>" +
		"  </body>" +
		"</html>";
	Document document = Jsoup.parse(html);

	// does not work
	String xpath = "//div[@data-hveid and @class='g']";

	// does work
	//String xpath = "//div[@class='g' and @data-hveid]";

	XElements elements = Xsoup.compile(xpath).evaluate(document);
	System.out.println(elements.getElements().size());

	for (Element element : elements.getElements())
	{
	  System.out.println(element.toString());
	}

`

I think it should work both ways.

Not support node array select. xpath("span[2]/small/text()")

Not support node array select. "span[2]"

Example:
page.getHtml().xpath("span[2]/small/text()");

XPathParser.java add code, I suggest:
private Evaluator consumePredicates(String queue) {
// +++ start add code ++++
if(StringUtils.isNumericSpace(queue)) {
return new XEvaluators.IsNthOfType(0, Integer.parseInt(queue.trim()));
}
// +++ end add code ++++
XTokenQueue predicatesQueue = new XTokenQueue(queue);
EvaluatorStack evaluatorStack = new EvaluatorStack();
Operation currentOperation = null;
predicatesQueue.consumeWhitespace();
while (!predicatesQueue.isEmpty()) {
...
}

Error description in readme.md

In the Extended syntax supported section:
https://github.com/code4craft/xsoup/blob/master/README.md

The expression of start with function and regex function are conflicted:
attribute value start with: [@key~=value] yes
attribute value match regex: [@key~=value] yes

According to Jsoup selector syntax:
http://jsoup.org/cookbook/extracting-data/selector-syntax
In the selector overview of Jsoup:
[attr^=value], [attr$=value], [attr_=value]: elements with attributes that start with, end with, or contain the value, e.g. [href_=/path/]

Hence I think the expression of start with in Xsoup should be that:
attribute value start with: [@key^=value]

Support for concat

I'm trying to use the XPath CONCAT function with no success, I think it isn't implemented yet.

Example:
String result = Xsoup.compile("CONCAT('https://themissingaddress.com/', //a/@href)").evaluate(document).get();

Fetch value from inside an attribute

Here is an html from one of the webpages i am parsing

<section class="property-header
                         " data-locale="de-DE" data-campsite-data="{&quot;id&quot;:2645,&quot;property_id&quot;:2645,&quot;property_name&quot;:&quot;Erzgebirgscamp Neuclausnitz&quot;,&quot;premium&quot;:false,&quot;amenities&quot;:&quot;[\&quot;WiFi\&quot;, \&quot;Free WiFi\&quot;]&quot;,&quot;activities&quot;:&quot;[\&quot;Cycle tracks\&quot;,\&quot;Hiking\&quot;,\&quot;Playground\&quot;,\&quot;Soccer\&quot;,\&quot;Mountainbiking\&quot;,\&quot;Cross-country skiing\&quot;]&quot;,&quot;leisure&quot;:&quot;[\&quot;Washing-up area\&quot;,\&quot;Washing machines\&quot;,\&quot;Tumble dryer\&quot;,\&quot;Washbasins\&quot;,\&quot;Shower cubicles\&quot;,\&quot;Heated sanitary facilities\&quot;,\&quot;Shared barbecueing area\&quot;,\&quot;Bread delivery\&quot;]&quot;,&quot;utilities&quot;:&quot;[\&quot;Guest boat moorings\&quot;,\&quot;Firewood available\&quot;,\&quot;Electric bike charging station\&quot;]&quot;,&quot;rules&quot;:&quot;[\&quot;Dogs Allowed\&quot;,\&quot;Barbecueing On Pitch Allowed\&quot;,\&quot;Car On Pitch Allowed\&quot;]&quot;,&quot;slug&quot;:&quot;erzgebirgscamp-neuclausnitz&quot;,&quot;coordinates&quot;:&quot;(50.7411,13.516)&quot;,&quot;distance&quot;:&quot;3594.66454744824&quot;,&quot;zip&quot;:&quot;09623&quot;,&quot;town&quot;:&quot;Rechenberg-Bienenm\u00fchle&quot;,&quot;latitude&quot;:50.7411,&quot;longitude&quot;:13.516,&quot;street&quot;:&quot;Hauptstra\u00dfe 25&quot;,&quot;city_id&quot;:16,&quot;city_slug&quot;:&quot;sachsen&quot;,&quot;city&quot;:&quot;Sachsen&quot;,&quot;country&quot;:&quot;Deutschland&quot;,&quot;country_slug&quot;:&quot;de&quot;,&quot;currency&quot;:&quot;EUR&quot;,&quot;rating&quot;:&quot;5&quot;,&quot;podio_id&quot;:508050624,&quot;older_children_lower_age_limit&quot;:6,&quot;older_children_upper_age_limit&quot;:17,&quot;younger_children_lower_age_limit&quot;:0,&quot;younger_children_upper_age_limit&quot;:5,&quot;photo&quot;:&quot;https:\/\/intcamp-eu-west-1-live01-public.s3-eu-west-1.amazonaws.com\/de-live\/gallery_970x545_270661681-campsite.jpg,https:\/\/intcamp-eu-west-1-live01-public.s3-eu-west-1.amazonaws.com\/de-live\/gallery_970x545_270661691-campsite.jpg,https:\/\/intcamp-eu-west-1-live01-public.s3-eu-west-1.amazonaws.com\/de-live\/gallery_970x545_270661694-campsite.jpg&quot;,&quot;availability&quot;:{&quot;9962&quot;:{&quot;is_available&quot;:true,&quot;has_instant_booking&quot;:true,&quot;pitch_type&quot;:&quot;[\&quot;Motorhome\&quot;]&quot;},&quot;9959&quot;:{&quot;is_available&quot;:true,&quot;has_instant_booking&quot;:true,&quot;pitch_type&quot;:&quot;[\&quot;Tent\&quot;]&quot;},&quot;9960&quot;:{&quot;is_available&quot;:true,&quot;has_instant_booking&quot;:true,&quot;pitch_type&quot;:&quot;[\&quot;Tent\&quot;]&quot;},&quot;9961&quot;:{&quot;is_available&quot;:true,&quot;has_instant_booking&quot;:true,&quot;pitch_type&quot;:&quot;[\&quot;Caravan\&quot;]&quot;},&quot;9963&quot;:{&quot;is_available&quot;:true,&quot;has_instant_booking&quot;:true,&quot;pitch_type&quot;:&quot;[\&quot;Motorhome\&quot;]&quot;}},&quot;instant_booking&quot;:true,&quot;category&quot;:&quot;Regul\u00e4r&quot;,&quot;address&quot;:&quot;Hauptstra\u00dfe 25, 09623 Rechenberg-Bienenm\u00fchle, Sachsen, Deutschland&quot;,&quot;slashed_price&quot;:null,&quot;coordinatesShow&quot;:&quot;50.741, 13.516&quot;}" data-all-pitches="{&quot;tent&quot;:&quot;Zelt&quot;,&quot;caravan&quot;:&quot;Wohnwagen&quot;,&quot;motorhome&quot;:&quot;Wohnmobil&quot;,&quot;cabin&quot;:&quot;Mietunterkunft&quot;}" data-price-request="price" data-currency-code="EUR">
<div class="row">
<div class="col-xs-12 col-sm-9">
<h1 itemprop="name">Campingurlaub auf dem Erzgebirgscamp Neuclausnitz</h1>
</div>
<div class="col-sm-3 text-right label-instant-wrapper">
<span class="label-instant">
<i class="icon-ic_flash"></i>
Sofortbuchung</span>
</div>
</div>
<div class="row">
<div class="col-xs-12 col-sm-9">
<div class="property-address" itemprop="address" itemscope="" itemtype="https://schema.org/PostalAddress">
<span itemprop="streetAddress">Hauptstraße 25, 09623 Rechenberg-Bienenmühle, Sachsen, Deutschland</span>
<meta itemprop="addressCountry" content="Deutschland">
<meta itemprop="postalCode" content="09623">
</div>
<a class="property-address" target="_blank" href="//google.com/maps/search/Erzgebirgscamp+Neuclausnitz/@">
Hauptstraße 25, 09623 Rechenberg-Bienenmühle, Sachsen, Deutschland </a>
</div>
<div class="col-xs-12 col-sm-9 col-lg-3 rating-wrapper">
<span class="rating-number">5,0</span>
<span class="rating-stars icon-ic-11-ratings"></span>
<a class="rating-popup-link rating-popup-link-desktop" href="#" data-property-id="2645">(Bewertungen)</a>
<a class="rating-popup-link rating-popup-link-mobile" href="#" data-property-id="2645" data-more="(Bewertungen)" data-less="(Bewertungen ausblenden)">(Bewertungen)</a>
</div>
</div>
</section>

Q: I need to fetch the value inside data-campsite-data

On Firefox Firepath this works: //@data-campsite-data

What would be its equivalent using xsoup

Add hasAttribute() to XElement

Sometimes we want to detect whether the select result XElement can directly trade as an element or just a String as attribute. Add hasAttribute() to XElement for detecting.

Use Antlr4 for XPath parsing

Now I use TokenQueue and write parser manually, it is difficult to cover all syntax.
I did some research on Antlr, and found some grammer file of XPath1.0.
I will try to use Antlr4 to parse XPath.

Problem with text() method

Hi,
using XSoup and the XPATH Query //td[text()='Unverb. Preisempf.:'], i get the following exception:
Could not parse query 'td[text()='Unverb. Preisempf.:']': unexpected token at 'text()='Unverb. Preisempf.:''

Using the same query within chrome works fine.

How to use "and" this operator

//span[@class='info-name' and contains(text(), 'sometext')]/following-sibling::span/text()
this can't work on 0.3.1 and it parse to "span .info-name null :parent:root"

Support //*[text()='mytext']

Hey there, I am trying to use text() selectors matching a given string:

//*[text()='mytext']

From the documentation I see that some of the parts work, but it seems that the combination doesn't do. Could you make this clearer in the documentation?

Thanks for the work, xsoup + jsoup might replace jdom2 in our implementation here.

Add getter of Element to XElement

Now XElement has Element as private field but no accessor. Add getter of Element to XElement so we can get Element for further operation.

CombiningEvaluator.Or() works as AND

The codes vendered from Jsoup 1.7.2:

 /**
         * Create a new Or evaluator. The initial evaluators are ANDed together and used as the first clause of the OR.
         * @param evaluators initial OR clause (these are wrapped into an AND evaluator).
         */
        Or(Collection<Evaluator> evaluators) {
            super();
            if (evaluators.size() > 1)
                this.evaluators.add(new CombiningEvaluator.And(evaluators));
            else // 0 or 1
                this.evaluators.addAll(evaluators);
        }

So CombiningEvaluator.Or(a,b) will be AND instead of OR.

Change It to OR for my using.

Or(Collection<Evaluator> evaluators) {
            super();
            this.evaluators.addAll(evaluators);
}

Parsing error when separate chars in quotes cause

Separate chars such as "/" "|" will be recognized first.

For example, in XPath:

      //div/regex('/code4craft/(\w+)')

"/" in '/code4craft/(\w+)' will be recognized as a separator and cause parsing error.

Valid xpath crashes the lib with NPE

Example xml snippet:

<tbody>
    <tr>
        <td>NoLuck</td>
        <td>
            <span>NoHit</span>
        </td>
    </tr>
    <tr>
        <td>whatever</td>
        <td>
            <span>ShouldHit</span>
        </td>
    </tr>
    <tr>
        <td>Again</td>
        <td>
            <span>NoHit</span>
        </td>
    </tr>
</tbody>

XPath:
//table/tbody/tr[contains(td,'whatever']/td/span/text()

It should return: ShouldHit as verified with w3cSchool online tool.
Instead it throws NullPointerException:

Exception in thread "main" java.lang.NullPointerException
at us.codecraft.xsoup.xevaluator.CombiningEvaluator$And.matches(CombiningEvaluator.java:53)
at us.codecraft.xsoup.xevaluator.CombiningEvaluator$And.matches(CombiningEvaluator.java:53)
at us.codecraft.xsoup.xevaluator.StructuralEvaluator$ImmediateParent.matches(StructuralEvaluator.java:84)
at us.codecraft.xsoup.xevaluator.CombiningEvaluator$And.matches(CombiningEvaluator.java:53)
at us.codecraft.xsoup.xevaluator.StructuralEvaluator$ImmediateParent.matches(StructuralEvaluator.java:84)
at us.codecraft.xsoup.xevaluator.CombiningEvaluator$And.matches(CombiningEvaluator.java:53)
at org.jsoup.select.Collector$Accumulator.head(Collector.java:42)
at org.jsoup.select.NodeTraversor.traverse(NodeTraversor.java:31)
at org.jsoup.select.Collector.collect(Collector.java:24)
at us.codecraft.xsoup.xevaluator.DefaultXPathEvaluator.evaluate(DefaultXPathEvaluator.java:29)

Xsoup cannot compile valid xpath expression / ()[1] / first element

I need only the first element for a selector (div[@Class="fh-breadcrumb"])[1]. This expression work fine in chrome browser.

page.getHtml().xpath("//(div[@Class="fh-breadcrumb"])[1]//li").nodes();

But when i try i have this exception linked to Xsoup :

org.jsoup.select.Selector$SelectorParseException: Could not parse query '(div[@Class="fh-breadcrumb"])[1]': unexpected token at '(div[@Class="fh-breadcrumb"])[1]'
at us.codecraft.xsoup.xevaluator.XPathParser.findElements(XPathParser.java:166) ~[xsoup-0.3.1.jar:na]
at us.codecraft.xsoup.xevaluator.XPathParser.parse(XPathParser.java:76) ~[xsoup-0.3.1.jar:na]
at us.codecraft.xsoup.xevaluator.XPathParser.parse(XPathParser.java:408) ~[xsoup-0.3.1.jar:na]
at us.codecraft.xsoup.xevaluator.XPathParser.combinator(XPathParser.java:110) ~[xsoup-0.3.1.jar:na]
at us.codecraft.xsoup.xevaluator.XPathParser.parse(XPathParser.java:74) ~[xsoup-0.3.1.jar:na]
at us.codecraft.xsoup.xevaluator.XPathParser.parse(XPathParser.java:408) ~[xsoup-0.3.1.jar:na]
at us.codecraft.xsoup.Xsoup.compile(Xsoup.java:25) ~[xsoup-0.3.1.jar:na]

Sorry i not view you have specific project for Xsoup.(so dupplicate with issue code4craft/webmagic#339)

Problem with XPath

Hi, i'm using XSoup With XPath query //div[@Class="sp"], but i got this match result: <div class="logo sp">北邮人论坛手机版</div> Is it a bug?

nth-of-type selector does not work with tag "SVG"

In XPath like //div/svg[2], only the first element of tag "svg" will be selected.
I checked code in Jsoup,

        protected int calculatePosition(Element root, Element element) {
            int pos = 0;
            Elements family = element.parent().children();
            for (int i = 0; i < family.size(); i++) {
                if (family.get(i).tag() == element.tag()) pos++;
                if (family.get(i) == element) break;
            }
            return pos;
        }

The element is compared by Tag object. And the Tag object is create by Tag.valueOf().
For known tags, they will be got from the map Tag.tags. But for unknown tag (such as svg), it is created ever time when call , so the compare "if (family.get(i).tag() == element.tag()) pos++;" will return false.
I have sent a pull request to Jsoup jhy/jsoup#402.
Before it is fixed, I will use XEvaluators.IsNthOfType instead of org.jsoup.select.Evaluator.IsNthOfType.

IsNthOfType does not support Nth of matching elements

Looking at the code for IsNthOfType it currently only supports finding the Nth element of the elements parent->children.

Consider: div[@id='rr_soc_top'][1]

This xpath is saying find all divs with id='rr_soc_top' and return the first match.

If the html document has, say 2, div[@id='rr_soc_top'] spread throughout the document then IsNthOfType does not work for this scenario.

NPE in only attribute selector

Xpath like '@href' will generate no evaluator and cause NullPointerException in

java.lang.NullPointerException
    at org.jsoup.select.Collector$Accumulator.head(Collector.java:42)
    at org.jsoup.select.NodeTraversor.traverse(NodeTraversor.java:30)
    at org.jsoup.select.Collector.collect(Collector.java:24)
    at us.codecraft.xsoup.DefaultXPathEvaluator.evaluate(DefaultXPathEvaluator.java:27)
    at us.codecraft.webmagic.selector.XpathSelector.selectList(XpathSelector.java:31)
    at us.codecraft.webmagic.selector.HtmlNode.selectElements(HtmlNode.java:80)
    at us.codecraft.webmagic.selector.HtmlNode.xpath(HtmlNode.java:43)

Improvement: Xsoup should return xpath attribute matches using getElements() method

At this time, i have to match against a regex rule to decide wich method should be used (wether it's an attribute xpath query or not). I think the getElements() method should also match against attribute xpath queries, this would be a fine improvement.

if (query.matches("(.*/@.*)")) {
            String result = Xsoup.compile(query).evaluate(document).get();
            matches.add(result);
} else {
            Elements results = Xsoup.compile(query).evaluate(document).getElements();
             // following more code wich adds the results to the matches list...
}

Best regards

Question: Xpath expression fails with Could not parse query

Hi, I am trying to parse the Stay Safe section at this url - http://wikitravel.org/en/San_Francisco and my xpath is //h2[span[text()='Stay safe']]/following-sibling::p//text()

When i run the xpath in Chrome dev console, it evaluates properly and returns text nodes to me. However fails in xsoup at XpathParser.byFunction() and throws
Could not parse query 'h2[span[text()='Stay safe']]': unexpected token at 'span[text()='Stay safe']'

Do you have suggestion on this? TIA.

Xpath @class= does not work as in chrome.

If I run the following code:

`

String html = "<!DOCTYPE html>" +
    "<html>" +
    "  <head>" +
    "    <title>test</title>" +
    "  </head>" +
    "  <body>" +
    "  <div class=\"g\">" +
    "  </div>" +
    "  <div class=\"g x y t\">" +
    "  </div>" +
    "  </body>" +
    "</html>";
Document document = Jsoup.parse(html);

String xpath = "//div[@class='g']";

XElements elements = Xsoup.compile(xpath).evaluate(document);
System.out.println(elements.getElements().size());

for (Element element : elements.getElements())
{
  System.out.println(element.toString());
}

`

I get two elements as out put:

<div class="g">

and

<div class="g x y t">

In Chrome, I get only one, the one with the exact match. Who is wrong, xsoup or Chrome?

Logic Operation Support

Support for logic operation and/or/() in attribute selector.

@Test
    public void testLogicOperation() {

        Document document = Jsoup.parse(html);

        String result = Xsoup.select(document, "//*[@id=te or @id=test]/text()").get();
        assertEquals("aaa", result);

        result = Xsoup.select(document, "//*[@id=te and @id=test]/text()").get();
        assertNull(result);

        result = Xsoup.select(document, "//*[(@id=te or @id=test) and @id=test]/text()").get();
        assertEquals("aaa", result);

        result = Xsoup.select(document, "//*[@id=te or (@id=test and @id=test)]/text()").get();
        assertEquals("aaa", result);
    }

XPath '|'(or) support

Support for multi XPath expression with | as seperator.

e.g.

 //book/title | //book/price

class="class-name-with-a-space " cannot be found using @class="class-name-with-a-space "

Document doc = Jsoup.parse("<span><div class=\"class-name-with-a-space \" >This is a test element</div></span>");
Elements elems = Xsoup.compile("//div[@class=\"class-name-with-a-space \"]").evaluate(doc).getElements();
System.out.println(elems.size());// Output is 0- no elements are extracted.

When the xpath get evaluated it seems that the class name is getting trimmed as follwing code will give a element.

Document doc = Jsoup.parse("<span><div class=\"class-name-with-out-space\" >This is a test element</div></span>");
Elements elems = Xsoup.compile("//div[@class=\"class-name-with-out-space \"]").evaluate(doc).getElements();
System.out.println(elems.size());// Output is 1

Even though the xpath contains a space it ignores and give an element

Next release?

Hi @sutra

any chance to get a new release with updated jsoup and removed assertj dep? It would help to drop some exclusions in other projects.

Thanks in advance
Richard

使用Xpath的contains函数出现问题

举例:https://club.autohome.com.cn/bbs/thread/86f2870bac840396/72293736-1.html
在浏览器审查元素-Console中输入
$x("//div[@class='conleft fl']/ul[@class='leftlist']/li[contains(text(),'帖子')]/a[1]/text()")
得到(7) [text, text, text, text, text, text, text]
而使用Webmagic集成的Xsoup
page.getHtml().xpath("//div[@class='conleft fl']/ul[@class='leftlist']/li[contains(text(),'帖子')]/a[1]/text()").all();
会得到一个空集合,求解

last()怎么用不了

String lastHref = Xsoup.compile("//li[last()]/p/span/a/@href").evaluate(Jsoup.parse(htmlStr)).get();
提示错误 Could not parse query '[last()]': unexpected token at 'last()'

Support for XPath starts-with

String source = "\n" +
" AnnotationsBasedJMXAutoExporter\n" +
" org.springframework.jmx.export.MBeanExporter\n" +
" false\n" +
" assembler\n" +
" \n" +
"";
XpathSelector selector = new XpathSelector("//id[starts-with(text(),'Annotations')]");
selector.selectList(source);

Above Throws NullPointerException, but fine with xpathStr="//id[starts-with(@id,'Annotations')]"

Seems not used correctly? Could you help me check?

/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/bin/java "-javaagent:/Applications/IntelliJ IDEA CE.app/Contents/lib/idea_rt.jar=59972:/Applications/IntelliJ IDEA CE.app/Contents/bin" -Dfile.encoding=UTF-8 -classpath /Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/ext/cldrdata.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/ext/dnsns.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/ext/jaccess.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/ext/localedata.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/ext/nashorn.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/ext/sunec.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/ext/sunjce_provider.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/ext/sunpkcs11.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/ext/zipfs.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/jce.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/jsse.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/management-agent.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/resources.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/lib/rt.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/lib/dt.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/lib/jconsole.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/lib/sa-jdi.jar:/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/lib/tools.jar:/Users/linyuechu/IdeaProjects/webCrawler/target/classes:/Users/linyuechu/Downloads/jsoup-1.12.1.jar:/Users/linyuechu/Downloads/xsoup-master/target/xsoup-0.3.2-SNAPSHOT.jar MyExtractor
Exception in thread "main" java.lang.NoClassDefFoundError: org/jsoup/helper/StringUtil
at us.codecraft.xsoup.XTokenQueue.matchesWhitespace(XTokenQueue.java:159)
at us.codecraft.xsoup.XTokenQueue.consumeWhitespace(XTokenQueue.java:398)
at us.codecraft.xsoup.xevaluator.XPathParser.consumeSubQuery(XPathParser.java:133)
at us.codecraft.xsoup.xevaluator.XPathParser.combinator(XPathParser.java:109)
at us.codecraft.xsoup.xevaluator.XPathParser.parse(XPathParser.java:74)
at us.codecraft.xsoup.xevaluator.XPathParser.parse(XPathParser.java:408)
at us.codecraft.xsoup.Xsoup.compile(Xsoup.java:25)
at MyExtractor.main(MyExtractor.java:38)
Caused by: java.lang.ClassNotFoundException: org.jsoup.helper.StringUtil
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 8 more

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.