Giter Club home page Giter Club logo

htmlunit-neko's Introduction

HtmlUnit

Version 4.3.0 / June 27, 2024

❤️ Sponsor

Maven Central OpenSSF Scorecard

Homepage

htmlunit.org

HtmlUnit@mastodon | HtmlUnit@Twitter

HtmlUnit Kanban Board

Check out HtmlUnit satellite projects, such as:

Note as well that you can use HtmlUnit with Selenium via their htmlunit-driver!

Sponsoring

Constantly updating and maintaining the HtmlUnit code base already takes a lot of time.

I would like to make 2 major extensions in the next few months

For doing this I need your sponsoring.

Get it!

Maven

Add to your pom.xml:

<dependency>
    <groupId>org.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>4.3.0</version>
</dependency>

Gradle

Add to your build.gradle:

implementation group: 'org.htmlunit', name: 'htmlunit', version: '4.3.0'

Vulnerabilities

List of Vulnerabilities

Security Policy

Overview

HtmlUnit is a "GUI-less browser for Java programs". It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc... just like you do in your "normal" browser.

It has fairly good JavaScript support (which is constantly improving) and is able to work even with quite complex AJAX libraries, simulating Chrome, Firefox or Internet Explorer depending on the configuration used.

HtmlUnit is typically used for testing purposes or to retrieve information from web sites.

Features

  • Support for the HTTP and HTTPS protocols
  • Support for cookies
  • Ability to specify whether failing responses from the server should throw exceptions or should be returned as pages of the appropriate type (based on content type)
  • Support for submit methods POST and GET (as well as HEAD, DELETE, ...)
  • Ability to customize the request headers being sent to the server
  • Support for HTML responses
    • Wrapper for HTML pages that provides easy access to all information contained inside them
    • Support for submitting forms
    • Support for clicking links
    • Support for walking the DOM model of the HTML document
  • Proxy server support
  • Support for basic and NTLM authentication
  • Excellent JavaScript support

Getting Started

You can start here:

Contributing

Pull Requests and all other Community Contributions are essential for open source software. Every contribution - from bug reports to feature requests, typos to full new features - are greatly appreciated.

Last CI build

The latest builds are available from our Jenkins CI build server

Build Status

Read on if you want to try the latest bleeding-edge snapshot.

Maven

Add the snapshot repository and dependency to your pom.xml:

    <!-- ... -->
    <repository>
      <id>OSS Sonatype snapshots</id>
      <url>https://s01.oss.sonatype.org/content/repositories/snapshots/</url>
      <snapshots>
        <enabled>true</enabled>
        <updatePolicy>always</updatePolicy>
      </snapshots>
      <releases>
        <enabled>false</enabled>
      </releases>
    </repository>

    <!-- ... -->
    <dependencies>
      <dependency>
          <groupId>org.htmlunit</groupId>
          <artifactId>htmlunit</artifactId>
          <version>4.4.0-SNAPSHOT</version>
      </dependency>
      <!-- ... -->
    </dependencies>

    <!-- ... -->

Gradle

Add the snapshot repository and dependency to your build.gradle:

repositories {
  maven { url "https://s01.oss.sonatype.org/content/repositories/snapshots" }
  // ...
}
// ...
dependencies {
    implementation group: 'org.htmlunit', name: 'htmlunit', version: '4.4.0-SNAPSHOT'
  // ...
}

License

This project is licensed under the Apache 2.0 License

Development

useful mvn command lines

setup as or refresh the eclipse project

mvn eclipse:eclipse -DdownloadSources=true

run the whole core test suite (no huge tests, no libary tests)

mvn test -U -P without-library-and-huge-tests -Dgpg.skip -Djava.awt.headless=true

check dependencies for known security problems

mvn dependency-check:check

Contributing

I welcome contributions, especially in the form of pull requests. Please try to keep your pull requests small (don't bundle unrelated changes) and try to include test cases.

Some insights

HtmlUnit at openhub

Stargazers

Stargazers

htmlunit-neko's People

Contributors

atnak avatar dependabot[bot] avatar duonglaiquang avatar flavorjones avatar markusheiden avatar rbri avatar rschwietzke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

htmlunit-neko's Issues

2.68.0 changes DefaultFilter API?

When I upgrade to 2.68.0 my project, AntiSamy (https://github.com/nahsra/antisamy), now gets a build error that I didn't get with 2.67.0 and prior. I notice a major change was made in 2.68.0 per the README: "As of version 2.68.0, neko-htmlunit also uses its own fork of Xerces (https://github.com/apache/xerces2-j). This made it possible to remove many unneeded parts and dependencies to ensure e.g. compatibility with Android." <-- The end of this sentence is a bit awkward by the way.

When I upgrade to 2.68.0, I get the following build error:

antisamy_main/src/main/java/org/owasp/validator/html/scan/MagicSAXFilter.java:[54,8] org.owasp.validator.html.scan.MagicSAXFilter is not abstract and does not override abstract method getDocumentSource() in org.apache.xerces.xni.XMLDocumentHandler

The code in question is here: https://github.com/nahsra/antisamy/blob/main/src/main/java/org/owasp/validator/html/scan/MagicSAXFilter.java

The class declaration is: public class MagicSAXFilter extends net.sourceforge.htmlunit.cyberneko.filters.DefaultFilter implements org.apache.xerces.xni.parser.XMLDocumentFilter {

I'm guessing that the net.sourceforge.htmlunit.cyberneko.filters.DefaultFilter API changed somehow? Or did the org.apache.xerces.xni.parser.XMLDocumentFilter API change?

Any suggestions on how to deal with this? Do I need to change another import in my pom? Which is here: https://github.com/nahsra/antisamy/blob/main/pom.xml

I don't see anything in the README that says, when you upgrade to 2.68.0, you also need to upgrade to version X of something, or change your import of Y to Z. Is this any undocumented API change or am I doing something wrong?

Thanks, Dave

Nested template tags not supported

Nested <template> tags are not supported yet. Recently I had a case similar to the following:

<html>                   
    <head>                   
    </head>                   
    <body>           
        <div>                                                                          
            <template id="outer">                                                                                           
                <div>                                                                                                       
                    <template id="nested-1">                                                                                                                   
                    </template>                                                                                       
                </div>                                                                                       
                <template id="nested-2">                                                                                                        
                </template>                                                                           
            </template>        
        </div>              
    </body>
</html>

After loading this snippet with HtmlUnit's WebClient.getPage() and printing the page as XML, we get:

<?xml version="1.0" encoding="ISO-8859-1"?>
<html>
  <head>
  </head>
  <body>
    <div>
      <template id="outer">
        <div>
        </div>
      </template>
      <template id="nested-1">
      </template>
    </div>
    <template id="nested-2">
    </template>
  </body>
</html>

Looks like the parser auto-closes the outer template (and also its div) when the first nested template is detected, effectively unnesting the inner templates.

Meet a Veracode issue.

image

https://cwe.mitre.org/data/definitions/597.html

image

Hi Rbri,
when we updated our esapi version in maven dependency, after veracode scanned, we meet this issue,
Could we replace != with null verification and String.equals()

Duplicate forks of Nekohtml resulting in some confusion

@rbri Can we use this codelibs fork of nekohtml itself instead of htmlunit-neko as codelibs also seems to be a working fork without any major vulnerabilities. Else is codelibs.nekohtml project not being maintained & should we switch to htmlunit-neko.

Problem is htmlunit-neko seems to use it's own version of xerces2j jar instead of xercesImpl. But xerces2j of htmlunit-neko doesn't seem to be actively maintained & couldn't find any Maven link to download xerces2j jar as well.

Kindly pls advice on this.

Support for missing method in xerces

@rbri In one of my company project, we are using getNonNormalizedValue() method from Xerces. But in the new implementation htmlunit-neko , this method is not present in XMLAttributesImpl.java.

Can you help me in understanding is there any alternate way for supporting getNonNormalizedValue() method?

Self closing elements in SVG break markup

Self enclosing 'title' element in SVG breaks the entire following markup as it swallows and encodes everything.

Example:

<svg viewBox="0 0 32 32" xmlns="http://www.w3.org/2000/svg">
    <defs><style>.cls-1{fill:none;}</style></defs>
    <title/>
    <g data-name="Layer 2" id="Layer_2"><path d="M19,26a1,1,0,0,1-.71-.29,1,1,0,0,1,0-1.42L26.59,16l-8.3-8.29a1,1,0,0,1,1.42-1.42l9,9a1,1,0,0,1,0,1.42l-9,9A1,1,0,0,1,19,26Z"/>
    <path d="M28,17H4a1,1,0,0,1,0-2H28a1,1,0,0,1,0,2Z"/></g><g id="frame"><rect class="cls-1" height="32" width="32"/>
    </g>
</svg>
<span>hellloooooo</span>

Should misplaced elements inside a table be parsed as appearing before the table?

Hello,

I lost an afternoon to this one while writing some code to scrape a legacy system and migrate the data... Anyway, from a quick check Firefox and Chrome seem to parse the following:

<table>
<td>Callout Order</td></tr>
<h2>Motion Control:</h2>
<td>1st</td></tr>

as

<h2>Motion Control:</h2>
<table>
<tr><td>Callout Order</td></tr>
<tr><td>1st</td></tr>
</table>

ie bring the h2 forward.

I added a quick unit test and htmlunit-neko (latest ab9f8f5) seems to leave the h2 inside the table.

Thanks for all the effort. I'm a, very, long time user of htmlunit/neko.

Cheers

Sam

Review Neko to increase performance and reduce memory usage

This is just a bookmark for my ongoing task of tuning without larger rewrites. So far, we got to this:

Wikipedia DE Homepage, DOM Parser, JDK 17, JDK 8 target

Old, v3.8.0

Benchmark                                               Mode  Cnt      Score     Error   Units
HtmlParser_v380_Benchmark.domParser                     avgt    3  1,596,511 ± 136,413   ns/op
HtmlParser_v380_Benchmark.domParser:gc.alloc.rate.norm  avgt    3  1,091,867 ±   1,868    B/op

New, JDK 8 target

Benchmark                                               Mode  Cnt      Score     Error   Units
HtmlParser_v380_Benchmark.domParser                     avgt    3  1,178,706 ± 130,720   ns/op
HtmlParser_v380_Benchmark.domParser:gc.alloc.rate.norm  avgt    3    870,992 ±       0    B/op

New, JDK 11 target

Benchmark                                               Mode  Cnt      Score     Error   Units
HtmlParser_v380_Benchmark.domParser                     avgt    3  1,165,087 ± 112,361   ns/op
HtmlParser_v380_Benchmark.domParser:gc.alloc.rate.norm  avgt    3    870,656 ±       0    B/op

Summary: 25% faster and 20% less memory is needed. There are 1-2% more performance in a JDK 11 compile than a JDK 8 one due to improvements of the JDK 11 code generation (no accessor methods for inner classes anymore).

How do we create DOMParser object in htmlunit-neko

We are upgrading from nekohtml to htmlunit-neko to mitigate CVE-2022-28366.
I could not find any class that implements net.sourceforge.htmlunit.xerces.dom.DocumentImpl, Could you please suggest one example to create net.sourceforge.htmlunit.cyberneko.parsers.DOMParser(<? extends DocumentImpl) object.

Here is the code we used to have to parse an HTML to XML using nekohtml-1.9.22 version

public Document openHTMLDoc( Reader in )
throws IOException, SAXException
{
org.cyberneko.html.parsers.DOMParser ps = new org.cyberneko.html.parsers.DOMParser();
ps.setFeature("http://xml.org/sax/features/namespaces", false);
ps.parse( new InputSource( in ) );
return ps.getDocument();
}

Please guide how we can achieve the above scenario with htmlunit-neko 2.x version

CVE-2023-26119 at neko-htmlunit

[ERROR] Failed to execute goal org.owasp:dependency-check-maven:8.2.1:check (default-cli) on project ins-app: 
[ERROR] 
[ERROR] One or more dependencies were identified with vulnerabilities that have a CVSS score greater than or equal to '8.0': 
[ERROR] 
[ERROR] neko-htmlunit-2.66.0.jar: CVE-2023-26119(9.8)

See https://nvd.nist.gov/vuln/detail/CVE-2023-26119.

Frameset not added to DOM in some malformed HTML

Note: This issue is a migration (for our convenience) of this issue on sourceforge which can now be closed.

Problem in brief

<frameset> is lost and not added to the DOM in some malformed HTML when it should be.

Examples

These examples demonstate the issue with input HTMLs and their corresponding expected DOM and what HtmlUnit produces.

# Input HTML Expected DOM HtmlUnit's DOM
1
<html>
<div></div>
<frameset>
  <frame src="about:blank"/>
  <frame src="about:blank"/>
</frameset>
</form>
</html>
<html>
<head></head>
<frameset>
  <frame src="about:blank"/>
  <frame src="about:blank"/>
</frameset>
</html>
<html>
<head></head>
<body>
  <div></div>
</body>
</html>
2
<html>
<div>
  <frameset>
    <frame src="about:blank"/>
    <frame src="about:blank"/>
  </frameset>
</div>
</html>
Same as above. Same as above.
3
<html>
<form>
  <frameset>
    <frame src="about:blank"/>
    <frame src="about:blank"/>
  </frameset>
</form>
</html>
Same as above.
<html>
<head></head>
<body>
  <form></form>
</body>
</html>

Note: These already exist as test cases in org.htmlunit.html.parser.MalformedHtmlTest as siblingWithoutContentBeforeFrameset(), framesetInsideForm(), as well as others not covered above.

Remarks

Where can i get the source code for the old version below 2.32. ?

I need to cross verify our project jars with the new ones, for that I need old version of this projects that is version 2.27, 2.26, 2.25.
Where can I get the source code for these versions, I can't find any in this repository, since it only have from 2.32.
Can you help me by pointing me into a direction or providing me with some resources?

Support for porting from Cyberneko-NekoHTML to HtmlUnit-Neko

Hi, our project is using Antisamy-1.6.4 and nekohtml-1.9.22 for a very long time. We have decided to move to a newer version of the anitsamy, but antisamy has ported from nekohtml to htmlunit-neko, in our project we have separately used nekohtml for additional xss filtering by using "org.cyberneko.html.filters.ElementRemover" and "org.cyberneko.html.filters.Writer", but HtmlUnit-Neko seems to have removed these classes and also found that HTMLEntities.java has been removed and entities has been handled by HTMLNamesEntitiesParser.java.

May we know the reason for the removal of those classes and how to overcome this issue while porting?

scanEntityRef should not rewind past the beginning of the fCurrentEntity buffer

I am seeing the exception:

java.lang.ArrayIndexOutOfBoundsException: -1
	at net.sourceforge.htmlunit.cyberneko.HTMLScanner$CurrentEntity.read(HTMLScanner.java:1901)
	at net.sourceforge.htmlunit.cyberneko.HTMLScanner$ContentScanner.scanAttribute(HTMLScanner.java:3075)
	at net.sourceforge.htmlunit.cyberneko.HTMLScanner$ContentScanner.scanAttribute(HTMLScanner.java:2900)
	at net.sourceforge.htmlunit.cyberneko.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.java:2747)
	at net.sourceforge.htmlunit.cyberneko.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2127)
	at net.sourceforge.htmlunit.cyberneko.HTMLScanner.scanDocument(HTMLScanner.java:937)
	at net.sourceforge.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:443)
	at net.sourceforge.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:394)
	at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.parse(HtmlUnitNekoDOMBuilder.java:758)
	at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoHtmlParser.parse(HtmlUnitNekoHtmlParser.java:236)
	at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoHtmlParser.parseHtml(HtmlUnitNekoHtmlParser.java:179)
	at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:280)
	at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:163)

When trying to parse the following file: https://gist.github.com/jzheaux/18f32257c66a02f95c6f0f9243a913ae

Or, I've got a test here to reproduce:

@Test
public void test() throws Exception {
	HTMLConfiguration htmlConfiguration = new HTMLConfiguration();
	String content = "<html blah=\"" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfunfun" +
		"funfunfun&fin\"></html>";
	InputStream byteStream = new ByteArrayInputStream(content.getBytes());
	XMLInputSource inputSource = new XMLInputSource("", "", "", byteStream, "UTF-8");
	htmlConfiguration.parse(inputSource);
}

I believe the problem is with this commit, which tries to rewind after it has read ahead to look for an entity.

The rewind here, if performed soon after the fCurrentEntity refreshes its buffer, could rewind past the beginning, setting the offset to a negative value:

if (match == null) {
    // we can't rewind if at EOF
    if (nextChar != -1) {
        final String consumed = str.toString();
        fCurrentEntity.rewind(consumed.length() - 1); // <-- here
        str.clear();
        str.append('&');
    }
}

This logic here:

private void rewind(int i) {
    offset -= i;
    characterOffset_ -= i;
    columnNumber_ -= i;
}

may be problematic since it can set offset to what appears to be an invalid value.

HTMLTagBalancer does strange things with <table> inside <template>

Problem in brief

When a <table> is inside a <template>, the resulting DOM tree is created with the table's children in a strange location.

Reproducing

Here is a example HTML to reproduce the issue:

<!DOCTYPE html>
<html>
<head></head>
<body>
    
<template>
  <table>
    <tr>
      <ul></ul>
      <td></td>
    </tr>
  </table>
</template>
    
</body>
</html> 
  • Note: The <ul></ul> isn't required for the test. It's there just to demonstate a case where HTMLTagBalancer is correctly moving it outside of <table>.

Chrome creates this DOM tree:

<!DOCTYPE html>
<html>
<head></head>
<body>

<template>
  <ul></ul>
  <table>
    <tbody>
      <tr>
        <td></td>
      </tr>
    </tbody>
  </table>
</template>

</body>
</html> 

HtmlUnit 2.66 creates this:

<!DOCTYPE html>
<html>
<head></head>
<body>
    
<template>
  <tr>
    <td></td>
  </tr>
  <ul></ul>
  <table></table>
</template>
    
</body>
</html>
  • <tr></tr> is moved to a strange location
  • <tbody> was not created

More details

This issue cropped up as a regression when we updated HtmlUnit and is likely caused by this code:

else if (fTemplateFragment) {
// nothing, don't force/check parent for the direct template children
}

The comment states don't force/check parent for the direct template children but there lacks a direct vs indirect check so perhaps the issue is caused by that. (I don't know what the test case for this new code is so this is just a guess.)

CVE-2017-10355: Xerces Security Vulnerability

EXPLANATION
Apache Xerces-J is vulnerable to a Denial of Service (DoS) attack. The setupCurrentEntity() method in the XMLEntityManager class lacks a connection timeout mechanism. A remote attacker can exploit this vulnerability by supplying an XML document containing a URL to their malicious FTP server. This URL is then retrieved and stored in the expandedSystemId object, and used to instantiate a URLConnection. Once the server begins fetching the resource, the attacker's server would then exit abruptly, leaving the connection in a CLOSE_WAIT status. The attacker would need to issue one request per thread, eventually leading to a DoS as the application repeatedly attempts to fetch the FTP resource.

NOTE: This vulnerability was assigned CVE-2017-10355.

Not sure if switching to this version fixes this issue:

<dependency>
  <groupId>org.codelibs.xerces</groupId>
  <artifactId>xercesImpl</artifactId>
  <version>2.12.1-sp1</version>
</dependency>

<noscript> parsing problem

Hi,

I have noticed a parsing issue in htmlunit-neko with <noscript> tag.
Consider the following inputs:
<div><noscript><!-- </noscript> --></noscript>
<div><noscript><img src="</noscript>"/></noscript>

These are parsed as is, but in reality Noscript tag closes itself when it encounters the string </noscript> or </noscript . The weird part is, unlike other special tags like <title>, <noframes>, <style>, etc., <noscript> element values are parsed as HTML by all modern browsers.

This was identified while I discovered a security bypass in a third party whitelist HTML filter which is based on NekoHtml. As NekoHtml isn't maintained and while I checked this HtmlUnit's fork of Neko, many of NekoHtml's issues were fixed (like HTML comments parsing, etc) - so it would be great if we fix this, so that the users of NekoHtml would move to this fork.

And, I'm aware of PARSE_NOSCRIPT_CONTENT flag, but when it is set true, the element values are treated as text and not parsed. But, browsers treat them as HTML, which could possibly cause potential security issues.

Possible fix:
We should parse Noscript's element values as HTML, but close the Noscript tag when the parser encounters the string </noscript> or </noscript .

Cheers,
Vivek Krishna

A Error occured when using xpath to get nodelist from document

A Error occurred when using xpath to get nodelist from document
When I used org.apache.xpath.XPathAPI to extract some nodes by xpath, the error occurred.
There is my error stack message

org.w3c.dom.DOMException: HIERARCHY_REQUEST_ERR: An attempt was made to insert a node where it is not permitted.
        at org.htmlunit.cyberneko.xerces.dom.ParentNode.internalInsertBefore(ParentNode.java:332)
        at org.htmlunit.cyberneko.xerces.dom.ParentNode.insertBefore(ParentNode.java:267)
        at org.htmlunit.cyberneko.xerces.dom.NodeImpl.appendChild(NodeImpl.java:172)
        at org.htmlunit.cyberneko.html.dom.HTMLDocumentImpl.getDocumentElement(HTMLDocumentImpl.java:275)
        at org.apache.xpath.XPathAPI.eval(XPathAPI.java:233)
        at org.apache.xpath.XPathAPI.selectNodeList(XPathAPI.java:167)
        at org.apache.xpath.XPathAPI.selectNodeList(XPathAPI.java:147)

I found the two way may occur this error,

  1. project not import the serializer.jar
  2. add node to the document may occur this error.

I'm sure i has imported the serializer.jar to my project,so i guess may be the reason is happend when appending node.

Lowercase tags and attributes

CyberNeko has a feature where it can convert HTML tags and attributes to lower case using a configuration setting.

parser.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");

This feature was still available in version 2.25 but the current version of 2.36.0 throws an exception when you try to use it.

org.xml.sax.SAXNotRecognizedException: Property 'http://cyberneko.org/html/properties/names/elems' is not recognized.
	at org.apache.xerces.parsers.DOMParser.setProperty(Unknown Source)

Has it been renamed? Deleted? Is there an alternative way of doing this?
I can't find any documentation.

Porting guidance for v3.0.0?

Can you create a porting guide for this major new release? It's not just as simple as changing the imports.

For example, in my project:
I changed all the imports as needed, but:
https://github.com/nahsra/antisamy/blob/main/src/main/java/org/owasp/validator/html/scan/MagicSAXFilter.java#L194 and L196, use new AugmentationsImpl(), but that class doesn't exist anymore. I tried setting them to just null, but that failed horribly.

There is now a branch that includes these changes called: upgradeCyberNekoHTMLUnit

Can you either explain to me what I need to do to finish porting over to your new 3.0.0. API, or even better fork my branch, fix it for me, and send me a pull request? And then write a porting guide to help others, like me?

Hopefully this is an easy fix, but I don't see any good documentation that guides me on what I should be doing here.

Thanks, Dave

Maven central POM (for version 4.2.0) has wrong (4.3.0-SNAPSHOT) version

Maven central POM (for version 4.2.0) has wrong (4.3.0-SNAPSHOT) version

see https://search.maven.org/remotecontent?filepath=org/htmlunit/neko-htmlunit/4.2.0/neko-htmlunit-4.2.0.pom

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.htmlunit</groupId>
<artifactId>neko-htmlunit</artifactId>
<version>4.3.0-SNAPSHOT</version>
<name>HtmlUnit NekoHtml</name>
...

CloneNode is not working on org.w3c.dom.Node

@rbri I am using htmlUnit-Neko parser to read an Html Document. I was able to successfully fetch the nodes but when I am trying to clone a org.w3c.dom.Node getting unimplemented Exception

java.lang.UnsupportedOperationException: unimplemented
at org.htmlunit.cyberneko.util.SimpleArrayList.iterator(SimpleArrayList.java:234)
at org.htmlunit.cyberneko.xerces.dom.AttributeMap.cloneContent(AttributeMap.java:388)
at org.htmlunit.cyberneko.xerces.dom.AttributeMap.cloneMap(AttributeMap.java:367)
at org.htmlunit.cyberneko.xerces.dom.ElementImpl.cloneNode(ElementImpl.java:146)

How to resolve this issue ?

Remove reference to old xalan/xerces

Xalan/Xerces are implementing old JAXP. It's incompatible with newer JAXP implementation. So anyone using old impl might end up in:
java.lang.IllegalArgumentException: Not supported: http://javax.xml.XMLConstants/property/accessExternalDTD at org.apache.xalan.processor.TransformerFactoryImpl.setAttribute(TransformerFactoryImpl.java:571)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.