Giter Club home page Giter Club logo

jtidy's Introduction

JTidy

Build Status Build Status Code Quality

About

Revival of the JTidy Project updated to work with HTML5 new tags. Along with new option to not remove unknown tags:

Tidy tidy = new Tidy();
tidy.setDropProprietaryTags(false);

JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.

HOW TO

You can use JTidy as an html checker/pretty-printer or as a DOM parser.

First of all, you will need to download a JTidy distribution. Inside it you will find a jtidy.jar containing all classes, no other libraries are needed.

Now that you have JTidy you can use it in different ways.

JTidy Executable

Run java -jar jtidy.jar {options} to access the JTidy command line interface.

java -jar jtidy.jar -h will output a short help on JTidy command line with a few examples.

java -jar jtidy.jar -help-config outputs all the available configuration options and java -jar jtidy.jar -show-config the current (default) values.

Ant Task

Detailed instructions on how to use the JTidy ant task can be found in JTidyTask JavaDoc.

JTidy API

To use JTidy embedded in your program, you best set up a Maven dependency to the official release at maven-central:

<dependency>
    <groupId>com.github.jtidy</groupId>
    <artifactId>jtidy</artifactId>
    <version>1.0.5</version>
</dependency>

If you require a Java-6-compatible version, you can use the back-ported artifact:

<dependency>
    <groupId>com.github.jtidy</groupId>
    <artifactId>jtidy-java6</artifactId>
    <version>1.0.4</version>
</dependency>

The entry point for accessing JTidy functionalities is the org.w3c.tidy.Tidy class. This is a simple usage example:

Tidy tidy = new Tidy(); // obtain a new Tidy instance
tidy.setXHTML(boolean xhtml); // set desired config options using tidy setters 
...                           // (equivalent to command line options)

tidy.parse(inputStream, System.out); // run tidy, providing an input and output stream

Using parseDOM(java.io.InputStream in, java.io.OutputStream out) instead of parse() you will also obtain a DOM document you can parse and print out later using pprint(org.w3c.dom.Document doc, java.io.OutputStream out) (note that the JTidy DOM implementation is not fully-featured, and many DOM methods are not supported).

JTidy also provides a MessageListener interface you can implement to be notified about warnings and errors in your HTML code. For details on advanced uses refer to the JTidy JavaDoc.

History

JTidy was initially written by Andy Quick. The project has been maintained at sourceforge.net by Fabrizio Giustina from 2004 to 2010. Since the JTidy project on SourceForge.net seemed to fall into disrepair years ago and had not been updated for years. A few had forked it on Github. William L. Thomson Jr. came along and created a fork of others forks with a tag for his packaging needs as a dependency for JMeter. Then another came along, Dell Green who noticed some issues, tests failing, and undertook fixing both.

Since the code belonged to neither, William decided to create a JTidy organization and revive the project via community support. Which you are welcome to join in. Eventually this should become the official new home for JTidy.

Thanks to all past authors and developers. Those of which who could be found on Github have been invited to join this project. Along with those that this repository was forked from.

Contributing

You are welcome to contribute issues and pull-request. Please have a look at the coding conventions and test cases.

License

This project is licensed under the zlib/libpng license. More information is available in the LICENSE.txt file

Future

The project is looking for new contributors and project maintainers.

Checkout v.Nu validator for a possible modern replacement.

jtidy's People

Contributors

dellgreen avatar dependabot[bot] avatar desiderantes avatar dunamariuscosmin avatar guidotrueb avatar haumacher avatar jwickers avatar knight0x avatar michelerfbj5 avatar nanndoj avatar spullara avatar steffenyount avatar wltjr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

jtidy's Issues

Building produces slide*.html files in project directory

Project root directory is polluted with slide files when project is built.
Investigate what these files are and decide if they should be produced in the build directory as build artifacts

slide001.html
slide002.html
slide003.html
slide004.html
slide005.html
slide006.html
slide007.html
slide008.html
slide009.html
slide010.html
slide011.html
slide012.html
slide013.html
slide014.html
slide015.html
slide016.html
slide017.html
slide018.html
slide019.html
slide020.html
slide021.html
slide022.html
slide023.html
slide024.html
slide025.html
slide026.html

Parsing of Unicode code points >U+FFFF does not work

There is an incompatibility in the handling of Unicode characters inside the implementation of JTidy's lexer:

The method Lexer.addCharToLexer(int c) obviously expects 32-bit Unicode code points, which are internally converted to UTF-8 using EncodingUtils.encodeCharToUTF8Bytes(). However, all calls of Lexer.addCharToLexer use UTF-16 Unicode code units, for example from String.charAt() or Reader.read().

The result is that all Unicode characters >U+FFFF are replaced by two 0xFFFD characters during parsing, since their Unicode code units are recognized as invalid.

Escaping of "<" characters into "&lt;"

Dear all,
this is more a question than an issue... I have an application where users can enter HTML source code for formatting. Somtimes it happens that users enter a less-than character ("<") instead of the entity "&lt;" which leads to a damaged DOM and hidden content in some browsers. Is there a way to deal with such characters in JTidy?
Currently I´m using JTidy with the following settings:

Tidy tidy = new Tidy();
tidy.setPrintBodyOnly(true);
tidy.setRawOut(true);
tidy.setWraplen(0);
tidy.setOutputEncoding("UTF-8");
tidy.setQuiet(true);

Thank you and best regards,
Christof

Site documentation source code links to sourceforge

Whilst its important to acknowledge and link to original sourceforge page in the documentation so the user is informed, the links that point to the source code that produced the documentation should reference the github project not sourceforge anymore

Change Clover code coverage to open source Jacoco

Clover is commercial, and we should support the use of open source alternatives. It appears all build tools, IDE's and CI tools have built in support for Jacoco
At the time of writing Jacoco seems to have more interest and the one i'm most familiar with

Can't load the scm provider. No such provider: '[email protected]'

When running the 'site' goal the following error is produced

ERROR] Failed to execute goal org.apache.maven.plugins:maven-site-plugin:3.3:site (default-site) on project jtidy: Error during page generation: Error rendering Maven report: Cannot run changelog command : Can't load the scm provider. No such provider: '[email protected]'. -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

Gitignore direnv config file

For those linux/mac developers that use direnv to aid with working with eclectic project environment demands, gitignore should be updated to ignore .envrc files as they are often machine/user dependent and should not be accidentally committed.

direnv

when there is excessive nesting the output mentions an error but does not print what it is

Example:

line 10 column 21 - Warning: <style> lacks "type" attribute
line 1,204 column 1 - Warning: inserting missing 'title' element
test.html: Doctype given is "-//W3C//DTD HTML 4.01 Transitional//EN"
test.html: Document content looks like HTML 4.01 Transitional
2 warnings, 1 error were found!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
To learn more about JTidy see https://github.com/jtidy/jtidy
Please report bugs at https://github.com/jtidy/jtidy/issues
HTML & CSS specifications are available from http://www.w3.org/
Lobby your company to join W3C, see http://www.w3.org/Consortium

Instead of:

line 10 column 21 - Warning: <style> lacks "type" attribute
line 1,204 column 1 - Error: Document with excessive nesting
line 1,204 column 1 - Warning: inserting missing 'title' element
test.html: Doctype given is "-//W3C//DTD HTML 4.01 Transitional//EN"
test.html: Document content looks like HTML 4.01 Transitional
2 warnings, 1 error were found!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
To learn more about JTidy see https://github.com/jtidy/jtidy
Please report bugs at https://github.com/jtidy/jtidy/issues
HTML & CSS specifications are available from http://www.w3.org/
Lobby your company to join W3C, see http://www.w3.org/Consortium

Keep or drop ant build.xml, switch to gradle from maven

Not sure if it makes sense to keep 2 build systems. Not sure any time will be spent on the ant build.xml. Maybe need to remove that. I do not believe there is any benefit to maintaining 2 build systems. Most will likely go with maven for dependencies etc.

Along those same lines, could also switch to gradle from maven. That is pretty moot to me, but is worth considering if making changes to build system etc. I have no preference either way. Though for new projects I tend to go with Gradle vs Maven, because the configuration/build files tend to be smaller, build.gradle vs pom.xml. But that also maybe more work than worth while. Just mentioning it for discussion, since visiting build systems for the project.

Html5 docType not supported

<!DOCTYPE> for HTML5 is this:

<!DOCTYPE html>

But jtidy does not recognise it. It reports:

InputStream: Doctype given is ""
InputStream: Document content looks like HTML 4.01 Transitional

Update or remove distributionManagement section of pom.xml

Pretty sure we need to replace and update the following

    <distributionManagement>
        <repository>
            <id>jtidy</id>
            <url>scp://shell.sourceforge.net/home/groups/j/jt/jtidy/htdocs/maven2/</url>
        </repository>
        <snapshotRepository>
            <id>jtidy</id>
            <url>scp://shell.sourceforge.net/home/groups/j/jt/jtidy/htdocs/snapshots/</url>
        </snapshotRepository>
        <site>
            <id>jtidy</id>
            <url>scp://shell.sourceforge.net/home/groups/j/jt/jtidy/htdocs/</url>
        </site>
    </distributionManagement>

Not really clear what we need to change them to, or if they can be removed, etc. Just need to be updated at some point or removed.

Enable Failing tests to break the build

Currently failing tests do not fail the build. This unfortunately have left the tests in an unusable state as changes have been added. To ensure new changes don't become breaking changes, the build should turn on the failing of the build if any tests fail.

This has a dependency on first fixing the failing tests so that we have reliable suite of tests to program against. Fix broken tests

Html5: Validation errors reported if empty tags have no corresponding closing tag

There's technically no reason to close meta, img, br and other empty tags. HTML5 does not require empty elements to be closed.

Yet jTidy does not accept that when it's configured with tidy.setXmlTags(true). And only then it reports on unclosed tags - so it seems like it's all or nothing for now ;)

Also:

Given say , it will output a warning lacks "content" attribute... looks like another html4 vs 5 issue...


I went through the history of commits and I realise html5 support is WIP and that html5 specific tags were added but that's basically it for now. I'm reporting the issue to highlight it although you probably know about it very well... So just to let you know I really appreciate the effort! Hopefully this project will be maintained and won't end up like the previous one. That would be a shame...
So fingers crossed!

SVG reformatting attempt errors out

When using an SVG (created with Inkscape) as input, I get this:

line 2 column 1 - Warning: missing <!DOCTYPE> declaration
line 2 column 1 - Warning: discarding unexpected <svg>
line 19 column 28 - Error: <metadata> is not recognized!
line 19 column 28 - Warning: discarding unexpected <metadata>
...

I ran this:

java -jar jtidy-1.0.2.jar -xml logo.svg

Re-write history to associate past committers with their github accounts

The following committers have accounts on github, but their commits are not linked. Ideally they add the email addresses used for those commits to their github accounts.

@aditsu @fgiust and @steffenyount

Short of them adding emails to their account. The other option is to re-write history so we can update the author and match them to their accounts that way. Its pretty moot, but would be ideal to match the commits to the authors that have accounts on github. If they see this and add emails themselves, there is nothing to be done here. But if they do not act, someone may need to re-write history and force push. Which will effect all others, as they will need to force pull. A one time thing, hopefully can be avoided, but maybe necessary someday.

jTidy fails with Null pointer exception incase of white space before <br>

Hi ,

I'm using jTidy version 1.0.2 and here is my configuration set up

    Tidy tidy = new Tidy();
     tidy.setInputEncoding("UTF-8");
     tidy.setOutputEncoding("UTF-8");
     tidy.setForceOutput(true);
     tidy.setWrapSection(false);
     tidy.setWraplen(0);
     tidy.setShowErrors(0);
     tidy.setShowWarnings(false);
     tidy.setEscapeCdata(true);
     tidy.setFixComments(true);
     tidy.setIndentCdata(true);
     tidy.setDropProprietaryTags(true);

Few of my html documents fail when I try parsing them using jTidy and after debugging I found that it was space before
that is causing the issue . Is there anyway I can handle this whitespace please advice ?

Here is the error from console
Exception in thread "main" java.lang.NullPointerException
at org.w3c.tidy.PPrint.afterSpace(PPrint.java:1358)
at org.w3c.tidy.PPrint.afterSpace(PPrint.java:1366)
at org.w3c.tidy.PPrint.printTag(PPrint.java:1423)
at org.w3c.tidy.PPrint.printTree(PPrint.java:2209)
at org.w3c.tidy.PPrint.printTree(PPrint.java:2306)
at org.w3c.tidy.PPrint.printTree(PPrint.java:2363)
at org.w3c.tidy.PPrint.printTree(PPrint.java:2363)
at org.w3c.tidy.PPrint.printTree(PPrint.java:2363)
at org.w3c.tidy.PPrint.printTree(PPrint.java:2363)
at org.w3c.tidy.PPrint.printTree(PPrint.java:2152)

tidy img src issue

String html = "

1

";
ByteArrayInputStream stream = new ByteArrayInputStream(html.getBytes());
ByteArrayOutputStream tidyOutStream = new ByteArrayOutputStream();
// 实例化Tidy对象
Tidy tidy = new Tidy();
// 设置输入
tidy.setInputEncoding("UTF-8");
// 如果是true 不输出注释,警告和错误信息
tidy.setQuiet(true);
// 设置输出
tidy.setOutputEncoding("UTF-8");
// 不显示警告信息
tidy.setShowWarnings(false);
// 不设置它会在输出的文件中给加条meta信息
tidy.setTidyMark(false);
// 让它加上
tidy.setXmlPi(true);
// 缩进适当的标签内容。
tidy.setIndentContent(true);
// 内容缩进
tidy.setSmartIndent(true);
tidy.setIndentAttributes(false);
// 输出为xhtml
tidy.setXHTML(true);
// 去掉没用的标签
tidy.setMakeClean(true);
// 清洗word2000的内容
tidy.setWord2000(true);
// 设置错误输出信息
tidy.setErrout(new PrintWriter(System.out));
tidy.parse(stream, tidyOutStream);
String target = tidyOutStream.toString();
System.out.println(target);

<title></title>


<p><img src="%65E0_16287796991580.png" />1</p>

source = img src="无_16287796991580.png
target = img src="%65E0_16287796991580.png

JTidy converts HTML 5 Doctype declaration into HTML 2.0 Doctype declaration

When migrating from JTidy r938 to the latest JTidy 1.0.4, I noticed that HTML 5 Doctype declarations present in the input HTML code are now converted to HTML 2.0 Doctype declarations. This is a new behavior. While JTidy r938 did not have official support for HTML 5, it did not change an HTML 5 Doctype declaration that was present in the input.

HTML input:

<!DOCTYPE html>
<html>
   [...]
</html>

HTML output with JTidy r938:

<!DOCTYPE html>
<html>
   [...]
</html>

HTML output with JTidy 1.0.4:

<!DOCTYPE html PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
   [...]
</html>

Here is the source code that I use to parse and reformat the HTML code:

public static String formatHtmlCode(InputStream htmlStream) {
	Tidy tidy = new Tidy();
	tidy.setIndentContent(true);
	tidy.setErrout(new PrintWriter(NullOutputStream.NULL_OUTPUT_STREAM)); // do not write messages to STDERR
	ByteArrayOutputStream buffer = new ByteArrayOutputStream();
	tidy.parse(htmlStream, buffer);
	return buffer.toString();
}

Is this the expected behavior and I'm missing something, maybe some change between r938 and 1.0.4 which requires changes in my code? Or is it a bug? Is something broken with the automatic detection of the Doctype?

Thanks in advance for any help on this topic.

Best regards,
Ste

setDropProprietaryTags config not working for me

Here is my setup.

def tidyUp(str: String): String = {
  val nl = """\n""".r
  val tidy = new Tidy()
  tidy.setXmlOut(true)
  tidy.setXHTML(true)
  tidy.setInputEncoding("UTF-8")
  tidy.setOutputEncoding("UTF-8")
  tidy.setIndentAttributes(false)
  tidy.setDropProprietaryTags(false)
  tidy.setIndentContent(false)
  tidy.setSpaces(0)
  tidy.setWraplen(Integer.MAX_VALUE)
  tidy.setPrintBodyOnly(true)
  tidy.setQuiet(true)
  tidy.setDropEmptyParas(false)
  tidy.setTrimEmptyElements(false)
  tidy.setEscapeCdata(false)
  tidy.setShowWarnings(false)
  val inputStream = new ByteArrayInputStream(str.getBytes("UTF-8"))
  val outputStream = new ByteArrayOutputStream()
  tidy.parseDOM(inputStream, outputStream)
  nl.replaceAllIn(outputStream.toString(), " ").trim
}

For:

tidyUp("<a-b-c>hello</a-b-c>")

I am expecting it to return: <a-b-c>hello</a-b-c>
But it returns empty string with Error: line 1 column 1 - Error: <a-b-c> is not recognized!
Also, can someone help on how can I get wrapped output of HTML instead of replacing \n with " "?

Thanks in advance!

Log4j tests log file not written to build directory

When running unit tests a tests.log file is being produced and written to the project root directory.
It should really be written to the build directory so that it automatically gets cleaned between runs and also does not pollute the project directory, as it could be accidentally committed to the repo

Html 3.2 "type" attribute warning incorrect

When checking Html 3.2 document, it logs out warnings that a "type" attribute is missing which is incorrect for the version it is checking.

Error logged is: "line 1 column 167 - Warning: lacks "type" attribute"

In html 3.2 the type attribute on "link" elements and possibly other elements are not valid

Move to proper license

I am not clear which license the license in this repository was based off, not sure if is MIT, BSD, etc. It does not seem to be GPL based. Most classified it as other. I think we should look to move to a proper common/standard FOSS license. Likely MIT, BSD or maybe Apache-2.0. Not sure but likely should do something. Seems the original license is only dated 2000. Which it would still be under the life span from that time frame. None the less going forward I think it best to move to a proper license.

Bring dependencies up to date

Update dependencies versions so they don't get stuck on older versions and hence break with newer versions of build tools and JDK

Become root of all forks or detach from other forks

Pretty minor and moot for now. For the long term for contributor commits to count on their contribution charts. We either need to either detach from other forks, and/or become the root fork. Those are the 2 available routes.

  1. Get permission from the original root fork owner @spullara to replace them as the root fork.
    Need to contact Github once we get @spullara permission to re-root.

  2. Detach from other forks and become a stand alone repo.
    Also requires contacting Github to request such. Lose relation to other forks/repos.

At some point one of those two needs to take place. Ideally # 1 as that would keep association with other forks. Though that is not really necessary. The main goal is to have commits/contributions show on peoples contribution charts. Also to not be seen as a fork, but the main project for others to contribute. Rather than a bunch of forks with different modifications spread all over the place. Just an attempt to clean house and keep project relation to other forks.

Fix Javadoc warnings under java 8

Javadoc creation under Java 8 became stricter with the introduction of doclint.
Rather than disable doclint to which hides the problem, the Javadoc comments should be cleaned up to conform to the new stricter requirements

Xpath

when i used Xpath throw your code and it was very slow .
how to fix that , thank you

replacing bug

replacing Chinese quotation marks with English ones, for instance, <img alt="测试“123”" /> would be changed to <img alt="测试"123"" />, which is clearly an unreasonable substitution.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.