Giter Club home page Giter Club logo

Comments (35)

mikesamuel avatar mikesamuel commented on July 17, 2024 3

i can understand that you do not have time but that should not be an argument to close this issue. you can give it a low priority maybe some one else can take this one.

It's not just that I don't have a lot of time, but also that I've seen nothing that leads me to believe that it serves a purpose ; I don't just prioritize it low, I can't distinguish its priority from Priority:Worthless.

from java-html-sanitizer.

mikesamuel avatar mikesamuel commented on July 17, 2024 1

I'm not going to expand the space of configurations to specify different policies on entity encoding.

Expanding the configuration space reduces the mean test coverage per configuration and the minimum test coverage per configuration.

Noone has yet outlined a use-case that requires this extra configuration.

I have too many other demands on time to write and maintain lots of extra tests for configurations that don't meet a required use-case.

I can't recommend this library for use in production if the minimum test coverage per configuration is low.

I want to be able to recommend this library for use in production.

CJK languages compress better when arbitrary code-points are not encoded as character reference.

Therefore, until I see a clear and well-argued use-case,
I will not complicate the code that decides whether to encode a code-point as a character reference.

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 17, 2024
This is a bug for sure. The sanitize should not replace entities as they are 
the generalized way of expressing extended characters regardless of encoding. 
They are perfectly valid and safe in HTML.

Original comment by [email protected] on 24 Jun 2014 at 4:32

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 17, 2024
Are you not seeing codepoint U+A0 in the output?

Original comment by [email protected] on 24 Jun 2014 at 9:10

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 17, 2024
I'm seeing "?" in the output. We use CKEditor which has a feature to insert 
extended characters which are inserted as HTML entities. 

You can see the ckeditor demo here:
http://ckeditor.com/demo

If the user does this, the result after sanitizing replaces these entities with 
"?". This occurs whether I use the entity name or number. E.g.

<p>blah blah blah ♦</p>
<p>blah blah blah ♦!</p>



Original comment by [email protected] on 25 Jun 2014 at 10:09

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 17, 2024
I don't know what CKEditor has to do with this bug, and you talk about '?' in 
the output but to understand the output, I'd have to actually see the headers 
and meta-content of the response you're serving.

I added a test (see patch below) and entities seems to work fine.  The HTML 
sanitizer decodes entities just fine and normalizes them in the output.

As long as your content-type header's charset matches the charset you used to 
encode your string, then the content should reach the browser just fine.

Index: src/tests/org/owasp/html/HtmlPolicyBuilderTest.java
===================================================================
--- src/tests/org/owasp/html/HtmlPolicyBuilderTest.java (revision 235)
+++ src/tests/org/owasp/html/HtmlPolicyBuilderTest.java (working copy)
@@ -282,6 +282,17 @@
             "<select>\n  <option>1</option>\n  <option>2</option>\n</select>"));
   }

+  @Test
+  public static final void testEntities() throws Exception {
+    assertEquals(
+        "(Foo)\u00a0(Bar)\u2666\u2666\u2666\u2666(Baz)"
+        + "𔠴𔠴𔠴(Boo)",
+        apply(
+            new HtmlPolicyBuilder(),
+            "(Foo) (Bar)♦♦♦♦(Baz)"
+            + "\ud812\udc34𔠴𔠴(Boo)"));
+  }
+
   private static String apply(HtmlPolicyBuilder b) throws Exception {
     return apply(b, EXAMPLE);
   }

Original comment by [email protected] on 25 Jun 2014 at 1:09

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 17, 2024
I added a test to HtmlSanitizerTest

    @Test
    public static final void testDiamond() throws Exception {
        assertEquals("<p>blah blah blah ♦</p>", sanitize("<p>blah blah blah ♦</p>"));
    }

I get:

    junit.framework.ComparisonFailure: null 
    Expected :<p>blah blah blah ♦</p>
    Actual   :<p>blah blah blah ♦</p>

Original comment by [email protected] on 25 Jun 2014 at 3:10

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 17, 2024
And what I don't understand is

> Expected :<p>blah blah blah ♦</p>
> Actual   :<p>blah blah blah ♦</p>

Why is the actual result a problem?

Original comment by [email protected] on 5 Jul 2014 at 8:28

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 17, 2024
The issue is text portability. Named entities preserve the text as simple ascii 
which is universal. In our situation we send various html content to 3rd party 
systems so we  can't guarantee a common unicode for special characters.  
Currently we added jericho as a dependency to reparse and convert these back to 
entities.   

Unless you think there's some security issue with entities I think they should 
be preserved. Best would probably be to have a setting for this in the 
HtmlPolicyBuilder.

Thanks 

Original comment by [email protected] on 6 Jul 2014 at 10:53

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 17, 2024
Having same issue ' is replaced with it's ' code @ with @
Version: r239
What shoud I do to avoid such behaviour?

Original comment by [email protected] on 11 Oct 2014 at 5:42

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 17, 2024
The adventures of ' from the input:

Version: r239
org.owasp.html.Encoding
line 55-60:
it calls method HtmlEntities.decodeEntityAt(s, amp, n);
it calculates codepoint 39 for ' so it puts ' symbol in StringBuilder instead 
of '...
After it in encodeHtmlOnto method line:168-176
it rereplaces ' with it's code from String[] REPLACEMENTS (it's for sure ').
but yet I don't understand why, or how to easily add ' to allowed symbols?



Original comment by [email protected] on 11 Oct 2014 at 6:29

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 17, 2024
Why are these replacements problems?

Original comment by [email protected] on 13 Oct 2014 at 10:23

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 17, 2024
Numerical entities would be interpreted differently depending on unicode. Named 
entities should really be preserved for text portability.

https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references


Original comment by [email protected] on 13 Oct 2014 at 11:10

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 17, 2024
Agree with [email protected].
Character entity references in HTML are already a symbol references, widely 
used.
It will be great if they will be just ignored by, maybe some optional setting 
implemented to HtmlEntities.
Text pocessing could be bore predictable and reversable in this case.

Original comment by [email protected] on 14 Oct 2014 at 8:31

from java-html-sanitizer.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 17, 2024
Replacing   with a unicode no-break-space messes me up because I use UTF-8 
encoding, and when I UTF-8 encode a unicode no-break-space it results in Â.  
If I pass in " " then I want to get " " back. Why bother replacing it?

Original comment by [email protected] on 28 Jan 2015 at 4:41

from java-html-sanitizer.

Rancho007 avatar Rancho007 commented on July 17, 2024

We have an another issue example, if you enter &#8217; sanitizer covert this to html character (´) which is not recognized by rendered engine (UTF-8).

Actually we are not understanding why santizer required to change html entities to characters. This happend to only some entites not for all (example &lt; is not converted)

Kr,
Urvish

from java-html-sanitizer.

Rancho007 avatar Rancho007 commented on July 17, 2024

Hi,

In fact if use different encoding rather than UTF-8 there is problem.
If use ISO-8859-1 it does not convert HTMLentities well to character.
But we are not understanding why it is required to convert HTML entities to character.
Example (default file encoding ISO-8599-1),
Before sanitize: &#8217
After sanitize: ? (if we use encoding UTF-8 it gives ´(SINGLE RIGHT QUOTATION MARK))

According to us this is incorrect?

Kr,
U

from java-html-sanitizer.

tibistibi avatar tibistibi commented on July 17, 2024

i can understand that you do not have time but that should not be an argument to close this issue. you can give it a low priority maybe some one else can take this one.

you need a clear and well argued use case. ok here is mine.

i have a front-end where, via a rich text editor, a user can add content. which will be stored in the db and shown to users on some other page.
i want to make this secure so the editor can't add tags which are not allowed but he should still be able to add all the text he wants.
i also want to give clear error messages.

so ideally i would receive the added text. i would run it through the sanitizer to remove tags i don't want. then i would compare the added text with the cleaned text. if this is different i return an error telling the editor he added wrong input.
second spaces should still be kept as &nbsp because multiple &nbsp is different presented in html than multiple spaces (they are compressed to one space).

the strange thin is i have all the options to configure which tags i want with even which arguments but i have no option at all to toggle the encoding.

both scenario's are not possible because the text is encoded when i sanitize. encoding and sanitation are 2 different steps which should be two different steps in the code. this will not add you more unit tests it just would make the lib more modular.

i won't mind helping out 😄

from java-html-sanitizer.

tibistibi avatar tibistibi commented on July 17, 2024

from #43 to put all the examples in 1 issue
test case:

    @Test
    public void testSpace() {
        String text = "L&amp;nbsp;&amp;nbsp;&amp;nbsp;L";
        assertEquals(text, Sanitizers.FORMATTING.sanitize(text));
     }

why:
1)
a &nbsp; is something different than a space. when i get a &nbsp; from my richttext editor i want to preserve it. in the above example when i would add the sanitized text into an html page it would look like L L instead of L   L.
2)
i want to know if the user added something wrong and present an error:

    if (!StringUtils.equals(text, clean)) {
        addFieldError("wrong input! please check cleaned text");
    }

i don't want this to happen after a spcae replacement.

todo:
remove the &nbsp; to space part or make it optional.

work around:
replace the &nbsp; by spaces. do the sanitizing and checking and re add the &nbsp;

from java-html-sanitizer.

mikesamuel avatar mikesamuel commented on July 17, 2024

&nbsp; is something different than a space

Agreed. That's why I don't normalize it as space (U+20) and instead re-encode it as non-breaking space (U+A0).

from java-html-sanitizer.

tibistibi avatar tibistibi commented on July 17, 2024

ok but for me this tool is worthless now because i need to sanitize and not decode/encode.

somehow it is not important for you to preserve added text. how can i sanitize and keep the text equal. how can i tell my users that they added wrong tags?

from java-html-sanitizer.

mikesamuel avatar mikesamuel commented on July 17, 2024

Testing string equality is extremely brittle -- there are many HTML strings with the same semantics.

What does HtmlChangeListener not provide that you need to do your check?

from java-html-sanitizer.

tibistibi avatar tibistibi commented on July 17, 2024

yes i was looking into that one interesting that i get an call for every change well every sanitize change. the decode encode changes are not send to a listener.
but its not a problem that html strings have the same semantics when it stays the same i do not have to report any thing to the user.
but i agree with the listener i think i can report back to the user.

so how can i keep the text equal?

from java-html-sanitizer.

tibistibi avatar tibistibi commented on July 17, 2024

i ran the test again with some extra chars and they all get changed.:

@Test
public void testSpace() {
    String text = "&nbsp;&nbsp;&nbsp; ' =  _+ ;'\" ";
    String result = Sanitizers.FORMATTING.sanitize(text);
    assertEquals(text, result);
}

so the nbsp's get decoded and the others get encoded. what i find hard to understand is that you seem a very smart coder but don't mind a method does not do what it is called. when you create a method sanitize and what it does is sanitize and encode/decode i find it really confusing. apart that i cant' use the tool now. i either need to fork this or need to get the text out first and re-add it later.

from java-html-sanitizer.

mikesamuel avatar mikesamuel commented on July 17, 2024

If you have problems with HtmlChangeListener, open an issue about that.

I define "sanitize" in a specific way that allows semantics preserving changes to documents in addition to elision.

If you choose to fork or not use the tool then good luck to you.

from java-html-sanitizer.

tibistibi avatar tibistibi commented on July 17, 2024

i don't have any problem with the HtmlChangeListhren but it is interesting that you find some changes changes reportable and others are quietly done.

i think we agree on all but what you find semantics preserving changes for me (and the others above) are real changes which are not acceptable for me nor for my client. i will go on and fork it than.

i think your tool would be much better when you split the method sanitize into sanitize which will clean the text and a method dencode which will change chars for some reason.

from java-html-sanitizer.

mikesamuel avatar mikesamuel commented on July 17, 2024

i think we agree on all but what you find semantics preserving changes for me

This is defined by the HTML5 specification. It's not just a matter of opinion.

i think your tool would be much better when you split the method sanitize into sanitize which will clean the text and a method dencode which will change chars for some reason.

This is not just a whim. Sanitizing requires taking a large and inconsistently interpreted set of strings and producing an output that is in a safe and consistently interpreted set. You seem to think that defining that subset without encoding/decoding is easy. My experience writing sanitizers has taught me otherwise.

from java-html-sanitizer.

tibistibi avatar tibistibi commented on July 17, 2024

just for future references I ran into another issue. with encoding text which is that we use place holders which get replaced by a process later in the chain. these text can be anything.
so if the place holder text would be te'st than it will be encoded to te&#39;st and lose its place holder capability.

one more thing when reading through the documentation i could not find anything about encoding. can you point me to these documents?

from java-html-sanitizer.

mikesamuel avatar mikesamuel commented on July 17, 2024

Encoding and decoding of HTML text nodes are specified in http://www.w3.org/html/wg/drafts/html/master/single-page.html#character-references

from java-html-sanitizer.

tibistibi avatar tibistibi commented on July 17, 2024

thanks. but w3 does not say what this html sanitizer does.
i would expect somewhere a remark or documentation about what this tool does besides sanitizing. which chars are decoded (at least the &nbsp) and which are encoded.

from java-html-sanitizer.

mikesamuel avatar mikesamuel commented on July 17, 2024

I'm intentionally not specifying that. Clients should not rely on the exact character output because I may change that in response to new threats.

from java-html-sanitizer.

tibistibi avatar tibistibi commented on July 17, 2024

ah ok so basically user who use this tool should be warned that by upgrading the encoding and decoding could changed without notice.

from java-html-sanitizer.

jmanico avatar jmanico commented on July 17, 2024

Those changes would be discussed in release notes.

Again, if you are upset at Mr. Samuels choices, I encourage you to fork
the project.

Please note that Michael is a volunteer and this is an open source
project that he has donated to the OWASP foundation. He is also kind
enough to continue maintaining the project to the best of his ability.

Please be sure to maintain respectful, helpful discourse.

Aloha,
Jim

On 10/6/15 1:44 PM, tibistibi wrote:

ah ok so basically user who use this tool should be warned that by
upgrading the encoding and decoding could changed without notice.


Reply to this email directly or view it on GitHub
#30 (comment).

from java-html-sanitizer.

tibistibi avatar tibistibi commented on July 17, 2024

jim first I’m not addressing anyone in particular, anyone with knowledge of the project can respond.

second I spend my time on this forum/issue because I think it will improve the product so it could be used in more real life situations. I see some other posters who have problems with the encoding and decoding like myself.

third there was a question to provide clear and well argued cases, so that's what I’m doing.

off course I can fork the project to change it to whatever I want but that will not improve the community it will just clutter it and probably my implementation will not be as good because off lake of experience and overview being new to the code.

and finally if you think I’m in any way not respectful just point me to what sentence let you think that so I can update it. I’m not a native English speaker so it could be that I made some mistakes.

from java-html-sanitizer.

jmanico avatar jmanico commented on July 17, 2024

Fair comments and thank you for your support.

Jim Manico
@manicode
(808) 652-3805

On Oct 6, 2015, at 2:04 PM, tibistibi [email protected] wrote:

jim first I’m not addressing anyone in particular, anyone with knowledge of the project can respond.

second I spend my time on this forum/issue because I think it will improve the product so it could be used in more real life situations. I see some other posters who have problems with the encoding and decoding like myself.

third there was a question to provide clear and well argued cases, so that's what I’m doing.

off course I can fork the project to change it to whatever I want but that will not improve the community it will just clutter it and probably my implementation will not be as good because off lake of experience and overview being new to the code.

and finally if you think I’m in any way not respectful just point me to what sentence let you think that so I can update it. I’m not a native English speaker so it could be that I made some mistakes.


Reply to this email directly or view it on GitHub.

from java-html-sanitizer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.