<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

" " returns space after sanitize instead of returning same " ",about owasp/java-html-sanitizer

Comments (35)

mikesamuel commented on July 17, 2024 3

i can understand that you do not have time but that should not be an argument to close this issue. you can give it a low priority maybe some one else can take this one.

It's not just that I don't have a lot of time, but also that I've seen nothing that leads me to believe that it serves a purpose ; I don't just prioritize it low, I can't distinguish its priority from Priority:Worthless.

from java-html-sanitizer.

mikesamuel commented on July 17, 2024 1

I'm not going to expand the space of configurations to specify different policies on entity encoding.

Expanding the configuration space reduces the mean test coverage per configuration and the minimum test coverage per configuration.

Noone has yet outlined a use-case that requires this extra configuration.

I have too many other demands on time to write and maintain lots of extra tests for configurations that don't meet a required use-case.

I can't recommend this library for use in production if the minimum test coverage per configuration is low.

I want to be able to recommend this library for use in production.

CJK languages compress better when arbitrary code-points are not encoded as character reference.

Therefore, until I see a clear and well-argued use-case,
I will not complicate the code that decides whether to encode a code-point as a character reference.

from java-html-sanitizer.

GoogleCodeExporter commented on July 17, 2024

This is a bug for sure. The sanitize should not replace entities as they are 
the generalized way of expressing extended characters regardless of encoding. 
They are perfectly valid and safe in HTML.

Original comment by [email protected] on 24 Jun 2014 at 4:32

from java-html-sanitizer.

GoogleCodeExporter commented on July 17, 2024

Are you not seeing codepoint U+A0 in the output?

Original comment by [email protected] on 24 Jun 2014 at 9:10

from java-html-sanitizer.

GoogleCodeExporter commented on July 17, 2024

I'm seeing "?" in the output. We use CKEditor which has a feature to insert 
extended characters which are inserted as HTML entities. 

You can see the ckeditor demo here:
http://ckeditor.com/demo

If the user does this, the result after sanitizing replaces these entities with 
"?". This occurs whether I use the entity name or number. E.g.

<p>blah blah blah ♦</p>
<p>blah blah blah ♦!</p>

Original comment by [email protected] on 25 Jun 2014 at 10:09

from java-html-sanitizer.

GoogleCodeExporter commented on July 17, 2024

I don't know what CKEditor has to do with this bug, and you talk about '?' in 
the output but to understand the output, I'd have to actually see the headers 
and meta-content of the response you're serving.

I added a test (see patch below) and entities seems to work fine.  The HTML 
sanitizer decodes entities just fine and normalizes them in the output.

As long as your content-type header's charset matches the charset you used to 
encode your string, then the content should reach the browser just fine.

Index: src/tests/org/owasp/html/HtmlPolicyBuilderTest.java
===================================================================
--- src/tests/org/owasp/html/HtmlPolicyBuilderTest.java (revision 235)
+++ src/tests/org/owasp/html/HtmlPolicyBuilderTest.java (working copy)
@@ -282,6 +282,17 @@
             "<select>\n  <option>1</option>\n  <option>2</option>\n</select>"));
   }

+  @Test
+  public static final void testEntities() throws Exception {
+    assertEquals(
+        "(Foo)\u00a0(Bar)\u2666\u2666\u2666\u2666(Baz)"
+        + "𔠴𔠴𔠴(Boo)",
+        apply(
+            new HtmlPolicyBuilder(),
+            "(Foo) (Bar)♦♦♦♦(Baz)"
+            + "\ud812\udc34𔠴𔠴(Boo)"));
+  }
+
   private static String apply(HtmlPolicyBuilder b) throws Exception {
     return apply(b, EXAMPLE);
   }

Original comment by [email protected] on 25 Jun 2014 at 1:09

from java-html-sanitizer.

GoogleCodeExporter commented on July 17, 2024

I added a test to HtmlSanitizerTest

    @Test
    public static final void testDiamond() throws Exception {
        assertEquals("<p>blah blah blah ♦</p>", sanitize("<p>blah blah blah ♦</p>"));
    }

I get:

    junit.framework.ComparisonFailure: null 
    Expected :<p>blah blah blah ♦</p>
    Actual   :<p>blah blah blah ♦</p>

Original comment by [email protected] on 25 Jun 2014 at 3:10

from java-html-sanitizer.

GoogleCodeExporter commented on July 17, 2024

And what I don't understand is

> Expected :<p>blah blah blah ♦</p>
> Actual   :<p>blah blah blah ♦</p>

Why is the actual result a problem?

Original comment by [email protected] on 5 Jul 2014 at 8:28

from java-html-sanitizer.

GoogleCodeExporter commented on July 17, 2024

The issue is text portability. Named entities preserve the text as simple ascii 
which is universal. In our situation we send various html content to 3rd party 
systems so we  can't guarantee a common unicode for special characters.  
Currently we added jericho as a dependency to reparse and convert these back to 
entities.   

Unless you think there's some security issue with entities I think they should 
be preserved. Best would probably be to have a setting for this in the 
HtmlPolicyBuilder.

Thanks

Original comment by [email protected] on 6 Jul 2014 at 10:53

from java-html-sanitizer.

GoogleCodeExporter commented on July 17, 2024

Having same issue ' is replaced with it's ' code @ with @
Version: r239
What shoud I do to avoid such behaviour?

Original comment by [email protected] on 11 Oct 2014 at 5:42

from java-html-sanitizer.

GoogleCodeExporter commented on July 17, 2024

The adventures of ' from the input:

Version: r239
org.owasp.html.Encoding
line 55-60:
it calls method HtmlEntities.decodeEntityAt(s, amp, n);
it calculates codepoint 39 for ' so it puts ' symbol in StringBuilder instead 
of '...
After it in encodeHtmlOnto method line:168-176
it rereplaces ' with it's code from String[] REPLACEMENTS (it's for sure ').
but yet I don't understand why, or how to easily add ' to allowed symbols?

Original comment by [email protected] on 11 Oct 2014 at 6:29

from java-html-sanitizer.

GoogleCodeExporter commented on July 17, 2024

Why are these replacements problems?

Original comment by [email protected] on 13 Oct 2014 at 10:23

from java-html-sanitizer.

GoogleCodeExporter commented on July 17, 2024

Numerical entities would be interpreted differently depending on unicode. Named 
entities should really be preserved for text portability.

https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

Original comment by [email protected] on 13 Oct 2014 at 11:10

from java-html-sanitizer.

GoogleCodeExporter commented on July 17, 2024

Agree with [email protected].
Character entity references in HTML are already a symbol references, widely 
used.
It will be great if they will be just ignored by, maybe some optional setting 
implemented to HtmlEntities.
Text pocessing could be bore predictable and reversable in this case.

Original comment by [email protected] on 14 Oct 2014 at 8:31

from java-html-sanitizer.

GoogleCodeExporter commented on July 17, 2024

Replacing   with a unicode no-break-space messes me up because I use UTF-8 
encoding, and when I UTF-8 encode a unicode no-break-space it results in Â.  
If I pass in " " then I want to get " " back. Why bother replacing it?

Original comment by [email protected] on 28 Jan 2015 at 4:41

from java-html-sanitizer.

Rancho007 commented on July 17, 2024

We have an another issue example, if you enter ’ sanitizer covert this to html character (´) which is not recognized by rendered engine (UTF-8).

Actually we are not understanding why santizer required to change html entities to characters. This happend to only some entites not for all (example < is not converted)

Kr,
Urvish

from java-html-sanitizer.

Rancho007 commented on July 17, 2024

Hi,

In fact if use different encoding rather than UTF-8 there is problem.
If use ISO-8859-1 it does not convert HTMLentities well to character.
But we are not understanding why it is required to convert HTML entities to character.
Example (default file encoding ISO-8599-1),
Before sanitize: &#8217
After sanitize: ? (if we use encoding UTF-8 it gives ´(SINGLE RIGHT QUOTATION MARK))

According to us this is incorrect?

Kr,
U

from java-html-sanitizer.

tibistibi commented on July 17, 2024

i can understand that you do not have time but that should not be an argument to close this issue. you can give it a low priority maybe some one else can take this one.

you need a clear and well argued use case. ok here is mine.

i have a front-end where, via a rich text editor, a user can add content. which will be stored in the db and shown to users on some other page.
i want to make this secure so the editor can't add tags which are not allowed but he should still be able to add all the text he wants.
i also want to give clear error messages.

so ideally i would receive the added text. i would run it through the sanitizer to remove tags i don't want. then i would compare the added text with the cleaned text. if this is different i return an error telling the editor he added wrong input.
second spaces should still be kept as &nbsp because multiple &nbsp is different presented in html than multiple spaces (they are compressed to one space).

the strange thin is i have all the options to configure which tags i want with even which arguments but i have no option at all to toggle the encoding.

both scenario's are not possible because the text is encoded when i sanitize. encoding and sanitation are 2 different steps which should be two different steps in the code. this will not add you more unit tests it just would make the lib more modular.

i won't mind helping out 😄

from java-html-sanitizer.

tibistibi commented on July 17, 2024

from #43 to put all the examples in 1 issue
test case:

    @Test
    public void testSpace() {
        String text = "L&amp;nbsp;&amp;nbsp;&amp;nbsp;L";
        assertEquals(text, Sanitizers.FORMATTING.sanitize(text));
     }

why:
1)
a   is something different than a space. when i get a   from my richttext editor i want to preserve it. in the above example when i would add the sanitized text into an html page it would look like L L instead of L L.
2)
i want to know if the user added something wrong and present an error:

    if (!StringUtils.equals(text, clean)) {
        addFieldError("wrong input! please check cleaned text");
    }

i don't want this to happen after a spcae replacement.

todo:
remove the   to space part or make it optional.

work around:
replace the   by spaces. do the sanitizing and checking and re add the  

from java-html-sanitizer.

mikesamuel commented on July 17, 2024

  is something different than a space

Agreed. That's why I don't normalize it as space (U+20) and instead re-encode it as non-breaking space (U+A0).

from java-html-sanitizer.

tibistibi commented on July 17, 2024

ok but for me this tool is worthless now because i need to sanitize and not decode/encode.

somehow it is not important for you to preserve added text. how can i sanitize and keep the text equal. how can i tell my users that they added wrong tags?

from java-html-sanitizer.

mikesamuel commented on July 17, 2024

Testing string equality is extremely brittle -- there are many HTML strings with the same semantics.

What does HtmlChangeListener not provide that you need to do your check?

from java-html-sanitizer.

tibistibi commented on July 17, 2024

yes i was looking into that one interesting that i get an call for every change well every sanitize change. the decode encode changes are not send to a listener.
but its not a problem that html strings have the same semantics when it stays the same i do not have to report any thing to the user.
but i agree with the listener i think i can report back to the user.

so how can i keep the text equal?

from java-html-sanitizer.

tibistibi commented on July 17, 2024

i ran the test again with some extra chars and they all get changed.:

@Test
public void testSpace() {
    String text = "&nbsp;&nbsp;&nbsp; ' =  _+ ;'\" ";
    String result = Sanitizers.FORMATTING.sanitize(text);
    assertEquals(text, result);
}

so the nbsp's get decoded and the others get encoded. what i find hard to understand is that you seem a very smart coder but don't mind a method does not do what it is called. when you create a method sanitize and what it does is sanitize and encode/decode i find it really confusing. apart that i cant' use the tool now. i either need to fork this or need to get the text out first and re-add it later.

from java-html-sanitizer.

mikesamuel commented on July 17, 2024

If you have problems with HtmlChangeListener, open an issue about that.

I define "sanitize" in a specific way that allows semantics preserving changes to documents in addition to elision.

If you choose to fork or not use the tool then good luck to you.

from java-html-sanitizer.

tibistibi commented on July 17, 2024

i don't have any problem with the HtmlChangeListhren but it is interesting that you find some changes changes reportable and others are quietly done.

i think we agree on all but what you find semantics preserving changes for me (and the others above) are real changes which are not acceptable for me nor for my client. i will go on and fork it than.

i think your tool would be much better when you split the method sanitize into sanitize which will clean the text and a method dencode which will change chars for some reason.

from java-html-sanitizer.

mikesamuel commented on July 17, 2024

i think we agree on all but what you find semantics preserving changes for me

This is defined by the HTML5 specification. It's not just a matter of opinion.

i think your tool would be much better when you split the method sanitize into sanitize which will clean the text and a method dencode which will change chars for some reason.

This is not just a whim. Sanitizing requires taking a large and inconsistently interpreted set of strings and producing an output that is in a safe and consistently interpreted set. You seem to think that defining that subset without encoding/decoding is easy. My experience writing sanitizers has taught me otherwise.

from java-html-sanitizer.

tibistibi commented on July 17, 2024

just for future references I ran into another issue. with encoding text which is that we use place holders which get replaced by a process later in the chain. these text can be anything.
so if the place holder text would be te'st than it will be encoded to te'st and lose its place holder capability.

one more thing when reading through the documentation i could not find anything about encoding. can you point me to these documents?

from java-html-sanitizer.

mikesamuel commented on July 17, 2024

Encoding and decoding of HTML text nodes are specified in http://www.w3.org/html/wg/drafts/html/master/single-page.html#character-references

from java-html-sanitizer.

tibistibi commented on July 17, 2024

thanks. but w3 does not say what this html sanitizer does.
i would expect somewhere a remark or documentation about what this tool does besides sanitizing. which chars are decoded (at least the &nbsp) and which are encoded.

from java-html-sanitizer.

mikesamuel commented on July 17, 2024

I'm intentionally not specifying that. Clients should not rely on the exact character output because I may change that in response to new threats.

from java-html-sanitizer.

tibistibi commented on July 17, 2024

ah ok so basically user who use this tool should be warned that by upgrading the encoding and decoding could changed without notice.

from java-html-sanitizer.

jmanico commented on July 17, 2024

Those changes would be discussed in release notes.

Again, if you are upset at Mr. Samuels choices, I encourage you to fork
the project.

Please note that Michael is a volunteer and this is an open source
project that he has donated to the OWASP foundation. He is also kind
enough to continue maintaining the project to the best of his ability.

Please be sure to maintain respectful, helpful discourse.

Aloha,
Jim

On 10/6/15 1:44 PM, tibistibi wrote:

ah ok so basically user who use this tool should be warned that by
upgrading the encoding and decoding could changed without notice.

—
Reply to this email directly or view it on GitHub
#30 (comment).

from java-html-sanitizer.

tibistibi commented on July 17, 2024

jim first I’m not addressing anyone in particular, anyone with knowledge of the project can respond.

second I spend my time on this forum/issue because I think it will improve the product so it could be used in more real life situations. I see some other posters who have problems with the encoding and decoding like myself.

third there was a question to provide clear and well argued cases, so that's what I’m doing.

off course I can fork the project to change it to whatever I want but that will not improve the community it will just clutter it and probably my implementation will not be as good because off lake of experience and overview being new to the code.

and finally if you think I’m in any way not respectful just point me to what sentence let you think that so I can update it. I’m not a native English speaker so it could be that I made some mistakes.

from java-html-sanitizer.

jmanico commented on July 17, 2024

Fair comments and thank you for your support.

Jim Manico
@manicode
(808) 652-3805

On Oct 6, 2015, at 2:04 PM, tibistibi [email protected] wrote:

jim first I’m not addressing anyone in particular, anyone with knowledge of the project can respond.

second I spend my time on this forum/issue because I think it will improve the product so it could be used in more real life situations. I see some other posters who have problems with the encoding and decoding like myself.

third there was a question to provide clear and well argued cases, so that's what I’m doing.

off course I can fork the project to change it to whatever I want but that will not improve the community it will just clutter it and probably my implementation will not be as good because off lake of experience and overview being new to the code.

and finally if you think I’m in any way not respectful just point me to what sentence let you think that so I can update it. I’m not a native English speaker so it could be that I made some mistakes.

—
Reply to this email directly or view it on GitHub.

from java-html-sanitizer.

" " returns space after sanitize instead of returning same " " about java-html-sanitizer HOT 35 CLOSED

Comments (35)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent