Giter Club home page Giter Club logo

Comments (13)

mity avatar mity commented on June 3, 2024

Ack. This is needed.

The open question is what exactly it should do.

My current blurry idea includes these points:

  • Disable row HTML blocks and spans (at least for now I don't want to do full HTML parser to make some more complex filtering). This is too prevent any <script> etc.
  • Disable all links but those with some whitelisted scheme (http:, https:, mailto:, maybe few others).

Opinions? Other ideas?

from md4c.

DemiMarie avatar DemiMarie commented on June 3, 2024

It would be nice to be able to whitelist some HTML tags, like Snudown (Reddit’s non-conformant parser) can. But that would be hard. On the other hand, MD4C already must be able to parse out a tag for inline HTML, and in safe mode we can simply disallow anything that isn’t strictly valid. (Checking for basic syntax correctness isn’t that hard. The hard part in parsing HTML is handling invalid markup. But we can simply reject invalid markup.)

from md4c.

mity avatar mity commented on June 3, 2024

Whitelisting inline (correctly formed) tags is fine.

But the raw HTML block with its start and end conditions is quite different story.

So to be sure, you propose to treat the raw HTML blocks in the safe mode differently and recognize them only if they are a sequence of complete and valid (whitelisted) tags. I.e. treat them almost the same way as paragraph composed of nothing but raw HTML inlines and the only difference would be the block is not enclosed in <p>. Right?

I'll think about it.

from md4c.

DemiMarie avatar DemiMarie commented on June 3, 2024

@mity I was thinking of ignoring HTML blocks entirely, which MD4C can already do. Much simpler and less prone to security vulnerabilities, both memory safety violations and whitelist validation failures. And HTML blocks are not needed for most applications that want a safe mode.

from md4c.

mity avatar mity commented on June 3, 2024

Ugh. That means that this:

<div>

Paragraph 1.

Paragraph 2.

</div>

would be translated to this:

<p><div></p>

<p>Paragraph 1.</p>

<p>Paragraph 2.</p>

<p></div></p>

from md4c.

DemiMarie avatar DemiMarie commented on June 3, 2024

from md4c.

mity avatar mity commented on June 3, 2024

I tried to implement something as discussed above. It is very incomplete but I already can see this approach has some big problems. So I consider it to be just a proof of concept which shows its severe limitations and that real solution needs more worked then I thought. (As always ;-)

The open problems:

  • Raw HTML blocks: Currently not handled at all.

  • filter_url_scheme() callback: Is it the right way? Maybe the app might want to see whole link destination and not just the scheme. (Consider e.g. relative links without any scheme. App might still want to forbid them if they start e.g. with .. or '/'). That way, the interface could be unified with the attribute callback (see below.) On the flip side, app then must parse the scheme on its own if that's what it wants to filter.

  • Tag attributes: Should not be there also filter_raw_html_tag_attribute()? Consider e.g. all those onclick, onload and similar HTML attributes.

  • Markdown construct versus semantically equivalent raw HTML: Consider e.g. [innocent text](dangerous_url) versus <a href="dangerous_url">innocent text</a>. Should not this be somehow unified in the filtering interface? Maybe, for the filtering purposes, the Markdown construct should be translated to raw html tag/attribute firthe purposes of the filtering, so app does not have a chance to screw it up by forbidding one and allowing the 2nd or vice versa?

  • HTML comments or CDATA;? Do we want to support them as well in the safe mode? I tend to say no and to keep it relatively simple.

from md4c.

DemiMarie avatar DemiMarie commented on June 3, 2024

from md4c.

DemiMarie avatar DemiMarie commented on June 3, 2024

from md4c.

craigbarnes avatar craigbarnes commented on June 3, 2024

Whitelisting specific elements isn't sufficient to stop scripting attacks. Attributes also need to be filtered:

<div style="background: red; width: 100%; height: 100%" onmouseover="alert('pwned')">

from md4c.

mity avatar mity commented on June 3, 2024

Whitelisting specific elements isn't sufficient to stop scripting attacks. Attributes also need to be filtered:

<div style="background: red; width: 100%; height: 100%" onmouseover="alert('pwned')">

Yeah. I have realized that too (see the comment with open comments). I am in process of thinking how the interface of the filtering should look like.

from md4c.

mity avatar mity commented on June 3, 2024

Right now, I tend to discard the attempt above and start again.

As said, I am mostly wondering how the interface should look like. I am especially interested in a feedback for the following questions:

  • Is it enough to call a tag callback and then (n-times) attribute callback (for each attribute)? Or are there cases where the filter needs to know all tags attributes (and their values) at the same time?

  • If a single attribute is refused, should just be that single attribute be skipped, or should the whole tag be downgraded from the "html tag" status and be treated as a normal text? (Well, even allowing the attribute callback both ways could perhaps be quite simple to do. But is it good idea?)

  • It would be great for compactness of the API if markdown links and images are treated the same way (for the purposes of filtering) and through the same callbacks, as raw <a> and <img> tags (including the attributes href, title; and src, alt). However CommonMark allows complicated things (e.g. escapes, entities) in those contexts and especially image contents, allowed to contain any valid Markdown including links and nested images, and which is mapped to alt attribute is huge complication. This already meant quite a big complication and changes of the API in the past (the current struct MD_ATTRIBUTE and change in how images are rendered). So I am wondering whether this unification is good or bad idea.

from md4c.

craigbarnes avatar craigbarnes commented on June 3, 2024

Is it enough to call a tag callback and then (n-times) attribute callback (for each attribute)? Or are there cases where the filter needs to know all tags attributes (and their values) at the same time?

The only cases where you need to know the element name, attribute name and attribute value at the same time are a[href], area[href] and img[src].

The large majority of attributes suitable for whitelisting are global attributes (applicable/safe for all elements), where the value is implicitly safe (doesn't need to be known by the filter, doesn't depend on the presence of other attributes).

I've implemented a minimal HTML5 DOM sanitizer in Lua here. The safeness of the code is dependant on the correctness of the underlying HTML5 parser, but the set of whitelisting rules should be similar to what you might want to use here.

If a single attribute is refused, should just be that single attribute be skipped, or should the whole tag be downgraded from the "html tag" status and be treated as a normal text?

Just that attribute should be omitted. I can't think of any case where removing a non-whitelisted attribute from a whitelisted element would make it unsafe.

from md4c.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.