Comments (13)
Ack. This is needed.
The open question is what exactly it should do.
My current blurry idea includes these points:
- Disable row HTML blocks and spans (at least for now I don't want to do full HTML parser to make some more complex filtering). This is too prevent any
<script>
etc. - Disable all links but those with some whitelisted scheme (
http:
,https:
,mailto:
, maybe few others).
Opinions? Other ideas?
from md4c.
It would be nice to be able to whitelist some HTML tags, like Snudown (Reddit’s non-conformant parser) can. But that would be hard. On the other hand, MD4C already must be able to parse out a tag for inline HTML, and in safe mode we can simply disallow anything that isn’t strictly valid. (Checking for basic syntax correctness isn’t that hard. The hard part in parsing HTML is handling invalid markup. But we can simply reject invalid markup.)
from md4c.
Whitelisting inline (correctly formed) tags is fine.
But the raw HTML block with its start and end conditions is quite different story.
So to be sure, you propose to treat the raw HTML blocks in the safe mode differently and recognize them only if they are a sequence of complete and valid (whitelisted) tags. I.e. treat them almost the same way as paragraph composed of nothing but raw HTML inlines and the only difference would be the block is not enclosed in <p>
. Right?
I'll think about it.
from md4c.
@mity I was thinking of ignoring HTML blocks entirely, which MD4C can already do. Much simpler and less prone to security vulnerabilities, both memory safety violations and whitelist validation failures. And HTML blocks are not needed for most applications that want a safe mode.
from md4c.
Ugh. That means that this:
<div>
Paragraph 1.
Paragraph 2.
</div>
would be translated to this:
<p><div></p>
<p>Paragraph 1.</p>
<p>Paragraph 2.</p>
<p></div></p>
from md4c.
from md4c.
I tried to implement something as discussed above. It is very incomplete but I already can see this approach has some big problems. So I consider it to be just a proof of concept which shows its severe limitations and that real solution needs more worked then I thought. (As always ;-)
The open problems:
-
Raw HTML blocks: Currently not handled at all.
-
filter_url_scheme()
callback: Is it the right way? Maybe the app might want to see whole link destination and not just the scheme. (Consider e.g. relative links without any scheme. App might still want to forbid them if they start e.g. with..
or '/'). That way, the interface could be unified with the attribute callback (see below.) On the flip side, app then must parse the scheme on its own if that's what it wants to filter. -
Tag attributes: Should not be there also
filter_raw_html_tag_attribute()
? Consider e.g. all thoseonclick
,onload
and similar HTML attributes. -
Markdown construct versus semantically equivalent raw HTML: Consider e.g.
[innocent text](dangerous_url)
versus<a href="dangerous_url">innocent text</a>
. Should not this be somehow unified in the filtering interface? Maybe, for the filtering purposes, the Markdown construct should be translated to raw html tag/attribute firthe purposes of the filtering, so app does not have a chance to screw it up by forbidding one and allowing the 2nd or vice versa? -
HTML comments or CDATA;? Do we want to support them as well in the safe mode? I tend to say no and to keep it relatively simple.
from md4c.
from md4c.
from md4c.
Whitelisting specific elements isn't sufficient to stop scripting attacks. Attributes also need to be filtered:
<div style="background: red; width: 100%; height: 100%" onmouseover="alert('pwned')">
from md4c.
Whitelisting specific elements isn't sufficient to stop scripting attacks. Attributes also need to be filtered:
<div style="background: red; width: 100%; height: 100%" onmouseover="alert('pwned')">
Yeah. I have realized that too (see the comment with open comments). I am in process of thinking how the interface of the filtering should look like.
from md4c.
Right now, I tend to discard the attempt above and start again.
As said, I am mostly wondering how the interface should look like. I am especially interested in a feedback for the following questions:
-
Is it enough to call a tag callback and then (n-times) attribute callback (for each attribute)? Or are there cases where the filter needs to know all tags attributes (and their values) at the same time?
-
If a single attribute is refused, should just be that single attribute be skipped, or should the whole tag be downgraded from the "html tag" status and be treated as a normal text? (Well, even allowing the attribute callback both ways could perhaps be quite simple to do. But is it good idea?)
-
It would be great for compactness of the API if markdown links and images are treated the same way (for the purposes of filtering) and through the same callbacks, as raw
<a>
and<img>
tags (including the attributeshref
,title
; andsrc
,alt
). However CommonMark allows complicated things (e.g. escapes, entities) in those contexts and especially image contents, allowed to contain any valid Markdown including links and nested images, and which is mapped toalt
attribute is huge complication. This already meant quite a big complication and changes of the API in the past (the current structMD_ATTRIBUTE
and change in how images are rendered). So I am wondering whether this unification is good or bad idea.
from md4c.
Is it enough to call a tag callback and then (n-times) attribute callback (for each attribute)? Or are there cases where the filter needs to know all tags attributes (and their values) at the same time?
The only cases where you need to know the element name, attribute name and attribute value at the same time are a[href]
, area[href]
and img[src]
.
The large majority of attributes suitable for whitelisting are global attributes (applicable/safe for all elements), where the value is implicitly safe (doesn't need to be known by the filter, doesn't depend on the presence of other attributes).
I've implemented a minimal HTML5 DOM sanitizer in Lua here. The safeness of the code is dependant on the correctness of the underlying HTML5 parser, but the set of whitelisting rules should be similar to what you might want to use here.
If a single attribute is refused, should just be that single attribute be skipped, or should the whole tag be downgraded from the "html tag" status and be treated as a normal text?
Just that attribute should be omitted. I can't think of any case where removing a non-whitelisted attribute from a whitelisted element would make it unsafe.
from md4c.
Related Issues (20)
- HTML declarations are not recognized without a whitespace HOT 1
- Incorrect parsing of difficult emphasis cases
- OSS-Fuzz issue 65781 HOT 2
- OSS-Fuzz issue 65875
- Striketrough incorrectly matches closer with 1st opener, not the last preceding one.
- code spans differ - cmark vs md2html
- OSS-Fuzz issue 65943 HOT 1
- permissive url ending with slash not recognized
- code spans differ - cmark vs md2html (reprise)
- md2html not outputing <meta charset="UTF-8"> HOT 5
- "shadowed" variables warning HOT 2
- Pathological input
- pathologic input: O(n^2) with link ref. defs HOT 1
- Adding an HTML to Markdown parser/converter (html2md) HOT 5
- Improve strikethrough around double quotes HOT 1
- Strikethrough with punctuation special case HOT 2
- compatibility: initial tab replaced by four spaces HOT 2
- TAB in heading HOT 1
- spurious hard break after html tag HOT 1
- Some character sequences break URL detection
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from md4c.