mozilla / bleach Goto Github PK
View Code? Open in Web Editor NEWBleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes
Home Page: https://bleach.readthedocs.io/en/latest/
License: Other
Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes
Home Page: https://bleach.readthedocs.io/en/latest/
License: Other
The string "libgl.so.1" is linkified but shouldn't be. Failing test in e7c421e83a8e28aa726113839aedaf31e297d805
Not all in-line styles get cleaned appropriately, especially when there are mulitple properties listed. The first example below produces what I would expect to be accurate output.
>>> from bleach import clean
>>> html = '<p style="font-family: Arial; color: red; float: left;">bar</p>'
>>> clean(html, tags=['p'], attributes=['style'], styles=['color'])
'<p style="color: red;">bar</p>'
However, the following examples produce unexpected results. Specifically, properties other than those listed in the whitelist are included in the output.
>>> html = '<p style="font-family: Arial; color: red; float: left; background-color: red;">bar</p>'
>>> clean(html, tags=['p'], attributes=['style'], styles=['color'])
'<p style="color: red; background-color: red;">bar</p>'
>>> html = '<p style="border: 1px solid blue; color: red; float: left;">bar</p>'
>>> clean(html, tags=['p'], attributes=['style'], styles=['color'])
'<p style="border: 1px solid blue; color: red;">bar</p>'
>>> clean(html, tags=['p'], attributes=['style'], styles=['color', 'float'])
'<p style="border: 1px solid blue; color: red; float: left;">bar</p>'
>>> html = '<p style="color: red; float: left; padding: 1em;">bar</p>'
>>> clean(html, tags=['p'], attributes=['style'], styles=['color', 'float'])
'<p style="color: red; float: left; padding: 1em;">bar</p>'
>>> clean(html, tags=['p'], attributes=['style'], styles=['color'])
'<p style="color: red; padding: 1em;">bar</p>'
So this is an edge case, but still, here's the test:
def test_delinkify_no_whitespace():
html = ('<a href="http://ex.mp">a</a>'
'<a href="http://ex.mp">b</a>'
'<a href="http://ex.mp">c</a>'
'<a href="http://ex.mp">d</a>')
eq_('abcd', bleach.delinkify(html))
This passes:
def test_delinkify_some_whitespace():
html = ('<a href="http://ex.mp">a</a>'
'<a href="http://ex.mp">b</a> '
'<a href="http://ex.mp">c</a> '
'<a href="http://ex.mp">d</a>')
eq_('abcd', bleach.delinkify(html))
bleach.clean(u'<hr>', strip=True, tags=[])
…makes this happen:
ERROR:bleach:HTML: AssertionError('Node has no children',) ::: u'<hr>'
(assuming you have your log handlers set up)
It does return the right thing in this case—''—but I feel like it shouldn't be having this error.
For instance <br /> tag is cleaned as <br>, which is not a valid XHTML code. Probably it would be wise to specify a mode in settings.py in some variable named BLEACH_HTML and then specify whether cleaning is done for XHTML or HTML.
import bleach
bleach.linkify('http://aslobodyan.net/sysiq2/)\n')
u'<a href="http://aslobodyan.net/sysiq2/)" rel="nofollow">http://aslobodyan.net/sysiq2/)</a>\n'
As you can see there is closed bracket inside text of link.
In [18]: linkify('(http://ex.mp/)')
Out[18]: u'(<a href="http://ex.mp/" rel="nofollow">http://ex.mp/</a>)'
In [19]: linkify('(http://ex.mp/).')
Out[19]: u'(<a href="http://ex.mp/)" rel="nofollow">http://ex.mp/)</a>.'
In [20]: linkify('(http://ex.mp).')
Out[20]: u'(<a href="http://ex.mp" rel="nofollow">http://ex.mp</a>).'
Note that you need the URL to end with /
for this to fail.
It would allow people to encode emails (to make them less visible to spam crawlers) or route links through tracking scripts.
It seems that linkify will turn anything that looks like a domain into a URL. There should be a way to turn this off. Maybe eager_domain_linking=False
.
For example, the filenames in amo-validator output are getting linked. Like /path/to/somescript.sh gets linked as http://somescript.sh . Screenshot: https://bug670047.bugzilla.mozilla.org/attachment.cgi?id=544675 More details: https://bugzilla.mozilla.org/show_bug.cgi?id=670047
(I have the AMO issue assigned to me but it keeps getting bumped due to priority. I should get to it soon!)
Currently you can construct a string that will cause linkify
to raise a RuntimeError
. That should be fixed ASAP.
If a URL starts with /
, bleach shouldn't set rel="nofollow"
.
To be honest, I've never been a huge fan of delinkify
. It's coming from a position of blacklisting, instead of white listing, and I think there are better ways to achieve a similar functionality.
With clean
, the attributes
list or dict can accept callables, which get the attribute and its value and either return True
(leave it in) or False
(remove it). I think this concept could be extended to the tags
list, somehow. This will be trickier and requires some serious API consideration.
Thinking about how to solve it in a way that also solves Jeff's old complaint about repeating the tag list in the attributes dict.
An interesting idea from @aehlke in pull req #43 is to pass links through a callback or callbacks that accept and return (possibly among other things) a dict of attributes.
I think this has a lot of potential to simplify the linkify
API and address the issues that delinkify
aimed to, but in a better way.
Total straw man API might be:
def linkify(text, callback):
# ... snip ...
def some_callback(text, attributes=None, new=False): # This signature would be specified in docs.
# Do processing, including things like url and text munging that are currently callbacks.
# Create a link.
return {'href': href, ...} # Return a dict, everything's OK.
# Or don't create a link/remove the link.
return None # Existing links removed, new links not created.
The obvious question that arises is what to do with multiple callbacks. It'd be great to make these callback functions small, atomic, etc, so something along the lines of Django context processors seems obvious:
def linkify(text, callbacks=None):
if callbacks is None:
callbacks = []
elsif callable(callbacks):
callbacks = [callbacks]
# ... snip ...
# Found a link!
attrs = {'some': 'dict', ...}
for cb in callbacks:
a = cb(link_text, attrs, new)
if a is None:
# Break out of this somehow, just replace everything with the text.
attrs.update(a)
# Render the link with the attributes in attrs
Pros of this approach:
linkify
currently does. filter_url
and filter_text
are already callbacks, so they could easily be updated to return dicts. Adding or setting attributes like rel
and target
is trivial, etc.delinkify
was meant to do--just write a callback that only allows relative links or links from certain domains.skip_existing_links
idea from pull req #43.Cons, and some suggestions:
linkify(text, DEFAULT_CALLBACKS + [my_callback])
or something)._text
? Something that can never be an HTML attribute.)'some_attr': None
to remove it? Or make sure you return the whole attr
dict, unlike Django context processors.)Cleaning this string with styles and css border allowed results in the entire border definition being stripped because of the -moz-use-text-color definition.
<table><tbody><tr><td style="border: medium 1pt 1pt none solid solid -moz-use-text-color windowtext windowtext;"></td></tr></tbody></table>
Wikipedia makes lots of links with balanced parenthesis. You should test the linker for those.
There are some related tests here to borrow from:
http://codespeak.net/svn/lxml/trunk/src/lxml/html/tests/test_autolink.txt
If HTML serialization doesn't work, we currently log with an error
level. That's extreme. The first attempt should be a warning, then an error.
linkify("comment a comment")
Result:
comment\n<a rel="nofollow" href="http://a comment">a comment</a>
linkify("have you try internet bar?")
Result:
have you try\n<a rel="nofollow" href="http://internet">internet</a>\nbar?
When the attributes argument passed to the clean function is a list it matches against every element, if it is a dictionary it only matches elements that are dictionary keys. There should be a wildcard key, *
perhaps that would match all elements. Right now using a dictionary and allowing the attributes id or class is a big pain without a wildcard element.
>>> bleach.linkify('<iframe allowfullscreen="true" frameborder="0" height="315" src="https://www.youtube.com/embed/nZbEjjz4GL0" width="560"></iframe>')
u'<iframe allowfullscreen="true" frameborder="0" height="315" src="<a href="https://www.youtube.com/embed/nZbEjjz4GL0" rel="nofollow">https://www.youtube.com/embed/nZbEjjz4GL0</a>" width="560"></iframe>'
This shouldn't happen...
Apparently it's caused by the HTMLSanitizer from html5lib
Bleach has ignored only 2nd domain emails like [email protected], but if domain is 3rd level it will be linkified.
In [5]: bleach.linkify('[email protected]')
Out[5]: u'www@example.<a href="http://com.ua" rel="nofollow">com.ua</a>'
Expected: ignoring 3rd level domains.
Looks like bleach doesn't recognize less popular TLD such as 'ly'
http://bit.ly/ - URL shortening service
Bleach will delete all comments, and there is no options to change that. In my use case, I need to leave all comments intact.
It would be nice to allow subdomain matching in allow_domains
for delinkify
, but it needs to be opt-in, like allow_domains=['*.ex.mp']
.
I think it's kinder to users if *.ex.mp
matches ex.mp
, too.
One thing that occured to me while working on the BleachSanitizerMixin.sanitize_token
is that it's code is getting pretty long and kind of ugly, the more functionality one needs do add.
One possible way to counter this problem might be to implement a "pipeline". I've seen this concept first in django-social-auth which I'm also using in a project. To be more specific this part of the README explains it pretty good. Personally I really love the concept. The code of the default social auth pipeline is here
So what I am proposing is something very much in the vain of social auth's pipeline:
Instead of having to tie into the (big) if condition, we iterate over an iterable with functions::
PIPELINE = ('SkipAllowedElements', 'StripScripts',...)
While iterating, each function get's called and returns either a cleaned token which gets passed on to the next function in the iterable or None
for skip. Note that maybe it might be more expressive to use Exceptions (like SkipToken
) here.
This may also clean up a bit of the code I introduce with the strip_script_content
fork (like doing self.previous_token = token
right before every return
)
So, basically this is me requesting for a comment on refactoring things a bit :D – Maybe the same kind of pipeline-logic could be used for #56
Right now, all the interesting code in Bleach is in __init__.py
, which is convenient but not a great practice. I'd like to shuffle things around a little and clean up that file.
This test fails in an interesting way:
def test_sarcasm():
"""Jokes should crash.<sarcasm/>"""
dirty = u'Yeah right <sarcasm/>'
clean = u'Yeah right <sarcasm/>'
eq_(clean, linkify(dirty))
The failure is:
AssertionError: u'Yeah right <sarcasm/>' != u'Yeah right <sarcasm></sarcasm>'
In clean()
, there's no way to remove the contents of a <script>
tag. In discussion in #57 we decided the best approach was an additional kwarg and optional treewalk to remove a <script>
and any of its children, including text nodes (e.g. the script content).
In [17]: bleach.linkify('this breaks: http://www.adsaüüüääds.at')
Out[17]: u'this breaks: http://www.adsa\xc3\xbc\xc3\xbc\xc3\xbc\xc3\xa4\xc3\xa4<a href="http://ds.at" rel="nofollow">ds.at</a>'
The particular instance I noticed was the nofollow or target parameters don't exist anymore for linkify, and there's a callback parameter.
>>> linkify('[email protected]')
u'doesntmatter@foo<a href="http://-bar.de" rel="nofollow">-bar.de</a>'
lxml.html has cleaning tests you might want to use:
http://codespeak.net/svn/lxml/trunk/src/lxml/html/tests/test_feedparser_data.py
It includes tests from feedparser (which has its own restrictive cleaner), and also a bunch taken from http://ha.ckers.org/xss.html
The AMO validator has sentences with links in them and one looked like this:
For more information, see http://blog.mozilla.com/addons/2009/01/16/firefox-extensions-global-namespace-pollution/, or use JavaScript modules.
bleach.linkify
turned that into
http://outgoing.mozilla.org/v1/f96d4df7dfad6d4ec94310deb58d5da8217f8860/http%3A//blog.mozilla.com/addons/2009/01/16/firefox-extensions-global-namespace-pollution/%2C
which is incorrect because it captured the comma at the end. The exact code was: bleach.linkify(text, nofollow=True, filter_url=amo.urlresolvers.get_outgoing_url)
Using linkify with parse_email=True doesn't linkify urls
>>> import bleach
>>> bleach.linkify('http://example.com [email protected]')
u'<a href="http://example.com" rel="nofollow">http://google.com</a> [email protected]'
>>> bleach.linkify('http://example.com [email protected]', parse_email=True)
u'http://example.com <a href="mailto:[email protected]">[email protected]</a>'
Adding all the tests from http://ha.ckers.org/xss.html would be a good idea
Placing tags in wrong order leads to extra tags in some circumstances:
>>> bl.clean('<i><blockquote>text</i></blockquote>')
u'<i/><blockquote><i>text</i></blockquote>'
but
>>> bl.clean('<i><b>text</i></b>')
u'<i><b>text</b></i>'
I don't know if this is a bug, or if it is just bad practice to place the tags in the wrong order.
Another top-level method! delinkify
!
delinkify()
will strip <a>
tags, and accept a whitelist of domains to allow to stay. It will replace links with their linktext. It also has an option to leave relative links be.
It seems that html5lib is taking named HTML entities like and ® and outputting them as characters. I haven't dug into that library enough to know if there's a way to instruct it to pass over these entities, but it may be useful for those who value keeping them. Especially in the case of where some browsers/renderers handle it differently than a normal space character.
I can fork and work on this if it's a feature deemed worthy of adding and html5lib has an available solution.
Low priority, but it might be useful for some people to skip linkification inside <pre>
elements.
If I do
value = bleach.clean('<p><iframe frameborder="0" height="315" src="http://www.youtube.com/embed/6ydeY0tTtF4" width="560"></iframe></p>', strip=True, tags=[], attributes=[])
it's ok that it print assertionError?
2012-04-20 15:23:27,885 ERROR /usr/lib/python2.7/site-packages/bleach/__init__.py:328 HTML: AssertionError('Node has no children',)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/bleach/__init__.py", line 326, in _render
return force_unicode(_serialize(tree))
File "/usr/lib/python2.7/site-packages/bleach/__init__.py", line 341, in _serialize
return serializer.render(stream)
File "/usr/lib/python2.7/site-packages/html5lib/serializer/htmlserializer.py", line 302, in render
return u"".join(list(self.serialize(treewalker)))
File "/usr/lib/python2.7/site-packages/html5lib/serializer/htmlserializer.py", line 192, in serialize
for token in treewalker:
File "/usr/lib/python2.7/site-packages/html5lib/treewalkers/_base.py", line 154, in __iter__
firstChild = self.getFirstChild(currentNode)
File "/usr/lib/python2.7/site-packages/html5lib/treewalkers/simpletree.py", line 58, in getFirstChild
assert node.hasContent(), "Node has no children"
AssertionError: Node has no children
value is u'', but in python shell printed it assertion error.
Maybe don't print it?
Thanks.
Right now, if you chain methods (e.g. linkify(clean(text))
) the whole tree-building, which is expensive, has to happen twice.
I can see two ways of solving this. One is a total API rewrite, one is less complicated. I'm leaning toward the latter.
Total rewrite:
Recreate a Bleach
class, but this time it is the text you want to clean, e.g.:
from bleach import Bleach
text = Bleach(text)
print text.clean()
It could store state of what's already been done (like cleaning, linkifying, etc) and either return text (as above) or return itself to allow chaining, e.g.:
# Implement __unicode__ and __str__, seems nicer than above.
print Bleach(text).clean().linkify()
Less-crazy idea:
Instead of actual unicode
objects, the top-level methods return a lazy object that looks like a string, maybe even descends from unicode
, but is smarter, stores the tree, and only renders itself if anything needs to treat it as a string.
Then the methods would be reworked to check for the tree before rebuilding it, and if found, they would just use the existing tree. clean()
might need special handling--i.e. always go back to text first--because of the nature of what it does. But I generally say you should always call clean()
first if you're chaining, anyway.
Pros and cons
The rewrite might make it easier to deal with the increasing complexity of linkify()
. You could (again) subclass Bleach
and override a number of things there. I don't love APIs where you have to subclass things. I'm also not sure how complex this would make the upgrade for consumers--it would be a major version change.
The less-crazy idea doesn't involve subclassing anything, ever. The API can stay the same and it can be a minor version change. I am guessing won't be too difficult to write some sort of LazyTreeStringObjectThing
class. It would be a good opportunity to do some refactoring and get some stuff out of bleach/__init__.py
. (I wouldn't change import paths, just where the code lives.)
from bleach import Bleach
b = Bleach()
b.linkify(' brie')
Actual:
' brie'
Expected:
' brie'
This is an odd behavior I boiled down from a bigger piece of content. Seems like the data has to be longer than 3 characters and contain 'ie'. Bleach loves to linkify Hippies too.
When calling .clean with certain inputs of ambiguous encoding, the returned unicode value is invalid. Might want to try decoding from utf-8, if the input string is not unicode or using something like the one of open source libraries to detect the encoding. Either chardet from Universal Feed Parser or UnicodeDammit from BeautifulSoup should work.
Consider:
café utf-8: 'caf\xc3\xa9' unicode: u'caf\xe9'
>>> bleach.clean('caf\xc3\xa9')
u'caf\xc3\xa9'
>>> print bleach.clean('caf\xc3\xa9')
café
>>> bleach.clean('caf\xc3\xa9'.decode('utf-8'))
u'caf\xe9'
>>> print bleach.clean('caf\xc3\xa9'.decode('utf-8'))
café
Now that I've got Bleach on RTD I need to actually document all this. It's long outgrown the simple README.
>>> from bleach import Bleach
>>> b = Bleach()
>>> b.clean("<p>Foo<div>bar</div></p>", tags=['div', 'p'])
Actual: <p>Foo</p><divbar</div><p/>
Expected: <p>Foo<div>bar</div></p>
If the formatting above is messed up:
http://pastebin.mozilla.org/731947
Bleach adds tbody tags within tables. I don't know if it's a big problem, but according to w3schools, tbody is optional within a table, but if it's present, you also need thead and tfoot, so as far as I can tell, this actually turns valid markup invalid. I'm using version 0.5.0.
Example from the python interpreter:
>>> b.clean('<table><tr><td>hi</td></tr></table>', ['table','td','tr'])
u'<table><tbody><tr><td>hi</td></tr></tbody></table>'
Calling linkify() on a URL like http://example.com/
ends up stripping the trailing slash from the link, resulting in output like:
<a href="http://example.com">http://example.com</a>/
I've added a test case that demonstrates the issue in a fork danfairs/bleach
, but I'm afraid my regex foo isn't up to actually fixing the problem!
(Note that this also happens on longer URLs, such as http://example.com/page/
.)
Fred brings up two salient points about nofollow_relative
:
//google.com
) incorrectly match.foo.html
, ../foo.html
) incorrectly fail to match.I've been thinking about the use case for this. The motivating use case was SUMO: generated internal links should not have rel="nofollow"
added (i.e. links created with [[wiki markup]]
) but URLs or <a>
tags manually inserted should.
I think this case can be addressed by running linkify()
before parsing wiki content. The downside is that we then lose most of the benefit from skip_pre
. (Only manually created <pre>
blocks would be skipped, not wiki-marked-up blocks.)
Either nofollow_relative
needs a less naïve algorithm for deciding a link is relative, or it needs to go, as it gives a false impression of security.
http://example.com:9000/foo
should get linkified as a whole, but it stops at .com
.
KeyError Traceback (most recent call last)
/home/aking/patchouli/ in ()
/home/aking/.virtualenvs/patchouli/lib/python2.5/site-packages/bleach/init.pyc in linkify(self, text, nofollow)
392 self.filter_text(url))
393
--> 394 linkify_nodes(forest)
395
396 return force_unicode(forest.toxml())
/home/aking/.virtualenvs/patchouli/lib/python2.5/site-packages/bleach/init.pyc in linkify_nodes(tree)
375 if nofollow:
376 node.attributes['rel'] = 'nofollow'
--> 377 href = self.filter_url(node.attributes['href'])
378 node.attributes['href'] = href
379 else:
KeyError: 'href'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.