mozilla / bleach Goto Github PK

View Code? Open in Web Editor NEW

2.6K 45.0 247.0 1.24 MB

Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes

Home Page: https://bleach.readthedocs.io/en/latest/

License: Other

Python 98.67% HTML 0.99% Shell 0.24% Makefile 0.11%

bleach's People

Contributors

Stargazers

Watchers

Forkers

rlr pcraciunoiu erikrose malept strogo mechanism jordanm olomix level-up smileychris readevalprint outcomm aehlke vaidik originell mackstann lmorchard simonwex mvasilkov ryanniemeyer nickburlett jaimeirurzun saturnsky kiawin percyperez garito web5design automatedtester marcdm jvanasco msabramo radiac smillaedler shavyg2 dstufft crazyzubr williamgoossen randy-ran quadrivium33 mdxs troush cndreisbach natedub lifuzu ratsirarar jtrain barseghyanartur blag genba joefazee suor kampfschlaefer hcastilho harvimt vovkd wodow jespino gvrv tintypemolly vitalyvolkov django-owasp vinnyrose andreasmalecki nikolas kaeen idfumg perlmint un33k vkotovv waytai fabioz hugg95 iwanttotellyou ksdtech defrex manoloah gaeandy cloudxtreme fortu-42 speedcoder5 datadog wyatt88 dougkeen vicentel ezc vanpav gsnedders aysonje timdumol zopieux boneyao generalov matthiaskubik henriquechehad antonnguyen haganf bogolla zupr-io derrybryson dieselmachine

bleach's Issues

"libgl.so.1" is linkified

The string "libgl.so.1" is linkified but shouldn't be. Failing test in e7c421e83a8e28aa726113839aedaf31e297d805

CSS Properties don't obey the styles whitelist

Not all in-line styles get cleaned appropriately, especially when there are mulitple properties listed. The first example below produces what I would expect to be accurate output.

>>> from bleach import clean

>>> html = '<p style="font-family: Arial; color: red; float: left;">bar</p>'
>>> clean(html, tags=['p'], attributes=['style'], styles=['color'])
'<p style="color: red;">bar</p>'

However, the following examples produce unexpected results. Specifically, properties other than those listed in the whitelist are included in the output.

>>> html = '<p style="font-family: Arial; color: red; float: left; background-color: red;">bar</p>'
>>> clean(html, tags=['p'], attributes=['style'], styles=['color'])
'<p style="color: red; background-color: red;">bar</p>'

>>> html = '<p style="border: 1px solid blue; color: red; float: left;">bar</p>'
>>> clean(html, tags=['p'], attributes=['style'], styles=['color'])
'<p style="border: 1px solid blue; color: red;">bar</p>'

>>> clean(html, tags=['p'], attributes=['style'], styles=['color', 'float'])
'<p style="border: 1px solid blue; color: red; float: left;">bar</p>'

>>> html = '<p style="color: red; float: left; padding: 1em;">bar</p>'
>>> clean(html, tags=['p'], attributes=['style'], styles=['color', 'float'])
'<p style="color: red; float: left; padding: 1em;">bar</p>'

>>> clean(html, tags=['p'], attributes=['style'], styles=['color'])
'<p style="color: red; padding: 1em;">bar</p>'

delinkify doesn't remove <a>s if there is no whitespace around them

So this is an edge case, but still, here's the test:

def test_delinkify_no_whitespace():
    html = ('<a href="http://ex.mp">a</a>'
            '<a href="http://ex.mp">b</a>'
            '<a href="http://ex.mp">c</a>'
            '<a href="http://ex.mp">d</a>')
    eq_('abcd', bleach.delinkify(html))

This passes:

def test_delinkify_some_whitespace():
    html = ('<a href="http://ex.mp">a</a>'
            '<a href="http://ex.mp">b</a> '
            '<a href="http://ex.mp">c</a> '
            '<a href="http://ex.mp">d</a>')
    eq_('abcd', bleach.delinkify(html))

Log message thrown on '<hr>'

bleach.clean(u'<hr>', strip=True, tags=[])

…makes this happen:

ERROR:bleach:HTML: AssertionError('Node has no children',) ::: u'<hr>'

(assuming you have your log handlers set up)

It does return the right thing in this case—''—but I feel like it shouldn't be having this error.

Bleach doesn't produce valid XHTML

For instance <br /> tag is cleaned as <br>, which is not a valid XHTML code. Probably it would be wise to specify a mode in settings.py in some variable named BLEACH_HTML and then specify whether cleaning is done for XHTML or HTML.

Linkify eats brackets in inner text.

import bleach
bleach.linkify('http://aslobodyan.net/sysiq2/)\n')
u'<a href="http://aslobodyan.net/sysiq2/)" rel="nofollow">http://aslobodyan.net/sysiq2/)</a>\n'

As you can see there is closed bracket inside text of link.

A period after a parenthesis sometimes throws off linkify

In [18]: linkify('(http://ex.mp/)')
Out[18]: u'(<a href="http://ex.mp/" rel="nofollow">http://ex.mp/</a>)'

In [19]: linkify('(http://ex.mp/).')
Out[19]: u'(<a href="http://ex.mp/)" rel="nofollow">http://ex.mp/)</a>.'

In [20]: linkify('(http://ex.mp).')
Out[20]: u'(<a href="http://ex.mp" rel="nofollow">http://ex.mp</a>).'

Note that you need the URL to end with / for this to fail.

Allow custom link_repl and email_repl functions

It would allow people to encode emails (to make them less visible to spam crawlers) or route links through tracking scripts.

linkify is eager about top level domain URLs

It seems that linkify will turn anything that looks like a domain into a URL. There should be a way to turn this off. Maybe eager_domain_linking=False.

For example, the filenames in amo-validator output are getting linked. Like /path/to/somescript.sh gets linked as http://somescript.sh . Screenshot: https://bug670047.bugzilla.mozilla.org/attachment.cgi?id=544675 More details: https://bugzilla.mozilla.org/show_bug.cgi?id=670047

(I have the AMO issue assigned to me but it keeps getting bumped due to priority. I should get to it soon!)

Linkify can hit maximum recursion depth

Currently you can construct a string that will cause linkify to raise a RuntimeError. That should be fixed ASAP.

Don't add rel="nofollow" on relative URLs

If a URL starts with /, bleach shouldn't set rel="nofollow".

Remove or simplify delinkify

To be honest, I've never been a huge fan of delinkify. It's coming from a position of blacklisting, instead of white listing, and I think there are better ways to achieve a similar functionality.

With clean, the attributes list or dict can accept callables, which get the attribute and its value and either return True (leave it in) or False (remove it). I think this concept could be extended to the tags list, somehow. This will be trickier and requires some serious API consideration.

Thinking about how to solve it in a way that also solves Jeff's old complaint about repeating the tag list in the attributes dict.

Pass links through (a) callback(s)

An interesting idea from @aehlke in pull req #43 is to pass links through a callback or callbacks that accept and return (possibly among other things) a dict of attributes.

I think this has a lot of potential to simplify the linkify API and address the issues that delinkify aimed to, but in a better way.

Total straw man API might be:

def linkify(text, callback):
    # ... snip ...

def some_callback(text, attributes=None, new=False):  # This signature would be specified in docs.
    # Do processing, including things like url and text munging that are currently callbacks.
    # Create a link.
   return {'href': href, ...}  # Return a dict, everything's OK.
    # Or don't create a link/remove the link.
   return None  # Existing links removed, new links not created.

The obvious question that arises is what to do with multiple callbacks. It'd be great to make these callback functions small, atomic, etc, so something along the lines of Django context processors seems obvious:

def linkify(text, callbacks=None):
    if callbacks is None:
        callbacks = []
    elsif callable(callbacks):
        callbacks = [callbacks]
    # ... snip ...
    # Found a link!
    attrs = {'some': 'dict', ...}
    for cb in callbacks:
        a = cb(link_text, attrs, new)
        if a is None:
            # Break out of this somehow, just replace everything with the text.
        attrs.update(a)
    # Render the link with the attributes in attrs

Pros of this approach:

Generalizes everything linkify currently does. filter_url and filter_text are already callbacks, so they could easily be updated to return dicts. Adding or setting attributes like rel and target is trivial, etc.
Is extremely flexible.
It's easy to create a bunch of default callbacks for people to use.
It can handle what delinkify was meant to do--just write a callback that only allows relative links or links from certain domains.
It can handle the skip_existing_links idea from pull req #43.

Cons, and some suggestions:

Adding new callbacks requires repeating all the old ones. Hopefully this is solvable (e.g. linkify(text, DEFAULT_CALLBACKS + [my_callback]) or something).
The order of the callbacks is critical and might introduce subtle bugs for people? (They should have tests.)
It's flexible but more complicated for developers.
How to deal with the link text? (Maybe add it to the dict as _text? Something that can never be an HTML attribute.)
As described above, wouldn't ever remove attributes unless their was some sort of whitelist/blacklist filter callback. (Maybe allow setting 'some_attr': None to remove it? Or make sure you return the whole attr dict, unlike Django context processors.)

Bleach does not allow browser specific style values

Cleaning this string with styles and css border allowed results in the entire border definition being stripped because of the -moz-use-text-color definition.

<table><tbody><tr><td style="border: medium 1pt 1pt none solid solid -moz-use-text-color windowtext windowtext;"></td></tr></tbody></table>

Linking URLs with parenthesis

Wikipedia makes lots of links with balanced parenthesis. You should test the linker for those.

There are some related tests here to borrow from:

http://codespeak.net/svn/lxml/trunk/src/lxml/html/tests/test_autolink.txt

Change log level of first render error to warning

If HTML serialization doesn't work, we currently log with an error level. That's extreme. The first attempt should be a warning, then an error.

Linkify got wrong results

linkify("comment a comment")
Result:

comment\n<a rel="nofollow" href="http://a comment">a comment</a>

linkify("have you try internet bar?")
Result:

have you try\n<a rel="nofollow" href="http://internet"&gt;internet&lt;/a&gt;\nbar?

Add wildcard when ALLOWED_ATTRIBUTES is a dictionary

When the attributes argument passed to the clean function is a list it matches against every element, if it is a dictionary it only matches elements that are dictionary keys. There should be a wildcard key, * perhaps that would match all elements. Right now using a dictionary and allowing the attributes id or class is a big pain without a wildcard element.

linkify breaks iframes

>>> bleach.linkify('<iframe allowfullscreen="true" frameborder="0" height="315" src="https://www.youtube.com/embed/nZbEjjz4GL0" width="560"></iframe>')
u'&lt;iframe allowfullscreen="true" frameborder="0" height="315" src="<a href="https://www.youtube.com/embed/nZbEjjz4GL0" rel="nofollow">https://www.youtube.com/embed/nZbEjjz4GL0</a>" width="560"&gt;&lt;/iframe&gt;'

This shouldn't happen...
Apparently it's caused by the HTMLSanitizer from html5lib

linkifying of 3rd level domains

Bleach has ignored only 2nd domain emails like [email protected], but if domain is 3rd level it will be linkified.

In [5]: bleach.linkify('[email protected]')
Out[5]: u'www@example.<a href="http://com.ua" rel="nofollow">com.ua</a>'

Expected: ignoring 3rd level domains.

ly TLD not linkified

Looks like bleach doesn't recognize less popular TLD such as 'ly'
http://bit.ly/ - URL shortening service

Allow strip_comments option

Bleach will delete all comments, and there is no options to change that. In my use case, I need to leave all comments intact.

Better subdomain support

It would be nice to allow subdomain matching in allow_domains for delinkify, but it needs to be opt-in, like allow_domains=['*.ex.mp'].

I think it's kinder to users if *.ex.mp matches ex.mp, too.

A cleaning pipeline

One thing that occured to me while working on the BleachSanitizerMixin.sanitize_token is that it's code is getting pretty long and kind of ugly, the more functionality one needs do add.

One possible way to counter this problem might be to implement a "pipeline". I've seen this concept first in django-social-auth which I'm also using in a project. To be more specific this part of the README explains it pretty good. Personally I really love the concept. The code of the default social auth pipeline is here

So what I am proposing is something very much in the vain of social auth's pipeline:

Instead of having to tie into the (big) if condition, we iterate over an iterable with functions::
```
PIPELINE = ('SkipAllowedElements', 'StripScripts',...)
```
While iterating, each function get's called and returns either a cleaned token which gets passed on to the next function in the iterable or None for skip. Note that maybe it might be more expressive to use Exceptions (like SkipToken) here.

This may also clean up a bit of the code I introduce with the strip_script_content fork (like doing self.previous_token = token right before every return)

So, basically this is me requesting for a comment on refactoring things a bit :D – Maybe the same kind of pipeline-logic could be used for #56

Refactor, clean up init

Right now, all the interesting code in Bleach is in __init__.py, which is convenient but not a great practice. I'd like to shuffle things around a little and clean up that file.

Linkify doesn't like self-closing tags

This test fails in an interesting way:

def test_sarcasm():
    """Jokes should crash.<sarcasm/>"""
    dirty = u'Yeah right <sarcasm/>'
    clean = u'Yeah right &lt;sarcasm/&gt;'
    eq_(clean, linkify(dirty))

The failure is:

AssertionError: u'Yeah right &lt;sarcasm/&gt;' != u'Yeah right <sarcasm></sarcasm>'

Remove script content with tag

In clean(), there's no way to remove the contents of a <script> tag. In discussion in #57 we decided the best approach was an additional kwarg and optional treewalk to remove a <script> and any of its children, including text nodes (e.g. the script content).

Unicode chars/umlauts break linkify

In [17]: bleach.linkify('this breaks: http://www.adsaüüüääds.at')
Out[17]: u'this breaks: http://www.adsa\xc3\xbc\xc3\xbc\xc3\xbc\xc3\xa4\xc3\xa4<a href="http://ds.at" rel="nofollow">ds.at</a>'

Documentation in README.md out of date

The particular instance I noticed was the nofollow or target parameters don't exist anymore for linkify, and there's a callback parameter.

Some email addresses are messed up with parse_email=False

>>> linkify('[email protected]')
u'doesntmatter@foo<a href="http://-bar.de" rel="nofollow">-bar.de</a>'

More tests for security

lxml.html has cleaning tests you might want to use:

http://codespeak.net/svn/lxml/trunk/src/lxml/html/tests/test_feedparser_data.py

It includes tests from feedparser (which has its own restrictive cleaner), and also a bunch taken from http://ha.ckers.org/xss.html

Linkify does not strip trailing punctuation like commas

The AMO validator has sentences with links in them and one looked like this:

For more information, see http://blog.mozilla.com/addons/2009/01/16/firefox-extensions-global-namespace-pollution/, or use JavaScript modules.

bleach.linkify turned that into

http://outgoing.mozilla.org/v1/f96d4df7dfad6d4ec94310deb58d5da8217f8860/http%3A//blog.mozilla.com/addons/2009/01/16/firefox-extensions-global-namespace-pollution/%2C

which is incorrect because it captured the comma at the end. The exact code was: bleach.linkify(text, nofollow=True, filter_url=amo.urlresolvers.get_outgoing_url)

linkify with parse_email=True doesn't linkify urls anymore

Using linkify with parse_email=True doesn't linkify urls

>>> import bleach
>>> bleach.linkify('http://example.com [email protected]')
u'<a href="http://example.com" rel="nofollow">http://google.com</a> [email protected]'
>>> bleach.linkify('http://example.com [email protected]', parse_email=True)
u'http://example.com <a href="mailto:[email protected]">[email protected]</a>'

add tests for XSS examples in OWASP testing guide

Adding all the tests from http://ha.ckers.org/xss.html would be a good idea

Improper nesting with block-level elements can cause strange output

Placing tags in wrong order leads to extra tags in some circumstances:

>>> bl.clean('<i><blockquote>text</i></blockquote>')
    u'<i/><blockquote><i>text</i></blockquote>'

but

>>> bl.clean('<i><b>text</i></b>')
    u'<i><b>text</b></i>'

I don't know if this is a bug, or if it is just bad practice to place the tags in the wrong order.

delinkify()

Another top-level method! delinkify!

delinkify() will strip <a> tags, and accept a whitelist of domains to allow to stay. It will replace links with their linktext. It also has an option to leave relative links be.

kwarg to pass over named HTML entities

It seems that html5lib is taking named HTML entities like   and ® and outputting them as characters. I haven't dug into that library enough to know if there's a way to instruct it to pass over these entities, but it may be useful for those who value keeping them. Especially in the case of   where some browsers/renderers handle it differently than a normal space character.

I can fork and work on this if it's a feature deemed worthy of adding and html5lib has an available solution.

linkify() option to skip <pre> sections

Low priority, but it might be useful for some people to skip linkification inside <pre> elements.

AssertionError: Node has no children

If I do

value = bleach.clean('<p><iframe frameborder="0" height="315" src="http://www.youtube.com/embed/6ydeY0tTtF4" width="560"></iframe></p>', strip=True, tags=[], attributes=[])

it's ok that it print assertionError?

2012-04-20 15:23:27,885 ERROR   /usr/lib/python2.7/site-packages/bleach/__init__.py:328 HTML: AssertionError('Node has no children',)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/bleach/__init__.py", line 326, in _render
    return force_unicode(_serialize(tree))
  File "/usr/lib/python2.7/site-packages/bleach/__init__.py", line 341, in _serialize
    return serializer.render(stream)
  File "/usr/lib/python2.7/site-packages/html5lib/serializer/htmlserializer.py", line 302, in render
    return u"".join(list(self.serialize(treewalker)))
  File "/usr/lib/python2.7/site-packages/html5lib/serializer/htmlserializer.py", line 192, in serialize
    for token in treewalker:
  File "/usr/lib/python2.7/site-packages/html5lib/treewalkers/_base.py", line 154, in __iter__
    firstChild = self.getFirstChild(currentNode)
  File "/usr/lib/python2.7/site-packages/html5lib/treewalkers/simpletree.py", line 58, in getFirstChild
    assert node.hasContent(), "Node has no children"
AssertionError: Node has no children

value is u'', but in python shell printed it assertion error.

Maybe don't print it?

Thanks.

Allow method chaining without rebuilding the tree

Right now, if you chain methods (e.g. linkify(clean(text))) the whole tree-building, which is expensive, has to happen twice.

I can see two ways of solving this. One is a total API rewrite, one is less complicated. I'm leaning toward the latter.

Total rewrite:

Recreate a Bleach class, but this time it is the text you want to clean, e.g.:

from bleach import Bleach
text = Bleach(text)
print text.clean()

It could store state of what's already been done (like cleaning, linkifying, etc) and either return text (as above) or return itself to allow chaining, e.g.:

# Implement __unicode__ and __str__, seems nicer than above.
print Bleach(text).clean().linkify()

Less-crazy idea:

Instead of actual unicode objects, the top-level methods return a lazy object that looks like a string, maybe even descends from unicode, but is smarter, stores the tree, and only renders itself if anything needs to treat it as a string.

Then the methods would be reworked to check for the tree before rebuilding it, and if found, they would just use the existing tree. clean() might need special handling--i.e. always go back to text first--because of the nature of what it does. But I generally say you should always call clean() first if you're chaining, anyway.

Pros and cons

The rewrite might make it easier to deal with the increasing complexity of linkify(). You could (again) subclass Bleach and override a number of things there. I don't love APIs where you have to subclass things. I'm also not sure how complex this would make the upgrade for consumers--it would be a major version change.

The less-crazy idea doesn't involve subclassing anything, ever. The API can stay the same and it can be a minor version change. I am guessing won't be too difficult to write some sort of LazyTreeStringObjectThing class. It would be a good opportunity to do some refactoring and get some stuff out of bleach/__init__.py. (I wouldn't change import paths, just where the code lives.)

bleach loves cheese to much, shouldn't linkify brie

from bleach import Bleach
b = Bleach()
b.linkify(' brie')
Actual:
' brie'

Expected:
' brie'

This is an odd behavior I boiled down from a bigger piece of content. Seems like the data has to be longer than 3 characters and contain 'ie'. Bleach loves to linkify Hippies too.

Bleach doesn't detect character encoding

When calling .clean with certain inputs of ambiguous encoding, the returned unicode value is invalid. Might want to try decoding from utf-8, if the input string is not unicode or using something like the one of open source libraries to detect the encoding. Either chardet from Universal Feed Parser or UnicodeDammit from BeautifulSoup should work.

Consider:

café utf-8: 'caf\xc3\xa9' unicode: u'caf\xe9'

>>> bleach.clean('caf\xc3\xa9')
u'caf\xc3\xa9'
>>> print bleach.clean('caf\xc3\xa9')
cafÃ©

>>> bleach.clean('caf\xc3\xa9'.decode('utf-8'))
u'caf\xe9'
>>> print bleach.clean('caf\xc3\xa9'.decode('utf-8'))
café

Docs!

Now that I've got Bleach on RTD I need to actually document all this. It's long outgrown the simple README.

clean with tags can produce malformed html

>>> from bleach import Bleach
>>> b = Bleach()
>>> b.clean("<p>Foo<div>bar</div></p>", tags=['div', 'p'])

Actual: <p>Foo</p><divbar</div><p/>

Expected: <p>Foo<div>bar</div></p>

If the formatting above is messed up:
http://pastebin.mozilla.org/731947

bleach adds tbody tags to tables

Bleach adds tbody tags within tables. I don't know if it's a big problem, but according to w3schools, tbody is optional within a table, but if it's present, you also need thead and tfoot, so as far as I can tell, this actually turns valid markup invalid. I'm using version 0.5.0.

Example from the python interpreter:

>>> b.clean('<table><tr><td>hi</td></tr></table>', ['table','td','tr'])

u'<table><tbody><tr><td>hi</td></tr></tbody></table>'

linkify() misses trailing slashes

Calling linkify() on a URL like http://example.com/ ends up stripping the trailing slash from the link, resulting in output like:

<a href="http://example.com">http://example.com</a>/

I've added a test case that demonstrates the issue in a fork danfairs/bleach, but I'm afraid my regex foo isn't up to actually fixing the problem!

(Note that this also happens on longer URLs, such as http://example.com/page/.)

TLD list

http://data.iana.org/TLD/tlds-alpha-by-domain.txt

Fix or drop nofollow_relative

Fred brings up two salient points about nofollow_relative:

Protocol-relative URLs (e.g. //google.com) incorrectly match.
Directory-relative URLs (e.g. foo.html, ../foo.html) incorrectly fail to match.

I've been thinking about the use case for this. The motivating use case was SUMO: generated internal links should not have rel="nofollow" added (i.e. links created with [[wiki markup]]) but URLs or <a> tags manually inserted should.

I think this case can be addressed by running linkify() before parsing wiki content. The downside is that we then lose most of the benefit from skip_pre. (Only manually created <pre> blocks would be skipped, not wiki-marked-up blocks.)

Either nofollow_relative needs a less naïve algorithm for deciding a link is relative, or it needs to go, as it gives a false impression of security.

linkify doesn't like ports

http://example.com:9000/foo should get linkified as a whole, but it stops at .com.

anchor tags can cause href KeyError

bleach.linkify('Foo')

KeyError Traceback (most recent call last)

/home/aking/patchouli/ in ()

/home/aking/.virtualenvs/patchouli/lib/python2.5/site-packages/bleach/init.pyc in linkify(self, text, nofollow)
392 self.filter_text(url))
393
--> 394 linkify_nodes(forest)
395
396 return force_unicode(forest.toxml())

/home/aking/.virtualenvs/patchouli/lib/python2.5/site-packages/bleach/init.pyc in linkify_nodes(tree)
375 if nofollow:
376 node.attributes['rel'] = 'nofollow'
--> 377 href = self.filter_url(node.attributes['href'])
378 node.attributes['href'] = href
379 else:

KeyError: 'href'