Giter Club home page Giter Club logo

adblockparser's Introduction

adblockparser

PyPI Version License Build Status Code Coverage

adblockparser is a package for working with Adblock Plus filter rules. It can parse Adblock Plus filters and match URLs against them.

Installation

pip install adblockparser

Python 2.7 and Python 3.3+ are supported.

If you plan to use this library with a large number of filters installing pyre2 library is highly recommended: the speedup for a list of default EasyList filters can be greater than 1000x.

pip install 're2 >= 0.2.21'

Note that pyre2 library requires C++ re2 library installed. On OS X you can get it using homebrew (brew install re2).

Usage

To learn about Adblock Plus filter syntax check these links:

  1. Get filter rules somewhere: write them manually, read lines from a file downloaded from EasyList, etc.:

    >>> raw_rules = [
    ...     "||ads.example.com^",
    ...     "@@||ads.example.com/notbanner^$~script",
    ... ]
    
  2. Create AdblockRules instance from rule strings:

    >>> from adblockparser import AdblockRules
    >>> rules = AdblockRules(raw_rules)
    
  3. Use this instance to check if an URL should be blocked or not:

    >>> rules.should_block("http://ads.example.com")
    True
    

    Rules with options are ignored unless you pass a dict with options values:

    >>> rules.should_block("http://ads.example.com/notbanner")
    True
    >>> rules.should_block("http://ads.example.com/notbanner", {'script': False})
    False
    >>> rules.should_block("http://ads.example.com/notbanner", {'script': True})
    True
    

Consult with Adblock Plus docs for options description. These options allow to write filters that depend on some external information not available in URL itself.

Performance

Regex engines

AdblockRules class creates a huge regex to match filters that don't use options. pyre2 library works better than stdlib's re with such regexes. If you have pyre2 installed then AdblockRules should work faster, and the speedup can be dramatic - more than 1000x in some cases.

Sometimes pyre2 prints something like re2/dfa.cc:459: DFA out of memory: prog size 270515 mem 1713850 to stderr. Give re2 library more memory to fix that:

>>> rules = AdblockRules(raw_rules, use_re2=True, max_mem=512*1024*1024)  # doctest: +SKIP

Make sure you are using re2 0.2.20 installed from PyPI, it doesn't work.

Parsing rules with options

Rules that have options are currently matched in a loop, one-by-one. Also, they are checked for compatibility with options passed by user: for example, if user didn't pass 'script' option (with a True or False value), all rules involving script are discarded.

This is slow if you have thousands of such rules. To make it work faster, explicitly list all options you want to support in AdblockRules constructor, disable skipping of unsupported rules, and always pass a dict with all options to should_block method:

>>> rules = AdblockRules(
...    raw_rules,
...    supported_options=['script', 'domain'],
...    skip_unsupported_rules=False
... )
>>> options = {'script': False, 'domain': 'www.mystartpage.com'}
>>> rules.should_block("http://ads.example.com/notbanner", options)
False

This way rules with unsupported options will be filtered once, when AdblockRules instance is created.

Limitations

There are some known limitations of the current implementation:

  • element hiding rules are ignored;
  • matching URLs against a large number of filters can be slow-ish, especially if pyre2 is not installed and many filter options are enabled;
  • match-case filter option is not properly supported (it is ignored);
  • document filter option is not properly supported;
  • rules are not validated before parsing, so invalid rules may raise inconsistent exceptions or silently work incorrectly.

It is possible to remove all these limitations. Pull requests are welcome if you want to make it happen sooner!

Contributing

In order to run tests, install tox and type

tox

from the source checkout.

The license is MIT.

adblockparser's People

Contributors

kmike avatar limonte avatar mlyko avatar mozbugbox avatar roman-dowakin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

adblockparser's Issues

use_re2 option causes segmentation fault

Hi all,

I've been attempting to use re2 for easyprivacy on adblockparser, however, setting the use_re2=True option causes a seg fault within Python3.10.

Currently, the code looks like this:

ROOT = os.path.dirname(os.path.abspath(__file__))
ADBLOCK_RULES = os.path.join(ROOT, 'adblockrules')
EASYPRIVACY = os.path.join(ADBLOCK_RULES, 'easyprivacy.txt')

with open(EASYPRIVACY, 'rb') as f:
    raw_rules = f.read().decode('utf8').splitlines()
privacy_rules = AdblockRules(raw_rules, use_re2=True) 

Versions:

Python 3.10.6
Pyre2 0.3.6 from https://github.com/andreasvc/pyre2
Fedora 36

UPDATE: as of this update, I resolved the issue by switching my os to ubuntu 20.04 with Python 3.8.10. Everything installs as needed and adblockparser can now use re2 properly, however, I did have to use the pyre2 from pip. pip3 install pyre2 should assist in this.

"ValueError: Invalid rule" with easylist filter

I intend to use the easylist default filter to be used by splash, but when I try to do so, I get the following error:

2016-10-17 11:33:54+0000 [-] Log opened.
2016-10-17 11:33:54.636355 [-] Splash version: 2.2
2016-10-17 11:33:54.637039 [-] Qt 5.5.1, PyQt 5.5.1, WebKit 538.1, sip 4.17, Twisted 16.1.1, Lua 5.2
2016-10-17 11:33:54.637173 [-] Python 3.4.3 (default, Oct 14 2015, 20:28:29) [GCC 4.8.4]
2016-10-17 11:33:54.637485 [-] Open files limit: 1048576
2016-10-17 11:33:54.638120 [-] Can't bump open files limit
2016-10-17 11:33:54.743707 [-] Xvfb is started: ['Xvfb', ':1', '-screen', '0', '1024x768x24']
2016-10-17 11:33:55.446450 [-] Traceback (most recent call last):
2016-10-17 11:33:55.446740 [-]   File "/app/bin/splash", line 4, in <module>
2016-10-17 11:33:55.448561 [-]     main()
2016-10-17 11:33:55.449293 [-]   File "/app/splash/server.py", line 372, in main
2016-10-17 11:33:55.450906 [-]     server_factory=server_factory,
2016-10-17 11:33:55.451930 [-]   File "/app/splash/server.py", line 273, in default_splash_server
2016-10-17 11:33:55.453654 [-]     allowed_schemes=allowed_schemes,
2016-10-17 11:33:55.454380 [-]   File "/app/splash/network_manager.py", line 58, in __init__
2016-10-17 11:33:55.456183 [-]     self.adblock_rules = AdblockRulesRegistry(filters_path, verbosity=verbosity)
2016-10-17 11:33:55.456661 [-]   File "/app/splash/request_middleware.py", line 161, in __init__
2016-10-17 11:33:55.458055 [-]     self._load(path)
2016-10-17 11:33:55.458534 [-]   File "/app/splash/request_middleware.py", line 204, in _load
2016-10-17 11:33:55.460139 [-]     max_mem=512*1024*1024,  # this doesn't actually use 512M
2016-10-17 11:33:55.460627 [-]   File "/usr/local/lib/python3.4/dist-packages/adblockparser/parser.py", line 301, in __init__
2016-10-17 11:33:55.462552 [-]     for r in rules
2016-10-17 11:33:55.463284 [-]   File "/usr/local/lib/python3.4/dist-packages/adblockparser/parser.py", line 299, in <listcomp>
2016-10-17 11:33:55.464906 [-]     r for r in (
2016-10-17 11:33:55.465371 [-]   File "/usr/local/lib/python3.4/dist-packages/adblockparser/parser.py", line 301, in <genexpr>
2016-10-17 11:33:55.466984 [-]     for r in rules
2016-10-17 11:33:55.467492 [-]   File "/usr/local/lib/python3.4/dist-packages/adblockparser/parser.py", line 112, in __init__
2016-10-17 11:33:55.469439 [-]     self.regex = self.rule_to_regex(rule_text)
2016-10-17 11:33:55.469940 [-]   File "/usr/local/lib/python3.4/dist-packages/adblockparser/parser.py", line 221, in rule_to_regex
2016-10-17 11:33:55.471339 [-]     raise ValueError("Invalid rule")
2016-10-17 11:33:55.471969 [-] ValueError: Invalid rule

I'm using scrapinghub/splash:2.2 docker image. My dockerfile looks like the following:

FROM scrapinghub/splash:2.2

# Adblock filters
ENV FILTDIR="/etc/splash/ad-filters/"
RUN wget -P $FILTDIR http://easylist.to/easylist/easylist.txt         #Error on run
RUN wget -P $FILTDIR http://easylist.to/easylist/fanboy-social.txt    #No error on run

CMD ["--filters-path=/etc/splash/ad-filters"]

Apologies if this isn't the correct project to create this issue.

rule with single negated domain not matched correctly

A rule with a single negated domain, like so: "adv$domain=~example.com" is very common, but is not matched correctly.

Here's a snippet to make the tests fail for file "test/test_parsing.py":

    "adv$domain=~example.com": [
        ("http://example.net/adv", {'domain': 'otherdomain.com'}, True),
        ("http://somewebsite.com/adv", {'domain': 'example.com'}, False),
    ],

adblockparser doesn't work with latest easylist.txt

Some rules can't be parsed (the ones with $websocket?), so ValueError is raised.

It could make sense to raise a more specific exception and add an option to AdblockRules to warn on errors instead of raising an exception.

Easylist no longer supported?

I can't get any of the easylists to work at all?

>>> with open('fanboy_social_general_block.txt', 'rb') as f:
...     raw_rules = f.read().decode('utf8').splitlines()
>>> rules = AdblockRules(raw_rules)
>>> rules.should_block("http://www.facebook.com")
False
>>> with open('easylist.txt', 'rb') as f:
...     raw_rules = f.read().decode('utf8').splitlines()
>>> rules = AdblockRules(raw_rules)
>>> rules.should_block("http://ads.example.com")
False

Any ideas? Are they no longer compatible?

Help with easylist.txt

Any help on getting the filter rules from easylist.txt, please?
Also, wouldn't it be faster to use a dictionary/set instead of a list?

||domain.com should match wss:subdomain.domain.com (but it doesn't)

The regex at

rule = r"^(?:[^:/?#]+:)?(?://(?:[^/?#]*\.)?)?" + rule[2:]
appears to be too restrictive. According to https://help.eyeo.com/en/adblockplus/how-to-write-filters#anchors

You might want to block http://example.com/banner.gif as well as https://example.com/banner.gif and http://www.example.com/banner.gif. You can do this by putting two pipe symbols in front of the filter. This ensures that the filter matches at the beginning of the domain name: ||example.com/banner.gif, and blocks all of these addresses while not blocking http://badexample.com/banner.gif or http://gooddomain.example/analyze?http://example.com/banner.gif.

If I understand this correctly, it should also block wss:www.example.com/banner.gif but in this implementation, it doesn't.

>>> from adblockparser import AdblockRules
>>> rules = AdblockRules(['||example.com/banner.gif'])
>>> rules.should_block('http://example.com/banner.gif')
True
>>> rules.should_block('http://www.example.com/banner.gif')
True
>>> rules.should_block('wss:example.com/banner.gif')
True
>>> rules.should_block('wss:www.example.com/banner.gif')
False

(should be True)

Invalid rule easyprivacy.txt + better errors

Hey,

I am getting error then using https://easylist.to/easylist/easyprivacy.txt
Would be great to get the rule that is erroneous in the error message.

with open('easyprivacy.txt', 'rb') as f: content = f.read().decode('utf8').splitlines() rules = AdblockRules(content)

Error

File "/usr/local/lib/python3.6/site-packages/adblockparser/parser.py", line 306, in __init__ for r in rules File "/usr/local/lib/python3.6/site-packages/adblockparser/parser.py", line 304, in <listcomp> r for r in ( File "/usr/local/lib/python3.6/site-packages/adblockparser/parser.py", line 306, in <genexpr> for r in rules File "/usr/local/lib/python3.6/site-packages/adblockparser/parser.py", line 117, in __init__ self.regex = self.rule_to_regex(rule_text) File "/usr/local/lib/python3.6/site-packages/adblockparser/parser.py", line 233, in rule_to_regex raise AdblockParsingError('Invalid rule') adblockparser.parser.AdblockParsingError: Invalid rule

Group filters into sub groups

Most of filter rules are domain specific. If we put filters start with "|", "||", "@@||" into their own list of rules with a given domain, then the giant regex will be cut into 1/3 of the original size. The domain specified rules will be quite short.

Each domain can then be put into a dict mapping to the respected filter rules.

And most rest of the rules have the string "ad" in it. if we seperate rest of the rules by whether a rule has "ad" in it, we can have a much smaller regex for urls without "ad".

domain_rules = {
"host1": merged_regex,
"host2": merged_regex
...
}
rules_with_ad = merged_regex
rules_without_ad = merged_regex

Since most false postive urls don't match domains and has no ad in them, matching false positive should be much faster.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.