ghostery / adblocker Goto Github PK

View Code? Open in Web Editor NEW

755.0 19.0 93.0 90.69 MB

Efficient embeddable adblocker library

Home Page: https://www.ghostery.com

License: Mozilla Public License 2.0

TypeScript 82.30% JavaScript 2.48% Makefile 0.06% Python 0.20% Jupyter Notebook 14.95%

content-blocking javascript adblocker easylist adblock puppeteer privacy

adblocker's Introduction

Adblocker

Efficient · Minimal · JavaScript · TypeScript · uBlock Origin- and Easylist-compatible
Node.js · Puppeteer · Electron · WebExtension

Cliqz' adblocker is a JavaScript library for blocking ads, trackers, and annoyances with a strong focus on efficiency. It was designed with compatibility in mind and integrates seamlessly with the following environments:

Getting Started

Cliqz' adblocker is the easiest and most efficient way to block ads and trackers in your project. Only a few lines of code are required to integrate smoothly with Puppeteer, Electron, a Chrome- and Firefox-compatible browser extension, or any environment supporting JavaScript (e.g. Node.js or React Native).

Here is how to do it in two steps for a Chrome- and Firefox-compatible WebExtension:

Install: npm install --save @cliqz/adblocker-webextension
Add the following in your background script:

import { WebExtensionBlocker } from '@cliqz/adblocker-webextension';

WebExtensionBlocker.fromPrebuiltAdsAndTracking().then((blocker) => {
  blocker.enableBlockingInBrowser(browser);
});

Congratulations, you are now blocking all ads and trackers! 🎉

Compatibility

The library supports 99% of all filters from the Easylist and uBlock Origin projects. Check the compatibility matrix on the wiki for more details.

Contributing

This project makes use of lerna and yarn workspaces under the hood. Quickly get started with:

Fork and clone the repository,
Enable corepack: corepack enable,
Install dependencies: yarn install --immutable,
Build: yarn build,
Test: yarn test,

For any question, feel free to open an issue or a pull request to get some help!

Who is using it?

This library is the building block technology used to power the adblockers from Ghostery and Cliqz on both desktop and mobile platforms. It is already running in production for millions of users and has been battle-tested to satisfy the following use-cases:

Mobile-friendly adblocker in react-native, WebExtension, or custom JavaScript context: Ghostery for iOS.
Ads and trackers blocker in Electron applications, Puppeteer headless browsers, Cliqz browser, ghostery and standalone).
Batch requests processing in Node.js, HTML fuzzy keywork matcher, and more.

The innovative algorithms and architecture designed and implemented in this project have been shown to be among the most efficient ways to implement ad-blockers and have been used in other projects to implement highly performant adblockers such as Brave.

Swag

Show the world you're using ghostery/adblocker →

[![powered by Ghostery](https://img.shields.io/badge/ghostery-powered-blue?logo=ghostery)](https://github.com/ghostery/adblocker)

Or HTML:

<a href="https://github.com/ghostery/adblocker/" target="_blank" rel="noopener noreferrer">
    <img alt="powered by Ghostery" src="https://img.shields.io/badge/ghostery-powered-blue?logo=ghostery">
</a>

License

Mozilla Public License 2.0

adblocker's People

Stargazers

Watchers

Forkers

chrmod remusao zhonghao-cliqz djdagovs mytry1 d4tocchini unlive666 asifmohd solso elenis happy-ferret sentialx hurleynerd69 antonok-edm anomous ipy fcjr tigerxxx jeffypooo mahmoudeid789 tomhouriez casebell manishshukla uday-media bytespider mrdondot dwsyoyo zanachka aruneshmathur namiaio tiamat-tech ermescs my-frontend-lab eugenioemmolo elx3020 tyler-murphy kmosunoff hixio-mh augleih dcalvo mjethani allevvator yusufozturk d3xt3-bitstechlab jolzee khoa-dev112 akornatskyy xedich cyberravn highestage fdezromero snomad1 pnegahdar gahabeen classicvalues kylegundersen ajunlonglive private-face animemandir elizaday vvanglro github-herve-bourzeix ospirito justworkhere fairhopeweb retromaintain philipp-classen armbiant perceive1 sjaboulet skunkfox pedrodiogo219 cyberflamego code-of jin0x drroot-github antoningr williamyaps ductridev brugarolas meekr aspenmayer johndizzle crllect seia-soto yocontra gorm tjoaks pd-pranay kaczy1233 grvydev iankrieger admaru-dev

adblocker's Issues

Extremely slow request decision

Hi, I noticed after update (0.10 I guess) the engine.match method became super-slow. Here's the code I'm using in my Electron app, and it literally hangs up:

 webRequest.onBeforeRequest(
    { urls: ['<all_urls>'] },
    async (details: Electron.OnBeforeRequestDetails, callback: any) => {
      if (engine && settings.isShieldToggled) {
        console.time('engine.match');
        const { match, redirect } = engine.match(
          Request.fromRawDetails({
            type: details.resourceType as any,
            url: details.url,
          }),
        );
        console.timeEnd('engine.match');

        if (match || redirect) {
          appWindow.webContents.send(`blocked-ad-${details.webContentsId}`);

          if (redirect) {
            callback({ redirectURL: redirect });
          } else {
            callback({ cancel: true });
          }

          return;
        }
      }

      callback({ cancel: false });
    },
  );

Bring back some patterns which look like RegExps

We currently drop RegExps filters /.../. Although there does not seem to be any example of such filter in the lists so far, it would be nice to only drop the filters which actually contain RegExp specific characters like *. This way, we would be able to retain /ads/ (for example).

Add more dynamic optimizations

It would be nice to extend the set of optimizations performed at runtime by the adblocker to speed-up matching. We could also consider optimizing individual filters; currently some of them are performed directly while parsing, but we could simplify (and speed-up) this phase and instead rely on the optimizer for this.

Clean-up request's types

We currently still handle legacy codes from Firefox Bootstrap extensions, which could be removed. It's also a good occasion to make sure all types are properly handled (they should).

Firefox documentation: https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/webRequest/ResourceType
Chrome documentation: https://developer.chrome.com/extensions/webRequest

"main_frame", "sub_frame", "stylesheet", "script", "image", "font", "object", "xmlhttprequest", "ping", "csp_report", "media", "websocket", "other"

Update adblockplus.js for thirdParty parameter

We removed the thirdParty parameter of matchesAny() recently (adblockplus/adblockpluscore@23cd2b9). After this commit, the adblockplus.js file here no longer works.

Here are the changes I made locally to fix the issue:

diff --git a/bench/comparison/blockers/adblockplus.js b/bench/comparison/blockers/adblockplus.js
index 2125c2f..3410e00 100644
--- a/bench/comparison/blockers/adblockplus.js
+++ b/bench/comparison/blockers/adblockplus.js
@@ -6,7 +6,7 @@ const { URL } = require('url');

 const { CombinedMatcher } = require('./adblockpluscore/lib/matcher.js');
 const { Filter, RegExpFilter } = require('./adblockpluscore/lib/filterClasses.js');
-const { parseURL, isThirdParty } = require('./adblockpluscore/lib/url.js');
+const { parseURL } = require('./adblockpluscore/lib/url.js');

 // Chrome can't distinguish between OBJECT_SUBREQUEST and OBJECT requests.
 RegExpFilter.typeMap.OBJECT_SUBREQUEST = RegExpFilter.typeMap.OBJECT;
@@ -75,12 +75,10 @@ module.exports = class AdBlockPlus {
   match(request) {
     const url = parseURL(request.url);
     const sourceURL = parseURL(request.frameUrl);
-    const thirdParty = isThirdParty(url, sourceURL.hostname);
     const filter = this.matcher.matchesAny(
-      url.href,
+      url,
       RegExpFilter.typeMap[resourceTypes.get(request.type) || 'OTHER'],
       sourceURL.hostname,
-      thirdParty,
       null,
       false,
     );

PS: The thirdParty parameter is no longer needed because the function calculates this on its own as needed based on the request URL and the hostname of the document making the request (Adblock Plus issue #7260).

Consider supporting globbing in regexps

I still don't think we should support full regexps in filter syntax, although having a more limited for like globbing could allow some more efficient filters in some cases.

||foo.bar/{scripts,ads,tracking}$xhr

This would allow to encode several possibilities in a single filter, with a clear syntax.

Improve API of filters engine

Consider implementing strict semantic of elemhide

We currently implement elemhide and generichide the same way. Let's keep in mind that the real semantic might be needed in the future.

Handle new syntax

The script:inject is now +js

gorhill/uBlock@ec56165

Support $webrtc option

Implement 'doctests' for filters

It would be nice to have a way to specify tests inline for filters, similar to how doctest would work in Python.

! This filter blocks ads on foo.com
! >>> https://foo.com/js
||foo.com$script

This would allow to both document filters as well as test them easily. One thing which is not clear is, what is the nicest way to specify the test cases (url, source url, type of request, etc.)

Support $inline-script option

Fix regex hostname-matching logic

Some rare filters are not matched properly. For instance:

||geo*.hltv.org^ should match https://geo2.hltv.org/rekl13.php
||www*.swatchseries.to^$script should match https://www1.swatchseries.to/sw.js
'||imp*.tradedoubler.com^$third-party' should match https://impde.tradedoubler.com/imp
||www*.swatchseries.to^$script should match https://www1.swatchseries.to/public/js/bootstrap-modal.js

It seems that the meaning of * depends on the context where it appears. Also, it begs the question of what should be considered the end of the hostname in an hostname anchor; currently I'm guessing this should always be the next separator. I'm hoping that all filters are consistent and follow this implicit rule but this will have to be investigated.

Inline matching logic in each class (NetworkFilter and CosmeticFilter)

This will open new opportunities to have custom logic for different kinds of filters in the future.

Engine on update, check checksums to not re-parse same lists

Allow serialization of engine even after internal optimizations triggered

Currently it is not possible to serialize the engine after internal optimizations triggered because of the change in structure in buckets. It would be nice to still keep the list of original filters before optimizations to allow for serialization at any point of time. (Note: keeping the id of filters might be enough).

Reduce serialized size further using adaptative coding of length

StaticDataView currently needs to store the size/length of some elements (e.g.: size of the string in pushASCII). We could make the representation more compact by using the strict minimum number of bytes to represent the size. We currently either use a 16 bits or 32 bits number depending on the type of data stored, but they could all benefit from a smarter encoding. For example we could use only 1 byte for length <= 127, then 2 or 4 bytes for higher values.

Support $popunder option

Support $popup option

Extend cosmetic filter to support full network filter pattern

Currently only selection of hostname is supported by the cosmetic filter a.com##.ads but it will also be nice to have the support for filters like a.com/b/c/d/##.ads

Create unified config

Currently multiple entities share similar configs but have their own. Let's having one instance shared instead.

Find a way to dynamically resize data view

Currently StaticDataView will not perform any resizing (for performance reasons). It would still be nice to have a way to do so without a performance penalty.

Support $frame option

Consider adding tldts in the bundle again

We currently do not depend directly upon any library to parse URLs but some public APIs require injection of a parse function which allows that. We usually use tldts for this purpose. We should consider adding it again as a dependency for convenience so that @cliqz/adblocker can be used and works out of the box. Also we could still provide the ability to by-pass the use of tldts.

Offer commonjs distribution

For usecases like Ghostery, when hosting project control the build system, Adblocker should come in source code form, so the build system may optimize dependecy loading.
Currently adblocker bundle comes with tldts embeded, which has quite high loading cost. If we distribute commonjs sources, builds systems may bundle tldts (or other dependencies) only once (or not at all if it gets exetranlized).

Increase test coverage for lists management

Automatically invalidate serialized engine when version changes

We currently need to bump the version of the adblocker whenever the format of the cached engine changes, which is error-prone. It would be nice to have something built-in for this.

[Question] How to use with Puppeteer

Hello,

First of all, thanks for creating this awesome project!

I'm trying to leverage adblock capabilities into my Puppeteer code.

Basically, puppeteer allows you abort requests, so just the thing I need is to determinate if an ad request should be aborted.

As a reference, that's my current implementation for aborting tracking requests: https://github.com/Kikobeats/browserless/blob/master/packages/goto/src/index.js#L46

If I read docs correctly, I need to create a FiltersEngine instance and check for match property.

Something like this can replace the previous code:

if (abortTrackers) {
      const { match } = engine.match(req)
      if (match) {
        debug(`abort:tracker:${++reqCount.abort}`, resourceUrl)
        return req.abort()
      }
    }

I created a FilterEngine instance providing a bit rules file, as a result of concatenating the most popular rules list (easylist, etc).

I'm not sure if I'm using the wrong API method, but the point is it never match the rule, even I try with sites I know it needs to match since I have the same rules on the browser 🤔

I suppose the thing is wrong is because req from Puppeteer is not the same than your Request object? not sure

Any idea about what is happening is welcome 😅

Support $empty option

Make sure #@# unhide rules apply on +js(...) injection rules

Networkfilter matches when it should not

||foo.co^aaa/ will match https://bar.foo.com/bbb/aaa/ when it probably should not

Optimize in-memory representation of cosmetic bucket

We do not need to keep the full instance of CosmeticFilter in the list of generic rules.

Edit: after running some benchmark on loading popular domains, it seems like a major part of the CPU time is spent in getCosmeticsFilters and createStylesheet. This should be optimized away.

Investigate ways to reduce memory usage

The memory usage could probably be reduced. Some ideas:

Use smaz.js to compress patterns. It should be possible to compress both patterns and urls and perform the matching on compressed typed arrays directly. This would benefit both memory usage and serialized engine size.
Improve optimizer to allow fusion of more filters, potentially in custom forms (e.g.: automata for plain patterns)
Find ways to reduce the size of NetworkFilter and CosmeticFilter objects

`engine.match` sometimes returns wrongly constructed filter

For example, when the filter is ||amazon.*/gp/product/, the engine.match returns ||amazon.^.*\\/gp\\/product\\/ as the matching filter

Optimize matching with faster uin32 access in data view

Currently one of the bottle-necks in matching is getUint32() from StaticDataView, we could fix this by making sure arrays of 32 bits numbers are aligned on 4 bytes and then using a Uint32Array directly for the access on these sections of the view.

Optimize punycode implementation

The punycode implementation is already pretty fast but unnecessarily generic. We could specialize it for our exact needs and probably make it a bit more efficient.

Benchmarking code does not work out of the box

The current version of the benchmarking code (in particular the code in bench/comparison) does not work without modifications to the makefile.

I had to make the following changes to bench/comparison/Makefile to make it work:

Change git:// URLs to https:// URLs
Add a rule for the target adblockpluscore, which clones github.com/adblockplus/adblockpluscore and checks out a specific commit
Replace requests.json with ../dataset/requests.json

Add tests for cosmetic injections

There is already a small POC for testing cosmetic injections using JSDOM, it would be nice to add more tests.

Add helpers to create Request from puppeteer/electron

The library could expose helpers to help in matching/blocking requests in common environments like web-extension, puppeteer, electron, etc. This could take the form of:

new helpers or methods exposed as part of the public API (e.g.: makeRequestFromElectron, makeRequestFromPuppeteer, matchPuppeteerRequest, etc.)
add more examples in the example folders to show different but common use-cases of the library

Support $jsinject option

Node support

Hi,

I'm trying to get this work with puppeteer and the fetch is failing so far.

There has been an error ReferenceError: fetch is not defined
    at fetchResource (../node_modules/@cliqz/adBlocker/dist/adblocker.cjs.js:3552:5

Is there a way to support this in Node?

When updating lists return true if the engine was updated

Support $document option

Some filters specify a $document option, which means it should apply to cpt 6 (to block the loading of a document completely). Additionally, we might want to trigger the display of a warning page in this case to explain user why the page was blocked as well as propose to visit the site anyway.

Dependabot couldn't authenticate with registry.npmjs.org

Dependabot couldn't authenticate with registry.npmjs.org.

You can provide authentication details in your Dependabot dashboard by clicking into the account menu (in the top right) and selecting 'Config variables'.

You can mention @dependabot in the comments below to contact the Dependabot team.

Allow running benchmarks in browsers (Firefox + Chrome)

Implement event logger

To help debugging, it would be nice to implement an event logger for the adblocker. This would optionally keep track of requests blocked/redirected, exceptions as well as cosmetics injected into the page.

Deserialize multiple cache buffers to one FiltersEngine instance

Hi, I have one cache file with easylist, easyprivacy etc. and I would like to also include optional, regional filters, but it seems there's no other way to do that without creating multiple FiltersEngine instances. Is there a way to do it like this?

const engine = new FiltersEngine();
engine.deserialize(buffer1);
engine.deserialize(buffer2);

I've also seen the update method, but why distinguishing cosmetic filters from network when the FiltersEngine.parse method does that, but it can be used only once per instance?