efforg / badger-sett Goto Github PK

Automated training for Privacy Badger. Badger Sett automates browsers to visit websites to produce fresh Privacy Badger tracker data.

Home Page: https://www.eff.org/badger-pretraining

License: MIT License

Python 87.45% Shell 11.10% Dockerfile 1.44%

badger-sett's People

Stargazers

Watchers

Forkers

sgnls iammyr 0dysseas p79n6a funilrys copperwall kenneth-aduddell princessbinas newera28 earthwormjim2 royalhousecompany ivorybishop mrbrain295

badger-sett's Issues

Make committing/pushing to Git optional

We should add a flag to allow running ./runscan.sh without automatically committing the results. Maybe an env var like GIT_PUSH?

Should we reduce the number of sites we run on for now?

To balance effectiveness against shipping buggy data (heuristic + bug label search results), ease of running crawler, ability to run crawler daily, size of seed database. We could always increase size of seed db later as we fix bugs (and multi-process/parallelize the crawler). Just want to start easier.

We lose snitch map entries for blocked domains after restarting the browser

We export and re-import Privacy Badger data after unexpected WebDriver exceptions as of 54cee3b, but data import currently loses snitch map entries for already-blocked (and cookieblocked?) domains. We don't have a dedicated issue for this problem yet, but it's mentioned in EFForg/privacybadger#1972.

Since this data import bug will now lead to incomplete crawl data and fixing it might be relatively straightforward, we should very much fix it.

Fail when the error rate is too high

If more than 10% (?) of sites failed, let's fail loudly and not record results, to avoid recording severely incomplete data.

Crawl beyond the home page

Currently, we only visit the home page of each domain in the scan. Many sites might have different kinds of trackers on different pages. We could modify the crawler to randomly "click" around on the different first-party links on each site it visits.

Look into identifying and visiting product listings

In addition to news articles (#60).

#77 (comment)

docker build failed - hardcoded username

The username and group for the chown command on privacybadger in the dockerfile are hardcoded (as "bennett") and are causing the docker build to fail with the following:
Step 25/28 : COPY --chown=bennett:bennett privacybadger $PBPATH unable to convert uid/gid chown string to host mapping: can't find uid for user bennett: no such user: bennett Docker build failed.

Wait to replace results-prev.json until crawl is successful

Should update results.json and results-prev.json at the same time to avoid making any changes before we have to. If the scan was aborted and there is no data, no reason to modify results-prev.json.

Disable on-by-default Firefox privacy protections when running crawls

We should disable whatever privacy protections are enabled by default in Firefox (third-party tracker cookies, fingerprinters, miners, TPL, etc.), to avoid needlessly limiting Privacy Badger's learning.

Clean list of domains to visit by removing invalid entries (non-websites, etc.)

We could make a little script using https://funilrys.github.io/PyFunceble/ that will clean our list of domains to visit before we visit them. This should speed up the crawl and reduce our error rate, as most failures are caused by unreachable websites: #18 (comment)

Consider creating a temporary skip list of sites

To further speed up and optimize tracker discovery.

We could one-time or time-limited skip DNS or other errors and sites with no trackers.

From #79 (comment).

Do we need another `git pull` after `git checkout`?

Could be out-of-date when switching branches otherwise. Scenario:

some-branch is checked out.
git pull brings in updates to master.
git checkout master because we want to run on master.
Status is:

On branch master
Your branch is behind 'origin/master' by 1 commit, and can be fast-forwarded.
  (use "git pull" to update your local branch)

Exclude regional Google domains from daily scans

There is like 200 of them.

Related to #79.

could some additional settings improve resource usage during training?

Since we're just trying to observe certain types of info per domain, could any of these settings improve the performance during training or reduce the resources needed to move through the training without compromising what domains are caught?

some
permissions.default.image = 2 --disable images
browser.display.use_document_fonts = 0 --disable fonts
gfx.downloadable_fonts.enabled = false --disable fonts
dom.serviceWorkers.enabled = false -- disable creation of service workers
media.peerconnection.enabled = false -- disable webrtc
security.OCSP.enabled = 0 --disable querying csp for every cert check
webgl.disabled = true --disable webgl
layout.spellcheckDefault = 0 --disable spellchecker
network.dns.disablePrefetch = true --disable dns prefetching since we aren't clicking on everything during the scan

curious:
network.cookie.maxPerHost = 32
dom.workers.maxPerDomain = 48
network.http.max-connections = 48
javascript.options.mem.max = 16384
network.websocket.max-connections = 20

another question
increase network.cookie.maxNumber ?

misc.
anything with the words network.*max or javascript.options that may make it hang on a page longer than it needs to catch what needs to be caught

disable some things related to media or the max size of things / max retries or timeouts of things. I dunno

Fix trailing whitespace (makes git angry)

See 5f63896#r29186428

Should change to run against master branch instead of current prod release

Running against prod gets us a database missing a release cycle's worth of improvements.

Related to changing schedule to run daily (#2) and using this tool to test branches (#1).

Ideas for increasing scan efficiency (decreasing the error rate)

9bbeb45

Of the 6000 sites tested, the log files show that the scan errors on 24% of the websites it tried to test. A summary of the types of errored sites:

UnexpectedAlertPresentException: 3
Reached error page: 1043
Reached Cloudflare security page: 65
Encountered unhandled user prompt dialog: 3
Timed out loading...: 311
NoSuchWindowException: 1
Likely bug: 1
InvalidSessionIdException: 1
InsecureCertificateException: 22

One thing I noticed is the scan excludes suffixes of .mil and .gov. I'd also suggest excluding .edu domains. While some have trackers it's probably not worth training upon them as they are going to be trackers on other sites already. .org suffixed sites might have a small benefit to test but not as strong as your typical .com site.
Other countries may use .gov., .edu, .mil earlier in their suffix. For example, epfindia.gov.in, sbv.gov.vn, conicet.gov.ar, nsw.gov.au
I still think more sites could be tested as long as the scan time is able to complete before the next scan. And they probably don't need to necessarily be the top 6000 sites. Even a randomized list of 20,000 of the top million sites, for example, might be even more effective. Imaging google.com and its country sites are all in the top 6000 websites visited. Google isn't a tracker-heavy site but my local car dealership is probably in the top million and stuffed with trackers.

Most action map entries are lost during restarts

Probably because load_user_data calls merge on action_map (which no longer copies action_map entries wholesale) but doesn't call merge on snitch_map (which then doesn't recreate action_map entries):

badger-sett/crawler.py

Lines 192 to 200 in da047d2

 def load_user_data(driver, browser, ext_path, data): 

 load_extension_page(driver, browser, ext_path, OPTIONS) 

 script = ''' 

 data = JSON.parse(arguments[0]); 

 badger.storage.action_map.merge(data.action_map); 

 for (let tracker in data.snitch_map) { 

  badger.storage.snitch_map._store[tracker] = data.snitch_map[tracker]; 

 }''' 

 driver.execute_script(script, json.dumps(data))

We should maybe call mergeUserData instead.

scan in docker failed - missing log file

When running ./runscan.sh I get the following error:

Running scan in Docker...
Ctrl-C to break.
Traceback (most recent call last):
  File "./crawler.py", line 620, in <module>
    crawler = Crawler(**vars(ap.parse_args()))
  File "./crawler.py", line 186, in __init__
    fh = logging.FileHandler(os.path.join(out_path, 'log.txt'))
  File "/usr/lib/python3.5/logging/__init__.py", line 1008, in __init__
    StreamHandler.__init__(self, self._open())
  File "/usr/lib/python3.5/logging/__init__.py", line 1037, in _open
    return open(self.baseFilename, self.mode, encoding=self.encoding)
PermissionError: [Errno 13] Permission denied: '/home/<local user>/out/log.txt'
mv: cannot stat '/home/<local user>/<local path to repo>/docker-out/log.txt': No such file or directory
Scan failed. See log.txt for details.

scan in docker failed - permission denied on home folder

When I run ./runscan.sh it fails with the following error:

Running scan in Docker...
Ctrl-C to break.
Traceback (most recent call last):
  File "./crawler.py", line 620, in <module>
    crawler = Crawler(**vars(ap.parse_args()))
  File "./crawler.py", line 186, in __init__
    fh = logging.FileHandler(os.path.join(out_path, 'log.txt'))
  File "/usr/lib/python3.5/logging/__init__.py", line 1008, in __init__
    StreamHandler.__init__(self, self._open())
  File "/usr/lib/python3.5/logging/__init__.py", line 1037, in _open
    return open(self.baseFilename, self.mode, encoding=self.encoding)
PermissionError: [Errno 13] Permission denied: '/home/<local user>/out/log.txt'
Scan failed. See log.txt for details.

Log all crawl parameters

We should log all parameters used for a crawl so that's easy to know which crawl was run with which browser, etc. Basically expand the "starting new crawl with timeout 30 n_sites 10" to span a few lines and give us everything relevant such as $BROWSER.

When excluding .gov and .mil suffixes, look into excluding other countries' equivalents

For example, www.gov.uk. From #79.

Should be able to shut down crawler/docker processes with Ctrl-C

Currently have to do

Ctrl-Z
killall -9 runscan.sh
docker ps | tail -n 1 | awk '{ print $1 }' | xargs docker kill

Review cleanup code

Used for a43c06e. Probably makes sense to keep it here as a post-processing step.

Swap out Majestic list for Tranco

A Research-Oriented Top Sites Ranking Hardened Against Manipulation

docker build failed - missing domain-lists file

When running ./runscan.sh for the first time, I was getting the following error

Step 23/28 : COPY domain-lists $HOME/domain-lists COPY failed: stat /var/lib/docker/tmp/docker-builder530191829/domain-lists: no such file or directory Docker build failed.

Consider sampling domains from deeper in the Tranco list

In addition to domains from the top of the list. From #79.

Add automated linting

Via Travis CI and GitHub Checks. Should we reuse the same Travis account we use for Privacy Badger?

Should probably use Prospector as it bundles several tools with sensible defaults. Although I haven't checked on what's new in this space in a few years.

Failed to decode response from marionette

This happened at the end of a crawl. All sites following the "failed to decode response" message failed; we should catch this and restart what needs to be restarted.

2018-06-15 00:52:11,847 visiting flavors.me
2018-06-15 00:52:21,768 visiting fastcodesign.com
2018-06-15 00:52:23,197 https://fastcodesign.com/ Failed to decode response from marionette
2018-06-15 00:52:23,197 trying http://fastcodesign.com/
2018-06-15 00:52:23,199 fastcodesign.com Tried to run command without establishing a connection
2018-06-15 00:52:23,199 visiting zippyshare.com
2018-06-15 00:52:23,203 https://zippyshare.com/ Tried to run command without establishing a connection
2018-06-15 00:52:23,203 trying http://zippyshare.com/
2018-06-15 00:52:23,204 zippyshare.com Tried to run command without establishing a connection.
...
2018-06-15 00:52:24,246 visiting realtor.com
2018-06-15 00:52:24,247 https://realtor.com/ Tried to run command without establishing a connection
2018-06-15 00:52:24,247 trying http://realtor.com/
2018-06-15 00:52:24,251 realtor.com Tried to run command without establishing a connection
2018-06-15 00:52:24,251 Scan complete. Saving data...

Support specifying browser version

When run via Docker now, we get the latest stable version of Firefox/Chrome/Edge. While this is good default behavior, we should also be able to specify a specific browser version (Chrome 103.0.5056.0, Firefox 100.0.1, ...), or at least branch (Beta, Dev/Canary, ...).

This probably involves adding a script that figures out the appropriate Firefox/Chrome/WebDriver version (for example), and builds a custom image for the given browser and corresponding driver versions. We will then update runscan.sh to use the new script, and update the Dockerfile to inherit from the generated image (instead of the selenium/standalone-firefox (or -chrome or -edge) image from Docker Hub).

Running with Chrome is broken

It broke when I removed CRX2 creation from Badger's Makefile as part of the CRX3 switchover.

make: *** No rule to make target `travisbuild'.  Stop.
Traceback (most recent call last):
  File "./crawler.py", line 623, in <module>
    crawler.crawl()
  File "./crawler.py", line 373, in crawl
    self.start_browser()
  File "./crawler.py", line 320, in start_browser
    self.start_driver()
  File "./crawler.py", line 204, in start_driver
    build = subprocess.check_output(cmd).strip().decode('utf8').split()[-1]
  File "/usr/lib/python3.4/subprocess.py", line 620, in check_output
    raise CalledProcessError(retcode, process.args, output=output)
subprocess.CalledProcessError: Command '['make', '-sC', '/home/zzz/privacybadger/', 'travisbuild']' returned non-zero exit status 2

Since CRX2 is going away, the solution should be to load Privacy Badger from source ("unpacked" vs. packaged extension loading): EFForg/privacybadger@88c862c

Error loading extension page: Tried to run command without establishing a connection

Tried running a scan against master this morning; the scan failed with the following output:

...
2018-06-28 16:08:02,757 visiting 510: ihg.com
2018-06-28 16:08:09,800 visiting 511: oxfordjournals.org
2018-06-28 16:08:22,455 visiting 512: ucsd.edu
2018-06-28 16:08:31,176 visiting 513: news.google.com
2018-06-28 16:08:46,126 timeout on https://news.google.com/
2018-06-28 16:08:47,716 trying http://news.google.com/
2018-06-28 16:08:58,895 timeout on news.google.com
2018-06-28 16:08:59,501 visiting 514: theregister.co.uk
2018-06-28 16:09:09,631 visiting 515: duke.edu
2018-06-28 16:09:10,534 Error loading extension page: Tried to run command without establishing a connection
2018-06-28 16:09:10,534 duke.edu SessionNotCreatedException: Tried to run command without establishing a connection
2018-06-28 16:09:10,535 visiting 516: scoop.it
2018-06-28 16:09:10,546 Error loading extension page: Tried to run command without establishing a connection
2018-06-28 16:09:10,547 scoop.it SessionNotCreatedException: Tried to run command without establishing a connection
2018-06-28 16:09:10,548 visiting 517: wikidot.com
2018-06-28 16:09:10,551 Error loading extension page: Tried to run command without establishing a connection
2018-06-28 16:09:10,552 wikidot.com SessionNotCreatedException: Tried to run command without establishing a connection
...
2018-06-28 16:09:17,296 visiting 2000: kissmetrics.com
2018-06-28 16:09:17,299 Error loading extension page: Tried to run command without establishing a connection
2018-06-28 16:09:17,299 kissmetrics.com SessionNotCreatedException: Tried to run command without establishing a connection
2018-06-28 16:09:17,300 Finished scan. Getting data from browser storage...
2018-06-28 16:09:17,303 Error loading extension page: Tried to run command without establishing a connection
2018-06-28 16:09:17,303 Could not get badger storage.

Might be related to fixes for #8.

Recommendation on range and some filters

Of the top 2000 domains on the current Tranco list, .org, .gov, and .edu represent approximately 275 of the list or about 13.75% of the domains being tested.

These are domains for organizations, government, and military use - at least in the United States- and are likely to have significantly less potential for use of commercial tracking. In fact, if any of the top 2000 websites were to track, the top 2000 sites are not going to be full of trackers (e.g. google, youtube, facebook, micorosoft, wikipedia, twitter, pinterest, amazon, netflix, vimeo, wordpress, github, windowsupdate, etc.) How many 3rd party domains do these sites have? Not many.

I suggest training further down the list. Say, 4000 thru 10000 (or more) excluding any domains .org$|.edu$|/.*.gov.|.gov$ (the 3rd rule to exclude government sites from other countries.)

Or, better yet, use a list of sites in certain categories (health/beauty/medical, crafts and hobbies, food, music, movies, entertainment, blogs, news, etc.)

A resource that does do categorizations of sites is the popular http://www.shallalist.de/categories.html

The default Privacy Badger is pretty small and should pick up your local newspapers and game sites... in short, the top sites do the tracking on the bottom sites to get you the visit the top sites.

check this

https://webcookies.org/number-of-cookies

google.com 2 cookies

compare that to any of these sites:

www.ogaracoach.com
http://www.10greatlines.com/
https://newsinfo.inquirer.net
https://www.favecrafts.com
www.ibtimes.co.uk

Log Privacy Badger branch and commit hash

Branch for human readability and hash for exactness. Logging the hash will make it possible to know precisely which Privacy Badger code was exercised by the scan.

Notify devs (via email) (from Cron) upon failures

Probably easiest to notify via emails from Cron. I think Cron emails on any output by default. We can write any failures to stderr from the script and then silence regular stdout and send stderr to stdout (... 2>&1 >/dev/null) in the crontab.

Related to #19.

What is the DNS Server to use?

Hello,
since you introduced PyFunceble (note that #21 might be closed) into #23, followed by the merge of the maintenances PR I submitted (#39 #40), I'm now wondering which DNS Server you recommend or use?

I'm mainly asking so I can write a PR because, since the last version of PyFunceble (2.0.0) which I released yesterday (Berlin Time), it's possible to set a custom DNS server for the DNS Lookup PyFunceble does.

More info about the new version can be found into the 2.0.0 release note.
More info about the custom DNS in PyFunceble can be found in the new version of the documentation.

For our implementation in this project, the following (pseudo) line:

"dns_server": ["first_dns_server", "second_dns_server"]

might be added to this dictionary:

badger-sett/crawler.py

Line 46 in 38b3681

OUR_PYFUNCEBLE_CONFIG = {"share_logs": False}

As a side note, I dropped the logs sharing by default. So "share_logs": False might not be needed but we can keep it if we choose to use PyFunceble<2.0.0 instead of the newly released one.

Have a nice day/night.

Cheers,
Nissar

Don't pull latest version of badger-sett repo when GIT_PUSH isn't set

We don't need to update to latest when we are not going to commit the results, and we may not want to update to latest (maybe we want to run using a specific commit for whatever reason).

Consider excluding .edu domains from Privacy Badger scans

From #79.

Remove some TLDs from scans

Let's add the default option to remove .mil and .gov TLDs from any list of domains Badger Sett scans.

Docker not working

When I try to run it I get this error ERROR: failed to solve: failed to parse stage name "selenium/standalone-/vscode/bin/linux-x64/1a5daa3a0231a0fbba4f14db7ec463cf99d7768e/bin/helpers/browser.sh": invalid reference format Docker build failed.

Do we have a race condition with dump_data?

Do we have a race condition with dump_data()? We load Badger's background page and query storage. Whenever we load the background page, we reinitialize another copy of background.js with all that entails. What if Badger's storage isn't yet ready? Basically if we depend on anything asynchronous, it may not yet be ready.

If this is a problem, it may only show up in Chrome (the faster browser).

Compare to the waiting logic in Privacy Badger's tests.

Allow loading of arbitrary extensions

Now that we can run scans to evaluate the effect of (strict) Tracking Protection on Privacy Badger's learning (c6c932e), it would be nice to also have the ability to load arbitrary other extensions to make similar comparisons. So, a --load-extension flag.

Loading packed extensions in Chrome should be simple with opts.add_extension(PATH_TO_EXTENSION_CRX). Firefox is probably more complicated, so I suggest getting this working on Chrome first.

[Firefox] Consider speeding up scans by disabling images with userpref permissions.default.image = 2

...as well as gfx.downloadable_fonts.enabled = false
downloaded font files can make up a sizeable portion of a webpage size as well which is why I mention them too.
I wondered if disabling loading images would save bandwidth or make each site's load quicker during tests but not affect the actual test results.

in crawler.py NUM_SITES = 2000? Is this how many Privacy Badger currently trains upon?

Can you all raise it to 100,000, a million, or as many as possible? It seems awfully low if users are not training each of their extensions against their own browsing habits.

The value of a tracker's tracking depends on each person's unique due browsing habits. It makes sense to train on the maximum amount of sites as possible rather than limiting the data set to 2000 most commonly visited sites.

I can think of car dealerships, smaller storefronts, travel sites, and local news sites with loads of trackers that are not likely in the top 2000 sites in the world.

I just opened up all of my bookmarks and privacy badger learned a new tracking domain.
I'd argue simply visiting a website is not enough. If I owned a shopping store, my trackers might not be on the main page. I would want to track each product that was clicked upon which may mean my trackers are on each product page. If the test emulated clicking on a few product pages, you may find a different set of trackers on each product page.

Clear previously learned Badger data before running

We are not clearing pre-trained data right now, which means the data isn't getting properly refreshed between runs. One problem with this is that Badger will never forget about any tracking domain regardless of whether it's still in use.

How to run this against specific branches?

I am looking to start using this as a manual version of EFForg/privacybadger#1019. Could I do so without setting up my own instance? Also, the code is currently hardcoded to grab the latest Privacy Badger release.

For example, there should be a way for me to kick off a run against the EFForg/privacybadger#2024 branch and then compare results to those from master.

This may mostly be a documentation task for the internal Privacy Badger wiki.

Add an option to specifically exclude or include user-provided TLDs

We should add the ability to either exclude or include specific user-provided TLDs when we do the scan. That way, if we don't want to scan domains in certain TLDs the user can configure that when they run the scan; or if, a user only wants to scan domains in a given TLD, they can do that instead.

Increase default timeout

Should we increase the default timeout? From 10 to 20? I noticed this seems to significantly decrease the number of timed out sites. For example, 20% of sites from the most recent scheduled crawl errored (about 2/3 of that are timeouts, or something like that). (Do we have any historical logs to see what the timeout rate is generally? Would be useful for gauging effectiveness.)

Fewer sites timing out should mean more stability to crawl results which means more meaningful comparisons between runs.

Warn from runscan.sh when mismatch between BROWSER and --browser

Seems to be a common footgun.

Validate output

See 48cf492#r29186475

	def load_user_data(driver, browser, ext_path, data):
	load_extension_page(driver, browser, ext_path, OPTIONS)
	script = '''
	data = JSON.parse(arguments[0]);
	badger.storage.action_map.merge(data.action_map);
	for (let tracker in data.snitch_map) {
	badger.storage.snitch_map._store[tracker] = data.snitch_map[tracker];
	}'''
	driver.execute_script(script, json.dumps(data))

efforg / badger-sett Goto Github PK

badger-sett's People

Stargazers

Watchers

Forkers

badger-sett's Issues

Recommend Projects

Recommend Topics

Recommend Org