Giter Club home page Giter Club logo

badger-sett's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

badger-sett's Issues

We lose snitch map entries for blocked domains after restarting the browser

We export and re-import Privacy Badger data after unexpected WebDriver exceptions as of 54cee3b, but data import currently loses snitch map entries for already-blocked (and cookieblocked?) domains. We don't have a dedicated issue for this problem yet, but it's mentioned in EFForg/privacybadger#1972.

Since this data import bug will now lead to incomplete crawl data and fixing it might be relatively straightforward, we should very much fix it.

Crawl beyond the home page

Currently, we only visit the home page of each domain in the scan. Many sites might have different kinds of trackers on different pages. We could modify the crawler to randomly "click" around on the different first-party links on each site it visits.

docker build failed - hardcoded username

The username and group for the chown command on privacybadger in the dockerfile are hardcoded (as "bennett") and are causing the docker build to fail with the following:
Step 25/28 : COPY --chown=bennett:bennett privacybadger $PBPATH unable to convert uid/gid chown string to host mapping: can't find uid for user bennett: no such user: bennett Docker build failed.

Do we need another `git pull` after `git checkout`?

Could be out-of-date when switching branches otherwise. Scenario:

  1. some-branch is checked out.
  2. git pull brings in updates to master.
  3. git checkout master because we want to run on master.
  4. Status is:
On branch master
Your branch is behind 'origin/master' by 1 commit, and can be fast-forwarded.
  (use "git pull" to update your local branch)

could some additional settings improve resource usage during training?

Since we're just trying to observe certain types of info per domain, could any of these settings improve the performance during training or reduce the resources needed to move through the training without compromising what domains are caught?

some
permissions.default.image = 2 --disable images
browser.display.use_document_fonts = 0 --disable fonts
gfx.downloadable_fonts.enabled = false --disable fonts
dom.serviceWorkers.enabled = false -- disable creation of service workers
media.peerconnection.enabled = false -- disable webrtc
security.OCSP.enabled = 0 --disable querying csp for every cert check
webgl.disabled = true --disable webgl
layout.spellcheckDefault = 0 --disable spellchecker
network.dns.disablePrefetch = true --disable dns prefetching since we aren't clicking on everything during the scan

curious:
network.cookie.maxPerHost = 32
dom.workers.maxPerDomain = 48
network.http.max-connections = 48
javascript.options.mem.max = 16384
network.websocket.max-connections = 20

another question
increase network.cookie.maxNumber ?

misc.
anything with the words network.*max or javascript.options that may make it hang on a page longer than it needs to catch what needs to be caught

disable some things related to media or the max size of things / max retries or timeouts of things. I dunno

Ideas for increasing scan efficiency (decreasing the error rate)

9bbeb45

Of the 6000 sites tested, the log files show that the scan errors on 24% of the websites it tried to test. A summary of the types of errored sites:

UnexpectedAlertPresentException: 3
Reached error page: 1043
Reached Cloudflare security page: 65
Encountered unhandled user prompt dialog: 3
Timed out loading...: 311
NoSuchWindowException: 1
Likely bug: 1
InvalidSessionIdException: 1
InsecureCertificateException: 22

  1. One thing I noticed is the scan excludes suffixes of .mil and .gov. I'd also suggest excluding .edu domains. While some have trackers it's probably not worth training upon them as they are going to be trackers on other sites already. .org suffixed sites might have a small benefit to test but not as strong as your typical .com site.

  2. Other countries may use .gov., .edu, .mil earlier in their suffix. For example, epfindia.gov.in, sbv.gov.vn, conicet.gov.ar, nsw.gov.au

  3. I still think more sites could be tested as long as the scan time is able to complete before the next scan. And they probably don't need to necessarily be the top 6000 sites. Even a randomized list of 20,000 of the top million sites, for example, might be even more effective. Imaging google.com and its country sites are all in the top 6000 websites visited. Google isn't a tracker-heavy site but my local car dealership is probably in the top million and stuffed with trackers.

Most action map entries are lost during restarts

Probably because load_user_data calls merge on action_map (which no longer copies action_map entries wholesale) but doesn't call merge on snitch_map (which then doesn't recreate action_map entries):

badger-sett/crawler.py

Lines 192 to 200 in da047d2

def load_user_data(driver, browser, ext_path, data):
load_extension_page(driver, browser, ext_path, OPTIONS)
script = '''
data = JSON.parse(arguments[0]);
badger.storage.action_map.merge(data.action_map);
for (let tracker in data.snitch_map) {
badger.storage.snitch_map._store[tracker] = data.snitch_map[tracker];
}'''
driver.execute_script(script, json.dumps(data))

We should maybe call mergeUserData instead.

scan in docker failed - missing log file

When running ./runscan.sh I get the following error:

Running scan in Docker...
Ctrl-C to break.
Traceback (most recent call last):
  File "./crawler.py", line 620, in <module>
    crawler = Crawler(**vars(ap.parse_args()))
  File "./crawler.py", line 186, in __init__
    fh = logging.FileHandler(os.path.join(out_path, 'log.txt'))
  File "/usr/lib/python3.5/logging/__init__.py", line 1008, in __init__
    StreamHandler.__init__(self, self._open())
  File "/usr/lib/python3.5/logging/__init__.py", line 1037, in _open
    return open(self.baseFilename, self.mode, encoding=self.encoding)
PermissionError: [Errno 13] Permission denied: '/home/<local user>/out/log.txt'
mv: cannot stat '/home/<local user>/<local path to repo>/docker-out/log.txt': No such file or directory
Scan failed. See log.txt for details.

scan in docker failed - permission denied on home folder

When I run ./runscan.sh it fails with the following error:

Running scan in Docker...
Ctrl-C to break.
Traceback (most recent call last):
  File "./crawler.py", line 620, in <module>
    crawler = Crawler(**vars(ap.parse_args()))
  File "./crawler.py", line 186, in __init__
    fh = logging.FileHandler(os.path.join(out_path, 'log.txt'))
  File "/usr/lib/python3.5/logging/__init__.py", line 1008, in __init__
    StreamHandler.__init__(self, self._open())
  File "/usr/lib/python3.5/logging/__init__.py", line 1037, in _open
    return open(self.baseFilename, self.mode, encoding=self.encoding)
PermissionError: [Errno 13] Permission denied: '/home/<local user>/out/log.txt'
Scan failed. See log.txt for details.

Log all crawl parameters

We should log all parameters used for a crawl so that's easy to know which crawl was run with which browser, etc. Basically expand the "starting new crawl with timeout 30 n_sites 10" to span a few lines and give us everything relevant such as $BROWSER.

docker build failed - missing domain-lists file

When running ./runscan.sh for the first time, I was getting the following error

Step 23/28 : COPY domain-lists $HOME/domain-lists COPY failed: stat /var/lib/docker/tmp/docker-builder530191829/domain-lists: no such file or directory Docker build failed.

Add automated linting

Via Travis CI and GitHub Checks. Should we reuse the same Travis account we use for Privacy Badger?

Should probably use Prospector as it bundles several tools with sensible defaults. Although I haven't checked on what's new in this space in a few years.

Failed to decode response from marionette

This happened at the end of a crawl. All sites following the "failed to decode response" message failed; we should catch this and restart what needs to be restarted.

2018-06-15 00:52:11,847 visiting flavors.me
2018-06-15 00:52:21,768 visiting fastcodesign.com
2018-06-15 00:52:23,197 https://fastcodesign.com/ Failed to decode response from marionette
2018-06-15 00:52:23,197 trying http://fastcodesign.com/
2018-06-15 00:52:23,199 fastcodesign.com Tried to run command without establishing a connection
2018-06-15 00:52:23,199 visiting zippyshare.com
2018-06-15 00:52:23,203 https://zippyshare.com/ Tried to run command without establishing a connection
2018-06-15 00:52:23,203 trying http://zippyshare.com/
2018-06-15 00:52:23,204 zippyshare.com Tried to run command without establishing a connection.
...
2018-06-15 00:52:24,246 visiting realtor.com
2018-06-15 00:52:24,247 https://realtor.com/ Tried to run command without establishing a connection
2018-06-15 00:52:24,247 trying http://realtor.com/
2018-06-15 00:52:24,251 realtor.com Tried to run command without establishing a connection
2018-06-15 00:52:24,251 Scan complete. Saving data...

Support specifying browser version

When run via Docker now, we get the latest stable version of Firefox/Chrome/Edge. While this is good default behavior, we should also be able to specify a specific browser version (Chrome 103.0.5056.0, Firefox 100.0.1, ...), or at least branch (Beta, Dev/Canary, ...).

This probably involves adding a script that figures out the appropriate Firefox/Chrome/WebDriver version (for example), and builds a custom image for the given browser and corresponding driver versions. We will then update runscan.sh to use the new script, and update the Dockerfile to inherit from the generated image (instead of the selenium/standalone-firefox (or -chrome or -edge) image from Docker Hub).

Running with Chrome is broken

It broke when I removed CRX2 creation from Badger's Makefile as part of the CRX3 switchover.

make: *** No rule to make target `travisbuild'.  Stop.
Traceback (most recent call last):
  File "./crawler.py", line 623, in <module>
    crawler.crawl()
  File "./crawler.py", line 373, in crawl
    self.start_browser()
  File "./crawler.py", line 320, in start_browser
    self.start_driver()
  File "./crawler.py", line 204, in start_driver
    build = subprocess.check_output(cmd).strip().decode('utf8').split()[-1]
  File "/usr/lib/python3.4/subprocess.py", line 620, in check_output
    raise CalledProcessError(retcode, process.args, output=output)
subprocess.CalledProcessError: Command '['make', '-sC', '/home/zzz/privacybadger/', 'travisbuild']' returned non-zero exit status 2

Since CRX2 is going away, the solution should be to load Privacy Badger from source ("unpacked" vs. packaged extension loading): EFForg/privacybadger@88c862c

Error loading extension page: Tried to run command without establishing a connection

Tried running a scan against master this morning; the scan failed with the following output:

...
2018-06-28 16:08:02,757 visiting 510: ihg.com
2018-06-28 16:08:09,800 visiting 511: oxfordjournals.org
2018-06-28 16:08:22,455 visiting 512: ucsd.edu
2018-06-28 16:08:31,176 visiting 513: news.google.com
2018-06-28 16:08:46,126 timeout on https://news.google.com/
2018-06-28 16:08:47,716 trying http://news.google.com/
2018-06-28 16:08:58,895 timeout on news.google.com
2018-06-28 16:08:59,501 visiting 514: theregister.co.uk
2018-06-28 16:09:09,631 visiting 515: duke.edu
2018-06-28 16:09:10,534 Error loading extension page: Tried to run command without establishing a connection
2018-06-28 16:09:10,534 duke.edu SessionNotCreatedException: Tried to run command without establishing a connection
2018-06-28 16:09:10,535 visiting 516: scoop.it
2018-06-28 16:09:10,546 Error loading extension page: Tried to run command without establishing a connection
2018-06-28 16:09:10,547 scoop.it SessionNotCreatedException: Tried to run command without establishing a connection
2018-06-28 16:09:10,548 visiting 517: wikidot.com
2018-06-28 16:09:10,551 Error loading extension page: Tried to run command without establishing a connection
2018-06-28 16:09:10,552 wikidot.com SessionNotCreatedException: Tried to run command without establishing a connection
...
2018-06-28 16:09:17,296 visiting 2000: kissmetrics.com
2018-06-28 16:09:17,299 Error loading extension page: Tried to run command without establishing a connection
2018-06-28 16:09:17,299 kissmetrics.com SessionNotCreatedException: Tried to run command without establishing a connection
2018-06-28 16:09:17,300 Finished scan. Getting data from browser storage...
2018-06-28 16:09:17,303 Error loading extension page: Tried to run command without establishing a connection
2018-06-28 16:09:17,303 Could not get badger storage.

Might be related to fixes for #8.

Recommendation on range and some filters

Of the top 2000 domains on the current Tranco list, .org, .gov, and .edu represent approximately 275 of the list or about 13.75% of the domains being tested.

These are domains for organizations, government, and military use - at least in the United States- and are likely to have significantly less potential for use of commercial tracking. In fact, if any of the top 2000 websites were to track, the top 2000 sites are not going to be full of trackers (e.g. google, youtube, facebook, micorosoft, wikipedia, twitter, pinterest, amazon, netflix, vimeo, wordpress, github, windowsupdate, etc.) How many 3rd party domains do these sites have? Not many.

I suggest training further down the list. Say, 4000 thru 10000 (or more) excluding any domains .org$|.edu$|/.*.gov.|.gov$ (the 3rd rule to exclude government sites from other countries.)

Or, better yet, use a list of sites in certain categories (health/beauty/medical, crafts and hobbies, food, music, movies, entertainment, blogs, news, etc.)

A resource that does do categorizations of sites is the popular http://www.shallalist.de/categories.html

The default Privacy Badger is pretty small and should pick up your local newspapers and game sites... in short, the top sites do the tracking on the bottom sites to get you the visit the top sites.

check this

https://webcookies.org/number-of-cookies

google.com 2 cookies

compare that to any of these sites:

www.ogaracoach.com
http://www.10greatlines.com/
https://newsinfo.inquirer.net
https://www.favecrafts.com
www.ibtimes.co.uk

Log Privacy Badger branch and commit hash

Branch for human readability and hash for exactness. Logging the hash will make it possible to know precisely which Privacy Badger code was exercised by the scan.

Notify devs (via email) (from Cron) upon failures

Probably easiest to notify via emails from Cron. I think Cron emails on any output by default. We can write any failures to stderr from the script and then silence regular stdout and send stderr to stdout (... 2>&1 >/dev/null) in the crontab.

Related to #19.

What is the DNS Server to use?

Hello,
since you introduced PyFunceble (note that #21 might be closed) into #23, followed by the merge of the maintenances PR I submitted (#39 #40), I'm now wondering which DNS Server you recommend or use?

I'm mainly asking so I can write a PR because, since the last version of PyFunceble (2.0.0) which I released yesterday (Berlin Time), it's possible to set a custom DNS server for the DNS Lookup PyFunceble does.

For our implementation in this project, the following (pseudo) line:

"dns_server": ["first_dns_server", "second_dns_server"]

might be added to this dictionary:

OUR_PYFUNCEBLE_CONFIG = {"share_logs": False}

As a side note, I dropped the logs sharing by default. So "share_logs": False might not be needed but we can keep it if we choose to use PyFunceble<2.0.0 instead of the newly released one.

Have a nice day/night.

Cheers,
Nissar

Docker not working

When I try to run it I get this error ERROR: failed to solve: failed to parse stage name "selenium/standalone-/vscode/bin/linux-x64/1a5daa3a0231a0fbba4f14db7ec463cf99d7768e/bin/helpers/browser.sh": invalid reference format Docker build failed.

Do we have a race condition with dump_data?

Do we have a race condition with dump_data()? We load Badger's background page and query storage. Whenever we load the background page, we reinitialize another copy of background.js with all that entails. What if Badger's storage isn't yet ready? Basically if we depend on anything asynchronous, it may not yet be ready.

If this is a problem, it may only show up in Chrome (the faster browser).

Compare to the waiting logic in Privacy Badger's tests.

Allow loading of arbitrary extensions

Now that we can run scans to evaluate the effect of (strict) Tracking Protection on Privacy Badger's learning (c6c932e), it would be nice to also have the ability to load arbitrary other extensions to make similar comparisons. So, a --load-extension flag.

Loading packed extensions in Chrome should be simple with opts.add_extension(PATH_TO_EXTENSION_CRX). Firefox is probably more complicated, so I suggest getting this working on Chrome first.

in crawler.py NUM_SITES = 2000? Is this how many Privacy Badger currently trains upon?

Can you all raise it to 100,000, a million, or as many as possible? It seems awfully low if users are not training each of their extensions against their own browsing habits.

The value of a tracker's tracking depends on each person's unique due browsing habits. It makes sense to train on the maximum amount of sites as possible rather than limiting the data set to 2000 most commonly visited sites.

I can think of car dealerships, smaller storefronts, travel sites, and local news sites with loads of trackers that are not likely in the top 2000 sites in the world.

I just opened up all of my bookmarks and privacy badger learned a new tracking domain.
I'd argue simply visiting a website is not enough. If I owned a shopping store, my trackers might not be on the main page. I would want to track each product that was clicked upon which may mean my trackers are on each product page. If the test emulated clicking on a few product pages, you may find a different set of trackers on each product page.

Clear previously learned Badger data before running

We are not clearing pre-trained data right now, which means the data isn't getting properly refreshed between runs. One problem with this is that Badger will never forget about any tracking domain regardless of whether it's still in use.

How to run this against specific branches?

I am looking to start using this as a manual version of EFForg/privacybadger#1019. Could I do so without setting up my own instance? Also, the code is currently hardcoded to grab the latest Privacy Badger release.

For example, there should be a way for me to kick off a run against the EFForg/privacybadger#2024 branch and then compare results to those from master.

This may mostly be a documentation task for the internal Privacy Badger wiki.

Add an option to specifically exclude or include user-provided TLDs

We should add the ability to either exclude or include specific user-provided TLDs when we do the scan. That way, if we don't want to scan domains in certain TLDs the user can configure that when they run the scan; or if, a user only wants to scan domains in a given TLD, they can do that instead.

Increase default timeout

Should we increase the default timeout? From 10 to 20? I noticed this seems to significantly decrease the number of timed out sites. For example, 20% of sites from the most recent scheduled crawl errored (about 2/3 of that are timeouts, or something like that). (Do we have any historical logs to see what the timeout rate is generally? Would be useful for gauging effectiveness.)

Fewer sites timing out should mean more stability to crawl results which means more meaningful comparisons between runs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.