Giter Club home page Giter Club logo

toripscanner's Introduction

Tor IP Scanner

There are many projects out there that produce a list of Tor exit IPs, but this one is the best. For us. Definitely. This is useful.

This scanner is better than all the all the rest because it does everything to find as many IPs exits may use as possible:

  • It records ORPort IPs (both v4 and v6) from the server descriptors of all relays that can exit to OFTC infrastructure, regardless of whether or not they have the Exit flag.

  • Through the relays that can connect to OFTC's user-facing ircds, we build circuits and connect to an ircd in order to get it to report to us what hostname/IP we are coming from. If we get a hostname, we lookup its A and AAAA records.

  • For all relays that can connect to OFTC infrastructure, we see if the IPs in their descriptors have rDNS entries, and if so, we lookup A and AAAA records.

To be explicit: "OFTC infrastructure" includes our user-facing ircds and our web irc client. For relays that can only exit to our web IRC client, we only check their descriptors and do DNS queries.

Tech

  • Tor
  • Stem
  • Python 3.7

Install

This will install Tor IP Scanner and its dependencies into a virtualenv suitable for development work.

$ cd to/this/directory
$ python3 -m venv venv
$ . venv/bin/activate
$ pip install -U pip
$ pip install -e .[dev]

Using

Start the scanner with toripscanner scan. It runs in the foreground and stays running forever, periodically scanning new relays.

Periodically run toripscanner parse data/results/* to parse the scanner's results into a plaintext list of IPv4/6 addresses. The command only uses "recent" results even if the input files it reads contain old results.

toripscanner's People

Contributors

pastly avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

zyguy

toripscanner's Issues

Try both IPv4 and IPv6 for all (?) relays

Right now we connect directly to a hardcoded IP of one of OFTC's servers.

We should try connecting to both irc4.oftc.net and irc6.oftc.net.

When doing this, we need to be careful we don't do it back-to-back, as the IRC will complain at you that you're reconnecting too quickly.

First measurement after startup succeeds weirdly

For example:

[2021-07-29 16:36:51,376] [INFO] [MainThread] [scan] [[email protected]:214] Measuring 1DBACC31486FC670FBD403FAE877342EC696D598. 707 relays remain
[2021-07-29 16:36:51,378] [DEBUG] [MainThread] [test_utils] [get_good_relays@test_utils.py:154] 7 good-relays seem to be running.
[2021-07-29 16:36:52,621] [DEBUG] [MainThread] [scan] [[email protected]:154] Will measure 1DBACC31486FC670FBD403FAE877342EC696D598 on 8 [Piratenpartei00 (1CD48F4E) -> hviv128 (1DBACC31)]
[2021-07-29 16:36:52,633] [DEBUG] [Event notifier] [test_utils] [closure_stream_event_listener@test_utils.py:166] Attaching stream 16 to circ 8 [Piratenpartei00 (1CD48F4E) -> hviv128 (1DBACC31)]
[2021-07-29 16:36:52,633] [WARNING] [Event notifier] [test_utils] [closure_stream_event_listener@test_utils.py:171] Couldn't attach stream to circ 8: Connection is not managed by controller.

Idk why stem doesn't think circuit 8 is managed by us. We just built it!

I think the confusion/bug must come from Tor still building preemptive circuits at this point.

A hacky but easy fix would be to sleep for like 10 seconds after bootstrapping but before doing anything real.

ISTR a torrc option to disable preemptive circuits. Try it. But make sure Tor can still bootstrap with and without an existing data directory.

Look into why schedule_new_relays takes so long and if it can be improved

A wild guess of mine is that it takes a lot of time to keep checking exit policies. If the relay doesn't allow exiting to, say, 6667, we have to go through all the servers w/port 6667 to conclude that.

It would be nice if we could check *:6667 and if it is not allowed, just skip that relay. But it needs to be checked that we won't miss a relay that allows 6667 to only a small part of the internet. I bet we would miss it. So idk if this will actually work.

(The above description oversimplifies because there are multiple ports we check)

Anyway, look into why the func takes so long and look into what is taking so long. Maybe my guess is wrong.

Avoid a thundering herd

If we need to measure 90% of the relays in the network, we schedule ourselves to do so ASAP. Thus they will all need measuring again at roughly the same time. So every few days we'll do a ton of work, and otherwise do very little.

We should spread out the load we create.

As an easy hack for my relayscanner, I didn't let more than 200 relays get queued up for testing at a time. Assuming you can do >=200 relays before the next consensus, this spread out the measurements across ~half a day. This still means there are busy days and not busy days.

Perhaps a better idea would be to have each relay's measurement expire at a random time X..Y days from now. After the first huge all-at-once scan, the ~entire network will next be measured slowly over the course of a few days.

Also scan 443: We miss exits that can connect to the web IRC client

I skimmed by #tor backlog, found an obvious exit (~[email protected]), found the/a relay that is related to that address.

There's >50 IPv6 address we're missing with just this one hostname:

$ dig tor-exit-anonymizer-v6.appliedprivacy.net AAAA | grep AAAA | wc -l
59

The scanner missed this relay. It doesn't allow IRC ports 6667/6697, but it does allow 443.

We should also scan exits that allow 443 to our web IRC client.

How to do the scan

Option 1: Request to IP checking website

I think this would work in practice 95-100% of the time. I don't think it's likely an exit would allow exiting to the IP checking service but not to our web IRC client (or vice versa).

This sucks because we now have another scan type to implement/maintain. Not hard, but not ideal.

Option 2: Run IP-getting service ourselves

For best results it should be on the same IP(s) as the web IRC client.

I know it is dead simple to setup with plain ngnix. Don't need anything else: not even files.

Again, this sucks because we now have another scan type to implement/maintain. Not hard, but not ideal.

Option 3: Don't do an active scan

Just take the IPs from the server descriptor. Stay with me here as I move on to the next section. I think we'll still get reasonable results.

What to do with the IP we get

We can discover more IPs by doing a couple DNS requests. Watch:

  1. Get 109.70.100.9 and 2a03:e600:100::9 from the server descriptor Relay
  2. If possible, get the hostname for these IPs:
>>> socket.gethostbyaddr('2a03:e600:100::9')
('tor-exit-anonymizer-v6.appliedprivacy.net', [], ['2a03:e600:100::9'])
  1. Get all IPs for that hostname:
>>> len(socket.getaddrinfo('tor-exit-anonymizer-v6.appliedprivacy.net', 443, proto=socket.IPPROTO_TCP))                                                                                      
57

Solve the mystery of ::84.209.139.0

This relay somehow resulted in both ::84.209.139.0 and 84.209.139.0

https://metrics.torproject.org/rs.html#details/AC717A01B8E3C00E7617EF65117A4E99C02DC7A0

This script shows me the only way I can get the '::' prefix is by supplying it myself.

#!/usr/bin/env python3
import socket
def f(host: str):
    for ret in socket.getaddrinfo(host, None, proto=socket.IPPROTO_TCP):
        print(ret)

if __name__ == '__main__':
    f('cm-84.209.139.0.getinternet.no')
    print('---')
    f('84.209.139.0')
    print('---')
    f('::84.209.139.0')
$ python3 a.py
(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('84.209.139.0', 0))
---
(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('84.209.139.0', 0))
---
(<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::84.209.139.0', 0, 0, 0))

Also test relays that allow 6697

We only test one destination on one port port 6667. Perhaps this will be taken care of by other work, but if not:

Look for relays that allow exiting to 6697. Make sure we scan them too.

We're trying IPv6 addresses on exits that don't support it

Apparently the exit policy can indicate accepting ipv6 traffic, but the exit doesn't actually support ipv6.

Be smarter about when to try ipv6.

Tons of exits don't support it, so even with a relatively short 15 second timeout, we're looking at up to 4 hours to scan the entire network. Before closing #1, we were at 90 minutes.

Exclude private IPs

We're including 127.0.0.1, and I'm going to deploy a hack to exclude it, but all private IPs should be excluded.

Check all irc ports, not just 6667 and 6697

Test all the publicly documented ports on https://oftc.net/

ircs://irc.oftc.net:6697 for SSL (alternative port: 9999), IPv4 and IPv6.
irc://irc.oftc.net:6667 for non-SSL (alternative ports: 6668-6670, 7000), IPv4 and IPv6.

I worry about the impact on the time it takes to run schedule_new_relays(). Measure the before/after and judge whether the extra coverage, if any, is worth it. This function is run every 1-2 hours, and takes 5s for me on my desktop.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.