Giter Club home page Giter Club logo

domain-scan's Introduction

Code Climate

A simple scanning system for the cloud

A lightweight scan pipeline for orchestrating third party tools, at scale and (optionally) using serverless infrastructure.

The point of this project is to make it easy to coordinate and parallelize third party tools with a simple scanning interface that produces consistent and database-agnostic output.

Outputs aggregate CSV for humans and machines, and detailed JSON for machines.

Can scan websites and domains for data on their HTTPS and email configuration, third party service usage, accessibility, and other things. Adding new scanners is relatively straightforward.

All scanners can be run locally using native Python multi-threading.

Some scanners can be executed inside Amazon Lambda for much higher levels of parallelization.

Most scanners work by using specialized third party tools, such as SSLyze or trustymail. Each scanner in this repo is meant to add the smallest wrapper possible around the responses returned from these tools.

There is also built-in support for using headless Chrome to efficiently measure sophisticated properties of web services. Especially powerful when combined with Amazon Lambda.

Requirements

domain-scan requires Python 3.6 or 3.7.

To install core dependencies:

pip install -r requirements.txt

You can install scanner- or gatherer-specific dependencies yourself. Or, you can "quick start" by just installing all dependencies for all scanners and/or all gatherers:

pip install -r requirements-scanners.txt
pip install -r requirements-gatherers.txt

If you plan on developing/testing domain-scan itself, install development requirements:

pip install -r requirements-dev.txt

Usage

Scan a domain. You must specify at least one "scanner" with --scan.

./scan whitehouse.gov --scan=pshtt

Scan a list of domains from a CSV. The CSV's header row will be ignored if the first cell starts with "Domain" (case-insensitive).

./scan domains.csv --scan=pshtt

Run multiple scanners on each domain:

./scan whitehouse.gov --scan=pshtt,sslyze

Append columns to each row with metadata about the scan itself, such as how long each individual scan took:

./scan example.com --scan=pshtt --meta

Scanners

Parallelization

It's important to understand that scans run in parallel by default, and data is streamed to disk immediately after each scan is done.

This makes domain-scan fast, as well as memory-efficient (the entire dataset doesn't need to be read into memory), but the order of result data is unpredictable.

By default, each scanner will spin up 10 parallel threads. You can override this value with --workers. To disable this and run sequentially through each domain (1 worker), use --serial.

If row order is important to you, either disable parallelization, or use the --sort parameter to sort the resulting CSVs once the scans have completed. (Note: Using --sort will cause the entire dataset to be read into memory.)

Lambda

The domain-scan tool can execute certain compatible scanners in Amazon Lambda, instead of locally.

This can allow the use of hundreds of parallel workers, and can speed up large scans by orders of magnitude. (Assuming that the domains you're scanning are disparate enough to avoid DDoS-ing any particular service!)

See docs/lambda.md for instructions on configuring scanners for use with Amazon Lambda.

Once configured, scans can be run in Lambda using the --lambda flag, like so:

./scan example.com --scan=pshtt,sslyze --lambda

Headless Chrome

This tool has some built-in support for instrumenting headless Chrome, both locally and inside of Amazon Lambda.

Install a recent version of Node (using a user-space version manager such as nvm or nodeenv is recommended).

Then install dependencies:

npm install

Chrome-based scanners use Puppeteer, a Node-based wrapper for headless Chrome that is maintained by the Chrome team. This means that Chrome-based scanners make use of Node, even while domain-scan itself is instrumented in Python. This makes initial setup a little more complicated.

  • During local scans, Python will shell out to Node from ./scanners/headless/local_bridge.py by executing ./scanners/headless/local_bridge.js, which expects the /usr/bin/env node to be usable as its executor. The data which is sent into the Node scanner, including original CLI options and environment data, is passed as a serialized JSON string as a CLI parameter, and the Node scanner returns data back to Python by emitting JSON over STDOUT.

  • During Lambda scans, local execution remains exclusively in Python, and Node is never used locally. However, the Lambda function itself is expected to be in the node6.10 runtime, and uses a special Node-based Lambda handler in lambda/headless/handler.js for this purpose. There is a separate lambda/headless/deploy script for the building and deployment of Node/Chrome-based Lambda functions.

It is recommended to use Lambda in production for Chrome-based scanners -- not only for the increased speed, but because they use a simpler and cleaner method of cross-language communication (the HTTP-based function call to Amazon Lambda itself).

Support for running headless Chrome locally is intended mostly for testing and debugging with fewer moving parts (and without risk of AWS costs). Lambda support is the expected method for production scanning use cases.

See below for how to structure a Chrome-based scanner.

See docs/lambda.md for how to build and deploy Lambda-based scanners.

Options

General options:

  • --scan - Required. Comma-separated names of one or more scanners.
  • --sort - Sort result CSVs by domain name, alphabetically. (Note: this causes the entire dataset to be read into memory.)
  • --serial - Disable parallelization, force each task to be done simultaneously. Helpful for testing and debugging.
  • --debug - Print out more stuff. Useful with --serial.
  • --workers - Limit parallel threads per-scanner to a number.
  • --output - Where to output the cache/ and results/ directories. Defaults to ./.
  • --cache - Use previously cached scan data to avoid scans hitting the network where possible.
  • --suffix - Add a suffix to all input domains. For example, a --suffix of virginia.gov will add .virginia.gov to the end of all input domains.
  • --lambda - Run certain scanners inside Amazon Lambda instead of locally. (See the Lambda instructions for how to use this.)
  • --lambda-profile - When running Lambda-related commands, use a specified AWS named profile. Credentials/config for this named profile should already be configured separately in the execution environment.
  • --meta - Append some additional columns to each row with information about the scan itself. This includes start/end times and durations, as well as any encountered errors. When also using --lambda, additional Lambda-specific information will be appended.

Output

All output files are placed into cache/ and results/ directories, whose location defaults to the current directory (./). Override the output home with --output.

  • Cached full scan data about each domain is saved in the cache/ directory, named after each scan and each domain, in JSON.

Example: cache/pshtt/whitehouse.gov.json

  • Formal output data in CSV form about all domains are saved in the results/ directory in CSV form, named after each scan.

Example: results/pshtt.csv

You can override the output directory by specifying --output.

It's possible for scans to save multiple CSV rows per-domain. For example, the a11y scan will have a row with details for each detected accessibility error.

  • Scan metadata with the start time, end time, and scan command will be placed in the results/ directory as meta.json.

Example: results/meta.json

Using with Docker

If you're using Docker Compose, run:

docker-compose up

(You may need to use sudo.)

To scan, prefix commands with docker-compose run:

docker-compose run scan <domain> --scan=<scanner>

Gathering hostnames

This tool also includes a facility for gathering domain names that end in one or more given suffixes (e.g. .gov or yahoo.com or .gov.uk) from various sources.

By default, only fetches third-level and higher domains (excluding second-level domains).

Usage:

./gather [source] [options]

Or gather hostnames from multiple sources separated by commas:

./gather [source1,source2,...,sourceN] [options]

Right now there's one specific source (Censys.io), and then a general way of sourcing URLs or files by whatever name is convenient.

Censys.io - The censys gatherer uses data from Censys.io, which has hostnames gathered from observed certificates, through the Google BigQuery API. Censys provides certificates observed from a nightly zmap scan of the IPv4 space, as well as certificates published to public Certificate Transparency logs.

Remote or local CSV - By using any other name besides censys, this will define a gatherer based on an HTTP/HTTPS URL or local path to a CSV. Its only option is a flag named after itself. For example, using a gatherer name of dap will mean that domain-scan expects --dap to point to the URL or local file.

Hostnames found from multiple sources are deduped, and filtered by suffix or base domain according to the options given.

The resulting gathered.csv will have the following columns:

  • the hostname
  • the hostname's base domain
  • one column for each checked source, with a value of True/False based on the hostname's presence in each source

See specific usage examples below.

General options:

  • --suffix: Required. One or more suffix to filter on, separated by commas as necessary. (e.g. .gov or .yahoo.com or .gov,.gov.uk)
  • --parents: A path or URL to a CSV whose first column is second-level domains. Any subdomain not contained within these second-level domains will be excluded.
  • --include-parents: Include second-level domains. (Defaults to false.)
  • --ignore-www: Ignore the www. prefixes of hostnames. If www.staging.example.com is found, it will be treated as staging.example.com.
  • --debug: display extra output

censys: Data from Censys.io via Google BigQuery

Gathers hostnames from Censys.io via the Google BigQuery API.

Before using this, you need to:

  • Create a Project in Google Cloud, and an associated service account with access to create new jobs/queries and get their results.
  • Give Censys.io this Google Cloud service account to grant access to.

For details on concepts, and how to test access in the web console:

Note that the web console access is based on access given to a Google account, but BigQuery API access via this script depends on access given to Google Cloud service account credentials.

To configure access, set one of two environment variables:

  • BIGQUERY_CREDENTIALS: JSON data that contains your Google BigQuery service account credentials.
  • BIGQUERY_CREDENTIALS_PATH: A path to a file with JSON data that contains your Google BigQuery service account credentials.

Options:

  • --timeout: Override the 10 minute job timeout (specify in seconds).
  • --cache: Use locally cached data instead of hitting BigQuery.

Example:

Find hostnames ending in either .gov or .fed.us from within Censys.io's certificate database

./gather censys --suffix=.gov,.fed.us

Gathering Usage Examples

To gather .gov hostnames from Censys.io:

./gather censys --suffix=.gov --debug

To gather .gov hostnames from a hosted CSV, such as one from the Digital Analytics Program:

./gather dap --suffix=.gov --dap=https://analytics.usa.gov/data/live/sites-extended.csv

Or to gather federal-only .gov hostnames from Censys' API, a remote CSV, and a local CSV:

./gather censys,dap,private --suffix=.gov --dap=https://analytics.usa.gov/data/live/sites-extended.csv --private=/path/to/private-research.csv --parents=https://github.com/GSA/data/raw/master/dotgov-domains/current-federal.csv

a11y setup

pa11y expects a config file at config/pa11y_config.json. Details and documentation for this config can be found in the pa11y repo.


A brief note on redirects:

For the accessibility scans we're running at 18F, we're using the pshtt scanner to follow redirects before the accessibility scan runs. Pulse.cio.gov is set up to show accessibility scans for live, non-redirecting sites. For example, if aaa.gov redirects to bbb.gov, we will show results for bbb.gov on the site, but not aaa.gov.

However, if you want to include results for redirecting site, note the following. For example, if aaa.gov redirects to bbb.gov, pa11y will run against bbb.gov (but the result will be recorded for aaa.gov).

In order to get the benefits of the pshtt scanner, all a11y scans must include it. For example, to scan gsa.gov:

./scan gsa.gov --scanner=pshtt,a11y

Because of domain-scan's caching, all the results of an pshtt scan will be saved in the cache/pshtt folder, and probably does not need to be re-run for every single ally scan.

Developing new scanners

Scanners are registered by creating a single Python file in the scanners/ directory, where the file is given the name of the scanner (plus the .py extension).

(Scanners that use Chrome are slightly different, require both a Python and JavaScript file, and their differences are documented below.)

Each scanner should define a few top-level functions and one variable that will be referenced at different points.

For an example of how a scanner works, start with scanners/noop.py. The noop scanner is a test scanner that does nothing (no-op), but it implements and documents a scanner's basic Python contract.

Scanners can implement 4 functions (2 required, 2 optional). In order of being called:

  • init(environment, options) (Optional)

    The init() function will be run only once, before any scans are executed.

    Returning a dict from this function will merge that dict into the environment dict passed to all subsequent function calls for every domain.

    Returning False from this function indicates that the scanner is unprepared, and the entire scan process (for all scanners) will abort.

    Useful for expensive actions that shouldn't be repeated for each scan, such as downloading supplementary data from a third party service. See the pshtt scanner for an example of downloading the Chrome preload list once, instead of for each scan.

    The init function is always run locally.

  • init_domain(domain, environment, options) (Optional)

    The init_domain() function will be run once per-domain, before the scan() function is executed.

    Returning a dict from this function will merge that dict into the environment dict passed to the scan() function for that particular domain.

    Returning False from this function indicates that the domain should not be scanned. The domain will be skipped and no rows will be added to the resulting CSV. The scan function will not be called for this domain, and cached scan data for this domain will not be stored to disk.

    Useful for per-domain preparatory work that needs to be performed locally, such as taking advantage of scan information cached on disk from a prior scan. See the sslyze scanner for an example of using available pshtt data to avoid scanning a domain known not to support HTTPS.

    The init_domain function is always run locally.

  • scan(domain, environment, options) (Required, unless using headless Chrome)

    The scan function performs the core of the scanning work.

    Returning a dict from this function indicates that the scan has completed successfully, and that the returned dict is the resulting information. This dict will be passed into the to_rows function described below, and used to generate one or more rows for the resulting CSV.

    Returning None from this function indicates that the scan has completed unsuccessfully. The domain will be skipped, and no rows will be added to the resulting CSV.

    In all cases, cached scan data for the domain will be stored to disk. If a scan was unsuccessful, the cached data will indicate that the scan was unsuccessful. Future scans that rely on cached responses will skip domains for which the cached scan was unsuccessful, and will not execute the scan function for those domains.

    The scan function is run either locally or in Lambda. (See docs/lambda.md for how to execute functions in Lambda.)

    If using headless Chrome, this method is defined in a corresponding Node file instead, and scan_headless must be set to True as described below.

  • to_rows(data) (Required)

    The to_rows function converts the data returned by a scan into one or more rows, which will be appended to the resulting CSV.

    The data argument passed to the function is the return value of the scan function described above.

    The function must return a list of lists, where each contained list is the same length as the headers variable described below.

    For example, a to_rows function that always returns one row with two values might be as simple as return [[ data['value1'], data['value2'] ]].

    The to_rows function is always run locally.

Scanners can implement a few top-level variables (1 required, others sometimes required):

  • headers (Required)

    The headers variable is a list of strings to use as column headers in the resulting CSV. These headers must be in the same order as the values in the lists returned by the to_rows function.

    The headers variable is always referenced locally.

  • lambda_support (Required if using --lambda)

    Set lambda_support to True to have the scanner "opt in" to being runnable in Lambda.

    If this variable is not set, or set to False, then using --lambda will have no effect on this scanner, and it will always be run locally.

  • scan_headless (Required if using headless Chrome)

    Set scan_headless to True to have the scanner indicate that its scan() method is defined in a corresponding Node file, rather than in this Python file.

    If this variable is not set, or set to False, then the scan() method must be defined. See documentation below for details on [developing Chrome scanners](#developing-chrome-scanners.

In all of the above functions that receive it, environment is a dict that will contain (at least) a scan_method key whose value is either "local" or "lambda".

The environment dict will also include any key/value pairs returned by previous function calls. This means that data returned from init will be contained in the environment dict sent to init_domain. Similarly, data returned from both init and init_domain for a particular domain will be contained in the environment dict sent to the scan method for that domain.

In all of the above functions that receive it, options is a dict that contains a direct representation of the command-line flags given to the ./scan executable.

For example, if the ./scan command is run with the flags --scan=pshtt,sslyze --lambda, they will translate to an options dict that contains (at least) {"scan": "pshtt,sslyze", "lambda": True}.

Developing Chrome scanners

This tool has some built-in support for instrumenting headless Chrome, both locally and inside of Amazon Lambda.

To make a scanner that uses headless Chrome, create two files:

  • A Python file, e.g. scanners/third_parties.py, that does not have a scan() function, but does have the standard init(), init_domain(), to_rows() and headers values, as described above.

  • A Node file, e.g. scanners/third_parties.js, that has a scanning function as described below.

The Node file must export the following method as part of its modules.exports:

  • scan(domain, environment, options, browser, page) (Required)

    The domain, environment, and options parameters are identical to the Python equivalent. The environment dict is affected by the init() and init_domain() functions in the corresponding Python file for this scanner.

    The browser parameter is an instance of Puppeteer's Browser class. It will already be connected to a running Chromium instance.

    The page parameter is an instance of Puppeteer's Page class. It will already have been instantiated through await browser.newPage(), but not set to any particular URL.

    Returning data from this function has identical effects to its Python equivalent: the return value is sent into the to_rows() Python function, and is cached to disk as JSON, etc.

Below is a simplified example of a scan() method. A full scanner will be a bit more complicated -- see scanners/third_parties.js for a real use case.

module.exports = {
  scan: async (domain, environment, options, browser, page) => {

    // Catch each HTTP request made in the page context.
    page.on('request', (request) => {
      // process the request somehow
    });

    // Navigate to the page
    try {
      await page.goto(environment.url);
    } catch (exc) {
      // Error handling, including timeout handling.
    }

  }
}

Note that the corresponding Python file (e.g. scanners/third_parties.py) is still needed, and its init() and init_domain() functions can affect the environment object.

This can be used, for example, to provide a modified starting URL to the Node scan() function in the environment object, based on the results of previous (Python-based) scanners such as pshtt.

Public domain

This project is in the worldwide public domain. As stated in CONTRIBUTING:

This project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication.

All contributions to this project will be released under the CC0 dedication. By submitting a pull request, you are agreeing to comply with this waiver of copyright interest.

domain-scan's People

Contributors

adelevie avatar alexbielen avatar blkperl avatar buckley-w-david avatar coolya avatar danielnaab avatar echudow avatar eddietejeda avatar ericlaw1979 avatar esonderegger avatar garrettr avatar gbinal avatar h-m-f-t avatar ianlee1521 avatar jasonmwhite avatar jsf9k avatar konklone avatar laurenancona avatar micahsaul avatar moschlar avatar ozzyjohnson avatar parkr avatar robbi5 avatar siccovansas avatar tadhg-ohiggins avatar tbaxter-18f avatar thibautgery avatar timothy-spencer avatar ultiferrago avatar vdavez avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

domain-scan's Issues

Sort output results alphabetically by domain

It should be easy to compare the results of one output to another, using diff tools, without worrying about the order that all the asynchronous parallelized tasks happened to complete in.

The various output .csv's should be ordered alphabetically, by domain name, with the header intact at the top. Hopefully this can be done more-or-less in place in a quick, streaming way.

pshtt scan exception case drops record from report

When the version of pshtt that is loaded by domain-scan lands into an ConnectionError or a RequestException exception case in its basic_check method, it dumps an error and returns an empty string where a JSON object representing the domain's test results should be. This ripples through sslyze and errors out of the domain-scan pshtt invocation altogether (i.e., "Bad news scanning"), and never writes an entry to the results. Here's an example (domain redacted; hit me up for a live example):

โ€บ docker-compose run scan domain.tld --scan=pshtt --debug --force
[domain.tld][pshtt]
	 /opt/pyenv/versions/2.7.11/bin/pshtt domain.tld
Failed to connect.
Certificate did not match expected hostname: domain.tld. Certificate: {[certificate chain information]}
Traceback (most recent call last):
  File "/opt/pyenv/versions/2.7.11/bin/pshtt", line 9, in <module>
    load_entry_point('pshtt==0.1.5', 'console_scripts', 'pshtt')()
  File "/opt/pyenv/versions/2.7.11/lib/python2.7/site-packages/pshtt/cli.py", line 54, in main
    results = pshtt.inspect_domains(domains, options)
  File "/opt/pyenv/versions/2.7.11/lib/python2.7/site-packages/pshtt/pshtt.py", line 882, in inspect_domains
    results.append(inspect(domain))
  File "/opt/pyenv/versions/2.7.11/lib/python2.7/site-packages/pshtt/pshtt.py", line 63, in inspect
    basic_check(domain.https)
  File "/opt/pyenv/versions/2.7.11/lib/python2.7/site-packages/pshtt/pshtt.py", line 173, in basic_check
    https_check(endpoint)
  File "/opt/pyenv/versions/2.7.11/lib/python2.7/site-packages/pshtt/pshtt.py", line 331, in https_check
    cert_plugin_result = cert_plugin.process_task(server_info, 'certinfo_basic')
  File "/opt/pyenv/versions/2.7.11/lib/python2.7/site-packages/sslyze/plugins/certificate_info_plugin.py", line 115, in process_task
    if scan_command.custom_ca_file:
AttributeError: 'str' object has no attribute 'custom_ca_file'
Error running eval "$(pyenv init -)" && pyenv shell 2.7.11 && /opt/pyenv/versions/2.7.11/bin/pshtt domain.tld --json --user-agent "github.com/18f/domain-scan, pshtt.py" --timeout 30 --preload-cache ./cache/preload-list.json.
	Bad news scanning, sorry!
Results written to CSV.

That last bit is a lie... Nothing is written because the raw data out of the pshtt invocation is None.

This ultimately is an issue with the way exception cases in pshtt are ordered in the overall flow of logic, and I'll open up an issue there and eventually offer a solution. The reason I'm bringing it up here, is because pshtt will produce a report with these exception cases properly reflected (i.e., as failing), but domain-scan just ends up dropping them from the report all together, which can be confusing as anything when your target list of 12K domains only results in a results.csv of 11,999 rows (gah!). So this is more just an "awareness" issue.

Change the way `try_command()` checks command

I'm trying to call pshtt as a docker container, which results in something like the following command being passed executed by domain-scan:

docker run --rm -e USER_ID=1042 -e GROUP_ID=1042 -v $(pwd):/data dockerpulse_c-pshtt rijksoverheid.nl

This fails when a scanner tries this command using try_command()[code], because it runs which on the whole command including parameters. This could be 'fixed' if only the actual executable of the command is passed to which, e.g.:

subprocess.check_call(["which", command.split(' ')[0]], ...)

I'm not sure if this has any implications, so I made an issue out of this instead of a pull request.

Tindel's data wrangling script

@jtexnl's script, saving here for posterity:

import csv
import re
import json

def readData(inputFile):
    outList = []
    with open(inputFile, 'rU') as infile:
        reader = csv.reader(infile)
        firstRow = True
        for row in reader:
            if firstRow == True:
                firstRow = False
                continue
            else:
                outList.append(row)
    return outList

def writeJson(inputData, fileName):
    with open(fileName, 'w+') as outfile:
        json.dump(inputData, outfile, indent = 4)

def makeAgencyOutput(inputList, errorDict, errorTypeDict):
    output = []
    for row in inputList:
        subSet = row[1]
        subDict = collections.OrderedDict({})
        subDict['Agency'] = row[0]
        subDict['Errors'] = errorDict[row[0]]
        for key, value in errorTypeDict.items():
            k = key
            try:
                subDict[k] = subSet[value]
            except KeyError:
                subDict[k] = 0
            except TypeError:
                subDict[k] = 0
        output.append(subDict)
    return output

def getKey(item): 
    return item[0]

def trimErrorField(errorField):
    pieces = re.split('.*(Guideline)', errorField)
    shortened = pieces[-1]
    pieces = shortened.split('.')
    num = pieces[0]
    return num

def categorize(dataset, referenceDict, colNum, altName):
    for row in dataset:
        if row[colNum] in referenceDict.keys():
            row.append(referenceDict[row[colNum]])
        else:
            row.append(altName)
    return dataset

def countDict(dataset, colIndex):
    output = {}
    for row in dataset:
        if row[colIndex] in output:
            output[row[colIndex]] += 1
        else:
            output[row[colIndex]] = 1
    return output

#Read in a11y.csv for errors and domains.csv for agencies
ally1 = readData('a11y.csv')
domains = readData('domains.csv')
#need to remove ussm.gov, whistleblower.gov, and safeocs.gov from ally due to discrepancies between the datasets. Solve at some point
ally = []
for row in ally1:
    if row[0] != 'safeocs.gov' and row[0] != 'whistleblower.gov' and row[0] != 'ussm.gov':
        ally.append(row)

#Truncate the a11y file so that it's a bit more manageable. Need the domain name [0] and the principle [4]
main = []
for row in ally:
    main.append([row[0], trimErrorField(row[4])])

#Add the information on the agency [1] and branch [2]
for error in main:
    for domain in domains:
        if error[0] == domain[0].lower():
            error.append(domain[1])
            error.append(domain[2])

#Dictionaries; branches = branch lookup, errorCats = error category lookup
branches = {"Library of Congress":"Legislative","The Legislative Branch (Congress)":"Legislative",
"Government Printing Office":"Legislative","Congressional Office of Compliance":"Legislative",
"The Judicial Branch (Courts)":"Judicial"}
errorCats = {'1_4':'Color Contrast Error', '1_1':'Alt Tag Error', '4_1':'HTML/Attribute Error', '1_3':'Form Error'}

#define branches for the 'main' and 'domains' sets, define error categories for 'main'
main = categorize(main, branches, -1, 'Executive')
domains = categorize(domains, branches, 2, 'Executive')
main = categorize(main, errorCats, 1, 'Other Error')

totalErrorsByDomain = countDict(main, 0)
totalErrorsByAgency = countDict(main, 3)

#createe dict of base vs. canonical domains
canonicals = {}
for row in ally:
    try:
        if row[0] in canonicals.keys():
            continue
        else:
            canonicals[row[0]] = row[1]
    except KeyError:
        continue


noErrors = []
errors = []
for domain in domains:
    if not domain[0].lower() in totalErrorsByDomain.keys():
        noErrors.append(domain)
    else:
        errors.append(domain)

for row in noErrors:
    row.append(0)
    row.append({})
    try:
        if row[0] in canonicals.keys():
            row.append('http://' + canonicals[row[0].lower()])
        else:
            row.append('http://' + row[0].lower())
    except TypeError:
        continue

for row in errors:
    row.append(totalErrorsByDomain[row[0].lower()])
    subset = []
    for line in main:
        if line[0] == row[0].lower():
            subset.append(line)
    errorDict = countDict(subset, -1)
    row.append(errorDict)
    try:
        if row[0] in canonicals.keys():
            row.append('http://' + canonicals[row[0].lower()])
        else:
            row.append('http://' + row[0].lower())
    except TypeError:
        continue

domains = errors + noErrors
domains = sorted(domains, key = getKey)

dictList = []
for row in domains:
    subDict = collections.OrderedDict({})
    subDict['agency'] = row[2]
    subDict['branch'] = row[5]
    subDict['canonical'] = row[8]
    subDict['domain'] = row[0].lower()
    subDict['errors'] = row[6]
    subDict['errorlist'] = row[7]
    dictList.append(subDict)

finalDict = {}
finalDict['data'] = dictList

writeJson(finalDict, 'domains.json')

agencyList = []
for row in main:
    if row[3] in agencyList:
        continue
    else:
        agencyList.append(row[3])

agencyErrorSets = []
for agency in agencyList:
    subList = []
    sub = {}
    for row in main:
        if row[3] == agency:
            if row[-1] in sub:
                sub[row[-1]] += 1
            else:
                sub[row[-1]] = 1
    subList.append(agency)
    subList.append(sub)
    agencyErrorSets.append(subList)

errorTypes = {'Color Contrast Errors':'Color Constrast Error', 'HTML/Attribute Errors':'HTML/Attribute Error', 
'Form Errors':'Form Error', 'Alt Tag Errors':'Alt Tag Error', 'Other Errors':'Other Error'}

output = makeAgencyOutput(agencyErrorSets, agencyErrorDict, errorTypes)
finalOutput = {}
finalOutput['data'] = output

writeJson(finalOutput, 'agencies.json')

process_a11y.py script is not removing other branches automatically

In step 6 of the a11y scanning process, the process_a11y.py script is not removing domains from the legislative, judicial, and 'non-federal' branches automatically. Right now, to get by, I go in and hand remove them from the domains.csv file as an extra part of step 3 of the a11y scanning process , but this is laborious and error prone.

I don't believe that this is broken functionality so much as functionality that has never existed. It'll be helpful to automate this. It doesn't actually matter at what step this occurs, so long as the domains from these other branches are not present in the final files generated in step 6.

Here is the list of agencies that shouldn't be included in the a11y scan.

Nothing is return for a specific domain name

$ docker-compose run scan particulier.api.gouv.fr --scan=tls
Results written to CSV.

But the file results/tls.csv is empty

$ cat results/tls.csv 
Domain,Base Domain,Grade,Signature Algorithm,Key Type,Key Size,Forward Secrecy,OCSP Stapling,Fallback SCSV,RC4,SSLv3,TLSv1.2,SPDY,Requires SNI,HTTP/2

But i works well in the interface

This command docker-compose run scan geo.api.gouv.fr --scan=tls works.

Nothing is printed on stderr. Do you know where is the problem ?

Report on any errors found during the process

At the end of the scan, --debug or not, list any errors that gave an invalid: true response in their cached data. Possibly include it in the meta.json file for the scan, too. It should be easy to find these and re-scan them.

Submit either/both HTTPS-enabled endpoints for a domain to ssllabs-scan

The SSL Labs API will stop automatically guessing whether or not the domain needs a www prefix there or not in the next version:

http://sourceforge.net/p/ssllabs/mailman/message/34661550/

We do make a best guess at the "canonical" form of a domain in the inspect step, using site-inspector, so we can use this to submit the right endpoint.

That said, since that canonical prefix detection is buggy (and arguably has been giving us incomplete data anyway) we may be better off submitting either or both of the root and www prefix, based on whether or not we detect HTTPS as available on that endpoint. That will leave less room for bugs and give us more data.

process_a11y.py script is not factoring in domains with no errors but it should

In step 6 of the a11y scanning process, the process_a11y.py script is not factoring in domains that have no errors since it is just building from the results of the a11y.csv file generated in step 5, however it needs to.

Imagine an executive branch agency with three active, non-redirecting domains. After step 5 completes, there are only error results for two domains (either because the third domain did not scan successfully or because no errors were detected). The problem is that step 6 computes based on the a11y.csv file of individual error results and does not factor in the total domain set that it should be considering.

sslyze error where (apparently) no certificates are delivered

Observing this sslyze error during scans:

Traceback (most recent call last):

  File "/opt/scan/domain-scan/scan", line 120, in process_scan
    rows = list(scanner.scan(domain, options))

  File "/opt/scan/domain-scan/scanners/sslyze.py", line 75, in scan
    data = parse_sslyze(xml)

  File "/opt/scan/domain-scan/scanners/sslyze.py", line 205, in parse_sslyze
    issuer = certificates[-1].select_one("issuer commonName")

IndexError: list index out of range

These appear to happen after a long timeout, suggesting that there could be a connection/timeout error that results in no certificate data being available.

Expand the `inspect.py` script

Expand that script, which runs first in the a11y scan process to have the resulting inspect.csv include columns for Agency and Branch.

  • Agency could carry over from the domains.csv file that inspect.py is running on.
  • Branch - I would think that this could be done using the same method that is applied in this later script.

We would need to ensure that these changes do not adversely impact the workflows for the HTTPS or DAP sections. cc @konklone

The benefits to making these changes is that it would make resolving #101 and, to a degree, #102 easier to resolve.

Gathering hostnames via Docker

As it currently stands, I don't think there is a way to use the gather tool from within the Docker images built from this repo? I think it's worth creating a dockerfile for building an image used to gather hostnames before scanning. I would love some feedback on this idea, and would be more than happy to help out if it is something that is wanted.

a11y scanner freezes on certain domains

The following domains break the a11y scan such that I have to stop it, remove the domain, and re-restart the scan all over again.

Two problems result:

  • We don't get any a11y scan results for these domains.
  • Having to restart the scan significantly adds to the time and effort that goes into it.
afadvantage.gov
ama.gov
banknet.gov
biomassboard.gov
broadband.gov
dea.gov
disasterhousing.gov
export.gov
flightschoolcandidates.gov
grantsolutions.gov
gsaadvantage.gov
gsaauctions.gov
hrsa.gov
hydrogen.gov
idmanagement.gov
invasivespecies.gov
myfdicinsurance.gov
nationalbank.gov
nationalbanknet.gov
nationalhousing.gov
nationalhousinglocator.gov
nhl.gov
nls.gov
onhir.gov
pay.gov
realestatesales.gov
safetyact.gov
sciencebase.gov
segurosocial.gov
selectusa.gov
stopfakes.gov
tvaoig.gov
usdebitcard.gov

ImportError: No module named 'requests'

thx to #132 !!

but when i worked this command

./gather censys 
--suffix=.gov 
--censys_id=id 
--censys_key=key 
--start=1 
--end=2 
--delay=5 
--debug

Traceback (most recent call last):
File "./gather", line 6, in
import requests

ImportError: No module named 'requests'

i am getting this error. What is the problem?

i was try to pip install requests

Suggestion: Split up the Dockerfile

Just took a look at the Dockerfile for the first time, and was surprised to see how much is in there. I guess it's because the various tools being used all have different dependencies?

Having multiple languages in a single Dockerfile is an antipattern (IMHO), and I think the setup for each scanner could be a lot simpler if you isolated each tool to its own Dockerfile. These could then be run independently, or via a domain-scan Dockerfile that calls out to docker run <scanner> and then stitches the results together.

I got this idea from the architecture of the Code Climate CLI, so you could look there for inspiration if you're interested in pursuing this.

No module named 'gatherers.'

When i use

sudo ./gather censys,dap,
--suffix=.gov
--censys_id=id
--censys_key=key
--dap=https://analytics.usa.gov/data/live/sites-extended.csv
--parents=https://raw.githubusercontent.com/GSA/data/gh-pages/dotgov-domains/current-federal.csv

this error was detected

Done fetching from API.
Results written to CSV.
rootk@ubuntu:~/domain-scan-master$ ./st2.sh
Fetching up to 100 records, starting at page 1.
[1] Cached page.
[] Gatherer not found, or had an error during loading.
ERROR: <class 'ImportError'>

No module named 'gatherers.' 

What is the problem?

docker-compose up fails

$ docker --version
Docker version 1.9.1, build a34a1d5
$ docker-compose up                                    
Building scan
Step 1 : FROM ubuntu:14.04.3
 ---> 6cc0fc2a5ee3
Step 2 : MAINTAINER V. David Zvenyach <[email protected]>
 ---> Using cache
 ---> 9c7124f58945
Step 3 : RUN apt-get update         -qq     && apt-get install         -qq         --yes         --no-install-recommends         --no-install-suggests       build-essential=11.6ubuntu6       curl=7.35.0-1ubuntu2.5       git=1:1.9.1-1ubuntu0.1       libc6-dev=2.19-0ubuntu6.6       libfontconfig1=2.11.0-0ubuntu4.1       libreadline-dev=6.3-4ubuntu2       libssl-dev=1.0.1f-1ubuntu2.15       libssl-doc=1.0.1f-1ubuntu2.15       libxml2-dev=2.9.1+dfsg1-3ubuntu4.4       libxslt1-dev=1.1.28-2build1       libyaml-dev=0.1.4-3ubuntu3.1       make=3.81-8.2ubuntu3       nodejs=0.10.25~dfsg2-2ubuntu1       npm=1.3.10~dfsg-1       python3-dev=3.4.0-0ubuntu2       python3-pip=1.5.4-1ubuntu3       unzip=6.0-9ubuntu1.3       wget=1.15-1ubuntu1.14.04.1       zlib1g-dev=1:1.2.8.dfsg-1ubuntu1       autoconf=2.69-6       automake=1:1.14.1-2ubuntu1       bison=2:3.0.2.dfsg-2       gawk=1:4.0.1+dfsg-2.1ubuntu2       libffi-dev=3.1~rc1+r3.0.13-12       libgdbm-dev=1.8.3-12build1       libncurses5-dev=5.9+20140118-1ubuntu1       libsqlite3-dev=3.8.2-1ubuntu2.1       libtool=2.4.2-1.7ubuntu1       pkg-config=0.26-1ubuntu4       sqlite3=3.8.2-1ubuntu2.1     && apt-get clean     && rm -rf /var/lib/apt/lists/*
 ---> Running in eb3383000cf7
E: Version '7.35.0-1ubuntu2.5' for 'curl' was not found
E: Version '1:1.9.1-1ubuntu0.1' for 'git' was not found
E: Version '1.0.1f-1ubuntu2.15' for 'libssl-dev' was not found
E: Version '1.0.1f-1ubuntu2.15' for 'libssl-doc' was not found
E: Version '2.9.1+dfsg1-3ubuntu4.4' for 'libxml2-dev' was not found
E: Version '6.0-9ubuntu1.3' for 'unzip' was not found
ERROR: Service 'scan' failed to build: The command '/bin/sh -c apt-get update         -qq     && apt-get install         -qq         --yes         --no-install-recommends         --no-install-suggests       build-essential=11.6ubuntu6       curl=7.35.0-1ubuntu2.5       git=1:1.9.1-1ubuntu0.1       libc6-dev=2.19-0ubuntu6.6       libfontconfig1=2.11.0-0ubuntu4.1       libreadline-dev=6.3-4ubuntu2       libssl-dev=1.0.1f-1ubuntu2.15       libssl-doc=1.0.1f-1ubuntu2.15       libxml2-dev=2.9.1+dfsg1-3ubuntu4.4       libxslt1-dev=1.1.28-2build1       libyaml-dev=0.1.4-3ubuntu3.1       make=3.81-8.2ubuntu3       nodejs=0.10.25~dfsg2-2ubuntu1       npm=1.3.10~dfsg-1       python3-dev=3.4.0-0ubuntu2       python3-pip=1.5.4-1ubuntu3       unzip=6.0-9ubuntu1.3       wget=1.15-1ubuntu1.14.04.1       zlib1g-dev=1:1.2.8.dfsg-1ubuntu1       autoconf=2.69-6       automake=1:1.14.1-2ubuntu1       bison=2:3.0.2.dfsg-2       gawk=1:4.0.1+dfsg-2.1ubuntu2       libffi-dev=3.1~rc1+r3.0.13-12       libgdbm-dev=1.8.3-12build1       libncurses5-dev=5.9+20140118-1ubuntu1       libsqlite3-dev=3.8.2-1ubuntu2.1       libtool=2.4.2-1.7ubuntu1       pkg-config=0.26-1ubuntu4       sqlite3=3.8.2-1ubuntu2.1     && apt-get clean     && rm -rf /var/lib/apt/lists/*' returned a non-zero code: 100

For a11y, include the following options as default

{
  "ignore": [
    "notice",
    "warning",
    "WCAG2AA.Principle1.Guideline1_4.1_4_3.G18.BgImage",
    "WCAG2AA.Principle1.Guideline1_4.1_4_3.G18.Abs",
    "WCAG2AA.Principle1.Guideline1_4.1_4_3.G145.Abs",
    "WCAG2AA.Principle3.Guideline3_1.3_1_1.H57.2",
    "WCAG2AA.Principle3.Guideline3_1.3_1_1.H57.3",
    "WCAG2AA.Principle3.Guideline3_1.3_1_2.H58.1",
    "WCAG2AA.Principle4.Guideline4_1.4_1_1.F77"
  ]
}

remove need for a11y.py to run alongside inspect.py

In the current workflow, in step 5, I run the following command: docker-compose run scan domains.csv --scan=inspect,a11y --debug. However, if I am using a domains.csv that has been derived straight from recent DAP results, there's no need to run the inspect command (which adds a decent bit of time to the scan). It would be faster if I could just run the a11y scan without the inspect scan: 'docker-compose run scan domains.csv --scan=a11y --debug`.

This does not work though, as it seems that the a11y scan depends on the inspect scan having already been run. You can see the error message below.

It would be handy to be able to run the a11y scan without it needing the inspect scan cache results.


[youthrules.gov][a11y]
Traceback (most recent call last):

  File "./scan", line 120, in process_scan
    rows = list(scanner.scan(domain, options))

  File "/home/scanner/scanners/a11y.py", line 197, in scan
    inspect_data = get_from_inspect_cache(domain)

  File "/home/scanner/scanners/a11y.py", line 23, in get_from_inspect_cache
    inspect_raw = open(inspect_cache).read()

FileNotFoundError: [Errno 2] No such file or directory: './cache/inspect/youthrules.gov.json'

Gracefully handle unauthenticated use of Censys.io Export API

Starting new HTTPS connection (1): www.censys.io
https://www.censys.io:443 "GET /api/v1/account HTTP/1.1" 200 243
Censys query:
SELECT parsed.subject.common_name, parsed.extensions.subject_alt_name.dns_names from FLATTEN([certificates.certificates], parsed.extensions.subject_alt_name.dns_names) where parsed.subject.common_name LIKE "%.gov" OR parsed.extensions.subject_alt_name.dns_names LIKE "%.gov";

Kicking off SQL query job.
https://www.censys.io:443 "POST /api/v1/export HTTP/1.1" 403 115
Traceback (most recent call last):

File "/home/user/domain-scan/gatherers/censys.py", line 194, in export_mode
job = export_api.new_job(query, format='csv', flatten=True)

File "/home/user/domain-scan/censys/export.py", line 25, in new_job
return self._post("export", data=data)

File "/home/user/domain-scan/censys/base.py", line 111, in _post
return self._make_call(self._session.post, endpoint, args, data)

File "/home/user/domain-scan/censys/base.py", line 105, in _make_call
const=const)

censys.base.CensysUnauthorizedException: 403 (unauthorized): Unauthorized. You do not have access to this service.

Censys error, aborting.
Downloading results of SQL query.
Traceback (most recent call last):
File "./gather", line 175, in
run(options)
File "./gather", line 73, in run
for domain in gatherer.gather(suffix, options, extra):
File "/home/user/domain-scan/gatherers/censys.py", line 66, in gather
hostnames_map = export_mode(suffix, options, uid, api_key)
File "/home/user/domain-scan/gatherers/censys.py", line 231, in export_mode
utils.download(results_url, download_file)
File "/home/user/domain-scan/scanners/utils.py", line 34, in download
filename, headers = urllib.request.urlretrieve(url, destination)
File "/usr/lib/python3.4/urllib/request.py", line 184, in urlretrieve
url_type, path = splittype(url)
File "/usr/lib/python3.4/urllib/parse.py", line 857, in splittype
match = _typeprog.match(url)
TypeError: expected string or buffer

how can i run this

Dockerfile needs an update?

Hey! Just checking out this repo and noticed that Dockerfile as-is doesn't build for me:

docker@boot2docker:~/domain-scan$ docker run 5740b16bdfb7
Traceback (most recent call last):
  File "/tmp/scan", line 6, in <module>
    from scanners import utils
  File "/tmp/scanners/utils.py", line 10, in <module>
    import strict_rfc3339
ImportError: No module named 'strict_rfc3339'

Running in boot2docker 1.6.2, and docker version yields:

docker@boot2docker:~/domain-scan$ docker version
Client version: 1.6.2
Client API version: 1.18
Go version (client): go1.4.2
Git commit (client): 7c8fca2
OS/Arch (client): linux/amd64
Server version: 1.6.2
Server API version: 1.18
Go version (server): go1.4.2
Git commit (server): 7c8fca2
OS/Arch (server): linux/amd64

undefined method `[]' for nil:NilClass (NoMethodError)

Looks like a few of them ran, but then I got an error.

[acus.gov]
[acus.gov]
Fetched, cached.
[achp.gov]
[achp.gov]
Fetched, cached.
[preserveamerica.gov]
[preserveamerica.gov]
Fetched, cached.
[adf.gov]
[adf.gov]
Fetched, cached.
[usadf.gov]
[usadf.gov]
Fetched, cached.
[abmc.gov]
[abmc.gov]
Fetched, cached.
[amtrakoig.gov]
[amtrakoig.gov]
Fetched, cached.
[arc.gov]
[arc.gov]
Fetched, cached.
[afrh.gov]
[afrh.gov]
Fetched, cached.
[cia.gov]
[cia.gov]
Fetched, cached.
[ic.gov]
[ic.gov]
/Library/Ruby/Gems/2.0.0/gems/site-inspector-1.0.0/lib/site-inspector/headers.rb:24:in strict_transport_security': undefined method[]' for nil:NilClass (NoMethodError)
from /Library/Ruby/Gems/2.0.0/gems/site-inspector-1.0.0/lib/site-inspector/headers.rb:10:in strict_transport_security?' from ./https-scan.rb:137:indomain_details'
from ./https-scan.rb:105:in check_domain' from ./https-scan.rb:51:inblock (2 levels) in go'
from /System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/csv.rb:1716:in each' from /System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/csv.rb:1120:inblock in foreach'
from /System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/csv.rb:1266:in open' from /System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/csv.rb:1119:inforeach'
from ./https-scan.rb:31:in block in go' from /System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/csv.rb:1266:inopen'
from ./https-scan.rb:17:in go' from ./https-scan.rb:152:in

'

Converts a11y scan result to a format pulse can use

These need major refactoring

formats csv into a json format

require 'bundler/setup'
require 'pry'
require 'csv'
require 'json'
require 'parallel'

def get_scan_error row_hash
  {
    "code" => row_hash["code"], 
    "typeCode" => row_hash["typeCode"],
    "message" => row_hash["message"],
    "context" => row_hash["context"],
    "selector" => row_hash["selector"],
    "type" => row_hash["typeCode"] == "1" ? "error" : "other"
  }
end

Dir.chdir(File.dirname(__FILE__))

csv_scan = File.read('../data/a11y-8-4-2016-no-2_csv.csv')
inspect_domains = File.read('../data/inspect-domains.csv')
domains = {}

# create domains hash with just domains from inspect file
CSV.parse(inspect_domains, headers: true) do |row|
  row_hash = row.to_hash
  if row_hash["Live"] != "False"
    domains[row_hash["Domain"]] = {
      "Domain Name" => row_hash["Domain"],
      "scan" => []
    }
  end
end

#go through get each error, add to scan output
CSV.parse(csv_scan, headers: true) do |row|
  row_hash  = row.to_hash
  if !domains[row_hash["Domain"]]
    domains[row_hash["Domain"]] = {
      "Domain Name" => row_hash["Domain"],
      "scan" => [get_scan_error(row_hash)]
    }
  else
    domains[row_hash["Domain"]]["scan"] << get_scan_error(row_hash)
  end
end

combined_domains = []
domains.each do |domain|
  combined_domains << domain[1]
end

File.open("../data/a11y-8-4-2016-no-2_csv.json","w") do |f|
  f.write(combined_domains.to_json)
end

takes that JSON and makes the 3 files needed for pulse

require 'bundler/setup'
require 'pry'
require 'csv'
require 'json'

def total_errors domain
  errors = domain["scan"].select{|row|
    row["type"] == "error"
  }
  errors.length
end

def get_branch domain, sample
  url = domain["Domain Name"].downcase
  puts "Get Branch url = #{url}"
  branch = ""
  sample["data"].each do |sample|
    if sample["domain"] == url
      branch = sample["branch"]
    end
  end
  puts "Branch = #{branch}"
  branch
end

def get_agency domain, sample
  url = domain["Domain Name"].downcase
  puts "Get Branch url = #{url}"
  agency = ""
  sample["data"].each do |sample|
    if sample["domain"] == url
      agency = sample["agency"]
    end
  end
  puts "Agency = #{agency}"
  agency
end

def get_error_cat_count domain
  errorlist = {
    "Alt Tag Errors" => 0,
    "Color Contrast Errors" => 0,
    "Form Errors" => 0,
    "HTML/Attribute Errors" => 0,
    "Other Errors" => 0
  }
  codes = {
    "1_4." => "Color Contrast Errors"
  }
  domain["scan"].each do |error|
    if error["code"].include? "1_4."
      errorlist["Color Contrast Errors"] = errorlist["Color Contrast Errors"] + 1
    elsif error["code"].include? "1_1."
      errorlist["Alt Tag Errors"] = errorlist["Alt Tag Errors"] + 1
    elsif error["code"].include? "4_1."
      errorlist["HTML/Attribute Errors"] = errorlist["HTML/Attribute Errors"] + 1
    elsif error["code"].include? "1_3."
      errorlist["Form Errors"] = errorlist["Form Errors"] + 1
    else
      errorlist["Other Errors"] = errorlist["Other Errors"] + 1
    end
  end
  errorlist
end

def get_cat_errors domain
  errorlist = {
    "Alt Tag Errors" => [],
    "Color Contrast Errors" => [],
    "Form Errors" => [],
    "HTML/Attribute Errors" => [],
    "Other Errors" => []
  }
  domain["scan"].each do |error|
    if error["code"].include? "1_4."
      errorlist["Color Contrast Errors"] << error
    elsif error["code"].include? "1_1."
      errorlist["Alt Tag Errors"] << error
    elsif error["code"].include? "4_1."
      errorlist["HTML/Attribute Errors"] << error
    elsif error["code"].include? "1_3."
      errorlist["Form Errors"] << error
    else
      errorlist["Other Errors"] << error
    end
  end
  errorlist
end



Dir.chdir(File.dirname(__FILE__))

scans = File.read('../data/a11y-8-4-2016-no-2_csv.json')

domains_sample = JSON.parse(File.read("../data/domains-sample.json"))
error_cats = JSON.parse(File.read('../config/error_cat.json'))

puts domains_sample["data"].length

scans = JSON.parse(scans)
all_errors_count = 0

domains = {}
domains["data"] = []

a11y = {}
a11y["data"] = {}
scans.each do |scan|
  puts scan["Domain Name"]
  puts "Total Errors = #{total_errors scan}"
  puts "Branch = #{get_branch(scan, domains_sample)}"
  all_errors_count += total_errors scan

  domains["data"] << {
    "agency": get_agency(scan, domains_sample),
     "branch": get_branch(scan, domains_sample),
     "canonical": "http://#{scan["Domain Name"].downcase}",
     "domain": scan["Domain Name"].downcase,
     "errors": total_errors(scan),
     "errorlist": get_error_cat_count(scan)
  }
  a11y["data"][scan["Domain Name"].downcase] = get_cat_errors(scan)
end

agencies = {}
agency_hash = {}

domains["data"].each do |domain|
  if agency_hash[domain["agency"]]
    agency = agency_hash[domain["agency"]]
    domain_error_list = domain["errorlist"]
    agency["Average Errors per Page"] += domain["errors"]
    agency["Alt Tag Errors"] += domain_error_list["Alt Tag Errors"]
    agency["HTML/Attribute Errors"] += domain_error_list["HTML/Attribute Errors"]
    agency["Form Errors"] += domain_error_list["Form Errors"]
    agency["Color Contrast Errors"] += domain_error_list["Color Contrast Errors"]
    agency["Other Errors"] += domain_error_list["Other Errors"]
  else
    agency_hash[domain[:agency]] = {}

    agency = agency_hash[domain[:agency]]
    domain_error_list = domain[:errorlist]
    # binding.pry
    agency["Agency"] = domain[:agency]
    agency["Average Errors per Page"] = domain[:errors]
    agency["Alt Tag Errors"] = domain_error_list["Alt Tag Errors"]
    agency["HTML/Attribute Errors"] = domain_error_list["HTML/Attribute Errors"]
    agency["Form Errors"] = domain_error_list["Form Errors"]
    agency["Color Contrast Errors"] = domain_error_list["Color Contrast Errors"]
    agency["Other Errors"] = domain_error_list["Other Errors"]
  end
  if domain["agency"] == "Department of Defense"
    puts agency_hash["Department of Defense"]
  end
end

agencies["data"] = agency_hash.map{|agency| 
  agency[1]
}



puts domains["data"].first

puts "Total Errors #{all_errors_count}"


File.open("../data/domains.json","w") do |f|
  f.write(domains.to_json)
end

File.open("../data/a11y.json","w") do |f|
  f.write(a11y.to_json)
end

File.open("../data/agencies.json","w") do |f|
  f.write(agencies.to_json)
end

process_a11y.py script is not removing inactive, redirecting domains automatically

In step 6 of the a11y scanning process, the process_a11y.py script is not removing inactive and redirecting domains automatically. Right now, to get by, I go in and hand remove them from the domains.csv file as an extra part of step 3 of the a11y scanning process , but this is laborious and error prone.

I don't believe that this is broken functionality so much as functionality that has never existed. It'll be helpful to automate this. It doesn't actually matter at what step this occurs, so long as the domains from these other branches are not present in the final files generated in step 6.

a11y.py scan is not ignoring individual errors

In step 5 of the a11y scanning process, the a11y.py scan is not excluding individual errors that are listed in the ignore list. Right now, to get by, I go back in and hand remove them from the a11y.csv file that is generated after step 5, but this is laborious and error prone.

Notices and warnings are correctly excluded but not the individual errors. It's as if I hadn't included them there.

I suspect that this comes from them being improperly formatted or referenced, though I don't know how. Here's some documentation that I've found:

Pshtt version

Perhaps this is a Docker-ism that I'm not as familiar with, but is there a reason to pin pshtt to a specific version, rather than leaving the version off and getting the latest from PyPI ? Is this something to optimize the container image building?

RUN pip3 install pshtt==0.2.1

sslyze appearing to stall out on some domains

I don't have enough information to report a bug yet. When I checked on our server, the sslyze scans had stalled out with 9 in-flight, with a bunch of defunct sslyze processes. These were the domains:

nces.ed.gov
autodiscover.ors.od.nih.gov
stg-reg2.hcia.cms.gov
vpn1.cjis.gov
portcullis.nlrb.gov
safesupportivelearning.ed.gov
my.uscis.gov
www.educationusa.state.gov
pittsburgh.feb.gov

But when I ran a scan using sslyze on all 9 of those in a row, using --serial, none of them stalled out. So I'm not totally sure how to reproduce this.

Finish integrating semantic changes to a11y scans

Right now, the following edits are manually made during the a11y scan process. We should go through and change the scripts and scans to address these so that I no longer need to manually make them:

  • Alt Text => Missing Image Description - issue
  • add http:// to canonical domains - issue
  • "errors" - "initial findings" - issue
  • agency -> Agency - issue

Tests

I know we're all busy, but I figured it's at least worth starting the discussion around this repo's lack of tests. Right now, the only a Python linter (?) is run.

I'll take a first stab at what I think should be tested:

  • each scanner with unit tests
  • the ./scan command with unit tests (throw all different args at it, see if it still works)
  • an integration test for a big scan

Build scripts for remote compilation of dependencies for domain-scan.zip

The lambda/remote_build.sh script has the commands I use to build the domain-scan Lambda environment, but it's not repeatable, and rebuilds require me to copy/paste manual subsets of the instructions.

This is going to become more of a burden over time, as any updates to dependencies will require a rebuild to capture these changes (and pshtt itself is likely to keep rapidly improving in very relevant ways), followed by a re-upload to Lambda.

A recursive web crawler to gather domains

Note: this is a potentially big task, that should be broken into smaller tasks/stages. But also, there is value to starting with a naive, simple crawler and leveling it up in stages.

Either baked into domain-scan, or finding/making a separate tool that does this. We could also potentially use Common Crawl data.

But the basic need is to gather domains through web crawling, as this is a fertile source for hostnames that do not appear in Censys.io. For .gov, both Censys and the LOC's web crawl (the End of Term Archive) each had ~50% of unique domains not found through any other public method. The LOC crawl data, performed in late 2016, is getting more stale by the month, and also won't be helpful for non-USG sources.

sslyze calls deadlocking due to combination of threads/processes/logging

There are no more defunct processes after #151, and our bulk scans go for much longer before they become an issue, but eventually they do just get stuck. Or as this says:

If a process is forked while an io lock is being held, the child process will deadlock on the next call to flush.

After looking at the stuck processes' trace with gdb, I'm convinced I'm facing the same issue described here:
https://stackoverflow.com/questions/39884898/large-amount-of-multiprocessing-process-causing-deadlock

And that this bug, opened in 2009 and still quite actively discussed in October 2017, is the cause:
https://bugs.python.org/issue6721

The folks on that bug thread seem to be reaching process at a fix that is specific to logging calls and buffered IO, which I suspect would be enough to fix our case. There's also some related discussion on this bug, with Guido indicating he believes something should be done for the GC interrupt case.

There are a few ways to work around this I can think of:

  • Use sslyze's SynchronousScanner instead of the ConcurrentScanner. However, this is both much slower and results in a distinct memory leak, as noted in #151.
  • Don't ever write to stdout from the child processes. However, this exposes some of my not-totally-complete understanding here -- I am not sure whether it's my own logging calls or something inside sslyze that is the problem. Workers at the domain-scan level are done as threads via a ThreadPoolExecutor, whereas I believe it's SSLyze's ConcurrentScanner that forks off processes. So I may have limited control here. The only reference to using stdout in SSLyze's core is this emergency shutdown message, so I am not sure where in SSLyze this might be happening.
  • This Python module that lets you register an after-fork hook that clears up any held locks the child copied over. It's not on PyPi and would need to be installed from the repo (pip supports github repo syntax).

Right now, I'm leaning toward the 3rd option, the Python module. I'll try it out and see how it goes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.