drivendataorg / deon Goto Github PK

View Code? Open in Web Editor NEW

274.0 15.0 51.0 400 KB

A command line tool to easily add an ethics checklist to your data science projects.

Home Page: https://deon.drivendata.org/

License: MIT License

Makefile 7.37% Python 92.63%

ethics data-science machine-learning data-ethics

deon's Introduction

Read more about deon on the project homepage

An ethics checklist for data scientists

deon is a command line tool that allows you to easily add an ethics checklist to your data science projects. We support creating a new, standalone checklist file or appending a checklist to an existing analysis in many common formats.

To help get started, deon includes a default Data Science Ethics Checklist along with a list of real-world examples connected with each item. Users can draw on the default list or develop their own.

δέον • (déon) [n.] (Ancient Greek) wikitionary

Duty; that which is binding, needful, right, proper.

The conversation about ethics in data science, machine learning, and AI is increasingly important. The goal of deon is to push that conversation forward and provide concrete, actionable reminders to the developers that have influence over how data science gets done.

Quickstart

You only need two lines of code to get started!

First, install deon:

$ pip install deon

Then, write out the default checklist to a markdown file called ETHICS.md:

$ deon -o ETHICS.md

Dig into the checklist questions to identify and navigate the ethical considerations in your data science project.

For more configuration details, see the sections on command line options, supported output file types, and custom checklists.

Background and perspective

We have a particular perspective with this package that we will use to make decisions about contributions, issues, PRs, and other maintenance and support activities.

First and foremost, our goal is not to be arbitrators of what ethical concerns merit inclusion. We have a process for changing the default checklist, but we believe that many domain-specific concerns are not included and teams will benefit from developing custom checklists. Not every checklist item will be relevant. We encourage teams to remove items, sections, or mark items as N/A as the concerns of their projects dictate.

Second, we built our initial list from a set of proposed items on multiple checklists that we referenced. This checklist was heavily inspired by an article written by Mike Loukides, Hilary Mason, and DJ Patil and published by O'Reilly: "Of Oaths and Checklists". We owe a great debt to the thinking that proceeded this, and we look forward to thoughtful engagement with the ongoing discussion about checklists for data science ethics.

Third, we believe in the power of examples to bring the principles of data ethics to bear on human experience. This repository includes a list of real-world examples connected with each item in the default checklist. We encourage you to contribute relevant use cases that you believe can benefit the community by their example. In addition, if you have a topic, idea, or comment that doesn't seem right for the documentation, please add it to the wiki page for this project!

Fourth, it's not up to data scientists alone to decide what the ethical course of action is. This has always been a responsibility of organizations that are part of civil society. This checklist is designed to provoke conversations around issues where data scientists have particular responsibility and perspective. This conversation should be part of a larger organizational commitment to doing what is right.

Fifth, we believe the primary benefit of a checklist is ensuring that we don't overlook important work. Sometimes it is difficult with pressing deadlines and a demand to multitask to make sure we do the hard work to think about the big picture. This package is meant to help ensure that those discussions happen, even in fast-moving environments. Ethics is hard, and we expect some of the conversations that arise from this checklist may also be hard.

Sixth, we are working at a level of abstraction that cannot concretely recommend a specific action (e.g., "remove variable X from your model"). Nearly all of the items on the checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Because of this, most of the items are framed as prompts to discuss or consider. Teams will want to document these discussions and decisions for posterity.

Seventh, we can't define exhaustively every term that appears in the checklist. Some of these terms are open to interpretation or mean different things in different contexts. We recommend that when relevant, users create their own glossary for reference.

Eighth, we want to avoid any items that strictly fall into the realm of statistical best practices. Instead, we want to highlight the areas where we need to pay particular attention above and beyond best practices.

Ninth, we want all the checklist items to be as simple as possible (but no simpler), and to be actionable.

Using this tool

Prerequisites

Python >3.6: Your project need not be Python 3, but you need Python 3 to execute this tool.

Installation

$ pip install deon

$ conda install deon -c conda-forge

Simple usage

We recommend adding a checklist as the first step in your data science project. After creating your project folder, you could run:

$ deon -o ETHICS.md

This will create a markdown file called ETHICS.md that you can add directly to your project.

For simple one-off analyses, you can append the checklist to a Jupyter notebook or RMarkdown file using the -o flag to indicate the output file. deon will automatically append if that file already exists.

$ jupyter notebook my-analysis.ipynb

...

$ deon -o my-analysis.ipynb  # append cells to existing output file

This checklist can be used by individuals or teams to ensure that reviewing the ethical implications of their work is part of every project. The checklist is meant as a jumping-off point, and it should spark deeper and more thourough discussions rather than replace those discussions.

Proudly display your Deon badge

You can add a Deon badge to your project documentation, such as the README, to encourage wider adoption of these ethical practices in the data science community.

HTML badge

<a href="http://deon.drivendata.org/">
    <img src="https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square" alt="Deon badge" />
</a>

Markdown badge

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

Supported file types

Here are the currently supported file types. We will accept pull requests with new file types if there is a strong case for widespread use of that filetype.

.txt: ascii
.html: html
.ipynb: jupyter
.md: markdown
.rmd: rmarkdown
.rst: rst

Command line options

Usage: deon [OPTIONS]

  Easily create an ethics checklist for your data science project.

  The checklist will be printed to standard output by default. Use the --output
  option to write to a file instead.

Options:
  -l, --checklist PATH  Override default checklist file with a path to a custom
                        checklist.yml file.
  -f, --format TEXT     Output format. Default is "markdown". Can be one of
                        [ascii, html, jupyter, markdown, rmarkdown, rst].
                        Ignored and file extension used if --output is passed.
  -o, --output PATH     Output file path. Extension can be one of [.txt, .html,
                        .ipynb, .md, .rmd, .rst]. The checklist is appended if
                        the file exists.
  -w, --overwrite       Overwrite output file if it exists. Default is False,
                        which will append to existing file.
  -m, --multicell       For use with Jupyter format only. Write checklist with
                        multiple cells, one item per cell. Default is False,
                        which will write the checklist in a single cell.
  --help                Show this message and exit.

Default checklist

Data Science Ethics Checklist

A. Data Collection

A.1 Informed consent: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
A.2 Collection bias: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
A.3 Limit PII exposure: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
A.4 Downstream bias mitigation: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

B. Data Storage

B.1 Data security: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
B.2 Right to be forgotten: Do we have a mechanism through which an individual can request their personal information be removed?
B.3 Data retention plan: Is there a schedule or plan to delete the data after it is no longer needed?

C. Analysis

C.1 Missing perspectives: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
C.2 Dataset bias: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
C.3 Honest representation: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
C.4 Privacy in analysis: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
C.5 Auditability: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

D. Modeling

D.1 Proxy discrimination: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
D.2 Fairness across groups: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
D.3 Metric selection: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
D.4 Explainability: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
D.5 Communicate limitations: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

E. Deployment

E.1 Monitoring and evaluation: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
E.2 Redress: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
E.3 Roll back: Is there a way to turn off or roll back the model in production if necessary?
E.4 Unintended use: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

Data Science Ethics Checklist generated with deon.

Custom checklists

This is not meant to be the only ethical checklist, but instead we try to capture reasonable defaults that are general enough to be widely useful. For your own projects with particular concerns, we recommend your own checklist.yml file that is maintained by your team and passed to this tool with the -l flag.

Custom checklists must follow the same schema as checklist.yml. There must be a top-level title which is a string, and sections which is a list. Each section in the list sections must have a title, a section_id, and then a list of lines. Each line must have a line_id, a line_summary which is a 1-3 word shorthand, and a line string which is the content. The format is as follows:

title: TITLE
sections:
  - title: SECTION TITLE
    section_id: SECTION NUMBER
    lines:
        - line_id: LINE NUMBER
          line_summary: LINE SUMMARY
          line: LINE CONTENT

Changing the checklist

Please see the framing for an understanding of our perspective. Given this perspective, we will consider changes to the default checklist that fit with that perspective and follow this process.

Our goal is to have checklist items that are actionable as part of a review of data science work or as part of a plan. Please avoid suggesting items that are too vague (e.g., "do no harm") or too specific (e.g., "remove social security numbers from data").

Note: This process is an experiment and is subject to change based on how well it works.

A pull request to add an item should change:

deon/assets/checklist.yml: contains the default checklist items
deon/assets/examples_of_ethical_issues.yml: contains example of harms caused when the item was not considered

The description in the pull request must include:

A justification for the change
A consideration of related items that already exist, and why this change is different from what exists
A published example (academic or press article) of where neglecting the principle has lead to concrete harm (articles that discuss potential or hypothetical harm will not be considered sufficient)

See detailed contributing instructions here.

Discussion and commentary

In addition to this documentation, the wiki pages for the GitHub repository are enabled. This is a good place for sharing of links and discussion of how the checklsits are used in practice.

If you have a topic, idea, or comment that doesn't seem right for the documentation, please add it to the wiki!

References, reading, and more

A robust discussion of data ethics is important for the profession. The goal of this tool is to make it easier to implement ethics review within technical projects. There are lots of great resources if you want to think about data ethics, and we encourage you to do so!

Checklist citations

We're excited to see so many articles popping up on data ethics! The short list below includes articles that directly informed the checklist content as well as a few case studies and thought-provoking pieces on the big picture.

Of oaths and checklists
How to build ethics into AI (Part I and Part II)
An ethical checklist for data science
How to recognize exclusion in AI
Case studies in data ethics
Technology is biased too. How do we fix it?
The dark secret at the heart of AI

Where things have gone wrong

To make the ideas contained in the checklist more concrete, we've compiled examples of times when things have gone wrong. They're paired with the checklist questions to help illuminate where in the process ethics discussions may have helped provide a course correction.

We welcome contributions! Follow these instructions to add an example.

Related tools

There are other groups working on data ethics and thinking about how tools can help in this space. Here are a few we've seen so far:

Aequitas (github)
Ethical OS Toolkit
Ethics & Algorithms Toolkit: A risk management framework for governments
Ethics and Data Science (free ebook) and (write-up)

deon was created and is maintained by the team at DrivenData. Our mission is to bring the power of data science to social impact organizations.

deon's People

Contributors

Stargazers

Watchers

deon's Issues

2018-08-14 Feedback from meeting

Framing (for section in the documentation) [tracked in #25]:

Not all have to be relevant to every project (N/A is an ok answer)
It is ethical to be good at your job, but this is not a list of statistical best practices
Provoke discussion, not recommending particular fixes or policies
Don't create comprehensive lists of what is unfair or harmful, this is up to the teams that use the checklist

Other ideas:

Make it easy to contribute links with examples (do we have a examples.yml that ties to checklist.yml will generate the table of examples that goes in the docs?)
Create a separate "data managers checklist" (team composition, culture of data, transparency)
include link to project homepage in the checklist for every format
Do we need to include something on data being sold / shared?
Explanation that is blog post style that goes in the documentation and can be shared in other places on launch
first rev of feedback that is private

Team Composition

Remove this as a separate item
incorporate these concerns into other checklist items (maybe a preamble) -- i.e. Have we noted where we are vulnerable to bias from missing perspectives?

Data Collection

(Bullets 3 + 4 combined and changed to): "Have you considered ways to to minimize exposure of PII for example through anonymization or not collecting information that isn't relevant for analysis."

Data Storage

Combine 1+2: Keep bullet the same, and the e.g. becomes: "(e.g. encryption at rest and in transit, access controls, access logs, and up-to-date software)"
Add link to data security best practices in references?

Exploratory analysis

Change section title to "Analysis"
1st bullet, edit: replace "studied and understood" with "examined the data for possible sources of bias and taken steps to mitigate or address these biases"
1st bullet, add: "(e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)"
3rd bullet, discussion: "PII are not used or displayed unless necessary for the analysis." Is this sufficiently clear?
3rd bullet, discussion: consider synonyms for "auditable" that is less formal

Modeling

Move 1st bullet to "Analysis"
"discriminative" to "discriminatory"
remove "assumptions built into model"

Deployment

"appeal and redress" to "response"
"discussed with our organization" instead of creating a system for "appeal" or "redress"
last bullet to modeling section
remove "malicious attacks"

Add `xclip` to prereqs on Linux

Add option to exclude ids from rendering

We currently always render ids because the include_ids parameter defaults to True. We may want to add an option to turn off the checklist item numbering if desired.

Consider adding a "data managers" checklist

Per our discussion, we may want a separate set of items for data team managers that don't fit directly into the goals for a practitioner-oriented list. For example, team composition and a culture of data and transparency.

Web application for use and accountability

CLI wrapper should be thinner

We should expose something like deon.create() that has all the same options as the current CLI. The CLI can just be a thin wrapper for this function. This will make integration easier for Python tools and scripts.

Format test suite

Move to travisci.org for CI since codeship can't run against PRs submitted from forks

Codeship worked when everything we did was in a branch, but now that we have PRs from contributor forks we need to move over to travis.

Implement docs page for best practices / resources

We may want include link to different kinds of best practices somewhere that we don't want to manage directly in this project (e.g., data storage best practices, statistical best practices).

Find a home for these kinds of link in the docs

Would be helpful to have "what to do" not just "what not to do"

Navigation menu on side

Override block in sidebars/navigation.html which ignores H1

{% macro toc_tree(toc) -%}
  {# This ignores H1s #}
  {% for toc_item in toc %}
    {{ _toc_tree_inner(toc_item.children) }}
  {% endfor %}
{%- endmacro %}

References of ethical issues

Include markdown of current version w/ table that has where things went wrong + citations. Goal is one articles (most salient) for each item in checklist. Backup is a few articles, point out which issues from checklist happened in each.

Landing page

Add supported format to help text

Help text for formats and Usage error should use the format dictionaries to b helpful to users.

Add to PyPI

tag release
build and upload to pypi
test install from pypi
add badge to readme.md

Launch blog post

intro with announcement and motivations
use existing framing and description

Override alabaster table styles

Prettify references table

Remove explicit horizontal rules from md and rst

RST template seems to have too many blank lines.

Create template to render readme and docs

Want to programmatically generate them so they both get updated when checklist or references are updated.

Find a better name for table of ethical issues

Examples header in the sidebar feels like it should link to checklist examples (i.e. the rendered versions) rather than a table of ethical issues.

One idea is to call this what not to do and then pair it with a what to do page of helpful resources.

bandit security report

I ran bandit over the code base excluding tests. There are basically two issues reported.

This is the report

bandit --exclude test -r .
[main]	INFO	profile include tests: None
[main]	INFO	profile exclude tests: None
[main]	INFO	cli include tests: None
[main]	INFO	cli exclude tests: None
[main]	INFO	running on Python 3.7.0
Run started:2018-10-24 10:41:31.438775

Test results:
>> Issue: [B506:yaml_load] Use of unsafe yaml load. Allows instantiation of arbitrary objects. Consider yaml.safe_load().
   Severity: Medium   Confidence: High
   Location: ./deon/parser.py:14
   More Info: https://bandit.readthedocs.io/en/latest/plugins/b506_yaml_load.html
13	        with open(filepath, "r") as f:
14	            data = yaml.load(f)
15	

--------------------------------------------------
>> Issue: [B701:jinja2_autoescape_false] By default, jinja2 sets autoescape to False. Consider using autoescape=True or use the select_autoescape function to mitigate XSS vulnerabilities.
   Severity: High   Confidence: High
   Location: ./docs/render_templates.py:13
   More Info: https://bandit.readthedocs.io/en/latest/plugins/b701_jinja2_autoescape_false.html
12	
13	env = Environment(
14	    loader=FileSystemLoader('md_templates'),
15	)

--------------------------------------------------
>> Issue: [B506:yaml_load] Use of unsafe yaml load. Allows instantiation of arbitrary objects. Consider yaml.safe_load().
   Severity: Medium   Confidence: High
   Location: ./docs/render_templates.py:52
   More Info: https://bandit.readthedocs.io/en/latest/plugins/b506_yaml_load.html
51	    with open(root / 'examples_of_ethical_issues.yml', 'r') as f:
52	        refs = yaml.load(f)
53	

--------------------------------------------------

Code scanned:
	Total lines of code: 390
	Total lines skipped (#nosec): 0

Run metrics:
	Total issues (by severity):
		Undefined: 0.0
		Low: 0.0
		Medium: 2.0
		High: 1.0
	Total issues (by confidence):
		Undefined: 0.0
		Low: 0.0
		Medium: 0.0
		High: 3.0
Files skipped (0):

Click should show help when improper arguments are passed

Currently just shows Usage: deon [OPTIONS] and then a custom error message. Ideal would be to show the options from help

Support RST

Clipboard option for html and jupyter does not return full doc

clipboard option only calls .render, not .write, meaning that clipboard paired with html or jupyter will not return the full document (i.e. will not be the same as what is returned with output).

For example

ethics-checklist -f html only returns the html starting with <h1> (excludes doc_template).
ethics-checklist -f jupyter only return the json starting with cell_type (excludes blank_jupyter_notebook).

Suport LaTeX format

do we need to require certain packages to support a checklist in LaTex?

Checklist formats should be yaml or toml

Currently the example is a json file which we don't want long term because it doesn't play nicely with source control

Add cli tests

Remaining content links + study formatting

reference link for E.2
reference link for E.3
best way to include second link for reference: wording will be "(Related academic study)" on bullet directly below
add links to propublica commentary / rebuttals
add cambridge analytica article

asciicinema

Support RTF

Link to rendered examples on Github

The documentation does not currently link to the examples folder of rendered checklists

Name and logo

Add to README

Prettier theme for docs

Potentially

lux
litera
alabaster

Turn on wiki

Custom checklist parameter should accept urls

Currently, we enforce this is a local file path, but we should also handle URLs to make integration easier.

Consider way to include definitions of key concepts

There is a trade off between fully defining each concept (e.g. informed consent) and having clean and simple checklist. Can we have an option to append a glossary or links to definitions?

Add bolding or callout of specific terms in the checklist items

Support ascii format

Refactor `make_table_of_links` to iterate over sections rather than references

Currently this function (in utils.py) iterates over references and then if the reference is the first in a chunk, it grabs the appropriate section. This is a legacy of having written this table without section titles first. Instead, it should iterate over sections and then grab references for a section.

Make docs

remember to add to CI

add horizontal rules before and after default checklist in docs
shrink h2 font size so it is smaller than h1

Clarify CLI help text

Here's an edited version:

  -l, --checklist PATH  Override default checklist file with a path to a custom checklist.yml file.
  -f, --format TEXT     Output format. Default is "markdown". Can be one of
                        [markdown, rst, jupyter, html]. Ignored and file extension used if
                        --output is passed.
  -o, --output PATH     Output file path. Extension can be one of [.md, .rst,
                        .ipynb, .html]. The checklist is appended if file exists.
  -c, --clipboard       Whether or not to copy the output to the clipboard.
  -w, --overwrite       Overwrite output file if it exists. Default is False, which will append to existing file.
  --help                Show this message and exit.

Add CI

Run tests
Render examples

Do we support RMarkdown?

Markdown templating should work, but the extension should be .Rmd. We'll have to test this.

bonus fix: make extensions case insensitive.

Minor text edits

inconsistency on project v. projects (a command line to add an ethics checklist to data science projects / to a data science project); this appears in site description and github repo description as well
inconsistency on command line v. command-line
flame emoji doesn't render

Add link from README to documentation page

Is C.4 sufficiently clear?

This came up in discussions, and some people thought it was clear, some thought it was not.

Opening this issue to continue the discussion.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.