Giter Club home page Giter Club logo

safedata's Introduction

Introduction

Access to new types of data has revolutionized much of science. Yet that revolution has yet to fully make its way to the scientific study of human beings and their interactions, where progress has been hindered by the legal, technical, and operational obstacles to sharing and accessing sensitive data about individuals. New paradigms and platforms are required to enable sensitive data from different sources to be discovered, integrated, and analyzed in an appropriately controlled manner, while also allowing researchers to share analysis methods, results, and expertise in ways not easily possible today. Success in this endeavor may enable a fundamental change in how data on human beings from governments, statistical agencies, research institutions, and other organizations, are made available for research, and thus a flowering of new methods for studying human subjects.

Our goal in this class is to study the technical problems raised by sensitive data and the technical solutions that have been developed for working with this data. We will read and discuss scientific papers in the area and hear from distinguished invited speakers with experience in the development and application of those solutions. We’ll investigate not just the technologies but also the practical applications that make secure access to sensitive data so important: applications such as new scientific approaches to understanding human beings and human societies, and evidence-based policy for healthcare, poverty, and crime. Topics to be discussed include:

  • Opportunities for social science, evidence-based policy, program evaluation, and healthcare
  • Building and operating secure data enclaves
  • Secure multi party computation
  • Statistical disclosure methods such as differential privacy
  • Privacy preserving data mining
  • Search and discovery with sensitive data
  • Data de-identification, linkage, and re-identification
  • Deployment scenarios in cities, state governments, federal government
  • Regulatory challenges and regimes
  • Methods for sharing code and reproducible research
  • The Globus safe data platform

Class organization

The class is held Tue/Thu 9:00-10:20am in Ryerson 277. It runs from March 28 to June 1. The instructor is Ian Foster [email protected], whose office is in Searle 222. Please feel free to email anytime with questions or to set up a meeting.

Along with this website, we'll use Piazza for course announcements, submitting paper reviews, posting lecture slides, and general discussion and questions about course material.

Grading

  • Paper reviews — 25%
  • Paper presentations — 20%
  • Participation — 5%
  • Course project — 50%

Separate pages provide guidance for paper reviews and presentations and class projects.

Schedule (subject to change)

Date Content
3/28 Introduction. Defining the space. Slides.
3/30 Technological challenges and opportunities. Required: Privacy and security with big data, Simson Garfinkel, 2017; Privacy protecting research: Challenges and opportunities, Daniel Goroff and Jules Polonetsky, 2017.
4/4 Guest lecture: Matt Gee, Harris School. Read this blog post and this paper on Enigma.
4/6 Guest lecture: Charlie Catlett, Argonne National Laboratory. The Array of Things and opportunities and challenges in urban data. Please read and comment on this paper on AoT.
4/11 Safe data enclaves and related topics. Required: Five safes: designing data access for research, Tanvi Desai, Felix Ritchie, Richard Welpton, 2016. (Notes); Research infrastructures for the safe analysis of sensitive data, Ian Foster, 2017; Recommended: NORC Data Enclave:Providing Secure Remote Access to Sensitive Microdata, Julia Lane et al., 2009; Data Access in a Cyber World: Making Use of Cyberinfrastructure, Julia Lane et al., 2008.
4/13 Safe data enclaves, contd. Required: Advancing Integrated Data Systems by States and Local Governments, Dennis Culhane et al., 2017; Cloud Kotta: Enabling Secure and Scalable Data Analytics in the Cloud, Yadu Babuji et al., 2016.
4/18 Guest lecture (remote): Julia Lane, New York University. Big data for public policy: The quadruple helix. Background reading: P1, P2, P3, P4, P5.
4/20 Homomorphic encryption. Three papers: Technical Perspective: A First Glimpse of Cryptography's Holy Grail, Daniele Micciancio, 2010; Computing Arbitrary Functions of Encrypted Data, Craig Gentry, 2010; What is Homomorphic Encryption, and Why Should I Care?, Craig Stuntz, 2010.
4/25 More homomorphic etc.. CryptoNets: Applying Neural Networks to Encrypted Data with High Throughput and Accuracy, Dowlin et al., 2016; CryptDB: Protecting Confidentiality with Encrypted Query Processing, Raluca Ada Popa et al., 2011.
4/27 Differential privacy. Privacy by the Numbers: A New Approach to Safeguarding Data, Erica Klarreich, 2012; A firm foundation for private data analysis, Cynthia Dwork, 2011; Privacy-Preserving Data Analysis for the Federal Statistical Agencies, John Abowd et al., 2017; The algorithmic foundations of differential privacy, Cynthia Dwork and Aaron Roth, 2014.
5/2 Synthetic data and statistical disclosure limitation. How Protective Are Synthetic Data?, Abowd and Vilhuber, 2008; Statistical Disclosure Control for Survey Data, Chris Skinner, 2009.
5/4 Guest lecturer: Simson Garfinkel, US Census Bureau. Technical challenges in disclosure control.
5/9 Differential privacy. On Significance of the Least Significant Bits For Differential Privacy, Ilya Mironov, 2012. Verifiable Differential Privacy, Arjun Narayan et al., 2015.
5/11 Multiparty and masked. Multiparty Computation Goes Live, Peter Bogetoft et al., 2008. Computing on Masked Data to improve the Security of Big Data, Vijay Gadepally et al., 2015.
5/16 Project reviews
5/18 Guest lecture: Brett Goldstein, University of Chicago. Responsible data mining.
5/23 Guest lecture: Bruce Meyer, University of Chicago.
5/25 Malaria paper and Communication-Efficient Learning of Deep Networks from Decentralized Data, McMahan et al., 2016, and see blog post, 2017.
5/30 Project presentations for those graduating this quarter
6/1 Reading period
6/6 No class (Ian at conference)
6/8 Project presentations

Papers to be discussed (a work in progress)

Overview

Safe data enclaves

Anonymization and de-identification

Reidentification risks

Statistical disclosure control (Notes)

Secure multi-party computation

Computing on masked data

Residual information in documents

Responsible data

Other

safedata's People

Contributors

ianfoster avatar

Watchers

 avatar

Forkers

tranway1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.