whythawk / data-as-a-science Goto Github PK

Lesson guide and textbook for "Data as a Science" course.

Jupyter Notebook 98.30% TeX 1.70%

data-science data-science-learning data-science-tutorials syllabus jupyter-notebooks data-ethics data-curation data-presentation data-analysis

data-as-a-science's People

Contributors

Stargazers

Watchers

Forkers

mitrofanovdmitry anti-batman dorgeln jherreragt tanhimislam wjdenny jon21paulos adhamenaya

data-as-a-science's Issues

Module 1 - Lesson 10: Publishing and evaluating studies based on cohort data and analysis of variance

ETHICS

Appraise the challenges inherent in evaluating and communicating analysis and results.

Analytical models are based on complex data. If any of these are opaque or inaccessible, then the clarity of the outcome is uncertain (or disputed).
Example: Google Flu Trends

CURATION

Discover data for reuse and appreciate the value of data retention beyond the scope for which it was collected or sampled.

Cohort data and open repositories … leading towards the project for this module.
E.g. the various cohort studies (INDEPTH, etc.)

ANALYSIS

Determine appropriate sample sizes and analysis of variance (ANOVA) methods.

Sample sizes, ANOVA, testing means across many groups.

PRESENTATION

Synthesize all you have learned to present new insight from old data.

Prepare students for the project.

CASE STUDY

Example of coffee and depression (could reverse data to demonstrate what the sample may have looked like, and how it could be spurious).

Student project will be to identify a cohort dataset online, and then study that data to produce a short statistical study. They need to take into account all that they have learned.

Module 1 - Lesson 5: Expected statistical outcomes using distributions, and issues for analysis

ETHICS

Recognise issues in analysing and exploring data for analysis.

Importance of prepublication on bias; will find correlation give law of large numbers.
Example: Abortion and crime.

CURATION

Infer interpolated data values using other data as input.

This is not necessarily “danger”, e.g. Net = Gross – Other, but the temptation could be to “create”
Note, this is also a data normalisation step (e.g. convert Y/N to True/False).
Always publish workings and definitions.

ANALYSIS

Assess expected statistical outcomes using geometric, binomial, and empirical distributions.

As above, using computational means to assess distributions.
Include simulations and case study.

PRESENTATION

Construct multiple plots, of different plot types, on a single set of axes.

Showing different data series, generated in different ways, and presented in different formats, on a single set of axes.
E.g. dot plots + histograms + line chart.

CASE STUDY

Vaccination coverage for herd immunity. E.g. Measles vaccination, and impact of ani-vaccination.

Module 2 - Lesson 10: Consolidate what you have learned, and explore machine learning

ETHICS

Synthesize the ethical lessons into a framework for professional behaviour.

Synthesis of lessons into a single data ethics set of “Hypocritic” principles.

CURATION

Consolidate curation methods into an integrated data and knowledge management process.

ANALYSIS

Challenge yourself with machine learning and an introduction to TensorFlow.

Machine learning with image data; intro to image classification with Tensor Flow; MNIST and reference to Kaggle Lung Cancer Challenge.

PRESENTATION

Assemble a data visualisation toolkit and reference framework.

CASE STUDY

Setting the final project using Randomised Control Trial data.

Module 2 - Lesson 8: Human agency and autonomous systems, and permutation testing for classification

ETHICS

Resolve conflicts between the need for human agency with algorithmic decision-making.

Cultural critique, who “owns” it? Do you take instructions from the machine, or is it a tool? Will it change behaviour?
Example: machine says “no”; or abstracting users away from the decision-making process and only permitting actions based on machines.

CURATION

Develop systems which support automated data collection and analysis while recognising user agency.

Automated systems of individual / machine data collection; agency, society, trust and data authenticity; cf “press 1 for x” type systems where no human involved and subjects cannot deviate from a predetermined path;
Emergent systems, automated diagnostics (cf telephone helplines like 111 in the UK)

ANALYSIS

Decide whether quantitative variables are related using permutation testing.

Permutation testing for categorical distributions; Radial basis function, and Support Vector Machines.

PRESENTATION

Plot categorical data – and interim states of categorisation – using violin plots and small multiples.

CASE STUDY

Continue with cancer data, or a new study from Dataverse.

Module 1 - Lesson 6: Techniques in data and population sampling, and assessing standard error

ETHICS

Determine the impact of marginalised populations in data sampling, and risk of spurious correlations.

Sampling methods and impacts on likely results, danger of spurious correlations.
Examples: how to sample drug addiction, or other stigmatised social characteristics and norms.

CURATION

Identify and apply licences and accessibility for data use and reuse.

Types of licence, and access for data (from CC to embargoed release) as well as the process of pre-prints and publication.

ANALYSIS

Evaluate standard errors on confidence intervals.

Point estimates, sampling distribution, standard error, and confidence intervals.

PRESENTATION

Plot confidence intervals and standard deviations from the mean.

Charts of confidence intervals from mean, standard deviation charts.

CASE STUDY

Chronic illness population? CDC alcohol abuse data… leads to sampling bias (cf social stigma) and alcohol abuse goes with other social problems (which came first? Alcohol or social problems?)

Module 1 - Lesson 3: Probability, randomness, and the risk of de-anonymization

ETHICS

Determine the implications in the collection, mining and recombination of open- and digital data.

As we use online and specialist data source for analysis, risks of de-anonymization.
Examples: Netflix de-anonymization, NY Taxis, Genome recovery, fitness trackers.

CURATION

Employ methods for presenting data for synthesis and usage, and employing methods for data maintenance.

Knowledge management systems, APIs, data standards and approaches to archival.
Methods for moving, tracking, cleaning, ETF, history and ownership.

ANALYSIS

Perform techniques in randomness and probability to understand distribution and likelihood.

Randomness, probability, generating datasets, tree diagrams, sampling techniques.
From software random numbers, to people (ie. before we get to people) … (samples with / without replacement), law of large numbers / averages.

PRESENTATION

Apply histograms, line charts and scatter plots to illustrate probability.

Using charts from previous lessons.

CASE STUDY

False positives / negatives from a universal breast cancer screening program, including cost and individual anxiety. Also consider risk from universal datasets of this nature.

Module 2 - Lesson 1: Trolley problems, and predictions using regression and least squares

ETHICS

Evaluate methods for balancing right and wrong in trolley problems.

Introduction to trolley problems and the conflicting ethical considerations they raise. Foundation for the module.
Example: Would you kill the fat man? Torture a kidnapper to find a child? Should Facebook hide Nazis to “pretend” they don’t exist? YouTube, terrorism and research.

CURATION

Differentiate between data which should be archived, and which should be deleted.

Personal data, financial transactions, research / cohort data, legal responsibilities.

ANALYSIS

Predict trends and future data with regressions, least squares, and least squares regressions.

Correlation, regression and least squares;
Outliers and influence for linear regression and least squares regression.

PRESENTATION

Apply line and scatter techniques to demonstrate and visualise predictive models.

Data visualisation methods for presenting uncertainty and variance.

CASE STUDY

Orlistat to reduce fat absorption; consider drugs and cost vs efficiency, NHS NICE.

Module 2 - Lesson 4: Ultimatum games, “fairness” and model selection for multiple regression

ETHICS

Examine notions of fairness between competing interest groups and amongst inexpert stakeholders.

Ultimate games, and notions of fairness. What happens when data and analysis contradict notions of fairness. Which “fairness” for whom, and why?
Examples: carbon emissions and accountability; GM foods and ownership; medicines and CRISPR; i.e. what is “fair” when most people cannot understand the data or methodology?

CURATION

Consider the risks and requirements for data disclosure when legal notions of privacy change.

Fairness in data curation; non-competing disclosure when some authorities / data owners refuse to publish; Freedom of Information and Tony Blair’s regret.
Could it appear on the front page of the New York Times … would you be embarrassed?

ANALYSIS

Identify appropriate variables for inclusion in a model and consider the variability of predictions.

Prediction intervals, and variability of prediction; identifying variables for exclusion, and two model selection. P-value approach to adjust R2, and confidence intervals …
“All models are wrong, but some are useful.”

PRESENTATION

Present variability of predictions to reflect confidence and uncertainty in the underlying data and methods.

Continuing from #12.

CASE STUDY

Continuing previous leukaemia case study; raise considerations about notifiability (HIV, Ebola … Trump’s response to banning re-entry of volunteer nurses).

Module 2 - Lesson 7: Counterfactual consequences, and implementing, testing and optimising classifiers

ETHICS

Develop mechanisms to explore the risks of counterfactual and unknowable consequences.

Counterfactuals (pp 58 in Ethics), and actual vs expected consequences; immediate vs distant;
Example: Google AI in China potentially used to support police state.

CURATION

Determine methods to permit data sources who are private individuals to access, modify or remove their personal records.

Individual access for inspecting / correction of data; Know Your Customer, and methods for secure identification of “owners” / sources of data;
Example: credit reports on individuals, some countries require users to have access. What about patient records?

ANALYSIS

Implement, test and optimise classifiers using a variety of methods.

Multiple attributes to k-nearest, and a return to linear regression; plus, maybe, stochastic optimisation (from “Collective Intelligence”)?

PRESENTATION

Reveal classification clustering in multiple dimensions on 2D plots using random “jittering”.

Adding a very small random “jitter” to scatter plots to ensure that overlaps can be seen.

CASE STUDY

Continue with cancer data, or a new study from Dataverse.

Module 1 - Lesson 9: Sample robustness, central limit theory, and the ethics and abuses of p-hacking

ETHICS

Appraise the risk of bias in p-hacking, and the risk to scientific self-correction from stigmatising researchers.

P-hacking happens unintentionally; review of mechanisms by which they occur and how to avoid, and calculate what to do about it.
However, how ethical is it to stigmatise researchers where research subsequently turns out to be p-hacked?
Examples: Amy Cuddy, Data Colada and guidelines from “False-positive psychology”

CURATION

Prepare data for long-term accessibility through unique domain object identifiers and platforms to support it.

DOI and URN are essential to ensure persistent referencing and discovery; data don’t exist if they keep moving … plus, leads into discussion and value of long-term cohort studies.

ANALYSIS

Assess sample robustness using the Central Limit Theory, and infer statistical significance based on inference for numerical data.

Central Limit Theory, variability of sample mean, determine approaches using point estimates, or test stats.
Difference of means, and hypothesis testing based on difference of means.

PRESENTATION

Plot sampling distributions for the mean of different sample sizes, and distribution of different sample means.

CASE STUDY

Impact of mothers who smoke on birth weight /// Beast-feeding and baby head circumference? + how p-hack these data?

Module 2 - Lesson 2: Doctrine of double effect, and interpreting regression with visual and numerical diagnostics

ETHICS

Interpret the Doctrine of Double Effect as it applies to negative and positive duties.

DDE, trolley problems related to “extra push” and “two loops”.
Example: Facebook / Twitter and elections (Nazis)

CURATION

Organise and present both data and methodology that supports trust in analysis.

At some point, findings will be challenged. If the process has ensured that data and methodology can be found and assessed independently, then this becomes answerable.
Example: Ratings agencies during subprime; World Bank Doing Business Report;

ANALYSIS

Interpret residuals and correlation using visual and numerical diagnostics.

Residuals, linear relationships with correlation, extrapolation, and outliers.

PRESENTATION

Present uncertainty, and build trust, with data and methods.

As we reach more speculative methods of forecasting and predicting, presentation also needs to display and support transparent and trusted methods

CASE STUDY

Ethics Toolkit pp 56 – Consequentialism
Alcohol and gender / body weight

Module 1 - Lesson 2: Research and experiments with data

ETHICS

Recognise the importance and process for applying concepts of privacy and anonymity.

Concepts in privacy, techniques in anonymity.
Examples: Facebook, Yahoo, syphilis in the US

CURATION

Integrate methods for metadata and archival into data management.

Descriptive and structural metadata.
Methods for sourcing data: acquisition, entry, reception

ANALYSIS

Investigate data distribution and confidence, and reshape using Pandas.

Experiments, numerical and categorical data.
Populations, sampling and observational data + standard deviation and mean.

PRESENTATION

Illustrate core analysis with histograms and box plots.

Scatterplots of paired data.
Histograms and shape.
Box plots, quintiles and medians.

CASE STUDY

Smoking habits of UK residents
Infant mortality
Air quality
Views on immigration

Module 2 - Lesson 3: Reflective equilibrium, and methods for multiple regression

ETHICS

Judge ethical responses – and states of reflective equilibrium – when justifying causing harm based on analysis.

John Rawls’ “Reflective equilibrium” and the trolley problems of “tractor” and “tumble”. Effectively: what happens when your choice enables a third-party to kill a person?
Example: Microsoft and US Supreme Court re data privacy in Ireland; Yahoo and Chinese Journalists; Can you kill a clinically dead (but still “alive”) person for their organs?

CURATION

Assess whether data – which in the wrong hands could cause harm – should have a purge function.

Dead man’s switches and data; if a dataset could be used to cause harm should there be a process to auto-delete if certain conditions occur? How do you manage backups in such a case (data-at-rest)?
Example: New York register of undocumented immigrants under Donald Trump; algorithms that ID gay people.

ANALYSIS

Prepare multiple regressions and inference for true slopes.

Includes adjusted R2 and bootstrapping scatter plots, plus null hypotheses of true slopes.

PRESENTATION

Ensure ambiguity in data, and risks in outcomes, are reflected in visualisation and results.

“as scientists aggregate and summarise large heterogenous data, risks of ambiguity rise”, cf Anscombe’s Quartet;
Error bars, and uncertainty in visualisation.

CASE STUDY

Multiple predictors for a single condition, and identifying which is the priority; recognising uncertainty and potential for wrong policy change.
National Cancer Institute NIH/GDC Genomic Data Commons Leukaemia dataset

Module 1 - Lesson 7: Hypothesis testing, and risks for policy from poor data

ETHICS

Assess the risks of population exclusion and hazardous externalities on data quality.

Statistics are not collected in a vacuum. They can inform and lead to policy implementation.
Consider the implications for policy when data samples are not representative or exclude the most at risk.
Examples: marginalised groups marginalisation reinforced through policy (cf African American pain management).

CURATION

Employ and assess methods for data collection and sampling.

Producing new data requires a protocol and a clear description of process and assumptions.
Any bias or oversampling to represent specific groups must be documented, and must serve research objectives.

ANALYSIS

Validate statistical data through hypotheses testing and error probabilities.

Hypothesis testing frameworks, skew and p-values, error probabilities.
p-thresholds (& 0.5), and significance levels.

PRESENTATION

Present p-values, confidence and significance to support analysis.

CASE STUDY

Example where risk of exclusion leads to poor policy. E.g. testing for food safety (bias by geography) – dataset has geospatial data. Build hypothesis testing into case study.

Module 2 - Lesson 9: Liquid modernity, multiple jurisdictions, and assessing causality in randomised control trials

ETHICS

Navigate competing ethical claims inherent to liquid societies.

Liquid Society / Modernity by Zygmunt Bauman; who are creators of sophisticated analytical systems responsible to in a global, fragmented and liquid society? Ethics is a framework for answering questions where everyone is “indignant”.
Example: sexual harassment training in Saudi Arabia; Union Carbide in Bhopal; cf Relativism, Subjectivism and Virtue Ethics.

CURATION

Resolve and integrate the technical implications for data curation in multiple jurisdictions.

Maintaining data, methods, evidence, and systems in compliance with multi-jurisdictional activity;
Example: EU hate-speech laws aimed at Twitter / Facebook. GDPR in EU, China Great Firewall.

ANALYSIS

Evaluate causality in analysis, and gain insight into genetic algorithms.

Causality, and intro to meta-analysis; intro to genetic algorithms;

PRESENTATION

Present proportions and decisions as probability trees.

Decisions trees, predictions, and tree diagrams.

CASE STUDY

Randomised control trial; treatment / control data to assess causality;
Dataverse: use of low cost Android tablets to train community health workers.

Module 2 - Lesson 5: Strong and weak machine intelligence, and classification using logistic regression

ETHICS

Consider the strengths and limitations of machine intelligence in the Chinese Room Experiment.

John Searle’s Chinese Room Experiment; strong vs weak AI. Intentionality and consciousness.
Example: Facebook Go; Alibaba and Microsoft AI reading test;

CURATION

Determine methods to secure and share data to support requirements for machine intelligence.

Requirements for data to support machine intelligence (cf car vision and snow); methods for securing personal data and individual consent where data are collected autonomously by always-on machines;
Example: UK backlash against centralised patient records; Norway’s hack of patient records; the personal media devices: Alexa, Cortana, Siri, Google …

ANALYSIS

Create data classifiers using logistic regression.

Classification using logistic regression, and modelling probability.

PRESENTATION

Plot classification outcomes for logistic regression using mosaic plots.

Mosaic plots designed to present categorical data.

CASE STUDY

Leukaemia dataset for classification.

Module 1 - Lesson 8: Bootstrapping and the risks of algorithmic decision-making

ETHICS

Differentiate as to when algorithmic decision-making has the potential to cause harm.

Automation lowers costs but also leads to “machine says no” situations. What can be done to “soften” this?
Example: automated diagnostics, insurance claims, filtered views (e.g. Facebook/ Twitter)

CURATION

Determine appropriate methods to classify and process data to aid discovery and analysis.

“Aboutness” of data is critical to discovery, but data creators don’t always classify appropriately.
Discussion of methods to classify aboutness, both manual and automatic, and recognising that automation may misclassify (since can’t understand context).
Eg. automated trading platforms response to news “events”.

ANALYSIS

Simulate resampling by generating new random samples using bootstrapping.

Bootstrap method, and using confidence intervals.

PRESENTATION

Present replicated samples as a stacked interval chart, displaying median and interval.

Chart is a variation of that used in meta-analysis for comparison.

CASE STUDY

Demonstrate how samples may show a tremendous bias, even inadvertently (hence why resampling important) and ties in above. E.g. patient waiting times.

Module 2 - Lesson 6: Emergent systems, strange loops, and supervised and unsupervised learning techniques

ETHICS

Consider the implications of strange loops – self-referential programs with emergent properties – on analysis.

Douglas Hofstadter developed the concept of strange loops; self-referential and paradoxical systems. DNA acts upon itself, and machine intelligence systems can adjust their own programs. When a system enters a strange loop, who is responsible for the outcomes of its decisions?
Example: machine intelligence systems where we don’t know how they make decisions.

CURATION

Analyse how data and methodology are curated from emergent systems.

Getting AI to explain their methodology and document how they do things; Bias in machine intelligence, from not recognising female voices, or racial bias, is a result of a chain of events. How curate these properties?

ANALYSIS

Categorise data using k-nearest neighbours for unsupervised learning clustering of data.

K-nearest neighbours, systems for training and testing;
Supervised vs unsupervised learning; k-means clustering vs k-nearest

PRESENTATION

Plot supervised and unsupervised learning outcomes using decision boundaries.

Dendrograms and decision boundaries using meshgrid.

CASE STUDY

Chronic Kidney Disease? Continue leukaemia, or look for something new on Figshare or Dataverse?

Module 1 - Lesson 4: Sampling, data distribution, and secure data custody

ETHICS

Acknowledge the privacy and confidentiality issues in data storage and security of personal data.

Consider issues in data loss, responsibility, and whether the risk from data loss outweighs the research value of collecting it.
Example: ICE and DACA deportation, National Immigration Law Centre, HIV and social stigma … cf “My daughter refused medicine because of stigma”.

CURATION

Recognise responsibilities and mechanisms for securing data-at-rest and data-in-motion.

Data storage requires recognition of responsibilities for securing that data, including encryption and auth ‘n auth. This is true for owners, users, and intermediaries (chain of custody).
Methods for securing and authentication with storage.

ANALYSIS

Apply linear and continuous sampling methods to assess normal distributions.

Linear and continuous probability; normal distribution and standardisation.
Some Bayes.

PRESENTATION

Plot distributions as normal histograms and continuous curves.

Applying previously learned charts, and learning how to present normal distributions.

CASE STUDY

Birthweight should follow a normal distribution; also raises personal data security issues since birth certificates must be secured indefinitely but used continuously.