whythawk / data-as-a-science Goto Github PK
View Code? Open in Web Editor NEWLesson guide and textbook for "Data as a Science" course.
Lesson guide and textbook for "Data as a Science" course.
Appraise the challenges inherent in evaluating and communicating analysis and results.
Analytical models are based on complex data. If any of these are opaque or inaccessible, then the clarity of the outcome is uncertain (or disputed).
Example: Google Flu Trends
Discover data for reuse and appreciate the value of data retention beyond the scope for which it was collected or sampled.
Cohort data and open repositories … leading towards the project for this module.
E.g. the various cohort studies (INDEPTH, etc.)
Determine appropriate sample sizes and analysis of variance (ANOVA) methods.
Sample sizes, ANOVA, testing means across many groups.
Synthesize all you have learned to present new insight from old data.
Prepare students for the project.
Example of coffee and depression (could reverse data to demonstrate what the sample may have looked like, and how it could be spurious).
Student project will be to identify a cohort dataset online, and then study that data to produce a short statistical study. They need to take into account all that they have learned.
Recognise issues in analysing and exploring data for analysis.
Importance of prepublication on bias; will find correlation give law of large numbers.
Example: Abortion and crime.
Infer interpolated data values using other data as input.
This is not necessarily “danger”, e.g. Net = Gross – Other, but the temptation could be to “create”
Note, this is also a data normalisation step (e.g. convert Y/N to True/False).
Always publish workings and definitions.
Assess expected statistical outcomes using geometric, binomial, and empirical distributions.
As above, using computational means to assess distributions.
Include simulations and case study.
Construct multiple plots, of different plot types, on a single set of axes.
Showing different data series, generated in different ways, and presented in different formats, on a single set of axes.
E.g. dot plots + histograms + line chart.
Vaccination coverage for herd immunity. E.g. Measles vaccination, and impact of ani-vaccination.
Synthesize the ethical lessons into a framework for professional behaviour.
Synthesis of lessons into a single data ethics set of “Hypocritic” principles.
Consolidate curation methods into an integrated data and knowledge management process.
Challenge yourself with machine learning and an introduction to TensorFlow.
Machine learning with image data; intro to image classification with Tensor Flow; MNIST and reference to Kaggle Lung Cancer Challenge.
Assemble a data visualisation toolkit and reference framework.
Setting the final project using Randomised Control Trial data.
Resolve conflicts between the need for human agency with algorithmic decision-making.
Cultural critique, who “owns” it? Do you take instructions from the machine, or is it a tool? Will it change behaviour?
Example: machine says “no”; or abstracting users away from the decision-making process and only permitting actions based on machines.
Develop systems which support automated data collection and analysis while recognising user agency.
Automated systems of individual / machine data collection; agency, society, trust and data authenticity; cf “press 1 for x” type systems where no human involved and subjects cannot deviate from a predetermined path;
Emergent systems, automated diagnostics (cf telephone helplines like 111 in the UK)
Decide whether quantitative variables are related using permutation testing.
Permutation testing for categorical distributions; Radial basis function, and Support Vector Machines.
Plot categorical data – and interim states of categorisation – using violin plots and small multiples.
Continue with cancer data, or a new study from Dataverse.
Determine the impact of marginalised populations in data sampling, and risk of spurious correlations.
Sampling methods and impacts on likely results, danger of spurious correlations.
Examples: how to sample drug addiction, or other stigmatised social characteristics and norms.
Identify and apply licences and accessibility for data use and reuse.
Types of licence, and access for data (from CC to embargoed release) as well as the process of pre-prints and publication.
Evaluate standard errors on confidence intervals.
Point estimates, sampling distribution, standard error, and confidence intervals.
Plot confidence intervals and standard deviations from the mean.
Charts of confidence intervals from mean, standard deviation charts.
Chronic illness population? CDC alcohol abuse data… leads to sampling bias (cf social stigma) and alcohol abuse goes with other social problems (which came first? Alcohol or social problems?)
Determine the implications in the collection, mining and recombination of open- and digital data.
As we use online and specialist data source for analysis, risks of de-anonymization.
Examples: Netflix de-anonymization, NY Taxis, Genome recovery, fitness trackers.
Employ methods for presenting data for synthesis and usage, and employing methods for data maintenance.
Knowledge management systems, APIs, data standards and approaches to archival.
Methods for moving, tracking, cleaning, ETF, history and ownership.
Perform techniques in randomness and probability to understand distribution and likelihood.
Randomness, probability, generating datasets, tree diagrams, sampling techniques.
From software random numbers, to people (ie. before we get to people) … (samples with / without replacement), law of large numbers / averages.
Apply histograms, line charts and scatter plots to illustrate probability.
Using charts from previous lessons.
False positives / negatives from a universal breast cancer screening program, including cost and individual anxiety. Also consider risk from universal datasets of this nature.
Evaluate methods for balancing right and wrong in trolley problems.
Introduction to trolley problems and the conflicting ethical considerations they raise. Foundation for the module.
Example: Would you kill the fat man? Torture a kidnapper to find a child? Should Facebook hide Nazis to “pretend” they don’t exist? YouTube, terrorism and research.
Differentiate between data which should be archived, and which should be deleted.
Personal data, financial transactions, research / cohort data, legal responsibilities.
Predict trends and future data with regressions, least squares, and least squares regressions.
Correlation, regression and least squares;
Outliers and influence for linear regression and least squares regression.
Apply line and scatter techniques to demonstrate and visualise predictive models.
Data visualisation methods for presenting uncertainty and variance.
Orlistat to reduce fat absorption; consider drugs and cost vs efficiency, NHS NICE.
Examine notions of fairness between competing interest groups and amongst inexpert stakeholders.
Ultimate games, and notions of fairness. What happens when data and analysis contradict notions of fairness. Which “fairness” for whom, and why?
Examples: carbon emissions and accountability; GM foods and ownership; medicines and CRISPR; i.e. what is “fair” when most people cannot understand the data or methodology?
Consider the risks and requirements for data disclosure when legal notions of privacy change.
Fairness in data curation; non-competing disclosure when some authorities / data owners refuse to publish; Freedom of Information and Tony Blair’s regret.
Could it appear on the front page of the New York Times … would you be embarrassed?
Identify appropriate variables for inclusion in a model and consider the variability of predictions.
Prediction intervals, and variability of prediction; identifying variables for exclusion, and two model selection. P-value approach to adjust R2, and confidence intervals …
“All models are wrong, but some are useful.”
Present variability of predictions to reflect confidence and uncertainty in the underlying data and methods.
Continuing from #12.
Continuing previous leukaemia case study; raise considerations about notifiability (HIV, Ebola … Trump’s response to banning re-entry of volunteer nurses).
Develop mechanisms to explore the risks of counterfactual and unknowable consequences.
Counterfactuals (pp 58 in Ethics), and actual vs expected consequences; immediate vs distant;
Example: Google AI in China potentially used to support police state.
Determine methods to permit data sources who are private individuals to access, modify or remove their personal records.
Individual access for inspecting / correction of data; Know Your Customer, and methods for secure identification of “owners” / sources of data;
Example: credit reports on individuals, some countries require users to have access. What about patient records?
Implement, test and optimise classifiers using a variety of methods.
Multiple attributes to k-nearest, and a return to linear regression; plus, maybe, stochastic optimisation (from “Collective Intelligence”)?
Reveal classification clustering in multiple dimensions on 2D plots using random “jittering”.
Adding a very small random “jitter” to scatter plots to ensure that overlaps can be seen.
Continue with cancer data, or a new study from Dataverse.
Appraise the risk of bias in p-hacking, and the risk to scientific self-correction from stigmatising researchers.
P-hacking happens unintentionally; review of mechanisms by which they occur and how to avoid, and calculate what to do about it.
However, how ethical is it to stigmatise researchers where research subsequently turns out to be p-hacked?
Examples: Amy Cuddy, Data Colada and guidelines from “False-positive psychology”
Prepare data for long-term accessibility through unique domain object identifiers and platforms to support it.
DOI and URN are essential to ensure persistent referencing and discovery; data don’t exist if they keep moving … plus, leads into discussion and value of long-term cohort studies.
Assess sample robustness using the Central Limit Theory, and infer statistical significance based on inference for numerical data.
Central Limit Theory, variability of sample mean, determine approaches using point estimates, or test stats.
Difference of means, and hypothesis testing based on difference of means.
Plot sampling distributions for the mean of different sample sizes, and distribution of different sample means.
Impact of mothers who smoke on birth weight /// Beast-feeding and baby head circumference? + how p-hack these data?
Interpret the Doctrine of Double Effect as it applies to negative and positive duties.
DDE, trolley problems related to “extra push” and “two loops”.
Example: Facebook / Twitter and elections (Nazis)
Organise and present both data and methodology that supports trust in analysis.
At some point, findings will be challenged. If the process has ensured that data and methodology can be found and assessed independently, then this becomes answerable.
Example: Ratings agencies during subprime; World Bank Doing Business Report;
Interpret residuals and correlation using visual and numerical diagnostics.
Residuals, linear relationships with correlation, extrapolation, and outliers.
Present uncertainty, and build trust, with data and methods.
As we reach more speculative methods of forecasting and predicting, presentation also needs to display and support transparent and trusted methods
Ethics Toolkit pp 56 – Consequentialism
Alcohol and gender / body weight
Recognise the importance and process for applying concepts of privacy and anonymity.
Concepts in privacy, techniques in anonymity.
Examples: Facebook, Yahoo, syphilis in the US
Integrate methods for metadata and archival into data management.
Descriptive and structural metadata.
Methods for sourcing data: acquisition, entry, reception
Investigate data distribution and confidence, and reshape using Pandas.
Experiments, numerical and categorical data.
Populations, sampling and observational data + standard deviation and mean.
Illustrate core analysis with histograms and box plots.
Scatterplots of paired data.
Histograms and shape.
Box plots, quintiles and medians.
Judge ethical responses – and states of reflective equilibrium – when justifying causing harm based on analysis.
John Rawls’ “Reflective equilibrium” and the trolley problems of “tractor” and “tumble”. Effectively: what happens when your choice enables a third-party to kill a person?
Example: Microsoft and US Supreme Court re data privacy in Ireland; Yahoo and Chinese Journalists; Can you kill a clinically dead (but still “alive”) person for their organs?
Assess whether data – which in the wrong hands could cause harm – should have a purge function.
Dead man’s switches and data; if a dataset could be used to cause harm should there be a process to auto-delete if certain conditions occur? How do you manage backups in such a case (data-at-rest)?
Example: New York register of undocumented immigrants under Donald Trump; algorithms that ID gay people.
Prepare multiple regressions and inference for true slopes.
Includes adjusted R2 and bootstrapping scatter plots, plus null hypotheses of true slopes.
Ensure ambiguity in data, and risks in outcomes, are reflected in visualisation and results.
“as scientists aggregate and summarise large heterogenous data, risks of ambiguity rise”, cf Anscombe’s Quartet;
Error bars, and uncertainty in visualisation.
Multiple predictors for a single condition, and identifying which is the priority; recognising uncertainty and potential for wrong policy change.
National Cancer Institute NIH/GDC Genomic Data Commons Leukaemia dataset
Assess the risks of population exclusion and hazardous externalities on data quality.
Statistics are not collected in a vacuum. They can inform and lead to policy implementation.
Consider the implications for policy when data samples are not representative or exclude the most at risk.
Examples: marginalised groups marginalisation reinforced through policy (cf African American pain management).
Employ and assess methods for data collection and sampling.
Producing new data requires a protocol and a clear description of process and assumptions.
Any bias or oversampling to represent specific groups must be documented, and must serve research objectives.
Validate statistical data through hypotheses testing and error probabilities.
Hypothesis testing frameworks, skew and p-values, error probabilities.
p-thresholds (& 0.5), and significance levels.
Present p-values, confidence and significance to support analysis.
Example where risk of exclusion leads to poor policy. E.g. testing for food safety (bias by geography) – dataset has geospatial data. Build hypothesis testing into case study.
Navigate competing ethical claims inherent to liquid societies.
Liquid Society / Modernity by Zygmunt Bauman; who are creators of sophisticated analytical systems responsible to in a global, fragmented and liquid society? Ethics is a framework for answering questions where everyone is “indignant”.
Example: sexual harassment training in Saudi Arabia; Union Carbide in Bhopal; cf Relativism, Subjectivism and Virtue Ethics.
Resolve and integrate the technical implications for data curation in multiple jurisdictions.
Maintaining data, methods, evidence, and systems in compliance with multi-jurisdictional activity;
Example: EU hate-speech laws aimed at Twitter / Facebook. GDPR in EU, China Great Firewall.
Evaluate causality in analysis, and gain insight into genetic algorithms.
Causality, and intro to meta-analysis; intro to genetic algorithms;
Present proportions and decisions as probability trees.
Decisions trees, predictions, and tree diagrams.
Randomised control trial; treatment / control data to assess causality;
Dataverse: use of low cost Android tablets to train community health workers.
Consider the strengths and limitations of machine intelligence in the Chinese Room Experiment.
John Searle’s Chinese Room Experiment; strong vs weak AI. Intentionality and consciousness.
Example: Facebook Go; Alibaba and Microsoft AI reading test;
Determine methods to secure and share data to support requirements for machine intelligence.
Requirements for data to support machine intelligence (cf car vision and snow); methods for securing personal data and individual consent where data are collected autonomously by always-on machines;
Example: UK backlash against centralised patient records; Norway’s hack of patient records; the personal media devices: Alexa, Cortana, Siri, Google …
Create data classifiers using logistic regression.
Classification using logistic regression, and modelling probability.
Plot classification outcomes for logistic regression using mosaic plots.
Mosaic plots designed to present categorical data.
Leukaemia dataset for classification.
Differentiate as to when algorithmic decision-making has the potential to cause harm.
Automation lowers costs but also leads to “machine says no” situations. What can be done to “soften” this?
Example: automated diagnostics, insurance claims, filtered views (e.g. Facebook/ Twitter)
Determine appropriate methods to classify and process data to aid discovery and analysis.
“Aboutness” of data is critical to discovery, but data creators don’t always classify appropriately.
Discussion of methods to classify aboutness, both manual and automatic, and recognising that automation may misclassify (since can’t understand context).
Eg. automated trading platforms response to news “events”.
Simulate resampling by generating new random samples using bootstrapping.
Bootstrap method, and using confidence intervals.
Present replicated samples as a stacked interval chart, displaying median and interval.
Chart is a variation of that used in meta-analysis for comparison.
Demonstrate how samples may show a tremendous bias, even inadvertently (hence why resampling important) and ties in above. E.g. patient waiting times.
Consider the implications of strange loops – self-referential programs with emergent properties – on analysis.
Douglas Hofstadter developed the concept of strange loops; self-referential and paradoxical systems. DNA acts upon itself, and machine intelligence systems can adjust their own programs. When a system enters a strange loop, who is responsible for the outcomes of its decisions?
Example: machine intelligence systems where we don’t know how they make decisions.
Analyse how data and methodology are curated from emergent systems.
Getting AI to explain their methodology and document how they do things; Bias in machine intelligence, from not recognising female voices, or racial bias, is a result of a chain of events. How curate these properties?
Categorise data using k-nearest neighbours for unsupervised learning clustering of data.
K-nearest neighbours, systems for training and testing;
Supervised vs unsupervised learning; k-means clustering vs k-nearest
Plot supervised and unsupervised learning outcomes using decision boundaries.
Dendrograms and decision boundaries using meshgrid.
Chronic Kidney Disease? Continue leukaemia, or look for something new on Figshare or Dataverse?
Acknowledge the privacy and confidentiality issues in data storage and security of personal data.
Consider issues in data loss, responsibility, and whether the risk from data loss outweighs the research value of collecting it.
Example: ICE and DACA deportation, National Immigration Law Centre, HIV and social stigma … cf “My daughter refused medicine because of stigma”.
Recognise responsibilities and mechanisms for securing data-at-rest and data-in-motion.
Data storage requires recognition of responsibilities for securing that data, including encryption and auth ‘n auth. This is true for owners, users, and intermediaries (chain of custody).
Methods for securing and authentication with storage.
Apply linear and continuous sampling methods to assess normal distributions.
Linear and continuous probability; normal distribution and standardisation.
Some Bayes.
Plot distributions as normal histograms and continuous curves.
Applying previously learned charts, and learning how to present normal distributions.
Birthweight should follow a normal distribution; also raises personal data security issues since birth certificates must be secured indefinitely but used continuously.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.