Start Date: October 10, 2022
- This dropbox folder contains all of the videos from our zoom course sessions and recordings from a previous MEG bioinformatics workshop.
These lessons are designed to introduce researchers to the R programming language for statistical analysis of metagenomic sequencing data. While we are primarily developing these training resources for the Microbial Ecology Group (MEG), we would love to get your input on improvements to any component so that we can one day provide this as a useful public resource. As the lessons are meant to be an informal collection of resources and tutorials, we have have liberally used parts and pieces of other online lessons and tailored it for our purposes. We attempt to give credit when possible by linking the original source and we are happy to hear recommendations for other resources to include.
We wholeheartedly encourage students to independently troubleshoot the majority of problems they might encounter by:
- googling it (or using another search engine)
- getting help from other students by using our slackgroup channel #2021-AMR++workshop
- searching bioinformatic forums such as (stackoverflow.com, biostars.org, seqanswers.com, etc.)
Upon completion of these lessons, students will:
- have their computer set up with the R and RStudio software
- know how to read-in count matrices from bioinformatic analysis of sequence data
- be able to explore and summarize bioinformatic results using
- diversity indices and box plots
- ordination with non-metric multidimensional scaling (NMDS)
- heatmaps
- be familiar with common statistical techiniques such as:
- Wilcoxon test
- Generalized linear models
- Analysis of similarities (ANOSIM)
- Differential abundance testing using a zero-inflated Gaussian (ZIG) model
Group email: [email protected]
Dr. Paul Morley -- [email protected]
Dr. Noelle Noyes -- [email protected]
Peter Ferm -- [email protected]
Dr. Lee Pinnell -- [email protected]
Dr. Enrique Doster -- [email protected]
Dr. Lisa Perez -- [email protected]
Metagenomic sequencing approach determines the type of analysis you can perform:
- Shotgun metagenomic sequencing
- can analyze both the microbiome and resistome, in addition to other sequences such as plasmid-associated or virulence factors
- Target-enriched resistome sequencing (MEGARes baits)
- can only analyze the resistome
- 16S rRNA amplicon sequencing
- can only analyze the microbiome
In this repository, we'll show you examples of running variants of the AMR++ pipeline to achieve your bioinformatic analysis goals. We'll be using code found in this repository of bioinformatic pipelines
- AMR++ pipeline
- The main_AmrPlusPlus_v2_withKraken.nf script nalyzes shotgun metagenomic sequencing reads to characterize the microbiome using the taxanomic classier, kraken2, and alignment of reads to our MEGARes database to characterize the resistome.
- The main_AmrPlusPlus_v2.nf script is simply a subset of the entire pipeline and only performs the resistome analysis.
- Qiime2 pipeline
- We use the Qiime2 pipeline to analyze 16S rRNA reads and export the results to a file format that we can use to analyze with R.
Remember, the analysis will always have to be based on your study design and performed with the goal of testing your apriori hypotheses. The scripts in this repository are merely meant to provide an outline for you to begin your analysis and branch off as needed.
Using RStudio, download everything in this repository and change your working directory to the newly downloaded AMRplusplus_bioinformatic_workshop directory. Start by opening the script on the main page, Stats_overview_script.R, and follow along for a brief explanation of how each of the scripts below fits into your analysis.
If you don't have RStudio installed, click on the link below to explore our test dataset using Binder and RStudio:
Otherwise, follow the instructions on this tutorial for installing R and Rstudio on your personal computer.
The main steps of data exploration and statistical analysis we will cover are divided into four main steps with associated scripts for each general step:
- Loading count matrix results from bioinformatic analyses into R
- Calculating summary statistics
- Normalizing counts and creating exploratory figures
- Running some common statistical tests
MEG resources
- MEG bioinformatic term glossary
- AMR ++ pipeline overview
- Bioinformatic AMR and 16S pipeline overview
- Bioinformatics statistics overview
- RStudio cheatsheets
- This website has tons of helpful cheatsheets for various R packages and analyses methods. Also includes cheatsheets translated to other languages.
- YaRrr! The Pirate’s Guide to R
- This is a free online book that goes over many useful topics in a quirky, but fun way! Follow along with our simplified R scripts in Lesson 1 and reference this book if you have any other questions.
- R programming coursera course
- This free coursera course goes in-depth with all of the functionality of R. It combines videos with example R scripts for you to follow along with. We recommend this course after you have been playing around with R a bit and want to learn more about the details into how R works.
- Introduction to R workshop
- We haven't personally tried this workshop, but they have a combination of videos, slides, and R code for various topics.
- ggpubr
- Nice package for "publication-ready" figures.
- Harvard's Data Science: R Basics
- dataviz project
- This website is for a private company, but they have a great interface for exploring different figure types
- Visual vocabulary
- Handy outline and explanation for the uses of different plots.
- You can also check out this interactive figure of the same material
- FT Visual Journalism Team
- Awesome site with articles covering various topics and with the emphasis on creating awesome graphics to convey
- Interactive Jupyter notebooks
- GGplot colors and themes
- More ggplot colors and themes
- Interactive heatmaps
- Explain shell
- cool website that explains bash commands piece by piece
-
GUide to STatistical Analysis in Microbial Ecology (GUSTA ME)
-
Diversity indices
-
LHS 610: Exploratory Data Analysis for Health
- We haven't personally tried this course, but they provide great videos and code examples for learning how to explore data using R.
-
R-specific resources
-
Batch effects
- Tackling the widespread and critical impact of batch effects in high-throughput data
- Why Batch Effects Matter in Omics Data, and How to Avoid Them
- Beware the bane of batch effects
- Mitigating the adverse impact of batch effects in sample pattern detection
- Identifying and mitigating batch effects in whole genome sequencing data
- Why Batch Effects Matter in Omics Data, and How to Avoid Them
-
Misc
The development of this tutorial was supported in part by USDA NIFA Grant No. 2018-51300-28563, University of Minnesota College of Veterinary Medicine, The VERO Program at Texas A&M University and West Texas A&M University, and the State of Minnesota Agricultural Research, Education, Extension and Technology Transfer program.