Giter Club home page Giter Club logo

ms_scallop_popgen's Introduction

Scallop Population Genomics

This repository is specifically for analyzing the data to support the following manuscript, and comes with no guarantees of usefulness for other purposes.

Requirements

Software:
Stacks v2 http://catchenlab.life.illinois.edu/stacks/
cutadapt https://cutadapt.readthedocs.io/en/stable/index.html
bwa http://bio-bwa.sourceforge.net/
samtools http://www.htslib.org/
R https://www.r-project.org
FastQC https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
multiQC https://multiqc.info/
plink https://zzz.bwh.harvard.edu/plink/plink2.shtml
fineRADstructure https://www.milan-malinsky.org/fineradstructure
NeEstimator v.2 http://www.molecularfisherieslaboratory.com.au/neestimator-software/
note: see R scripts for individual package requirements

Additional analytic repositories:
stacks_workflow https://github.com/enormandeau/stacks_workflow
simple_pop_stats https://github.com/bensutherland/simple_pop_stats

Clone ms_scallop_popgen, stacks_workflow, and simple_pop_stats in a common folder, with all repos at the same level.

0. Project planning

Before starting, plan the appropriate RADseq enzymes for use with the scallop, considering target number of loci and relation to sequencing depth.
Use the following script, after updating the locations for the scallop and Pacific oyster genome (as a reference) reference genomes:
01_scripts/scallop_genome_digest.R

1. stacks_workflow - prepare data

Execute all commands from within the stacks_workflow repository.

a. Setup

  1. Copy links to all raw data in 02-raw (use cp -l)
  2. Prepare the tab-delimited sample_information.csv as explained in stacks_workflow (also see template sample_information.csv).
  3. Download reference genome: https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_002113885.1/
    note: source citation is Wang et al. 2017, Nat Ecol Evol. https://pubmed.ncbi.nlm.nih.gov/28812685/

b. Quality control and trim

View raw data with fastqc and multiqc:

mkdir 02-raw/fastqc_raw    
fastqc 02-raw/*.fastq.gz -o 02-raw/fastqc_raw/ -t 5    
multiqc -o 02-raw/fastqc_raw/ 02-raw/fastqc_raw   

Prepare lane_info.txt file with automated script:
./00-scripts/00_prepare_lane_info.sh

Trim adapters and too short reads:
./00-scripts/01_cutadapt.sh <numCPUs>

View trimmed data with fastqc and multiqc:

mkdir 02-raw/trimmed/fastqc_trimmed/    
fastqc -t 5 02-raw/trimmed/*.fastq.gz -o 02-raw/trimmed/fastqc_trimmed/
multiqc -o 02-raw/trimmed/fastqc_trimmed/ 02-raw/trimmed/fastqc_trimmed       

c. De-multiplex reads

Detect cut sites and barcodes to de-multiplex and truncate reads to 80 bp with process_radtags in parallel:
00-scripts/02_process_radtags_2_enzymes_parallel.sh 80 pstI mspI 8

Collect samples from multiple runs together and rename samples:
./00-scripts/03_rename_samples.sh

Prepare the population map file:
./00-scripts/04_prepare_population_map.sh

2. stacks_workflow - align samples and quality filter on samples

Execute all commands from within the stacks_workflow repository.

a. Align samples against the reference genome

Index the reference genome with bwa:
bwa index <ref_genome>

Point the alignment script to the reference genome (full path) using the GENOMEFOLDER and GENOME variables, and run:
00-scripts/bwa_mem_align_reads.sh 6

b. Inspect alignment results

Compare per-sample reads and alignments, and per-sample reads and number of aligned scaffolds:
./../ms_scallop_popgen/01_scripts/assess_results.sh

Produces:

04-all_samples/reads_per_sample_table.txt
04-all_samples/mappings_per_sample.txt
# A graph of number reads and aligned reads in the main directory

Optional other calculations:

# Total reads in all samples:     
awk '{ print $2 } ' 04-all_samples/reads_per_sample_table.txt | paste -sd+ - | bc
# Total reads before de-multiplexing (note: divide by 4 due to fastqc):   
for i in $(ls 02-raw/*.fastq.gz) ; do echo $i ; gunzip -c $i | wc -l ; done

c. Filter out low reads/alignments individuals

Make directory to remove samples:
mkdir 04-all_samples/removed_samples

  1. Manually remove specific individuals with too few reads:
    mv 04-all_samples/PIP_631.* 04-all_samples/removed_samples/
    note: you can now go back and recalculate sample stats if you choose, but save the original outputs to avoid writing over.

  2. Manually remove these entries from the pop map

3. stacks_workflow - genotype

Execute all commands from within the stacks_workflow repository.

a. Run Stacks v.2.0 genotyper

Update the number of cores, then run:
00_scripts/stacks2_gstacks_reference.sh

b. Run populations to output VCF to identify outlier het samples

This step will be used to identify any problematic samples. Use single SNP per locus VCF:

# Calculate inbreeding coefficient per sample
vcftools --het --vcf 05-stacks/populations.snps.vcf --out ./07-filtered_vcfs/samples

# Format data
awk '{ print $5, $1, $1 }' 07-filtered_vcfs/samples.het | cut -d "_" -f 1,2 > 07-filtered_vcfs/samples.het.data

# Plot
./00-scripts/utility_scripts/plot_heterozygozity.R 07-filtered_vcfs/samples.het.data

# Observe output and view any problematic individuals
 

If you are concerned about any individuals, remove them from the population map, then re-run gstacks. In our case, we will remove individual VIU002, due to a beyond high level of heterozygosity.

c. Re-run Stacks v.2.0 genotyper

00_scripts/stacks2_gstacks_reference.sh

d. Filter using the populations module

Update the populations module script 00-scripts/stacks2_populations_reference.sh as follows, run once for single-snp and once for microhaps:

# single SNP per locus
populations -P "$STACKS_FOLDER" -M "$INFO_FILES_FOLDER"/"$POP_MAP" \
    -t "$NUM_CPU" -p 3 -r 0.7 \
    --ordered-export --fasta-loci --vcf \
    --min-maf 0.01 --hwe --plink --write-single-snp

# Save out all populations files appropriately
mkdir 05-stacks/popn_out_single_snp
mv 05-stacks/populations.* 05-stacks/popn_out_single_snp/

# microhaplotypes
populations -P "$STACKS_FOLDER" -M "$INFO_FILES_FOLDER"/"$POP_MAP" \
    -t "$NUM_CPU" -p 3 -r 0.7 \
    --ordered-export --fasta-loci --vcf \
    --min-maf 0.01 --hwe --plink \
    --radpainter

# Save out all populations files appropriately
mkdir 05-stacks/popn_out_microhaps
mv 05-stacks/populations.* 05-stacks/popn_out_microhaps/

e. General characterization of microhaplotype data

Determine the number of variants per RAD-tag:
grep -vE '^#' 05-stacks/popn_out_microhaps/populations.haps.vcf | awk '{ print $4 }' - | awk ' { print length } ' - | sort -nr | uniq -c | sort -nrk2 | less
...this works on the fourth column, which is the reference allele (e.g., TTTT).

Total the number of variants in all RAD-tags to be sure the above worked:
grep -vE '^#' 05-stacks/popn_out_microhaps/populations.haps.vcf | awk '{ print $4 }' - | awk ' { print length } ' - | sort -nr | uniq -c | sort -nrk2 | awk '{sum+=$1} END {print sum}'

Determine the number of alleles per RAD-tag:

# First determine the number of alternate alleles that are present, as per commas in col 5
grep -vE '^#' 05-stacks/popn_out_microhaps/populations.haps.vcf | awk '{ print $5 }' - | while read line ; do echo "$line" | tr -cd "," | wc -c ; done > allele_count.txt

# Second, add two to each (one for three records per two commas, one for the REF allele)     
awk '$1+=2' allele_count.txt | sort -rnk1 | uniq -c | less

Note: this may need to be done after filtering for HWE, or at least state that these loci may be deviating from HWE.

Create a tab-delimited file with the chromosome name, the position of the microhap, the catalog ID from stacks, and the alleles (alleles are separated by a comma):
grep -vE '^#' 05-stacks/popn_out_microhaps/populations.haps.vcf | awk '{ print $1 "\t" $2 "\t" $3 "\t" $4 "," $5 } ' - > 05-stacks/popn_out_microhaps/populations.haps_alleles.txt

Open Rscript ms_scallop_popgen/01_scripts/count_alleles_per_marker.R and run interactively. This will output a file 05-stacks/popn_out_microhaps/populations.haps_alleles_counts.txt

f. Convert output plink files

For these steps, you will use the single SNP per locus data.

Convert plink files to a useable format for adegenet:
plink --ped 05-stacks/popn_out_single_snp/populations.plink.ped --map 05-stacks/popn_out_single_snp/populations.plink.map --maf 0.01 --recode A --allow-extra-chr --out 05-stacks/popn_out_single_snp/populations_single_snp

Notes: --recode creates a new text fileset, and the A modifier causes additive (0/1/2) format to be generated, which is useful for input into R.

Output: 05-stacks/popn_out_single_snp/populations_single_snp.raw, the input file for R analyses.

4. Population genetic analyses

The following will be conducted in the repo ms_scallop_popgen and will use the R environment.

a. Import data

Import the formatted plink data from above into R and convert to genind:
ms_scallop_popgen/01_scripts/01_import_plink_to_genind.R
Output: ./03_results/prepared_genind.RData.

b. Characterize and filter data

Characterize the genind file and filter based on missing data:
ms_scallop_popgen/01_scripts/02_sps_char_and_filt.R

This will require you to source simple_pop_stats/01_scripts/simple_pop_stats_start.R.
Output will be in simple_pop_stats_pyes/03_results

This will do the following:

  • load ../ms_scallop_popgen/03_results/prepared_genind.RData
  • designate colours for each collection
  • calculate missing data by individual, plot, and remove excess missing individuals
  • calculate per locus HWE deviation and excess obs. heterozygosity, and filter
    ...and will output 03_results/post_all_filters.RData.

c. Population genetic analysis

After filtering, open and run interactively:
ms_scallop_popgen/01_scripts/03_sps_analysis.R

This will do the following:

  • load 03_results/post_all_filters.Rdata
  • create separate datasets for VIU or JPN, then filter by low MAF, plot MAF distributions
  • using sep VIU and JPN datasets, calc per loc stats (for HOBS), merge and plot (see 03_results/popn_sp_hobs_calc/
  • multivariate statistics (i.e., PCA, DAPC)
  • genetic differentiation (FST) calculation
  • identify private alleles for BC, JPN, VIU
  • estimate inter-individual relatedness between individuals
  • estimate inbreeding coefficient (F)

5. Relatedness

Use the output radpainter file, with fineRADstructure.
If stacks workflow is at same level as current repo:
cp ../stacks_workflow/05-stacks/populations.haps.radpainter ./02_inputs/
note: future version should create its own folder, as output of initial command outputs to input folder too.
note: future versions should ensure populations output has RAD loci ordered according to genomic coord

fineRADstructure process:

# Infer the co-ancestry matrix from RAD-seq data
RADpainter paint 02_inputs/populations.haps.radpainter

# Infer recent shared ancestry using haplotype linkage information and coalescence; assign indiv to popn
finestructure -x 100000 -y 100000 -z 1000 02_inputs/populations.haps_chunks.out 02_inputs/populations.haps_chunks.mcmc.xml

# Build tree
finestructure -m T -x 10000 02_inputs/populations.haps_chunks.out 02_inputs/populations.haps_chunks.mcmc.xml 02_inputs/populations.haps_chunks.mcmcTree.xml

# Copy R scripts from your programs folder (included with installation of fineRADstructure)
cp ~/programs/fineRADstructure/fineRADstructurePlot.R ./01_scripts/
cp ~/programs/fineRADstructure/FinestructureLibrary.R ./01_scripts/

# Edit the required sections of fineRADstructurePlot.R and run interactively
# The output figures will be in 04_fineRADstructure/ 

6. Effective population size

Export genepop file with microhaplotype or single SNP per locus data from stacks, rename with *.gen file extension, and place in ms_scallop_popgen/02_inputs folder. Use the software NeEstimator v2 to estimate effective population size (Ne) per population. Suggested output folder is 05_NeEst.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.