swvanderlaan / metagwastoolkit Goto Github PK

View Code? Open in Web Editor NEW

13.0 5.0 2.0 488.71 MB

A ToolKit to perform a Meta-analysis of Genome-Wide Association Studies

Home Page: https://swvanderlaan.github.io/MetaGWASToolKit/

License: MIT License

Perl 23.48% R 25.85% Shell 40.11% Python 8.60% Jupyter Notebook 1.96%

meta-analysis gwas gwas-dataset allele-frequencies variants indels genome association-studies chromosome allele-counts

metagwastoolkit's Introduction

MetaGWASToolKit

v1.1 "Willibrord"

A ToolKit to perform a Meta-analysis of Genome-Wide Association Studies (GWAS). Check out the wiki for more details.

MetaGWASToolKit is a set of scripts that executes a fully automated meta-analysis of GWAS. It is an extension of MANTEL, originally developed by Paul I.W. de Bakker, Sara L. Pulit and Jessica van Setten and for which many features were described before and later further extended upon by Winkler T.W. et al.

In a first step, MetaGWASToolKit will automatically parse, harmonize, and clean summary statistics from individual GWAS. In a second step the user will have to inspect each individual GWAS summarizing plot, including Manhattans, QQ-plots, Z-P plots, frequency plots, distribution of effect sizes, etc. In the third and fourth step, the meta-analysis is prepared and subsequently executed. In the fifth step, the results of the meta-analysis can be inspected, as the filtered and annnotated summary statistics and plots are created. Fixed- and random effects, as well as Z-score-based analyses are executed by default. Heterogeneity among cohorts is quantified using the I² and Q-statistics. When genome-wide significant hits are present, clumping is automatically done, and regional association plots are generated.

The necessary files for post-GWAS analyses, including those for Mendelian randomization and LD Score regression analysis through MR-base and LDHub, respectively. Currently, meta-analyses using 1000G phase 1 and 3, and HRC r1.1 as a reference are supported. Note that MetaGWASToolKit will accept multi-allelic variants coded as bi-allelic variants (each allele-combination written per line/row), however it will adhere to strict rules: only when a variant can be precisely match to the chosen reference will it be valid. Variants that cannot be matched will be analyzed, but flagged. In principle it is possible to make it work for legacy references too, e.g. HapMap2, please raise an issue for support on this.

Future versions

Scripts to execute fine-mapping, create regional (MIAME) plots to compare trait-results and perform formal colocalization analyses, as well as PolarMorphism will be added in future versions.

The MIT License (MIT)

metagwastoolkit's People

Contributors

Stargazers

Watchers

Forkers

geneticresources akhileshkaushal

metagwastoolkit's Issues

Manhattan-plot: fix outlining of non-autosomal chromosomes

Better outline chr X, XY, Y, MT. Now they are really squashed. 🔷

Manhattan-plot: speed gain using #ggrastr

Apparently using this package ggrastr we can increase plotting speed and reduce file size, while maintaining 'vectorised images' for downstream 'pimping' of plots using e.g. Adobe Illustrator.

Manhattan-plot: specify test-statistic

Add in option on test-statistics (Z-score, Chi^2, or P-value) 🔶

Meta-analysis: param-file generator

Add in params-file generator (cohort-name, lambda [after QC], avg. sample size, beta-correction factor) 🚧
If you have many cohorts (10+) or many (sub-)meta-analyses creating the param-file by hand can be a nuisance. We should make a perl/python script (probably fastest) to print out these things in one go...

Clumps are 'cleared'

For some reason parseClumps.pl literally 'greps' the contents of *.clumped-files, so essentially the contents are cleared. It is not really a problem, the contents are parsed and printed to a new file, but ideally the original results should be kept in the input-files (i.e. *.clumped-files).

Meta-analysis: add in option to include non-autosomal chromosomes

Add in option to include/exclude special chromosomes (X, Y, XY, MT) 🔷

Manhattan-plot: change orientation

Add in option to make it horizontal, vertical or circular 🔷

Manhattan-plot: MIAMI plotting, aka stratified plotting.

Add in option to choose for a stratified Manhattan (bottom vs. upper for instance male vs. female), i.e. a MIAMI-plot. There is a specific package that has this function --> borrow it from there... 🔶

Meta-analysis: add in trans-ethnic option

Add in trans-ethnic meta-analysis option 🔶

Meta-analysis: add --verbose function

Add --verbose and other flags of METAGWAS.pl to metagwastoolkit.conf. Currently as a default, the meta-analysis is done in a --verbose mode, i.e. all relevant data of each cohort is added to the final meta-analysis output. This can be troublesome when tens or hundreds of GWAS datasets are analyzed. In the next version this behaviour can be changed by setting the appropriate flag in metagwastoolkit.conf. Note: this script needs fixing. 🚧

Add in functions based on Winkler et al. (frequency plot among others)

When plotting the individual cohort data, raw or after quality control, add in a script that plots the frequency as compared to a reference.

Array jobs

Is your feature request related to a problem? Please describe.
Currently the main scripts chunk the files and submit jobs for each individual file. This increases the load on the server, and decreases your fair usage quotum.

Describe the solution you'd like
The individual submission scripts should be wrapped in an array job to reduce the load on the server.

Describe alternatives you've considered
An alternative would be not to chunk the files, but this would be inefficient and significantly reduce the amount of time needed to run an analysis.

Additional context
None.

Manhattan-plot: change title name

Add in option to change the title of the plot 🔶

Stratified QQ plots with lambda per bin

Is your feature request related to a problem? Please describe.
To assess the tradeoff between allele frequencies and info-metric it would be great to calculate the lambda per bin too, aside of the number of variants.

Describe the solution you'd like
Add the lambda per bin, next to the number of SNPs per bin. And in addition increase the font size of the legend (specifically the diamonds).

Describe alternatives you've considered
No alternatives were considered, other than running QQ plots separately for pre-calculated bins.

Additional context
See the example script below.

## READS input options
rm(list=ls())
input=commandArgs()[7]
input=substr(input,2,nchar(input))

output=commandArgs()[8]
output=substr(output,2,nchar(output))

## Plot function ##
plotQQ <- function(z,color,cex){
p <- 2*pnorm(-abs(z))
p <- sort(p)
expected <- c(1:length(p))
lobs <- -(log10(p))
lexp <- -(log10(expected / (length(expected)+1)))

# plots all points with p < 1e-3
p_sig = subset(p,p<0.001)
points(lexp[1:length(p_sig)], lobs[1:length(p_sig)], pch=23, cex=.3, col=color, bg=color)

# samples 5000 points from p > 1e-3
n=5001
i<- c(length(p)- c(0,round(log(2:(n-1))/log(n)*length(p))),1)
lobs_bottom=subset(lobs[i],lobs[i] <= 3)
lexp_bottom=lexp[i[1:length(lobs_bottom)]]
points(lexp_bottom, lobs_bottom, pch=23, cex=cex, col=color, bg=color)
}

plotQQ2 <- function(z,color,cex){
p <- 2*pnorm(-abs(z))
p <- sort(p)
expected <- c(1:length(p))
lobs <- -(log10(p))
lexp <- -(log10(expected / (length(expected)+1)))

# plots all points
points(lexp[1:length(p)], lobs[1:length(p)], pch=23, cex=.3, col=color, bg=color)

}


## Reads data
S <- read.table(input,header=T)
z=qnorm(S$P/2)
z_lo00=subset(S, ( S$CAF > 0.99 | S$CAF < 0.01 ))
z_lo01=subset(S, ( S$CAF > 0.20 & S$CAF < 0.80 ))
z_lo02=subset(S, ( S$CAF < 0.20 & S$CAF > 0.05 ) | ( S$CAF > 0.80 & S$CAF < 0.95 ))
z_lo03=subset(S, ( S$CAF < 0.05 & S$CAF > 0.01 ) | ( S$CAF > 0.95 & S$CAF < 0.99 ))

z_lo0=qnorm(z_lo00$P/2)
z_lo1=qnorm(z_lo01$P/2)
z_lo2=qnorm(z_lo02$P/2)
z_lo3=qnorm(z_lo03$P/2)


## calculates lambda
lambda = round(median(z^2)/qchisq(0.5,df=1),3)
l0 = round(median(z_lo0^2)/qchisq(0.5,df=1),3)
l1 = round(median(z_lo1^2)/qchisq(0.5,df=1),3)
l2 = round(median(z_lo2^2)/qchisq(0.5,df=1),3)
l3 = round(median(z_lo3^2)/qchisq(0.5,df=1),3)

## Plots axes and null distribution
pdf(paste(output,"qqplot_maf.pdf",sep="."), width=6, height=6)
plot(c(0,8), c(0,8), col="red", lwd=3, type="l", xlab="Expected Distribution (-log10 of P value)", ylab="Observed Distribution (-log10 of P value)", xlim=c(0,8), ylim=c(0,8), las=1, xaxs="i", yaxs="i", bty="l",main=c(substitute(paste("QQ plot: ",lambda," = ", lam),list(lam = lambda)),expression()))

## plots data

plotQQ(z,"black",0.4);
plotQQ(z_lo1,"olivedrab1",0.3);
plotQQ(z_lo2,"orange",0.3);
plotQQ(z_lo3,"lightskyblue",0.3);
plotQQ(z_lo0,"purple",0.3);

## provides legend

#legend(.25,8,legend=c("Expected (null)","Observed",
#paste("MAF > 0.20 [",length(z_lo1),"]"),
#paste("0.05 < MAF < 0.2 [",length(z_lo2),"]"),
#paste("0.01 < MAF < 0.05 [",length(z_lo3),"]"),
#paste("MAF < 0.01 [",length(z_lo0),"]")),
#pch=c((vector("numeric",6)+1)*23), cex=c((vector("numeric",6)+0.8)), pt.bg=c("red","black","olivedrab1","orange","lightskyblue","purple"))

legend(.25,8,legend=c("Expected (null)","Observed",
substitute(paste("MAF > 0.20 [", lambda," = ", lam, "]"),list(lam = l1)),expression(),
substitute(paste("0.05 < MAF < 0.20 [", lambda," = ", lam, "]"),list(lam = l2)),expression(),
substitute(paste("0.01 MAF < 0.05 [", lambda," = ", lam, "]"),list(lam = l3)),expression(),
substitute(paste("MAF < 0.01 [", lambda," = ", lam, "]"),list(lam = l0)),expression()),
pch=c((vector("numeric",6)+1)*23), cex=c((vector("numeric",6)+0.8)), pt.bg=c("red","black","olivedrab1","orange","lightskyblue","purple"))

rm(z)
dev.off()


## Plot function ##
plotQQ <- function(z,color,cex){
p <- 2*pnorm(-abs(z))
p <- sort(p)
expected <- c(1:length(p))
lobs <- -(log10(p))
lexp <- -(log10(expected / (length(expected)+1)))

# plots all points with p < 1e-3
p_sig = subset(p,p<0.001)
points(lexp[1:length(p_sig)], lobs[1:length(p_sig)], pch=23, cex=.3, col=color, bg=color)

# samples 5000 points from p > 1e-3
n=5001
i<- c(length(p)- c(0,round(log(2:(n-1))/log(n)*length(p))),1)
lobs_bottom=subset(lobs[i],lobs[i] <= 3)
lexp_bottom=lexp[i[1:length(lobs_bottom)]]
points(lexp_bottom, lobs_bottom, pch=23, cex=cex, col=color, bg=color)
}

plotQQ2 <- function(z,color,cex){
p <- 2*pnorm(-abs(z))
p <- sort(p)
expected <- c(1:length(p))
lobs <- -(log10(p))
lexp <- -(log10(expected / (length(expected)+1)))

# plots all points
points(lexp[1:length(p)], lobs[1:length(p)], pch=23, cex=.3, col=color, bg=color)

}


## Reads data
z=qnorm(S$P/2)
z_lo01=subset(S, S$INFO > 0.75)
z_lo02=subset(S, ( S$INFO < 0.75 & S$INFO > 0.5 ) )
z_lo03=subset(S, ( S$INFO < 0.5 & S$INFO > 0.25 ) )
z_lo04=subset(S, ( S$INFO < 0.25 ) )

z_lo4=qnorm(z_lo04$P/2)
z_lo1=qnorm(z_lo01$P/2)
z_lo2=qnorm(z_lo02$P/2)
z_lo3=qnorm(z_lo03$P/2)


## calculates lambda
lambda = round(median(z^2)/qchisq(0.5,df=1),3)
l4 = round(median(z_lo4^2)/qchisq(0.5,df=1),3)
l1 = round(median(z_lo1^2)/qchisq(0.5,df=1),3)
l2 = round(median(z_lo2^2)/qchisq(0.5,df=1),3)
l3 = round(median(z_lo3^2)/qchisq(0.5,df=1),3)

## Plots axes and null distribution
pdf(paste(output,"qqplot_impq.pdf",sep="."), width=6, height=6)
plot(c(0,8), c(0,8), col="red", lwd=3, type="l", xlab="Expected Distribution (-log10 of P value)", ylab="Observed Distribution (-log10 of P value)", xlim=c(0,8), ylim=c(0,8), las=1, xaxs="i", yaxs="i", bty="l",main=c(substitute(paste("QQ plot: ",lambda," = ", lam),list(lam = lambda)),expression()))

## plots data

plotQQ(z,"black",0.4);
plotQQ(z_lo1,"olivedrab",0.3);
plotQQ(z_lo2,"olivedrab1",0.3);
plotQQ(z_lo3,"orange",0.3);
plotQQ(z_lo4,"lightskyblue",0.3);

## provides legend
#legend(.25,8,legend=c("Expected (null)","Observed",
#paste("impq > 0.75 [",length(z_lo1),"]"),
#paste("0.5 < impq < 0.75 [",length(z_lo2),"]"),
#paste("0.25 < impq < 0.5 [",length(z_lo3),"]"),
#paste("impq < 0.25 [",length(z_lo4),"]")), 
#pch=c((vector("numeric",6)+1)*23), cex=c((vector("numeric",6)+0.8)), pt.bg=c("red","black","olivedrab","olivedrab1","orange","lightskyblue"))
legend(.25,8,legend=c("Expected (null)","Observed",
substitute(paste("imp qual > 0.75 [", lambda," = ", lam, "]"),list(lam = l1)),expression(),
substitute(paste("0.5 < imp qual < 0.75 [", lambda," = ", lam, "]"),list(lam = l2)),expression(),
substitute(paste("0.25 imp qual < 0.5 [", lambda," = ", lam, "]"),list(lam = l3)),expression(),
substitute(paste("imp qual < 0.25 [", lambda," = ", lam, "]"),list(lam = l4)),expression()),
pch=c((vector("numeric",6)+1)*23), cex=c((vector("numeric",6)+0.8)), pt.bg=c("red","black","olivedrab","olivedrab1","orange","lightskyblue"))

rm(z)
dev.off()

#sig <- subset(S,S$P <= 1e-2)
#nonsig <- subset(S,S$P > 1e-2)

#sampled <- sample(seq(1,nrow(nonsig),1),500000, replace = FALSE, prob = NULL)

#nonsigout <- nonsig[sampled,]

#p <- rbind(sig,nonsigout)

p <- S

p$POS <- p$POS/100
offset <- 0
color="red"
pos <- c()
pos_odd <- c()
p_odd <- c()
pos_even <- c()
p_even <- c()
xAT <- c()
xE <- c(0)
xO <- c(0)
xEND <- c(0)

#pos <- subset(p$POS,p$CHR == 1)
cols<-rainbow(23)

maxX = 0
for (i in 1:23){
pos_i <- subset(p$POS,p$CHR == i)
maxX = maxX + max(pos_i)
}

#pdf(paste(output,"manhattan.pdf",sep="."), width=16, height=8)
png(paste(output,"manhattan.png",sep="."), width=1500, height=750)

p1 <- subset(p, p$P >= 5e-8)
p2 <- subset(p, p$P < 5e-8)

for (i in 1:23){
pos_i <- subset(p1$POS,p1$CHR == i)
p_i <- subset(p1$P,p1$CHR == i)

pos_j <- subset(p2$POS,p2$CHR == i)
p_j <- subset(p2$P,p2$CHR == i)


if (i == 1){
plot(pos_i,-log10(p_i), pch=15, cex=.5, col="#1E90FF",ylim=c(0,15),xlim=c(0,maxX),xlab="Chromosome",ylab="",main=paste("Genome-wide results",sep=""),axes=F)
points(pos_j + offset,-log10(p_j), pch=18, cex=1, col="#1E90FF",axes=F)
}
if (i %% 2 == 0){
points(pos_i + offset,-log10(p_i), pch=15, cex=.5, col="#104E8B",axes=F)
points(pos_j + offset,-log10(p_j), pch=18, cex=1, col="#104E8B",axes=F)
}
if (i %% 2 == 1){
points(pos_i + offset,-log10(p_i), pch=15, cex=.5, col="#1E90FF",axes=F)
points(pos_j + offset,-log10(p_j), pch=18, cex=1, col="#1E90FF",axes=F)
}

if (color == "red"){
color <- "blue"
pos_odd <- c(pos_odd, pos_i + offset)
p_odd<- c(p_odd, p_i)
xO <- c(xO, (max(pos_i) + min(pos_i))/2 + offset)
}else {
color <- "red"
pos_even <- c(pos_even, pos_i + offset)
p_even <- c(p_even, p_i)
xE <- c(xE, (max(pos_i) + min(pos_i))/2 + offset)
}


pos <- c(pos, pos_i + offset)
xAT <- c(xAT, (max(pos_i) + min(pos_i))/2 + offset)
xEND <- c(xEND, max(pos_i) + offset)

offset <- max(pos)
}

lines(c(min(pos), max(pos)), c(7.3,7.3), lty="dotted", lwd=1, col="black")
mtext("-log10 of P value",side=2, at=7.5, line=1)
for (i in 1:23){
axis(1, at=xAT[i], labels=c(i), cex.axis=1.5,tick=FALSE)
} 

axis(1, at=xEND, labels=c("","","","","","","","","","","","","","","","","","","","","","","",""), tick=TRUE, cex.axis = 0.8) 
axis(2, at=c(0,5,10,15), labels=c(0,5,10,15), pos=c(0,0), las=1) 



dev.off()

Meta-analysis: VEGAS pathway enrichment analysis

Add in VEGAS2 based pathway enrichment analysis 🚧

Meta-analysis: total tallies over variants

In the summary of the meta-analysis (below) there is a total tally of variants per category. However, this seems to be a bit of (error file reports many more variants that were not found in the reference) and therefore needs double checking.

Example error-file output:

* chrX:86581529:A_G in [ DATA/MODEL1/META/TEMP/meta.all.unique.variants.reorder.split ] is not present in the Variant Annotation File  -- skipping it.
* chrX:86775594:C_T in [ DATA/MODEL1/META/TEMP/meta.all.unique.variants.reorder.split ] is not present in the Variant Annotation File  -- skipping it.
* chrX:86777100:C_A in [ DATA/MODEL1/META/TEMP/meta.all.unique.variants.reorder.split ] is not present in the Variant Annotation File  -- skipping it.
* chrX:86780351:T_C in [ DATA/MODEL1/META/TEMP/meta.all.unique.variants.reorder.split ] is not present in the Variant Annotation File  -- skipping it.
* chrX:86783054:G_T in [ DATA/MODEL1/META/TEMP/meta.all.unique.variants.reorder.split ] is not present in the Variant Annotation File  -- skipping it.
* chrX:86783867:G_A in [ DATA/MODEL1/META/TEMP/meta.all.unique.variants.reorder.split ] is not present in the Variant Annotation File  -- skipping it.
* chrX:86799180:G_T in [ DATA/MODEL1/META/TEMP/meta.all.unique.variants.reorder.split ] is not present in the Variant Annotation File  -- skipping it.
* chrX:86855260:T_C in [ DATA/MODEL1/META/TEMP/meta.all.unique.variants.reorder.split ] is not present in the Variant Annotation File  -- skipping it.
* chrX:86859763:T_C in [ DATA/MODEL1/META/TEMP/meta.all.unique.variants.reorder.split ] is not present in the Variant Annotation File  -- skipping it.
* chrX:86860584:C_A in [ DATA/MODEL1/META/TEMP/meta.all.unique.variants.reorder.split ] is not present in the Variant Annotation File  -- skipping it.

In this example there 4,856 variants skipped. But these are not reported in the summary below as uninformative or something.

Example summary:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Summarizing this meta-analysis.

* Number of variants in meta-analysis       : 122019.
* Number of variants not in the Reference   : 0.
* Number of uninformative variants skipped  : 0.

          Study name     Allele flips     Sign [beta] flips     Informative variants
          ----------     ------------     -----------------     --------------------
   study1                                   0                     0                   122019
   study2                                   0                     0                   106835

Meta-Analysis: add in unit of exposure for MR-base input-files

Add in units of exposure to configuration file and to the input-file for MR-base.

Meta-analysis: automatic checking of each cohort

Add in automagical checking of each cohort after cleaning 🚧
Although this is typically something you'd want to check by hand, some basic reporting function, of general statistics of the cohort and whether certain steps were successful could be very useful. Again, going over each cohort manually when there are a lot of them is quite some work...

Manhattan-plot: specify output name

Add in option to give the output a specific name (now it's based on the filename) 🔶

Manhattan-plot: add gene name to peaks

Add in a gene name at the most significant peaks (previous/novel loci) 🔶

(Plotter) scripts that rely on generating data before queuing slurm command could be sped up

Any part of a script that generates ".sh" files to queue them with sbatch after generating some data neccesary to run these ".sh" files could include generating said data into the ".sh" files themselves.
For example:

	echo "- producing normal QQ-plots..." # P-value
	zcat ${PROJECTDIR}/${COHORTNAME}.${DATAEXT} | ${SCRIPTS}/parseTable.pl --col P | tail -n +2 > ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.txt

	printf "#!/bin/bash\nRscript ${SCRIPTS}/plotter.qq.R --projectdir ${PROJECTDIR} --resultfile ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.txt --outputdir ${PROJECTDIR} --stattype ${STATTYPE} --imageformat ${IMAGEFORMAT}" > ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.sh
	## qsub -S /bin/bash -N ${COHORTNAME}.${DATAPLOTID}.QQ -o ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.log -e ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.errors -l h_vmem=${QMEMPLOTTER} -l h_rt=${QRUNTIMEPLOTTER} -wd ${PROJECTDIR} ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.sh
	QQ_ID=$(sbatch --parsable --job-name=${COHORTNAME}.${DATAPLOTID}.QQ -o ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.log --error ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.errors --time=${QRUNTIMEPLOTTER} --mem=${QMEMPLOTTER} --mail-user=${QMAIL} --mail-type=${QMAILOPTIONS} --chdir=${PROJECTDIR}/ ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.sh)

Would become something like:

	echo "- producing normal QQ-plots..." # P-value
	printf "#!/bin/bash\nzcat ${PROJECTDIR}/${COHORTNAME}.${DATAEXT} | ${SCRIPTS}/parseTable.pl --col P | tail -n +2 > ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.txt\n"
	printf "Rscript ${SCRIPTS}/plotter.qq.R --projectdir ${PROJECTDIR} --resultfile ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.txt --outputdir ${PROJECTDIR} --stattype ${STATTYPE} --imageformat ${IMAGEFORMAT}" >> ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.sh
	## qsub -S /bin/bash -N ${COHORTNAME}.${DATAPLOTID}.QQ -o ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.log -e ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.errors -l h_vmem=${QMEMPLOTTER} -l h_rt=${QRUNTIMEPLOTTER} -wd ${PROJECTDIR} ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.sh
	QQ_ID=$(sbatch --parsable --job-name=${COHORTNAME}.${DATAPLOTID}.QQ -o ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.log --error ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.errors --time=${QRUNTIMEPLOTTER} --mem=${QMEMPLOTTER} --mail-user=${QMAIL} --mail-type=${QMAILOPTIONS} --chdir=${PROJECTDIR}/ ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.sh)

This should speed things up by making it so that generating the neccesary data becomes part of the sbatch commands, and thus can run at the same time as other similar commands, instead of stopping the script dead in its tracks until the data is generated.

WARNING: This approach will only work if the data generated is used in a single resulting sbatch command, like with the QQ plot in "gwas.plotter.sh". If the data generated is used in multiple commands, like with the Manhattan plots generated in "gwas.plotter.sh" then the data generation will need to stay seperate (although it could still be turned into an sbatch command regardless and turned into a dependancy for the Manhattan plots).

http://www.internationalgenome.org/faq/can-i-convert-vcf-files-plinkped-format/