laninsky / gbs_snp_filter Goto Github PK

View Code? Open in Web Editor NEW

9.0 4.0 5.0 307 KB

filtering SNPs based on LD and HWE status

License: MIT License

R 97.13% Shell 2.87%

gbs_snp_filter's Introduction

GBS_SNP_filter v1.17

bug notice: GBS_SNP_filter is not curently filtering for missingness correctly. The current workaround is to use VCF tools to do this in a stand-alone step. Thanks to Didier for pointing out this issue.

Summary

After you get your ddRADseq/GBS variant dataset through your favourite pipeline, you might want to further filter the SNPs contained in the vcf file before doing downstream analysis. This set of scripts allows you to further filter to keep:

Bi-allelic variants
One SNP per locus
SNPs that have genotypes across most of the individuals (i.e. completeness)
Individuals that have genotypes across most of the SNPs (i.e. filtering out individuals with high levels of missing data)
SNPs in Hardy Weinberg Equilibrium (LWE)
Unlinked SNPs (based on LD)

It outputs vcf files at each filtering stage. A tutorial for running GBS_SNP_filter is available here.

Dependencies

GBS_SNP_filter requires the following R packages:

dplyr
readr
tibble
stringr

(if these are not previously installed, they will be installed when GBS_SNP_filter runs)

Other dependencies

vcftools
PLINK (but make sure this is not installed in your main folder with all your files)

Required inputs

This pipeline requires you have a vcf file (example.vcf) output from your favourite pipeline (e.g. ANGSD, ipyrad, stacks etc.), a file called popmap.txt which contains the population assignment code for each individual (popmap.txt), and a parameters file called GBS_SNP_filter.txt. These files are described below and examples are available in example_files.

Input vcf file

Header lines starting with "##" will be ignored. The script expects your first sample to be in column 10, and for the format to be GT:DP:Other_stuff (e.g. GT in the first position, followed by DP, and then other info that will be ignored). Sample names should not have the search term "_cov" in them as this will be used for filtering coverage within the script.

##fileformat=VCFv4.0
##fileDate=2018/05/09
##source=ipyrad_v.0.7.24
##reference=pseudo-reference (most common base at site)
##phasing=unphased
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=CATG,Number=1,Type=String,Description="Base Counts (CATG)">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  B104599-GAAGCATCA       B104930-TGGCAGCA        B104957-GAACCGTAA       B109462-TTGTCAGAA       B109466-GAACCTGA
locus_2 3       .       G       A       13      PASS    NS=21;DP=179    GT:DP:CATG      1/1:9:0,9,0,0   0/0:6:0,0,0,6   ./.:0:0,0,0,0   ./.:0:0,0,0,0   ./.:0:0,0,0,0   0/1:12:0,9,0,3  ./.:0:
locus_20        22      .       C       T       13      PASS    NS=99;DP=1182   GT:DP:CATG      0/0:8:8,0,0,0   0/0:14:14,0,0,0 0/0:9:9,0,0,0   0/0:8:8,0,0,0   0/0:8:8,0,0,0   0/0:15:15,0,0,
locus_21        11      .       C       T       13      PASS    NS=8;DP=65      GT:DP:CATG      0/0:8:8,0,0,0   ./.:0:0,0,0,0   ./.:0:0,0,0,0   ./.:0:0,0,0,0   ./.:0:0,0,0,0   ./.:0:0,0,0,0
locus_21        23      .       C       T       13      PASS    NS=8;DP=65      GT:DP:CATG      0/0:8:8,0,0,0   ./.:0:0,0,0,0   ./.:0:0,0,0,0   ./.:0:0,0,0,0   ./.:0:0,0,0,0   ./.:0:0,0,0,0
locus_21        40      .       G       A       13      PASS    NS=8;DP=65      GT:DP:CATG      0/0:8:0,0,0,8   ./.:0:0,0,0,0   ./.:0:0,0,0,0   ./.:0:0,0,0,0   ./.:0:0,0,0,0   ./.:0:0,0,0,0
locus_29        18      .       A       G       13      PASS    NS=113;DP=1538  GT:DP:CATG      1/1:8:0,0,0,8   0/0:14:0,14,0,0 0/0:16:0,16,0,0 1/0:12:0,6,0,6  0/0:10:0,10,0,0 0/0:16:0,16,0,
locus_29        44      .       T       C       13      PASS    NS=113;DP=1538  GT:DP:CATG      1/1:8:8,0,0,0   0/1:14:10,0,4,0 0/0:16:0,0,16,0 0/1:12:6,0,6,0  1/1:10:10,0,0,0 0/1:16:7,0,9,0
locus_32        28      .       G       A       13      PASS    NS=53;DP=569    GT:DP:CATG      1/1:8:0,8,0,0   ./.:0:0,0,0,0   ./.:0:0,0,0,0   ./.:0:0,0,0,0   0/0:12:0,0,0,12 1/1:10:0,10,0,
locus_33        38      .       C       T       13      PASS    NS=110;DP=1543  GT:DP:CATG      1/1:8:0,0,8,0   1/0:17:6,0,11,0 1/1:10:0,0,10,0 1/1:18:0,0,18,0 1/0:13:5,0,8,0  0/0:16:16,0,0,
locus_35        72      .       C       A       13      PASS    NS=53;DP=72     GT:DP:CATG      0/0:1:1,0,0,0   ./.:0:0,0,0,0   ./.:0:0,0,0,0   ./.:0:0,0,0,0   ./.:0:0,0,0,0   ./.:0:0,0,0,0
locus_35        74      .       G       A       13      PASS    NS=47;DP=56     GT:DP:CATG      0/0:1:0,0,0,1   ./.:0:0,0,0,0   ./.:0:0,0,0,0   ./.:0:0,0,0,0   ./.:0:0,0,0,0   ./.:0:0,0,0,0

Depending on how you generated your vcf file, you might have different ideas of what a "locus" is e.g. in the file above, a de novo assembled RAD/GBS dataset, the locus identifier is in the #CHROM column, and the position of the SNP within that locus is given in the POS column. In the following reference-assembled file:

##fileformat=VCFv4.2
##fileDate=20180310
##source="Stacks v1.48"
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Allele Depth">
##FORMAT=<ID=GL,Number=.,Type=Float,Description="Genotype Likelihood">
##INFO=<ID=locori,Number=1,Type=Character,Description="Orientation the corresponding Stacks locus aligns in">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  GFO-1   GFO-10  GFO-11  GFO-12  GFO-14  GFO-15  GFO-18  GFO-19  GFO-2
NC_001606.1     14019   223111_6        A       G       .       PASS    NS=17;AF=0.059;locori=p GT:DP:AD        ./.:0:.,.       ./.:0:.,.
NC_031697.1     109273  376732_24       A       G       .       PASS    NS=47;AF=0.074;locori=m GT:DP:AD        ./.:0:.,.       ./.:0:.,.
NC_031697.1     113922  416173_45       G       C       .       PASS    NS=10;AF=0.050;locori=p GT:DP:AD        ./.:0:.,.       ./.:0:.,.
NC_031697.1     114000  416174_43       T       G       .       PASS    NS=10;AF=0.050;locori=m GT:DP:AD        ./.:0:.,.       ./.:0:.,.
NC_031697.1     114021  416174_22       C       G       .       PASS    NS=10;AF=0.050;locori=m GT:DP:AD        ./.:0:.,.       ./.:0:.,.
NC_031697.1     114037  416174_6        G       A       .       PASS    NS=10;AF=0.050;locori=m GT:DP:AD        ./.:0:.,.       ./.:0:.,.
NC_031697.1     121273  468073_5        T       A       .       PASS    NS=2;AF=0.250;locori=p  GT:DP:AD        ./.:0:.,.       ./.:0:.,.
NC_031697.1     123294  408724_12       A       C       .       PASS    NS=11;AF=0.091;locori=m GT:DP:AD        ./.:0:.,.       ./.:0:.,.
NC_031697.1     133307  88175_45        T       C       .       PASS    NS=7;AF=0.071;locori=p  GT:DP:AD        ./.:0:.,.       0/0:3:3,0
NC_031697.1     141382  223355_61       A       T       .       PASS    NS=27;AF=0.093;locori=p GT:DP:AD        ./.:0:.,.       ./.:0:.,.
NC_031697.1     147737  88272_58        C       A       .       PASS    NS=12;AF=0.083;locori=m GT:DP:AD        ./.:0:.,.       0/0:3:3,0
NC_031697.1     147748  88272_47        C       A       .       PASS    NS=8;AF=0.125;locori=m  GT:DP:AD        ./.:0:.,.       0/0:3:3,0

The scaffold the locus was assembled against is given in the #CHROM column, while the locus name is given in the ID column separated by the position within that locus by an underscore (e.g. LocusID_SNPLocation). For the GBS_SNP_filter.txt file below, you'll need to know what column you want the locus ID based on, and if there is a regular expression needed to remove everything except the locus ID from the contents of that column e.g. for the example (LocusID_SNPLocation) above _.*. The locus name can have colons, but anything before the colon needs to be unique because everything following the colon will be stripped away following the LD step (you'll also need to add :.* to the 8th line of the GBS_SNP_filter.txt or you'll get an error about not enough lines.

popmap.txt

Sample name (exactly matching that in the vcf file) in the left hand column, separated by white space from population name in the right column

B104599-GAAGCATCA Kai
B104930-TGGCAGCA  Kai
B104957-GAACCGTAA Kai
B109462-TTGTCAGAA Kai
B109466-GAACCTGA  Nuku
B109467-AAGAGTCG  Nuku
B109469-AACATCGCA Nuku
B109470-CGTGGTGCA Nuku

GBS_SNP_filter.txt

On the first line you need to give the name of your original vcf file that will be processed. On the second line you should give the proportion of samples that has to be equalled or exceeded for a SNP to be retained in the dataset (e.g 0.85 = a SNP needs to be found in 85% or more of total samples to be retained). On the third line you need to give the proportion of missing loci that will be tolerated for individual samples before they will be removed from the dataset (e.g. 0.9 = a sample can have up to 90% missing data before it is removed from the dataset). On the fourth line you need to give the p-value cut-off for determining whether a locus is out of HWE within a population. On the fifth line you need to give the r^2 cut-off for determining whether SNPs are in LD with each other within a population. On the sixth line you need to give the number of populations a locus has to be out of HWE/in LD across in before that locus is discarded. On the seventh line you need to give the column header for where the locus id you wish to use resides. On the eighth line you need to give any regular expression needed to modify the contents of the column specified on Line 7 to give just the locus name.

e.g. for the first vcf file presented above (make sure to leave a blank line on Line 8 if no regex pattern required)

robin.vcf
0.85
0.9
0.05
0.5
3
#CHROM

e.g. for the reference-guided vcf file presented above

batch_1.vcf
0.65
0.9
0.01
0.5
3
ID
_.*

To run GBS_SNP_filter

Make sure GBS_SNP_filter.sh; GBS_SNP_filter_HWE.R; GBS_SNP_filter_rsq.R; GBS_SNP_filter_chrom_modifier.R, your input vcf file, your popmap.txt, and your GBS_SNP_filter.txt files are located in your working directory. Also make sure you have previously installed the R packages dplyr, readr, tibble and stringr, and have R, vcftools, and PLINK in your path (but not in the directory you are running GBS_SNP_filter.sh in, or it will be deleted). Then on the terminal:

bash GBS_SNP_filter.sh

If you are submitting through a slurm system, you might need to preface the bash command with srun within your sbatch script e.g.

srun bash GBS_SNP_filter.sh

Memory requirements: You should allow for approximately seven times the size of your original *.vcf file in RAM.

Detailed workflow

This bash/Rscript pipeline first filters for bi-allelic SNPs (and writes out *.biallelic.vcf), then filters for one SNP/locus (prioritizing the SNP site found in the most individuals. If this is a tie, then the SNP with the highest average coverage is retained. If this is a tie, then GBS_SNP_filter randomly selects a SNP and writes this to *.oneSNP.vcf). Following this, SNPs are filtered for completeness (according to the parameters you set in GBS_SNP_filter.txt. A vcf file with the format *.X_Y.vcf will be written out, where X = the proportion of completeness for loci, Y = the proportion of missing data allowed per individual), then for HWE (*.X_Y.Z_P.HWE.vcf, where Z = the p-value threshold used as a cut-off to suggest a locus is in HWD, and P = the threshold for the number of populations where HWD/LD could occur before that locus was tossed out), and finally for LD (*.X_Y.Z_P.HWE.Q.ld.vcf, where Q = the Rsq threshold used to chuck out one out of a pair of loci in LD across P populations).

In addition to the filtered vcf files from each step, there will be a number of other files written out: *.log (contains the number of SNPs/samples across each *.vcf file, and which samples/loci were binned during any of the steps); *.X_Y.Z_P.HWE (contains the HWE p-values by row for each locus/by column for each population. This will only include loci that passed the completeness filter); *.X_Y.Z_P.HWE.*.pop.vcf (population specific vcf files used for the ld steps. These include only loci that passed the completeness and HWE filters); *.X_Y.Z_P.HWE.*.pop.ld (per population files containing pairs of loci in LD as defined by having a RSq >= Q); *.X_Y.Z_P.HWE.*.pop.log (log files from PLINK identifying the pairs of loci in LD); and *.X_Y.Z_P.HWE.Q.rsq (contains loci removed due to LD, and the locus retained that they were in linkage with).

Troubleshooting

If you find that too many SNPs are being discarded based on the SNP completeness filter (e.g. being found in >= 85% of the samples), it may be that you have had a larger-than-expected number of samples fail. I would suggest changing the second line of GBS_SNP_filter.txt to 0.0 and to therefore not filter SNPs based on completeness the first time around. Following filtering of the datasets for samples with high levels of missing data, you could then take the output vcf and run it through another round of filtering, bumping this second line back up to a more stringent value (e.g. 0.85)

The following warning is safe to ignore:

Note: Using an external vector in selections is ambiguous.
ℹ Use `all_of(origcolnumber)` instead of `origcolnumber` to silence this message.
ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.

In the LD step, the following message is also safe to ignore (introduced in v 1.11 with the switch to read_table2 - see version history below)

Warning message:
Missing column names filled in: 'X8' [8]

If you get the following error, double check your vcf is formatted to have GT:DP:Other_stuff in the FORMAT column (instead of GT:AD:DP, for example).

Error in matrix(unlist(value, recursive = FALSE, use.names = FALSE), nrow = nr, :
length of 'dimnames' [2] not equal to array extent
Calls: unlist -> lapply -> FUN -> which -> Ops.data.frame -> matrix
Execution halted

Make sure plink is not installed in the directory you are running GBS_SNP_filter.sh in, because one of the steps is cleaning up all the files that plink makes (rm -rf *plink*). If plink is located in your actual directory, it will be removed too!

Further utilities

In the utilities folder are scripts for divvying out your "final" vcf into population-specific vcf files, and using reshape, vcfR, hierfstat and inbreedR to calculate individual-level (multi-locus heterozygosity) and population-level (He) estimates of heterozygosity.

Suggested citation

This code was first published in: Robin TBD

If you could cite the pub, and the scripts as below, that would be lovely:

Alexander, A. 2020. GBS_SNP_filter v1.x.x. Available from https://github.com/laninsky/GBS_SNP_filter

This pipeline also wouldn't be possible without the programs/packages listed below. Please cite them as well:

R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Hadley Wickham (2018). stringr: Simple, Consistent Wrappers for Common String Operations. R package version X.X.X. https://CRAN.R-project.org/package=stringr

Hadley Wickham, Romain Francois, Lionel Henry and Kirill Müller (2017). dplyr: A Grammar of Data Manipulation. R package version X.X.X. https://CRAN.R-project.org/package=dplyr

Hadley Wickham, Jim Hester and Romain Francois (2017). readr: Read Rectangular Text Data. R package version X.X.X. https://CRAN.R-project.org/package=readr

Kirill Müller and Hadley Wickham (2018). tibble: Simple Data Frames. R package version 1.4.2. https://CRAN.R-project.org/package=tibble

Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ & Sham PC (2007). PLINK: a toolset for whole-genome association and population-based linkage analysis. American Journal of Human Genetics, 81.

Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert Handsaker, Gerton Lunter, Gabor Marth, Stephen T. Sherry, Gilean McVean, Richard Durbin and 1000 Genomes Project Analysis Group. 2011. The Variant Call Format and VCFtools. Bioinformatics. http://dx.doi.org/10.1093/bioinformatics/btr330

Knaus BJ and Grünwald NJ (2017). “VCFR: a package to manipulate and visualize variant call format data in R.” Molecular Ecology Resources, 17(1), pp. 44–53. ISSN 757, http://dx.doi.org/10.1111/1755-0998.12549.

Stoffel, M. A., Esser, M., Kardos, M., Humble, E., Nichols, H., David, P., & Hoffman, J. I. (2016). inbreedR: An R package for the analysis of inbreeding based on genetic markers. Methods in Ecology and Evolution.

H. Wickham. Reshaping data with the reshape package. Journal of Statistical Software, 21(12), 2007.

Goudet, J., 2005. Hierfstat, a package for R to compute and test hierarchical F‐statistics. Molecular Ecology Notes, 5(1), pp.184-186.

Version history

1.17: It appears that the plink files that VCFtools outputs will use #CHROM:POS as an identifier for the SNP_A and SNP_B columns, and POS as an identifier for the BP_A and BP_B columns, if ID is not present in the vcf file. In contrast, if ID is present in the vcf file, it will be used for the SNP_A and SNP_B columns, with POS again used for the BP_A and BP_B columns. The code in GBS_SNP_filter_rsq.R has been updated to reflect this. If this had affected you, an error would have killed GBS_SNP_filter.sh, and/or potentially SNPs in LD may not have been removed.

1.16: Updated the code to remove one of the warnings detailed at #10 . Also made a note in the readme that the other warning can be safely ignored. This will not have affected the script running correctly.

1.15: Removed a check for a previous row being NA in the while (j <= SNP_length) loop of GBS_SNP_filter_rsq.R. Thanks to OmidJa for logging this issue. If you were affected, the pipeline would have failed with:
Error in if (is.na(SNP_record[(j - 1), 8])) { : argument is of length zero Execution halted

1.14: Fixed bugs introduced in 1.13 that lead to the final GBS_SNP_filter_rsq.R script failing if no regex pattern was provided. If your analysis was affected by this, it would have failed with an error message. Also added some code to only try to remove 'temp' if it is actually present. Finally, a bug was present (now fixed) that could cause the final sample in your vcf file to be silently dropped if not starting the analysis from scratch, sorry!

1.13: Addressed a number of little bugs that led to #5. Thanks to OmidJa for logging this issue. If you had been affected by this bug, the code would have failed and/or you might have had an extra "coverage" column appear in your final vcf.

1.12: Reformatted the readme and a few of the associated repo files based on a pull request from ldutoit. Thanks Ludo!

1.11: Addressed #3 which actually resulted from two different things: I had no test for whether the GBS_SNP_filter.txt file that the user provides actually had enough lines, and also the code couldn't handle a null genotype where an explicit coverage of 0 was not given. Thanks to ldutoit for raising this. If your code ran through fine without R warnings, then you were not affected by these bugs!

1.10: Allowed the column that locus names would be based on (for identifying multiple SNPs/locus etc) to be changed. This allows for reference-guided RADseq assemblies to retain multiple SNPs per scaffold (not possible if the #CHROM column is always used - this ditches everything but one SNP/scaffold). Thanks to OmidJa for this suggestion. Also modified the name of the *.rsq file as it was printed out to the log.

1.04: Modified GBS_SNP_filter_rsq.R in response to #1: I had been assuming that at least one population would end being dropped due to individual samples failing!

1.0.3: Fixed a bug introduced in 1.0.2 that was not coping with writing out population specific vcfs for calculating LD if any of the populations were completely missing populations. Added the scripts in the utilities folder.

1.0.2: Fix for bug that meant the first sample in the original vcf appeared in all population-specific vcfs used for calculating HWE and rsq. Modified Rscripts so dplyr, readr and stringr packages were installed if they weren't already when first called.

1.0.1: Fix for bug that arose when all samples in a population were removed due to the missing filter. Minor changes to README.md.

1.0.0: Tested on a single data set so far, and working. If any issues on your data set, please log an issue. Scripts are confirmed to work with these versions of software/libraries: VCFtools/0.1.14-gimkl-2017a-Perl-5.24.1, PLINK/1.09b3.32, R/3.5.0-gimkl-2017a, stringr v 1.3.1, readr v 1.1.1, and dplyr v 0.7.4

gbs_snp_filter's People

Contributors

Stargazers

Watchers

Forkers

ldutoit davidswang marionamujal sethmusker roseannagg

gbs_snp_filter's Issues

ld performance in the package

Dear Alana,
Hope you are fine.
Actually I just have a question about the ld option in your GBS_SNP_filter package. If two loci are in ld, then both of them will be removed or one of them will be retained?

Thank you for all your supports.

Regards,
Omid

Error in generating .HWE.vcf file

Dear Alana,
hope you are doing well.
This is Omid Jafari. Previously I was working with stacks ver.1 pipeline and you helped me with solving my error in using your package by updating it. But now I have generated a vcf file from stacks ver 2 which in column ID there is some changes and I think the error backs to that. It should be mentioned that my pipeline was genome reference-based.

Error: 'populations.0.65_0.9.0.01_3.HWE.vcf' does not exist in current working directory ('/home/omid/Khazar-vcf-mehrshad/hwe').
Execution halted

In the GBS_SNP_filter.txt file, at the last lane I changed _.* to :.* and it gets a bit running but then again I face with error.

There were 50 or more warnings (use warnings() to see the first 50)
[1] "Up to 1 out of 5 populations"
Error in matrix(if (is.null(value)) logical() else value, nrow = nr, dimnames = list(rn,  :
  length of 'dimnames' [2] not equal to array extent
Calls: unlist -> lapply -> FUN -> which -> Ops.data.frame -> matrix
In addition: Warning message:
Calling `as_tibble()` on a vector is discouraged, because the behavior is likely to change in the future. Use `enframe(name = NULL)` instead.
This warning is displayed once per session.
Execution halted
ls: cannot access *pop.vcf: No such file or directory
Loading required package: dplyr

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Loading required package: readr
Loading required package: stringr
Loading required package: tibble
Parsed with column specification:
cols(
  X1 = col_character()
)
Error: 'populations.recode.0.65_0.9.0.01_3.HWE.vcf' does not exist in current working directory ('/home/omid/Khazar-vcf-mehrshad/hwe').
Execution halted

I'll share my original vcf file, popmap.txt and GBS_SNP_filter.txt and will be so grateful if you can help me to pass over this error.

I should add the point that I think the the error backs to my vcf file, because when apply for using the package on some other .vcf files (generated from stacks 2) with the change in last line code of GBS_SNP_filter.txt file (as mentioned above) it works fluently, so I think there is some thing wrong in that vcf file!!

Regards,
Omid

Address some deprecating in code

Warning message:
funs() is soft deprecated as of dplyr 0.8.0
Please use a list of either functions or lambdas: 

  # Simple named list: 
  list(mean = mean, median = median)

  # Auto named with `tibble::lst()`: 
  tibble::lst(mean, median)

  # Using lambdas
  list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

Note: Using an external vector in selections is ambiguous.
ℹ Use `all_of(origcolnumber)` instead of `origcolnumber` to silence this message.
ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.

Add some guidance on what vcf to use

Tweak the readme to point out what vcf is the one to use, because people are not necessarily going to scroll through the novel that is the readme :P

Check the formatting of the vcf

Should write a check to make sure the vcf file is in the GT:DP:other_stuff format

Error in -removepops

Hi,
This is Omid and I am working on GBS data for a fish species and the vcf file is generated from stacks pipeline. I am using your package for filtering and I think it works well until the time of LD filtering. I should add that I have the .ld files per each pop but for the final file in .vcf format I have nothing. Here is the error which I faced and I will be so grateful if you can help me to solve this error.

Error in -removepops : invalid argument to unary operator 
Execution halted

Regards,
Omid

unknown issue

Hi laninsky,

thank you very much for your sweet package!

I am running into an issue

Loading required package: dplyr

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Loading required package: readr
Loading required package: stringr
Loading required package: tibble
Parsed with column specification:
cols(
  .default = col_character(),
  `1864` = col_double(),
  `15` = col_double()
)
See spec(...) for full column specifications.
Error in stri_count_regex(string, pattern, opts_regex = opts(pattern)) :
  object 'ALT' not found
Calls: %>% ... mutate.tbl_df -> mutate_impl -> str_count -> stri_count_regex
In addition: Warning message:
Duplicated column names deduplicated: '.' => '._1' [6], '0/0:25:25,0:40:-0.00,-9.18,-124.38' => '0/0:25:25,0:40:-0.00,-9.18,-124.38_1' [17], '0/0:18:18,0:40:-0.00,-7.07,-90.18' => '0/0:18:18,0:40:-0.00,-7.07,-90.18_1' [24], '0/0:23:23,0:40:-0.00,-8.57,-114.61' => '0/0:23:23,0:40:-0.00,-8.57,-114.61_1' [26], '0/0:40:40,0:40:-0.00,-13.69,-197.66' => '0/0:40:40,0:40:-0.00,-13.69,-197.66_1' [30], '0/0:28:28,0:40:-0.00,-10.08,-139.03' => '0/0:28:28,0:40:-0.00,-10.08,-139.03_1' [31], '0/0:30:30,0:40:-0.00,-10.68,-148.81' => '0/0:30:30,0:40:-0.00,-10.68,-148.81_1' [32], '0/0:15:15,0:40:-0.00,-6.17,-75.53' => '0/0:15:15,0:40:-0.00,-6.17,-75.53_1' [34], '0/0:35:35,0:40:-0.00,-12.19,-173.23' => '0/0:35:35,0:40:-0.00,-12.19,-173.23_1' [36], '0/0:23:23,0:40:-0.00,-8.57,-114.61' => '0/0:23:23,0:40:-0.00,-8.57,-114.61_2' [37], '0/0:14:14,0:40:-0.00,-5.86,-70.64' => '0/0:14:14,0:40:-0.00,-5.86,-70.64_1' [39], '0/0:18:18,0:40:-0.00,-7.07,-90.18' => '0/0:18:18,0:40:-0.00,-7.07,-90.18_2' [43], '0/0:19:1 [... truncated]
Execution halted
ls: cannot access *pop.vcf: No such file or directory
Loading required package: dplyr

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Loading required package: readr
Loading required package: stringr
Loading required package: tibble
Parsed with column specification:
cols(
  X1 = col_character()
)

grateful for any help!

My parameters file:

cat GBS_SNP_filter.txt


test.vcf
0.0
1
0.05
0.5
1

file attached ( test.vcf but called test.vcf.txt for upload authorisations
test.vcf.txt
)!

Error in generating the .ld.vcf file (the final step)

Dear Alana,
Hope you are fine.
Again this is Omid and sorry because of a new error. Actually I am working on the same samples as previous but today I decided to generate a vcf file including two populations and when I decided to use your package for filtering I faced with the following error. I will be so grateful if you can help to overcome this error.

Parsed with column specification:
cols(
  CHR_A = col_character(),
  BP_A = col_double(),
  SNP_A = col_character(),
  CHR_B = col_character(),
  BP_B = col_double(),
  SNP_B = col_character(),
  R2 = col_double(),
  X8 = col_logical()
)
Parsed with column specification:
cols(
  CHR_A = col_character(),
  BP_A = col_double(),
  SNP_A = col_character(),
  CHR_B = col_character(),
  BP_B = col_double(),
  SNP_B = col_character(),
  R2 = col_double(),
  X8 = col_logical()
)
Warning messages:
1: Missing column names filled in: 'X8' [8]
2: Missing column names filled in: 'X8' [8]
[1] "Up to 2 out of 13 pairwise LD comparisons"
Error in if (is.na(SNP_record[(j - 1), 8])) { :
  argument is of length zero
Execution halted

I also will send you my files through Dropbox.

Thank you very much in advance.

Regards,

Omid

Failing after PLINK for stickleback test data

Warning messages:
1: Missing column names filled in: 'X8' [8] 
2: Missing column names filled in: 'X8' [8] 
3: Missing column names filled in: 'X8' [8] 
Error in gsub(parameters[8, 1], "", as.matrix(temp %>% select(!!parameters[7,  : 
  incorrect number of dimensions
Execution halted

Suspect it is something to do with no regex pattern being present for this dataset - drill into it more tomorrow.

locus ID regex pattern when column has colons

I'm putting togther the GBS_SNP_filter.txt file and am a bit confused about line 8: the locus ID regex pattern. I have a reference-aligned vcf file output from Stacks v2. The #CHROM column is the scaffold the locus was assembled against, followed by POS and ID columns. The locus ID is in the 'ID' column but also has additional information, separated by colons. For example: (columns #CHROM, POS, ID) - ENA|CAJHIB020000001|CAJHIB020000001.1 13970 3:73:+.

From the ReadMe, it looks like I need to use a regex pattern because of the formatting of the locus ID column. I'm a bit confused about what that would look like in my case and if the colons in the ID column will be a problem, given what's mentioned in the ReadMe ("the locus name should not have a colon in it, because everything following the colon will be stripped away following the LD step").

I would appreciate any guidance on this!

Error using HWE.R

Hello,
I have come across an error at the HWE step. The error message reads,
"Error in names(hwetablebin) <- c("snp", popnames) :
'names' attribute [9] must be the same length as the vector [1]
Execution halted
ls: cannot access *pop.vcf: No such file or directory"

I do get the HWE table in the outfiles, but the HWE.vcf does not get written.
I would appreciate any suggestions.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.