Giter Club home page Giter Club logo

asdpex's Introduction

ASDPex

Introduction

The application ASDPex and the scripts in this repository can be used to improve the alignments of the alternate loci provided by GRC, to search for alignable scaffold-discrepant positions (ASDPs) in the alignments, and to use the resulting ASDP file to screen sample VCF files from whole-genome sequencing for ASDPs (which are likely to be false-positive variant calls).

See Jäger et al. (2016) Alternate-locus aware variant calling in whole genome sequencing. Genome Medicine 8:130

Prerequisites:

In addition to Java 8, you will need to install the tabix, bedtools, and samtools packages. If you are on a debian-based system, enter

sudo apt-get install tabix
sudo apt-get install bedtools
sudo apt-get install samtools

Alternatively you can install these tools using conda:

conda install tabix
conda install bedtools
conda install samtools

This can also be done in a separate environment, which then has to be loaded before running ASDPex.

Getting started

git clone https://github.com/charite/asdpex.git
cd asdpex

Download data

cd to the 'scripts' directory, make the "downloadData.sh" executable, and execute the following command to download all the data we will need.

cd scripts
chmod +x downloadData.sh
./downloadData.sh all data
cd ..

After you have downloaded the data, you will need to index the genome using samtools (the script will produce a message with the command you need, and if samtools is not in your path adjust the command accordingly).

Build the executables

regionalign2vcf

The aligner was written using the SeqAn C++ library. Therfore the library has to be downloaded and the tool compiled. We have to change to the seqan folder and run the Makefile.

cd seqan
make
cd ..

This command should result in an executable programm called regionalign2vcf, which is later on needed.

asdpex

We use the maven build system to compile the code. First cd back to the main folder.

mvn package

If everything goes well, you will see a message including the words BUILD SUCCESS.

The jar file asdpex-cli-0.3.jar contains the main code used in this project. An overview of available command are shown with the following command.

java -jar asdpex-cli/target/asdpex-cli-0.3.jar
Program: de.charite.compbio.asdpex (functional annotation of VCF files)
Version: 0.3
Contact: Marten Jäger <[email protected]>
         Peter N. Robinson <[email protected]>

Usage: java -jar asdpex.jar <command> [options]

Command: align       construct fasta and seed files and do the alignments
         annotate    functional annotation of VCF files
         create-db   creates a SQLite database used for this tool
         create-fa   construct fasta files for the alignments
         create-seed construct seed files for the alignments from the NCBI alignments

Example: java -jar asdpex.jar create-db -s asdpex.sqlite -d data
         java -jar asdpex.jar create-fa -o data


Create database and init

Create the SQLite database and inititate with the downloaded data.

java -jar asdpex-cli/target/asdpex-cli-0.3.jar create-db -s asdpex.sqlite -d data

Alignment and Identification of ASDPs

The following command will perform the alignment of the alternate loci with the corresponding regions of the primary assembly and will store the discrepant alignment positions for each of the alternate loci in one individual VCF file (e.g., chr5_KI270897v1_alt.vcf). The files will be written to the (new) directory "alignresults" (as indicated by the -o flag). The -q flag indicates the SQlite database that we created in the previous step.

java -jar asdpex-cli/target/asdpex-cli-0.3.jar \
  align -d data/ -s seqan/regionalign2vcf -o alignresults -q asdpex.sqlite

There should now be 261 separate VCF files in the alignresults directory. We merge these VCF files into a single file allASDPs.vcf.gz and filter for single nucleotide variants (SNVs). This and the following scripts assume that BGZIP and TABIX are defined as environment variables. If this is not the case in your system, you will need to modify the scripts accordingly (or set the environment variables).

./scripts/mergeVCFs.sh alignresults/ allASDPs.vcf

The file allASDPs.vcf that is created in this step should contain 768316 ASDP candidates with differences less than 50 nt. However, as mentioned in the manuscript, our procedure applies a additional criteria for the goodness of the alignment in regions surrounding discrepant positions on the basis of alignment windows of 50 nt that were allowed to contain up to 1 mismatch per 5 bases (1:5). This is implemented by the following Perl script.

perl scripts/filterBadASDPs.pl -i allASDPs.SNV.vcf.gz -w 50 -m 10 -c

This command will iterate aver all SNV ASDPs and merge them into MNV ASDPs. The above uses a window of 50 bases (-w 50), allowing up to 10 ASDPs (-m 10). O therwise it will mark and remove the ASDPs in the window. The ASDPs that survive this filtering step are stored in the file allASDPs.SNV.50_10.valid.vcf.gz.

Upload the variants into the SQLite database to store all needed information together in a more portable way.

java -jar asdpex-cli/target/asdpex-cli-0.3.jar \
create-db -s asdpex.sqlite -a allASDPs.SNV.50_10.valid.vcf.gz

We have copied the files allASDPs.SNV.50_10.valid.vcf.gz and allASDPs.SNV.50_10.valid.vcf.gz.tbi into the directory vcf in this repository. These are the files that are created by the code described above and that were used for the analysis described in the manuscript.

##Postprocess VCF files from Whole-Genome Sequencing (WGS) As described in the main manuscript, we can now use the ASDP file generated above (allASDPs.SNV.50_10.valid.vcf.gz) to postprocess a WGS file to mark up called variants that correspond to ASPDs. The following command performs this analysis and outputs a file (.vcf.gz) that contains the annotations.

java -jar asdpex-cli/target/asdpex-cli-0.3.jar \
  annotate -a allASDPs.SNV.50_10.valid.vcf.gz -d data/ -v <infile>.vcf.gz \
  -o <annot>.vcf.gz

##Example Here, we will take GRCh38 high-confidence calls for NA12878 to show how to use the ASDPex program on real data. Download the VCF and the index (tbi) files from ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.1/GRCh38/. Then (assuming we are in the same directory as in the above example), enter the following command:

$ java -jar asdpex-cli/target/asdpex-cli-0.3.jar annotate -a allASDPs.SNV.50_10.valid.vcf.gz\
   -d data/ -v HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.1_highconf_phased.vcf.gz \
   -o HG001-annot.vcf.gz

This will output an annotated (compressed) VCF file called HG001-annot.vcf.gz. The lines of the VCF file will be modified to indicate the presence of ASDPs inferred by ASDPex. For example,

chr22	42252402	.	C	T	50	ASDP	ALTGENOTYPE=HET;ALTLOCUS=chr22_GL383582v2_alt; ...

This line states that the variant called on chromosome 22 at position 42252402 was inferred to be a heterozygous ASDP-associated variant related to the alternate locus GL383582v2. Similarly,

chr18	43747118	.	A	T	50	ASDP	ALTGENOTYPE=HOM_VAR;ALTLOCUS=chr18_KI270864v1_alt;...

This line states that the variant called on chromosome 18 at position 43747118 was inferred to be a homozygous ASDP-associated variant related to the alternate locus KI270864v1. In total, 2653 ASDP-associated variants are called (homozygous: 441; heterozygous: 2212). The following table shows the number of ASDP-associated variants called for 32 regions.

Alternate Locus ASDPs (n)
chr4_GL000257v2_alt 470
chr8_KI270822v1_alt 382
chr8_KI270810v1_alt 184
chr7_KI270808v1_alt 157
chr4_KI270788v1_alt 116
chr11_KI270832v1_alt 114
chr9_GL383542v1_alt 89
chr6_KI270800v1_alt 87
chr1_KI270760v1_alt 79
chr12_GL383550v2_alt 78
chr11_JH159136v1_alt 71
chr4_KI270790v1_alt 70
chr5_GL383531v1_alt 69
chr14_KI270844v1_alt 69
chr3_KI270781v1_alt 64
chr18_KI270864v1_alt 63
chr2_GL383521v1_alt 59
chr22_GL383582v2_alt 48
chr4_GL383527v1_alt 46
chr3_KI270777v1_alt 44
chr2_GL383522v1_alt 42
chr15_GL383555v2_alt 35
chr13_KI270839v1_alt 34
chr1_KI270763v1_alt 30
chr13_KI270843v1_alt 27
chr22_KI270876v1_alt 26
chr16_GL383557v1_alt 23
chr6_KI270802v1_alt 23
chr21_GL383580v2_alt 17
chr6_GL383533v1_alt 16
chr18_GL383571v1_alt 15
chr6_KI270801v1_alt 6

asdpex's People

Contributors

martenj avatar pnrobinson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

snashraf anykine

asdpex's Issues

download data- no genome

I followed the instructions on the README.
I executed downloadData.sh, which threw no error. All subsequent steps were also successful until I wanted to start alignment. The index file for the genome was missing. I failed to index it with samtools. After checking the file (the genome), I saw it was empty.
I checked the script downloadData.sh for the genome download link and found this:

url38="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz"

Trying to execute:
wget --progress=bar -O test.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz
Gave me error:

No such directory ‘genomes/all/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids’.

After checking the ftp page I found this to be the correct link, with which the download of the genome worked and subsequently also the indexing with samtools and the rest of the following steps.

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/

[ERROR] The FastA is not index - please index file first and run again.

java -jar asdpex-cli/target/asdpex-cli-0.2.jar align -d scripts/data/ -s seqan/regionalign2vcf -o alignresults -q asdpex.sqlite
[ERROR] The FastA is not index - please index file first and run again.

Even though I have indexed the files but still getting same error .

These are files that I already have so I have indexed it already.

[nsyed@hpclogin1 asdpex]$ ls scripts/data/genome/GRCh38.*
scripts/data/genome/GRCh38.fa scripts/data/genome/GRCh38.fa.gz scripts/data/genome/GRCh38.fa.gz.fai scripts/data/genome/GRCh38.fa.gz.gzi

Thanks
Najeeb

missing genomic_regions_definitions.txt

I'm an admin with no omics background attempting to install this package on behalf of one of my users.

I'm following the instructions in the README.md. I'm able to download the data (though I also ran into the issue in #12 and had to obtain the data manually). I'm also able to index the data. But when I try to build and initialize the database I encounter the following:

$ java -jar asdpex-cli/target/asdpex-cli-0.2.jar create-db -s asdpex.sqlite -d data
[INFO] Create database
[INFO] Opened database successfully
Table created successfully
Table created successfully
Table created successfully
[INFO] Updated alt scaffold accessions
[INFO] Updated chromosome accessions
[INFO] Updated regions
java.io.FileNotFoundException: data/genomic_regions_definitions.txt (No such file or directory)
        at java.base/java.io.FileInputStream.open0(Native Method)
        at java.base/java.io.FileInputStream.open(FileInputStream.java:213)
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:155)
        at de.charite.compbio.asdpex.util.IOUtil.getBufferedReaderFromFileName(IOUtil.java:76)
        at de.charite.compbio.asdpex.io.parser.RegionInfoParser.parse(RegionInfoParser.java:58)
        at de.charite.compbio.asdpex.cmd.CreateDatabaseCommand.createAlternativeScaffoldTables(CreateDatabaseCommand.java:112)
        at de.charite.compbio.asdpex.cmd.CreateDatabaseCommand.run(CreateDatabaseCommand.java:72)
        at de.charite.compbio.asdpex.App.main(App.java:58)
[INFO] Updated alt scaffold placement

Here is what my data directory looks like:

$ tree data
data
├── TMP
│   └── alt_scaffold_placement.txt
├── all_alt_scaffold_placement.txt
├── alts_accessions_GRCh38.p2
├── chr_accessions_GRCh38.p2
└── genome
    ├── GRCh38.fa
    ├── GRCh38.fa.fai
    └── GRCh38.fa.gz

2 directories, 7 files

Am I supposed to have genomic_regions_definitions.txt somewhere? How am I supposed to obtain or generate this file? Thank you.

Error: Features added out of order

Hi,
When I run asdpex-cli-0.2.jar, I get the following error on a few samples. I have ensured that the input files have been sorted. Any help would be greatly appreciated! Thanks!

**************[WARN] found no 'first' ASDP for region REGION160 set to region start.
[WARN] found no 'last' ASDP for region REGION160 set to region stop.
****Exception in thread "main" java.lang.IllegalArgumentException: Features added out of order: previous (TabixFeature{referenceIndex=13, start=105573251, end=105573293, featureStartFilePosition=451911835841, featureEndFilePosition=-1}) > next (TabixFeature{referenceIndex=13, start=105573221, end=105573267, featureStartFilePosition=451911836245, featureEndFilePosition=-1})
at htsjdk.tribble.index.tabix.TabixIndexCreator.addFeature(TabixIndexCreator.java:89)
at htsjdk.variant.variantcontext.writer.IndexingVariantContextWriter.add(IndexingVariantContextWriter.java:170)
at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:219)
at de.charite.compbio.asdpex.cmd.AnnotatedVariantWriter.put(AnnotatedVariantWriter.java:62)
at de.charite.compbio.asdpex.cmd.AnnotateVCFCommand.writeVariants(AnnotateVCFCommand.java:320)
at de.charite.compbio.asdpex.cmd.AnnotateVCFCommand.run(AnnotateVCFCommand.java:292)
at de.charite.compbio.asdpex.App.main(App.java:58)
make: *** [sortvcf] Error 1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.