gfedonin / virgena Goto Github PK

A reference guided assembler of highly variable viral genomes

Java 100.00%

bioinformatics genome-assembly ngs viral-genomics viral-ngs

virgena's Introduction

Welcome to VirGenA home page

VirGenA is a reference guided assembler of highly variable viral genomes, based on iterative mapping and de novo reassembling of highly variable regions, which can handle with distant reference sequence due to specially designed read mapper. VirGenA can separate mixtures of strains of different intraspecies genetic groups (genotypes, subtypes, clades, etc.) and assemble a separate consensus sequence for each group in a mixture.

If provided with multiple sequence alignment (MSA) of target references VirGenA selects optimal reference set, sorts reads to selected references and outputs consensus sequences corresponding to these references. For each consensus sequence the multiple sequence alignment of its constituent reads is printed in BAM format.

If no MSA provided, VirGenA works in single-reference mode and use user-provided reference.

Multi-fragment references are supported in single-reference mode.

You can use VirGenA for full genome assembly or just to find optimal reference set for given fastq files with Illumina paired end reads.

Documentation

Complete documentation is provided in wiki format.

Installation

VirGenA is a java application: it runs on any platform supporting JVM. Simply download the latest release file and run according to usage instructions.

Required dependencies

The following are required to run VirGenA:

-Java version 8 or higher

-VSEARCH binary in any location. Path to the binary is set in configuration file. Recomended version is included in the distribution.

-Blast installed locally

Toy example

To run VirGenA with test data download and unzip release files.

on Windows:

You can set number of threads in config_test_win.xml by changing value of ThreadNumber element.

Using Windows command promt change dir to unzipped folder and type:

java -jar ./VirGenA.jar assemble -c config_test_win.xml

on Linux:

You can set number of threads in config_test_linux.xml by changing value of ThreadNumber element.

Change permissions of ./tools/vsearch to make it executable. After that using shell change dir to unzipped folder and type:

java -jar VirGenA.jar assemble -c config_test_linux.xml

Test data is an artificial mixture containing 100000 HIV paired reads of three different subtypes (01_AE, B and C) in equal proportions. VirGenA should detect these components and assemble genome-length consensus sequences for all components.

Results will be stored in ./res/ folder. Expected output is:

Files (fasta) with assemblies of three mixture components named after the selected references: 01_AE.TH.90.CM240.U54771_assembly.fasta, B.FR.83.HXB2_LAI_IIIB_BRU.K03455_assembly.fasta, C.BW.96.96BW0502.AF110967_assembly.fasta
Sorted bam files with read alignments and corresponding index files (bai): 'reference_name'_mapped_reads.bam and 'reference_name'_mapped_reads.bai
Log file.

How to cite:

Fedonin GG, Fantin YS, Favorov AV, Shipulin GA, Neverov AD. VirGenA: a reference-based assembler for variable viral genomes. Brief Bioinform, 2017 Jul 28. doi: 10.1093/bib/bbx079.

virgena's People

Contributors

Stargazers

Watchers

Forkers

vikash84 glf20

virgena's Issues

getting error in Map module

I have run the tool as follows with ram of 35 gb

java -jar VirGenA.jar map -c config_copy.xml

got error
DEBUG 2021-11-12 13:14:59 BlockCompressedOutputStream Using deflater: Deflater
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at htsjdk.samtools.BAMFileWriter.writeAlignment(BAMFileWriter.java:127)
at htsjdk.samtools.SAMFileWriterImpl.addAlignment(SAMFileWriterImpl.java:190)
at BamPrinter.printBAM(BamPrinter.java:321)
at Mapper.run(Mapper.java:574)
at VirGenA.main(VirGenA.java:31)

getting error

java -Xmx30G -jar /home/iipruser/VirGenA_v1.4/VirGenA.jar assemble -c /home/iipruser/VirGenA_v1.4/config_test_linux.xml
java.io.IOException: File /media/iipruser/shanmu_data/Sanjay_Viral_whole_genome/denovo_with_reference_alignment_27th_Nov_2021/ALL_NPV/AllNPV_samtools_reads_1.96m_reads/all_npv_samtools_R1_paired.fastq.gz have incorrect sequence identifier string
at DataReader.readFilesWithReads(DataReader.java:142)
at DataReader.readData(DataReader.java:41)
at DataReader.(DataReader.java:75)
at DataReader.getInstance(DataReader.java:102)
at KMerCounter.(KMerCounter.java:17)
at KMerCounter.getInstance(KMerCounter.java:59)
at Mapper.(Mapper.java:29)
at ConsensusBuilderSimple.(ConsensusBuilderSimple.java:23)
at ConsensusBuilderWithReassembling.(ConsensusBuilderWithReassembling.java:41)
at RefBasedAssembler.run(RefBasedAssembler.java:665)
at VirGenA.main(VirGenA.java:34)
java.lang.NullPointerException
at KMerCounter.(KMerCounter.java:40)
at KMerCounter.getInstance(KMerCounter.java:59)
at Mapper.(Mapper.java:29)
at ConsensusBuilderSimple.(ConsensusBuilderSimple.java:23)
at ConsensusBuilderWithReassembling.(ConsensusBuilderWithReassembling.java:41)
at RefBasedAssembler.run(RefBasedAssembler.java:665)
at VirGenA.main(VirGenA.java:34)
java.io.IOException: File /media/iipruser/shanmu_data/Sanjay_Viral_whole_genome/denovo_with_reference_alignment_27th_Nov_2021/ALL_NPV/AllNPV_samtools_reads_1.96m_reads/all_npv_samtools_R1_paired.fastq.gz have incorrect sequence identifier string
at DataReader.readFilesWithReads(DataReader.java:142)
at DataReader.readData(DataReader.java:41)
at DataReader.(DataReader.java:75)
at DataReader.getInstance(DataReader.java:102)
at ConsensusBuilderWithReassembling.assemble(ConsensusBuilderWithReassembling.java:762)
at RefBasedAssembler.run(RefBasedAssembler.java:666)
at VirGenA.main(VirGenA.java:34)
java.lang.NullPointerException
at ConsensusBuilderWithReassembling.assemble(ConsensusBuilderWithReassembling.java:764)
at RefBasedAssembler.run(RefBasedAssembler.java:666)
at VirGenA.main(VirGenA.java:34)

I am using same reads for denovo assembly with SPAdes and that works fine. but getting error here.

getting error while running map and assemble

I could run .jar but getting error
java -jar /home/shanmu/VirGenA/release_v1.4/VirGenA.jar assemble -c /home/shanmu/VirGenA/release_v1.4/config.xml
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.io.BufferedReader.readLine(BufferedReader.java:356)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at DataReader.readFilesWithReads(DataReader.java:144)
at DataReader.readData(DataReader.java:41)
at DataReader.(DataReader.java:75)
at DataReader.getInstance(DataReader.java:102)
at KMerCounter.(KMerCounter.java:17)
at KMerCounter.getInstance(KMerCounter.java:59)
at Mapper.(Mapper.java:29)
at ConsensusBuilderSimple.(ConsensusBuilderSimple.java:23)
at ConsensusBuilderWithReassembling.(ConsensusBuilderWithReassembling.java:41)
at RefBasedAssembler.run(RefBasedAssembler.java:665)
at VirGenA.main(VirGenA.java:34)

java -jar /home/shanmu/VirGenA/release_v1.4/VirGenA.jar assemble -c /home/shanmu/VirGenA/release_v1.4/config.xml
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:300)
at java.lang.StringCoding.encode(StringCoding.java:344)
at java.lang.StringCoding.encode(StringCoding.java:387)
at java.lang.String.getBytes(String.java:958)
at DataReader.readFilesWithReads(DataReader.java:152)
at DataReader.readData(DataReader.java:41)
at DataReader.(DataReader.java:75)
at DataReader.getInstance(DataReader.java:102)
at KMerCounter.(KMerCounter.java:17)
at KMerCounter.getInstance(KMerCounter.java:59)
at Mapper.(Mapper.java:29)
at ConsensusBuilderSimple.(ConsensusBuilderSimple.java:23)
at ConsensusBuilderWithReassembling.(ConsensusBuilderWithReassembling.java:41)
at RefBasedAssembler.run(RefBasedAssembler.java:665)
at VirGenA.main(VirGenA.java:34)

Pls help me to solve

Reelase 1.4 does not contain java executable

As stated in the header, there is no jar file included with the 1.4 release.

Java memory issues

Hi Gennady,

I am trying to run VirGenA on a cluster but have been running in to Java memory errors (both heap space and GC overhead limits exceeded). The trial data run fine. For the experimental data I am running against a single reference with paired-end reads that do not map to the host genome. The number of reads that should map to the virus genome are a small fraction of the total (expected to be ~500-1000 reads out of 20million). Is the low mapping rate an issue here?

-Keir

./res/clusters_reads_0_670.uc (No such file or directory)

Dear Gennady G. Fedonin,

First, thank you very much for your tool, that will be very usefull for us :)

I trying toy data set (01_AE, B, C) on MacOS to test installation, with config_test_linux.xml.
I have an error about a missing outputfile, bellow :

################################

me@MacPro:~/VirGenA/release_v1.4$ java -jar VirGenA.jar assemble -c config_test_linux.xml
java.io.FileNotFoundException: ./res/clusters_reads_0_670.uc (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at java.io.FileInputStream.(FileInputStream.java:93)
at java.io.FileReader.(FileReader.java:58)
at ReferenceFinder.readClustersAndBuildContigs(ReferenceFinder.java:233)
at ReferenceFinder.selectReferences(ReferenceFinder.java:545)
at RefBasedAssembler.assemble(RefBasedAssembler.java:552)
at RefBasedAssembler.run(RefBasedAssembler.java:663)
at VirGenA.main(VirGenA.java:34)

################################

./res/ directory is here, whit log.txt :

################################

Creating random reads model from MSA with the parameters given in the config file in
Creating random reads model from reference with the parameters given in the config file in
Mapping to MSA stats:

Total read pairs: 50000
Total pairs with both reads exists: 50000
Total reads: 100000
Total reads mapped from (3): 1,00, forward: 0,50, reverse: 0,50
Total pairs with both reads mapped from (2): 49923, 1,00
Concordant pairs from (5): 1,00
Total score: 21268249, average score: 212,92
Total time, s: 26
Time: 26
Reads after filtering by length >= 50: 99638
Preprocess time: 0
UClust time: 0

################################

Could you help us please ? :)

Best regards,
Nicolas

Not enough reads to assemble

I am using VirGenA with single-reference mode to assemble genomes from virus strain mixture. But it could not assemble some samples with the message "Not enough reads to assemble". I used the example config file and only changed the input, insert size and reference genome. Did I get something wrong?

virGenA run for viral genome database

Dear Gennady G. Fedonin,

Greetings of the day!

I am trying to run virGenA with a sample (paired-end .fastq.gz files) with a viral database (multi FASTA file).
As 1st step, I tried creating MSA fasta file for the database using T-coffee tool but it failed -- sequence lengths were too long for algorithm.
So I disabled the reference selector but virGenA kept running without outputting any file (memory used 350gb).
I am stuck at this stage. Please help me out. And is there any way to make MSA FASTA (with/without using MSA tool)?

Command & config file used -

nohup java -Xmx350g -jar /home/softwares/virgenA_release_v1.4/VirGenA.jar assemble -c ./config1.xml

<config> <Data> <pathToReads1>./5x_R1_clean.fastq.gz</pathToReads1> <pathToReads2>./5x_R2_clean.fastq.gz</pathToReads2> <InsertionLength>1000</InsertionLength> </Data> <Reference>/home/phase_2/viruSITE/viruSITE_genomes.fasta</Reference> <OutPath>./out</OutPath> <ThreadNumber>-1</ThreadNumber> <BatchSize>1000</BatchSize> <ReferenceSelector> <Enabled>false</Enabled> <UseMajor>false</UseMajor> <ReferenceMSA>Path to reference MSA in FASTA format</ReferenceMSA> <PathToUsearch>/home/softwares/virgenA_release_v1.4/tools/vsearch</PathToUsearch> <UclustIdentity>0.95</UclustIdentity> <MinReadLength>50</MinReadLength> <MinContigLength>1000</MinContigLength> <Delta>0.05</Delta> <MaxNongreedyComponentNumber>5</MaxNongreedyComponentNumber> <MapperToMSA> <K>7</K> <pValue>0.01</pValue> <IndelToleranceThreshold>1.5</IndelToleranceThreshold> <RandomModelParameters> <Order>4</Order> <ReadNum>10000</ReadNum> <Step>10</Step> </RandomModelParameters> </MapperToMSA> <Graph> <MinReadNumber>5</MinReadNumber> <VertexWeight>10</VertexWeight> <SimilarityThreshold>0.5</SimilarityThreshold> <Debug>false</Debug> </Graph> <Debug>false</Debug> </ReferenceSelector> <Mapper> <K>5</K> <pValue>0.01</pValue> <IndelToleranceThreshold>1.25</IndelToleranceThreshold> <RandomModelParameters> <Order>4</Order> <ReadNum>1000</ReadNum> <Step>10</Step> </RandomModelParameters> <Aligner> <Match>2</Match> <Mismatch>-3</Mismatch> <GapOpenPenalty>5</GapOpenPenalty> <GapExtensionPenalty>2</GapExtensionPenalty> </Aligner> </Mapper> <ConsensusBuilder> <IdentityThreshold>0.9</IdentityThreshold> <CoverageThreshold>0</CoverageThreshold> <MinIntersectionLength>10</MinIntersectionLength> <MinTerminationReadsNumber>1</MinTerminationReadsNumber> <Reassembler> <IdentityThreshold>0.9</IdentityThreshold> <MinTerminatingSequenceCoverage>0</MinTerminatingSequenceCoverage> <PairReadTerminationThreshold>0.1</PairReadTerminationThreshold> <MinReadLength>50</MinReadLength> </Reassembler> <Debug>false</Debug> </ConsensusBuilder> <Postprocessor> <Enabled>true</Enabled> <MinFragmentLength>500</MinFragmentLength> <MinIdentity>0.99</MinIdentity> <MinFragmentCoverage>0.99</MinFragmentCoverage> <Debug>false</Debug> </Postprocessor> </config>

{sample}.fastq.gz have incorrect sequence identifier string

Dear Fedonin,

I run VirGenA with option "assembling using reference, without msa" on some reads cleaned with alignment on a reference genome (Bowtie2+Samtools). Some tools are ok with my fastq obtnained, like fastqc, fastqscreen, DNAstar, but with VirGenA I have this issue :

java.io.IOException: File {sample}.fastq.gz have incorrect sequence identifier string

Somes parameters :

_> Mode:
Reference Selector: false
Use Major: true

Data:
Reads Insertion Length: 1000 nt

Computing:
Thread Number: -1 threads
Batch Size: 1000 reads

Assembling:
Reference: {my_ref}.fasta
MSA: {my_msa}.fasta
Minimum Read Length: 50 nt
Uclust Identity (%): 0.95
Minimum Contig Length: 1000 nt
Delta (%): 0.05_

My fastq format (head) before and after cleaning :

BEFORE

@FS10001377:5:BPA73114-2327:1:1101:1140:1000 1:N:0:4
AACATTGGCCGTGACAGCTTGACAAATGTTAAAAACACTATTAGCATA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@FS10001377:5:BPA73114-2327:1:1101:1360:1000 1:N:0:4
GCACATCACTACGCAACTTTAGAGCACATCACTACGCAACTTTAGAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@FS10001377:5:BPA73114-2327:1:1101:2240:1000 1:N:0:4
GCTTATTGTTGGCGTTGCACTTCTTGCTGTTTTTCAG

AFTER

@FS10001377:5:BPA73114-2327:1:1101:1000:1260
GAGTTTAGTTCCCTTCCATCATATGCAGCTTTTGCTACTGTTCAAGAAGCTTATGAGCAGGCTGTTGCTAATGGTGATTCTGAAGTTGTTCTTAAAAAGTTGAAGAAGTCTTTGAA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@FS10001377:5:BPA73114-2327:1:1101:1000:1530
CTGCTTGCACTGATGACAATGCTTTAGCTTACTACAACACAACAAAGGGAGGTAGGTTTGTACTTTCACTGTTATCCGATTTACAGGATTTGAAATGGGCTAGATTCCCTAAGAGTGATGGAACTGGTACTATC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@FS10001377:5:BPA73114-2327:1:1101:1000:2010
GCCATTGTGTATTTAGTAAGACGTTGACGTGATATATGTGGTACCATGTCACCGTCTATTCTAAACTTAAAGAAGTCATGTTTAGCAACAGCTGGACAATCCTTAAGTAAATTATAAATTGTTTCTTCATGTTGGTAG

It's the last missing part of the header missing (1:N:0:4) ?
Or maybe something else ?

Thank you very much,
Nicolas

gfedonin / virgena Goto Github PK

virgena's Introduction

Welcome to VirGenA home page

Documentation

Installation

Required dependencies

Toy example

How to cite:

virgena's People

Contributors

Stargazers

Watchers

Forkers

virgena's Issues

Recommend Projects

Recommend Topics

Recommend Org