Giter Club home page Giter Club logo

giraffe-sv-paper's People

Contributors

adamnovak avatar cmarkello avatar jmonlong avatar jonassibbesen avatar xchang1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

giraffe-sv-paper's Issues

readSVvcf is not work in findDups.R

Hello,

I want to construct the SV graph from several assemblies. Before starting my own works, I am testing the pipeline according to your manual, first merge SVs from multiple datasets, and then remap-to-dedup-merged-svs. In the second step, when I run the findDups.R, the readSVvcf does not work with the error massage "unused argument (vcf.object = TRUE)". When I remove the parameter "vcf.object=TRUE", the readSVvcf file could work but the command "svs.gr = rowRanges(svs)" gives another error message:

Error in MatrixGenerics:::.load_next_suggested_package_to_search(x) : 
  Failed to find a rowRanges() method for GRanges objects.

Any suggestions with this issue? Thanks in advance.

Best,
Zhongqu

SV calling using giraffe/vg

It's not very clear to me how exactly VG or Giraffe carries out SV calling (if it truly does). The tutorials mention how we can do genotyping using existing SV calls, but I was wondering if VG or Giraffe can actually identify novel SVs in genomes that were not used to construct the VG graph.

The figure in the "Research article summary" page in the Science paper (Dec 2021) has a legend that says "Call variants and genotype", while the title of the article is "Pangenomics enables genotyping of known structural variants in 5202 diverse genomes". The term "known" in the title indicates that the SVs were detected in an upstream step and giraffe used those SV calls for genotyping.

I'd appreciate it if you can clarify if Giraffe can actually do SV calling, or does it use existing calls for generating graphs that can genotype the calls.

sveval is not working in findDups.R

Hello,

When I repeat the example you provided, I updated my Sveval version to 1.2.1 and a new problem occurred. This error message is‘Error in .Call2("new_XStringSet_from_CHARACTER", class(x0), elementType(x0), : key 60 (char '<') not in lookup table’.
Thank you for your busy schedule to answer for me.

Liubang

Error running vg construct on GIAB_0.6 sample ->> error[VPKG::load_one]: Correct input type not found while loading handlegraph::MutablePathDeletableHandleGraph

Hi,

I am trying to run and replicate the giraffe SV paper results for GIAB 0.6 sample, to genotype SVs from the Illumina paired reads. I am currently following the steps from the giraffe paper, where instead of using the toil-vg scripts, the steps mentioned in the Snakefile for giraffe paper are the ones i am referring to (https://github.com/vgteam/vg_snakemake/blob/master/Snakefile).

Input ref : hs37d5.fa.gz and its corresponding index
input vcf : HG002_SVs_Tier1_v0.6.vcf.gz and its index file
After running the "vg construct" step without pre-processing input VCF, I am getting the following error in the log files,

"stderr": "Restricting to 1 from 1 to end\n building graph for 1 [ ] 0.0%\rwarning:[vg::Constructor] Lowercase characters found in 1; coercing to uppercase.\nwarning:[vg::Constructor] Unsupported IUPAC ambiguity codes found in 1; coercing to N.\nerror:[vg::Constructor] unacceptable characters found in 1.\nerror[VPKG::load_one]: Correct input type not found in standard input while loading handlegraph::MutablePathMutableHandleGraph\n",

->> I figured out the initial warnings, but the error towards the end re occurs, even if I re-Index the the input files.
->> Hickey et al 2019, uses the GIAB 0.5 samples as here, (https://github.com/vgteam/sv-genotyping-paper/tree/master/human/giab),

but the pre-processing step for preparing the input SV catalog file, is what gives me null VCF with only headers when running for GIAB 0.6, similar to steps below:

wget -nc ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_UnionSVs_12122017/svanalyzer_union_171212_v0.5.0_annotated.vcf.gz
wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_UnionSVs_12122017/svanalyzer_union_171212_v0.5.0_annotated.vcf.gz.tbi
vcfkeepinfo svanalyzer_union_171212_v0.5.0_annotated.vcf.gz NA | vcffixup - | bgzip > giab-0.5.vcf.gz
tabix -f -p vcf giab-0.5.vcf.gz

->> What I am trying to understand is how to prepare the input SV catalog file for GIAB 0.6 to be used in the "vg construct - ..." step, so as to avoid the above errors ?

1000 Genomes related SV calls

Hi,

Many thanks for making all code and data available. I am very keen to access this file (https://cgl.gi.ucsc.edu/data/giraffe/products/vggiraffe-sv-relkgp-raw.vcf.gz) reflecting the SV calls from the related subjects in 1000 Genomes. Currently this file is identical to the file containing unrelated subjects (https://cgl.gi.ucsc.edu/data/giraffe/products/vggiraffe-sv-2504kgp-raw.vcf.gz) and so this data is not available. I would be incredibly grateful if you can share this data with us.

Kind regards,

Dr Sam Kleeman MD
PhD Student
Cold Spring Harbor Laboratory, NY

Reference bias issue

Hi, I would like to test whether VG can efficiently reduce reference bias in potato as good as your team did in human genome analysis, thus I repeat the giraffe article with the following commands, but the results show giraffe is a little good than bwa, but not very good as your team did. Is it my command problem or my input vcf problem to construct the variation graph? do you have any suggestions?
The command for filtering input vcf from gatk variation calling

vcftools --gzvcf 44.raw.vcf.gz --minQ 30 --recode --minDP 3 --recode-INFO-all --out 44.snp.indel.vcf
bgzip 44.snp.indel.vcf
bcftools tabix 44.snp.indel.vcf.gz

Command for VG reference bias

#vg construct
ref=DM.fa
name=DM
vcf=44.snp.indel.vcf.gz
vg construct -r $ref -t 20 -v $vcf -a > x.$name.vg
vg index -x x.$name.xg -g x.$name.gcsa -k 16 -t 100  x.$name.vg

##gbwt
vg gbwt -p -g output.$name.gg -o output.$name.gbwt -E -x x.$name.xg 
vg minimizer -t 20 -p -i output.$name.min -g output.$name.gbwt -G output.$name.gg
vg snarls x.$name.xg >x.$name.snarls
vg index -j $name.dist -s x.$name.snarls x.$name.vg 

#real data for reference bias
f1=02_data/PG5002/s5002_HGVCKALXX_L1_1.clean.fq.gz
f2=02_data/PG5002/s5002_HGVCKALXX_L1_2.clean.fq.gz

#giraffe
vg giraffe -H output.$name.gbwt -g output.$name.gg -m output.$name.min -d $name.dist -f $f1 -f $f2 -t 24 | vg surject -x x.$name.xg -b -t 24 - > mapped.giraffe.bam
samtools sort mapped.giraffe.bam -@ 20 > mapped.sorted.giraffe.bam
samtools index -@ 20 mapped.sorted.giraffe.bam
#bwa
bwa mem -t 24 -p $ref $f1 $f2 > mapped.bwa.bam
samtools sort mapped.bwa.bam -@ 20 > mapped.sorted.bwa.bam
samtools index -@ 20 mapped.sorted.bwa.bam

bcftools mpileup --threads 20 -f $ref -E -a DP -a SP -a ADF -a ADR -a AD  -O u  mapped.sorted.giraffe.bam | bcftools call --threads 20 -m -o calls.giraffe.vcf.gz -O z -v
bcftools mpileup --threads 20 -f $ref -E -a DP -a SP -a ADF -a ADR -a AD  -O u  mapped.sorted.bwa.bam | bcftools call --threads 20 -m -o calls.bwa.vcf.gz -O z -v

bcftools sort -O z calls.bwa.vcf.gz >calls.bwa.sorted.vcf.gz
bcftools sort -O z calls.giraffe.vcf.gz >calls.gif.sorted.vcf.gz

bcftools index -f calls.bwa.sorted.vcf.gz
bcftools index -f calls.gif.sorted.vcf.gz

# vi  rename bam name to mapped.sorted.bam
bcftools merge calls.gif.sorted.vcf.gz calls.bwa.sorted.vcf.gz -O z > all_calls.vcf.gz

python3 plot_allele_bia.py  all_calls.vcf allel_bias.svg

The resulting figure.
image

sim error : sample "name" not found in the GBWT index`

Hi,
I simulate reads from vg for testing my species alignment based on your giraffe-sv-paper script. However, I get an error below
command :
vg sim -r -n 1000 -a -s 12345 -p 570 -v 165 -i 0.00029 -x x.xg -g x.gbwt --sample-name C151 --ploidy-regex "JTFH.*:0,KN70.*:0,Y:0,chrY_.*:0,chrEBV:0,.*:2" | vg annotate -p -x x.xg -a - > sim.gam
[vg sim] error: sample "C151" not found in the GBWT index

if I remove the -g option for simulate reads from vg, it's worked, the result shown as below(not remove segmental duplications from prep section since my species do not have segmental duplications bed file , and I used conda to install vg)
image

So, do you have any suggestion to get the ROC figure as the same good as your giraffe articles?
Many thanks for your help~

Giraffe SV genotypeing warning [VCFTraversalFinder] Warning: No alt path found '' in graph for variant. It will be ignored:

Hi,
I would like to genotype SV from Illumina paired reads. Thus I follow the giraffe article but get a warning at VG calling step. After running vg call, I found nothing in my result file except vcf header information.
my commands

##for autoindex
vg autoindex --workflow giraffe -r $ref -v $vcf -p $name -R XG
vg snarls -T $name.xg >$name.snarls
##maping
vg giraffe -x $name.xg -m $name.min -d $name.dist -b fast --rescue-algorithm "dozeu" -N $sample_name --gbwt-name $name.giraffe.gbwt -C 500 -o gaf -f $f1 -f $f2 -t $thread | gzip > $sample_name.gaf.gz
vg pack -x $name.xg -a $sample_name.gaf.gz -Q 5 -t $thread -o $sample_name.pack
##genotying
vg call -k $sample_name.pack  -t $thread -s $sample_name --snarls $name.snarls -v $vcf $name.xg > $sample_name.gif.vcf

The part of error info:
chr01 42052094 svim.DEL.5262 GAATTTTTGAGACCTTCTCAAAAATTCTGCCCAGTCTTGATTTGTGCAAAAACGCTCGTTGAGGAATTTTTGAGACCCTCTCAAAAATT G 29 PASS CHR2=chr01;CIEND=0,597;CIPOS=0,438;END=42052182;STRANDS=+-;SUPP=2;SUPP_VEC=00000000000001100000000000000000000000000000;SVLEN=-518;SVMETHOD=SURVIVOR1.0.7;SVTYPE=DEL [VCFTraversalFinder] Warning: No alt path (prefix=_alt_232b4364c8fa4532d6e2faf2485ea2d31050c1c2_) found in graph for variant. It will be ignored: chr01 42052195 svim.DEL.5142 CTGATCTGTGAGAAGAAATGCTTCTGATGGAATTTTTGAGTTCCTCTAAAAAATTCTTCCCCAGTATCGAGATTTCCAAACAGAATGCTCTTTGATCGAATATTTGAGACTCTATCAAAATTTTTGCCCTAGTTAC C 14 PASS CHR2=chr01;CIEND=0,902;CIPOS=0,941;END=42052330;STRANDS=+-;SUPP=2;SUPP_VEC=00000000000000000010000000100000000000000000;SVLEN=-135;SVMETHOD=SURVIVOR1.0.7;SVTYPE=DEL [VCFTraversalFinder] Warning: No alt path (prefix=_alt_7fe770621b89511fa1025b701f29877e91fa35fe_) found in graph for variant. It will be ignored:

QUESTIONS in code of SV genotyping

Hi,

When I was converting the cram to the fastq, I found the code in your WDL workflow:

seq 0 ~{in_nb_chunks} | head -n ~{in_max_chunks} | parallel -j ~{in_cram_convert_cores} "samtools collate -k {} -K ~{in_nb_chunks} --reference ~{in_ref_file} -Ouf ~{in_cram_file} {} | samtools fastq -1 reads.{}.R1.fastq.gz -2 reads.{}.R2.fastq.gz -0 reads.{}.o.fq.gz -s reads.{}.s.fq.gz -c 1 -N -"

However, it seems that samtools collate doesn't have the parameter "k" or "K". Could you please make an explanation for this and check which parameter was used in this step

Thanks!

Some questions about workflow?

Hello,

As is shown in your workflow, you have converted the bam to the fastq at the first step; then the SV calling, genotyping...

However, why don't we use the fastq file directly. I mean, it will save much time and avoid the problem of genome reference version.

But I am not sure whether it is right or suitable? Will it influence the results?

Hope for you response.
Best,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.