eblerjana / pangenie Goto Github PK
View Code? Open in Web Editor NEWPangenome-based genome inference
License: MIT License
Pangenome-based genome inference
License: MIT License
Hi !
I used your pipeline pangenome-graph-from-assemblies
to construct a VCF file, but I noticed that some of the variants have an unusually high number of alleles. see belows:
1 18300 . GACACACAACACACAC GACAGACAGACACACACAC,GACAGACAGACAGACAGACACAC,GACAGACAGACAGACAGACAGACACACACACACACACACAC,G,GACACACACACACACACACACAC,GACAGACAGACAGACACACACACACACAC
1 19171 . CAAAAAAAA CAA,CAAA,CAAAAAAAAAAAA,CAAAAAAAAAAAAAAAAAAAA,CAAAAAAAAAAAAAAAAAAAAA,CAAAAAAAAAAAAAAAAAAAAAA,CAAAAA,CAAAA,CAAAAAAAAAA,C,CA,CAAAAAAAAAAA
I suspect that this may be due to repetitive sequences in my input assemblies, which were directly obtained from hifiasm assembly results without being masked.
Hi, @eblerjana
Can the variants vcf is single chromosome ? For large-scale samples parallel, can we just use one chromosome vcf for genotyping/phasing? Or the pangenie must have whole genome variants for more precisely kmer counting and genotyping?
Thanks.
Zhigui Bao
Hi Jana,
I've noticed that the memory usage when counting kmers directly from reads (and not a .jf) is incredibly high, and it looks like PanGenie reads the entire fastq file into memory and then counts the kmers. I think this also leads to under reporting of usage, because the PanGenie log for a run says
Count kmers in genome ...
#### Memory usage until now: 61.7867 GB ####
Determine unique kmers ...
#### Memory usage until now: 79.1394 GB ####
....
Total maximum memory usage: 84.4493 GB
When the same cluster reports peak usage of 223.6 Gb. At one point during kmer counting it was 176.3 Gb, so it definitely seems like the kmer counting is a big bottleneck, especially for moderate/high coverage samples.
Just running jellyfish by itself to produce the .jf file only peaked at 42.1 Gb, so that seems like a better approach currently.
I tried and very strongly failed at trying to implement some streaming feature in jellyfish. It looks like they have some code set up for that, but I didn't last long enough.
Currently, genotype likelihoods from subsampling steps are added and normalized after each step. A better approach would be to first add up all likelihoods, and then do the normalization once at the end. Otherwise, the likelihoods are not weighted equally, which might lead to genotyping errors.
Line 96 in ae47956
Hi Jana,
I ran PanGenie for genotyping on my own dataset, and it finished without errors or warnings. However, I found on all 11 chromosomes, the genotyping results stopped too early, around 11 Mb or 13 Mb while the the chromosome length should be close to 30 Mb.
What is your recommendation to solve this problem?
Thank you!
Honghe
Hello,
I'am trying to genotype short reads using pangenie but during the indexing step with the vcf output of minigraph cactus i'am encountering this error:
terminate called after throwing an instance of 'std::runtime_error'
what(): GraphBuilder::GraphBuilder: Found invalid genotype. Genotypes must be diploid (.|. if missing).
Aborted (core dumped)
Here is a sample from my minigraph cactus vcf :
chr1 13551 >2370>2374 T C,A 60.0 . AC=1,16;AF=0.0434783
=>2370>2371>2374,>2370>2373>2374,>2370>2372>2374;NS=23;LV=0 GT 0 2 2 2 2 2 2 2 . . . . 2 2 0 . 2 . 0 0 2 2 0 1 . 2 .
I'am not working on the human genome so i can't use the pre-processing pipeline, does my vcf need pre-processing after minigraph cactus (apart from vcfbub).
Thank you beforehand.
Hi,
Thank you very much for such a nice tool. usage: PanGenie [options] -i <reads.fa/fq> -r <reference.fa> -v <variants.vcf>
I fund the paper (https://www.biorxiv.org/content/10.1101/2020.11.11.378133v1)said that :we propose a new algorithm, PanGenie, that leverages a haplotype-resolved pangenome reference together with k-mer counts from short-read sequencing data to genotype a wide spectrum of genetic variation....
May I ask if the pangenie software can only use short reads? Can long-reads be use for <reads.fa/fq> (such as Nanopore or PacBio reads)? ,What is the advantage over the second generation
The genotype of the outfile is the genotype of reads (i.e -i <reads.fa/fq>) that corresponds to the sample, rigth?
Thanks.
Running PanGenie with a Jellyfish database in jf format and multiple threads (-t) produces randomized output.
Executing it with either a .jf file or multiple threads works fine, but not with both parameters provided.
jellyfish count -m 31 -s 3G -p 126 -c 7 -C -L 1 --if <path-segments.fasta> <input.fasta> -o input.jf
PanGenie -i input.jf -r <reference.fa> -v <variants.vcf> -t 2 -o V1
PanGenie -i input.jf -r <reference.fa> -v <variants.vcf> -t 2 -o V2
diff V1_genotyping.vcf V2_genotyping.vcf | wc -l
Both genotyping files should be the same.
diff V1_genotyping.vcf V2_genotyping.vcf | wc -l
0
However, there are many (small) differences:
diff V1_genotyping.vcf V2_genotyping.vcf | wc -l
10790792
The problem seems to be within the JellyfishReader class. I'm not familiar with the source code of Jellyfish, but I presume that the line abundance = this->db->check(jelly_kmer);
in the function getKmerAbundance
cannot be parallized as is.
I tried guarding the line with a mutex and this does resolve the problem. However, the parallelization overhead of the mutex makes the "determining unique kmers" section almost as slow as the serial execution, so it's definitely not an optimal solution
Remark: The bug cannot be observed in the provided test data, because the VCF file only contains a single chromosome. In that case PanGenie will only submit a single thread, regardless of how many were specified in the command line.
The PanGenie command line help indicates that a jellyfish database in jf format is an option for input.
-i VAL sequencing reads in FASTA/FASTQ format or Jellyfish database in jf format.
NOTE: INPUT FASTA/Q FILE MUST NOT BE COMPRESSED. (required).
What is the recommended JellyFish command for creating the jf file? How does the jf formated file compare in size to the uncompressed fasta or fastq file?
Hi, pangenie group,
I met a problem when I run the pangenie with singularity, it seems that it cannot open the reference fasta file, which I can read. The error is listed below. In addition, I'd like to ask how to select an appropriate reference fasta file, is it also a part of our pan-genome?
singularity exec /share/apps/gene/pangenie-1.0.1/pangenie-new-gcc.sif cat /metadata/jellyfish.lib.version
libjellyfish-2.0-2:amd64 2.3.0-4build1 libjellyfish-2.0-dev:amd64 2.3.0-4build1
singularity exec /share/apps/gene/pangenie-1.0.1/pangenie-new-gcc.sif cat /metadata/pangenie.git.version
6e0da25
singularity exec /share/apps/gene/pangenie-1.0.1/pangenie-new-gcc.sif PanGenie -i $input.fastq -r $reference.fa -v $graph.vcf -o test
program: PanGenie - genotyping and phasing based on kmer-counting and known haplotype sequences.
author: Jana Ebler
Files and parameters used:
-c 0
-d 0
-e 3000000000
-g 1
-i $input.fastq
-j 1
-k 31
-o test
-p 0
-r $reference.fa
-s sample
-t 1
-u 0
-v $graph.vcf
Determine allele sequences ...
terminate called after throwing an instance of 'std::runtime_error'
what(): FastaReader::parse_file: reference file cannot be opened.
Aborted
Hi @eblerjana ,
For the newest code in master
branch, I encountered some errors like this:
[ 1%] Building CXX object src/CMakeFiles/PanGenieLib.dir/emissionprobabilitycomputer.cpp.o
In file included from /public/home/wjwei/SFTW/pangenie-3.0.0/src/uniquekmers.hpp:9,
from /public/home/wjwei/SFTW/pangenie-3.0.0/src/emissionprobabilitycomputer.hpp:8,
from /public/home/wjwei/SFTW/pangenie-3.0.0/src/emissionprobabilitycomputer.cpp:5:
/public/home/wjwei/SFTW/pangenie-3.0.0/src/kmerpath.hpp:8:10: fatal error: cereal/access.hpp: No such file or directory
8 | #include <cereal/access.hpp>
| ^~~~~~~~~~~~~~~~~~~
compilation terminated.
make[2]: *** [src/CMakeFiles/PanGenieLib.dir/emissionprobabilitycomputer.cpp.o] Error 1
make[1]: *** [src/CMakeFiles/PanGenieLib.dir/all] Error 2
make: *** [all] Error 2
After using conda to create a env, cereal
are here:
ll /public/home/wjwei/SFTW/conda/envs/pangenie/include/cereal/
total 515K
-rw-rw-r-- 2 wjwei JBYan 15K Feb 28 2022 access.hpp
drwxr-xr-x 2 wjwei JBYan 4.0K Aug 22 21:48 archives/
-rw-rw-r-- 2 wjwei JBYan 47K Feb 28 2022 cereal.hpp
drwxr-xr-x 2 wjwei JBYan 4.0K Aug 22 21:48 details/
drwxr-xr-x 4 wjwei JBYan 4.0K Aug 22 21:48 external/
-rw-rw-r-- 2 wjwei JBYan 6.6K Feb 28 2022 macros.hpp
-rw-rw-r-- 2 wjwei JBYan 6.5K Feb 28 2022 specialize.hpp
drwxr-xr-x 3 wjwei JBYan 4.0K Aug 22 21:48 types/
-rw-rw-r-- 2 wjwei JBYan 2.4K Feb 28 2022 version.hpp
So, I'm confused about the source-code install.
Looking forward to your reply sincerely.
Best wishes, Wei
Hi Jana,
Since PanGenie relies on exact kmer matches, I want to know if it will perform well with input SVs that have imprecise breakpoints? Have you investigated how inaccurate breakpoints impact the genotyping results? Thank you.
Best regards,
Zheng zhuqing
Description:
I'm encountering an issue when running the following command:
PanGenie-index -r test.fa -v test.vcf -t 24 -o pan_test
The error message I'm receiving is:
program: PanGenie - genotyping based on kmer-counting and known haplotype sequences.
author: Jana Ebler
version: v3.0.1
Files and parameters used:
-e 3000000000
-k 31
-o pan_test
-r test.fa
-t 24
-v test.vcf
Determine allele sequences ...
Read reference genome ...
Found 168 chromosome(s) from the reference file.
Read input VCF ...
terminate called after throwing an instance of 'std::runtime_error'
what(): GraphBuilder: number of paths is limited to 254 in current implementation.
In the test.vcf file, there are 170 samples, but Pangenie supports a maximum of 254 haplotypes, which corresponds to 127 samples. Can I extract an individual from the test.vcf file, change all the genotypes to 1/1 for this individual, and then proceed with Pangenie-index?
I want to install pangenie in ARM server, but it always issue errors, I don't know weather pangenie support ARM architecture, If you know it, you can tell me. thanks.
When i have download pangenie from github and install jellyfish-2.2.10 by conda. The error :
-- The C compiler identification is GNU 8.5.0
-- The CXX compiler identification is GNU 8.5.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found PkgConfig: /usr/bin/pkg-config (found version "1.4.2")
-- Found ZLIB: /usr/lib64/libz.so (found version "1.2.11")
-- Checking for module 'jellyfish-2.0'
-- Package 'jellyfish-2.0', required by 'virtual:world', not found
CMake Error at /usr/share/cmake/Modules/FindPkgConfig.cmake:556 (message):
A required package was not found
Call Stack (most recent call first):
/usr/share/cmake/Modules/FindPkgConfig.cmake:778 (_pkg_check_modules_internal)
CMakeLists.txt:11 (pkg_check_modules)
-- Configuring incomplete, errors occurred!
Hi @eblerjana,
I've been working with PanGenie
more often since my last issue. I was keen on assessing the performance for the genome inference of HG002, so I benchmarked my PnaGenie
outputs for this sample – obtained after running the tool on both my pangenome graphs (built with minigraph-CACTUS and PGGB) – against the truth set from GIAB.
However, despite not being as good as the HPRC one the recall is reasonable, on the other hand the precision is straight up bad... after talking with the team at the HPRC here in Santa Cruz they pointed out how most of these FP lowering the precision value are likely on satellite other difficult regions (I guess?), and explained to me that most of the improvement of a PanGenie
run in terms of precision is due to the post-processing filtering with a ML approach (I've been reading this #14).
Therefore, my question is how can I run this script for just my single HG002 sample? What I got from the issue #14 is that this is particularly beneficial for large cohort with trio data; is there a way to assess it a single sample? Also, you pointed out to a concept of genotype quality which might work in the absence of the former information..., how can I assess that eventually?
Thanks in advance and looking up online I came across these two links:
Are they anyhow related on how the ML approach works or is based on? This is just for my understanding. Thanks again!
I have vcf files generated by aligning fastq files to the pan-sequences fasta. The BAM files so produced, were used in CNVcaller to produce a VCF file. Can i use this vcf fiIe for pangenie? I read your paper on Pangenie, hence, I wondered, if I may ask you how to start to achieve this.
Is there a way to generate a joint-genotyped vcf with pangenie? If I understand the software correctly, only the variants in the pangenome reference are called across all the genotyped samples while the non-pangenome variants detected are called within individual samples.
Hi there,
My aim is to get a genome inference for a human sample (HG005) to benchmark my graph in terms of SVs calling using PanGenIe
. I assembled both with minigraph-CACTUS
(for which I'm testing right now) and with pggb
.
With that said, despite the pangenomes are quite small (only five individuals), after correctly outputting the .vcf files for both of them, I incur in an error when launching from Snakefile. Below the output for the command:
File "Snakefile", line 13
rule all:
^
SyntaxError: invalid syntax
Now, the reason I'm doing this is because of the non-overlapping variants requirement. Therefore, considering my .vcf haven't been generated using your pipeline, I'm assuming they are not in accordance to the specific standard you mentioned (particularly the third one); hence why, I proceeded this way.
One last mention, I'm not very experience with Snakefiles, and after looking online I've seen they are launched invoking python3; is that the correct way? Let me know, thanks in advance!
P.S. attached a version of the .yaml edited to conform the inputs for the script (config.log)
Hello,
When I try to run PanGenie using my VCF file, I get an error message. Here is the error message I receive:
GraphBuilder: skip variant at chr1:1000404 since it is contained in a previous one.
PanGenie:/biosoft/pangenie/src/graph.cpp:80: void Graph::add_variant_cluster(std::vector<std::shared_ptr >*, std::vector<std::vector<std::__cxx11::basic_string > >&, bool): Assertion `defined_alleles.size() == (variant_ids[i].size()+1)' failed.
/var/spool/job55890972/slurm_script: line 30: 11280 Aborted
I ensured the reference genome matches the variants in the VCF file. However, I continue to encounter this error.
Would you be able to assist me in understanding what is causing this problem? Could you please let me know if there are any specific requirements or formats for the VCF file that I have overlooked?
Thank you for your help.
Best,
Hongbo
Hello, authors,
At the demo, 'test-variants.vcf' was a regular variation vcf file. I am confused which step used the graph pangenome format with Extended Data Fig. 1. I used the 'pangenome-graph-from-assemblies' pipline to generate a callset.vcf file. What should I do next? Should I use vcfbub to convert it into a bubble file and then perform genotyping?
I would appreciate your help.
Tian Tian
Description:
I'm encountering an issue when running the Pangenie:
Files and parameters used:
-a 0
-c 0
-d 0
-e 3000000000
-g 0
-i /BIGDATA2/scau_jyang_1/software/EVG-main/test/genotype/PanGenie/test1/test1.fq
-j 10
-k 31
-o test1
-p 0
-r /BIGDATA2/scau_jyang_1/software/EVG-main/test/convert/convert.test.fa
-s test1
-t 10
-u 0
-v /BIGDATA2/scau_jyang_1/software/EVG-main/test/genotype/PanGenie/test1/test1.vcf
Determine allele sequences ...
Found 1 chromosome(s) from the reference file.
Identified 769 variants in total from VCF-file.
#### Memory usage until now: 0.008824 GB ####
Write path segments to file: test1_path_segments.fasta ...
Found 1 chromosome(s) in the VCF.
#### Memory usage until now: 0.008824 GB ####
Count kmers in reads ...
Histogram peak: 1 (3323)
Computed kmer abundance peak: 1
Count kmers in genome ...
terminate called after throwing an instance of 'jellyfish::large_hash::array_base<jellyfish::mer_dna_ns::mer_base_static<unsigned long, 0>, unsigned long, atomic::gcc, jellyfish::large_hash::unbounded_array<jellyfish::mer_dna_ns::mer_base_static<unsigned long, 0>, unsigned long, atomic::gcc, allocators::mmap> >::ErrorAllocation'
what(): Failed to allocate 17333333368 bytes of memory.
I am confused by this issue and would appreciate any guidance on how to resolve it.
Hi,
Thanks for your work on this program. The method seems really elegant and we've had some really exciting results from it so far.
Here is a brief outline of our preliminary analysis:
I have a couple of questions to make sure I'm getting the most out of the program:
I initially had good results with minigraph-cactus for generating a comprehensive vcf file. However I read on the PanGenie issues page, that this is not a recommended program. There was a reference to vcfbub, but It's not clear to me what steps need to be taken to go from the reference + long read assemblies to an appropriate pangenome graph. I didn't notice any obvious issues with the subsequent short read genotyping results from minigraph-cactus, but I'm worried that I'm not getting the best results.
I was trying to be quite conservative with the genotyping results, I used the GQ flag as mentioned in the paper, and then also applied a minimum KC of 7 per sample. Is this a reasonable filtering approach?
I noticed that there is an experimental phasing option. We are quite interested in the haplotype structure of our data, and initially we used Shapeit (using the HiFi individuals as reference haplotypes) to phase the population data. Are there anymore details available about PanGenie phasing? Is it worth applying the PanGenie phasing prior to Shapeit perhaps? Is it worth combining PanGenie genotyping with read-based phasing programs like WhatsHap?
Variants in the input VCF that are less than the kmer size apart are internally merged by PanGenie and treated as a single variant. Alleles of this merged variant are composed of combinations of the merged variants (defined by the paths). The number of unique kmers per allele is computed only for this merged variant. When writing the output VCF, these merged variants are converted back to their original representation (creating the original entries). However, the AK counts are computed for each original allele by summing up the unique kmers of all combined alleles that include that particular allele. The problem is that the same kmer can be counted several times when occurring on several combined alleles, leading to numbers of AK that exceed UK. There is not really any proper way of computing AK counts for the original alleles.
Hi,
When I tried to reproduce your pangenome graph, I noticed that there were several errors in vcf-merging/pangenome-graph-from-assemblies/Snakemake file, mainly about generating trio sample list and ped file.
According to function parse_trios in scripts/mendelian-consistency.py,
def parse_trios(ped_file, samples):
samples_to_include = set()
with open(samples, 'r') as listing:
for line in listing:
samples_to_include.add(line.strip())
trios = {}
for line in open(ped_file, 'r'):
if line.startswith('#'):
continue
fields = line.split()
# map trio_name -> [child, parent1, parent2, nr consistent, nr_inconsistent, nr_untyped, nr total]
if any([s not in samples_to_include for s in fields[1:4]]):
continue
trios[(fields[1], fields[2], fields[3])] = [fields[1], fields[2], fields[3], 0, 0, 0, 0]
return trios
trio sample list (results/trio-samples.txt) should be like:
HG00731
HG00732
HG00733
HG00512
HG00513
HG00514
NA19238
NA19239
NA19240
However, the Snakemake file generated a file like:
Puerto Rican
HG00731
HG00732
HG00733
Chinese
HG00512
HG00513
HG00514
Yoruban
NA19238
NA19239
NA19240
And ped (results/trios.ped) file should look like:
Puerto_Rican HG00733 HG00731 HG00732
Chinese HG00514 HG00512 HG00513
Yoruban NA19240 NA19238 NA19239
instead of:
Puerto Rican Puerto Rican HG00731 HG00732
Chinese Chinese HG00512 HG00513
Yoruban Yoruban NA19238 NA19239
(Also, trio names should not contain spaces)
To solve the problem above, you may need to modify some lines in Snakemake file:
#################################################################
# 5) Check mendelian consistency for trios and construct graph
#################################################################
# generate a file specifying the trio relationships
rule generate_ped_file:
output:
"{outdir}trios.ped"
run:
with open(output[0], "w") as ped_output:
for trio in config['trios']:
father=config['trios'][trio][0]
mother=config['trios'][trio][1]
#Original
#ped_output.write('\t'.join([trio, trio, father, mother]) + '\n')
#MODIFED#
child=config['trios'][trio][2]
ped_output.write('\t'.join([trio, child, father, mother]) + '\n')
rule generate_samples_file:
output:
"{outdir}trio-samples.txt"
run:
with open(output[0], "w") as sample_output:
for trio in config['trios']:
#sample_output.write(trio + '\n') #MODIFIED#
for sample in config['trios'][trio]:
sample_output.write(sample + '\n')
Hi pangenie team,
First of all, thank you very much for developing such a beautiful tool for the community.
I am wondering, instead of whole genome sequencing data, can pangenie also work with genotyping-by-sequencing (GBS) data? In plant breeding, people normally do GBS to save money since there might be hundreds of plants needed to be genotyped.
Looking forward to your answer and thank you very much in advance.
Best wishes,
Yutang
Any chance of pushing the singularity image to a cloud service?
It would be great for those of us who try to integrate this tool into nextflow genotyping pipelines.
Thank you.
As the title says, the command PanGenie-index
in newest master branch is :
program: PanGenie - genotyping based on kmer-counting and known haplotype sequences.
author: Jana Ebler
version: v3.0.0
usage:
PanGenie-index [options] -i <reads.fa/fq> -r <reference.fa> -v <variants.vcf> -o <index-prefix>
options:
-e VAL size of hash used by jellyfish (default: 3000000000).
-k VAL kmer size (default: 31).
-o VAL prefix of the output files. NOTE: the given path must not include non-existent folders.
-r VAL reference genome in FASTA format. NOTE: INPUT FASTA FILE MUST NOT BE COMPRESSED.
-t VAL number of threads to use for kmer-counting (default: 1).
-v VAL variants in VCF format. NOTE: INPUT VCF FILE MUST NOT BE COMPRESSED.
Error: option -v is mandatory.
The usage
info showed -i <reads.fa/fq>
, actually, it's not necessary . It could be trivial mistake:)
Best, Wei
Hello,
I've encountered the following error while running a program:
The input file used is the output from minigraph-cactus.
When I created the input VCF using only 3 genome sequences, the calling worked well. However, after adding 200 more genomes, I faced the following error
Can anyone help me resolve this issue?
Thank you.
$ pangenome singularity exec $programs/pangenie/container/pangenie.sif PanGenie -i 10-5.fastq -r $ref -v ./pangenome.vcf -l 31 -j 4 -t 4 -o 10-5 -s 10-5
program: PanGenie - genotyping and phasing based on kmer-counting and known haplotype sequences.
author: Jana Ebler
Files and parameters used:
-a 0
-c 0
-d 0
-e 3000000000
-g 0
-i 10-5.fastq
-j 4
-k 31
-o 10-5
-p 0
-r ./cw_final.fasta
-s 10-5
-t 4
-u 0
-v ./pangenome.vcf
Determine allele sequences ...
Found 11 chromosome(s) from the reference file.
terminate called after throwing an instance of 'std::runtime_error'
what(): VariantReader: number of paths is limited to 254 in current implementation.
[1] 229709 abort (core dumped) singularity exec $programs/pangenie/container/pangenie.sif PanGenie -i -r -
The command line help is accurate (reads.fa/fq
), but since uncompressed read files are uncommon, it would be helpful to add that requirement in a more obvious way, e.g.
-i VAL sequencing reads in FASTA/FASTQ format or Jellyfish database in jf format (required).
NOTE: INPUT FASTA/Q FILE MUST NOT BE COMPRESSED
Hi again,
I'm having serious troubles processing the PGGB
.vcf files... At first the tool pointed out to haplotypes being un-phased; this is a problem I fixed collapsing the columns for each individual haplotype in a single one for all samples using the following awk
command:
awk -F '\t' '/^#/ {print;next;} {OFS="\t";for(i=j=10; i < NF; i+=2) {$j = $i"|"$(i+1); j++} NF=j-1}1' pangenome_ref_guided_GRCh38.vcf > pangenome_ref_guided_GRCh38-phased.vcf
In fact, compared to minigraph-CACTUS
, PGGB
outputs a .vcf file where the haplotypes for each individual in the graph are split into two columns, this leads PanGenIe
to think they are un-pahsed. Anyway, I proceed then to fix the columns headers for the respective samples, but still despite running to completion the process returns an empty .vcf.
I'm not sure what was the cause for it, and after talking with Glenn and looking up online I acknowledged I was missing some pre-processing steps, namely removing complex bubbles and decomposing nested alleles with vcfbub
and vcfwave
, respectively.
I run the following command:
vcfbub -l 0 -r 10000 -i pangenome_ref_guided_GRCh38-phased.vcf.gz | vcfwave -L 10000 -t 16 -n | bgzip -c -@ 16 > decomposed_and_cleaned.vcf.gz
Unfortunately, I got the same result... So, going back to one of my hypotheses this could have been linked to having #
in the filenames as per the PanSN
standard for naming used by PGGB
.
I tested removing from the #CHROM column everything but the string chr(number); however, even after this attempt the result is always an empty .vcf file.
I though it could have been helpful to re-run the vcfbub
and vcflib
commands, but this time I got the following error:
error: more sample names in header than sample fields
samples: CHM13 HG002 HG00514 HG00733 HG03492 NA19240
line: chr1 418412 >11061496>11061499 T C 60.0 . AC=1;AF=1;AN=1;AT=>11061496>11061497>11061499,>11061496>11061498>11061499;NS=1;LV=0 GT .|. .|. .|. .|. .|1
thread 'main' panicked at 'calledResult::unwrap()
on anErr
value: IoError(Os { code: 32, kind: BrokenPipe, message: "Broken pipe" })', src/main.rs:199:46
note: run withRUST_BACKTRACE=1
environment variable to display a backtrace
This is strange as I previously didn't get that, maybe because I run the pre-processing before editing with awk
, I don't know for sure...
Please, if you have any clue/idea of what is happening I would really appreciate some help. I tried several approaches but all failed and I wonder whether there is something I'm missing.
With that said, I'm happy to attach a screenshot of the original .vcf file if needed and I'm sorry for the long text.
Thanks in advance!
P.S. I always made sure the contigs name in the reference and the #CHROM in the .vcf were consistent in all trials
Hi Jana,
Previously I thought PanGenie only worked with SNPs, insertions, and deletions. However, I read about a study with your contribution (https://doi.org/10.1101/2021.12.20.472354 , "Haplotype-resolved inversion landscape reveals hotspots of mutational recurrence associated with genomic disorders") in which it used PanGenie to genotype inversions. May I ask how the inversions should be represented in the input VCF file of PanGenie?
Thank you!
Best regards,
Honghe
It is not obvious how PanGenie handles the output prefix if the path includes non-existent folders. By looking at the runtime errors, I assume PanGenie does not create non-existent folders.
Dear @eblerjana,
Thank you for this useful tool. I have a question about Pangenie installation.
#here is the pangenie installation method I chose
git clone https://github.com/eblerjana/pangenie.git
cd pangenie
conda env create -f environment.yml
conda activate pangenie
mkdir build; cd build; cmake .. ; make
#(pangenie) cche@login02 21:25:03
~/lixin/software/pangenie/build
$
make
[ 37%] Built target PanGenieLib
[ 40%] Built target PanGenie
[ 42%] Built target PanGenie-index
[100%] Built target tests
#but PanGenie can't run after 'make' process
(pangenie) cche@login02 22:07:42
~/lixin/software/pangenie/build
./build/src/PanGenie
program: PanGenie - genotyping based on kmer-counting and known haplotype sequences.
author: Jana Ebler
version: v3.0.0
usage:
PanGenie [options] -f -i <reads.fa/fq> -o
PanGenie [options] -i <reads.fa/fq> -r <reference.fa> -v <variants.vcf> -o
I have no problem, and PanGenie can be run successfully.
Thanks and wish you all the best.
Li Xin
Hello.
I am a beginner in this field and I'm currently trying to learn how to use this tool. In regard to the input files for this tool, I have a very naive question.
I would like to use some larger datasets to run this program. I found some relevant information in the paper's link, such as this link: https://zenodo.org/record/5607680. However, the FASTA files in this link seem to be categorized by genotypes. How can I extract information about a single chromosome from these files and then obtain corresponding reference and variant files that match each other?
Best.
Hi is it possible to add Pangenie to Bioconda instead of having to manually build the tool?
Hello,
I have about 200 samples. There are 2 fastq.gz file (sample.R1.fq.gz and sample.R2.fq.gz) for every file. Each file is very large (more than 50G compressed file). How can I unzip them and combine the paired end file into one file?
Thank you !
Hi Jana,
I have some questions regarding the path-segments.fasta, please correct me if I am wrong:
PanGenie-graph -r <reference.fa> -v <variants.vcf> -o <output-prefix>
and not affected by the input fasta file.jellyfish count --if <path-segments.fasta>
As PanGenie-graph
seems a single thread process and takes me ~1 hour to generate the reference file (likely due to my HDD speed), I try to avoid generating this every time by separately using jellyfish to count k-mer first. But I found the path-segments.fasta is always generated.
Hence, I am wondering if it is possible to add one option to reuse path-segments.fasta when genotyping individual samples or skip generating this file when the input is a jf database?
Thanks a lot.
Best,
Han
Hi,
Thank you very much for such a nice tool.
I would like to represent not only snps but also structural variants in the input VCF.
Could you please tell me how to do that?
This is my toy example:
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=1>
##ALT=<ID=DEL,Description="Deletion">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00732 NA12878
1 15951 . T . PASS . GT 0|0 1|1
End error message:
VariantReader: skip variant at 1:15950 since alleles contain undefined nucleotides:
Best,
Mehmet
Hi there,
Thanks for this great tool. I noticed that, in both the HPRC reference release paper and the original paper describing this tool, you employ complex pre and post pangenie variant filtering:
"We used vcfbub (version 0.1.0, (Garrison, 2021)) with parameters -l 0 and -r 100000 in order to filter the VCF..."
"We additionally remove all records for which more than 80% of all 88 haplotypes carry a missing allele..."
"we have implemented a decomposition approach which aims at detecting all variant alleles nested inside of multi-allelic top-level records..."
"[We] defined a high quality subset of SV alleles that we can reliably genotype. For this purpose, we applied a machine learning approach similar to what we have presented previously..."
How important are these methods to the quality/reliability of your final genotypes, especially for large SV (> 50bp)?
Does this package implement these filters, and if not, is there code that does?
Do you have pre-trained models to use for the machine learning/regression based filter?
Just trying to get high-quality SV genotypes on my diverse short-read dataset, so any help is greatly appreciated.
Thanks again!
-Joe
Hi,
Thanks for developing this great tool. I have one naive question regarding the input fastq files. Is it important to pre-process input fastq files (e.g., trimming by sequencing quality, filtering by length after trimming)? Can we directly use the raw sequencing reads?
Thanks,
Han
Would be useful to be able to work, e.g., with exome data. If variants outside the given region set could be imputed, that'd be even more cool.
Hi
I started to install pangenie using the following code:
git clone https://github.com/eblerjana/pangenie.git
cd pangenie
conda activate pangenie
mkdir build; cd build; cmake .. ; make
Since I use an old linux HPC with old glibc I got this message:
cmake: /lib64/libc.so.6: version `GLIBC_2.17' not found
I installed glibc2.17 using this code:
conda install -c fridh glibc
Then I reactivated pangenie and tried
cmake
at the pangenie directory.
I get:
Segmentation fault
Thank you
Hi,
Thank you for developing PanGenie! Recently, I am trying to evaluate the output vcf of PanGenie with Truvari, but it comes up that the entries which have multiple sequence (separated by comma) in the ALT field are misclassified as false positives by Truvari. Is there any way or criteria to choose only one sequence from the ALT field, or find the most possible variant that corresponds to such an entry?
For example, given a record like this:
chr1 899960 . TGGGAGGGGTCCGCGCGTCCGCAGTGGGGATGTGCTGCGGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCCGGGA TGGGAGGGGTCCACGCGTCCGCAGTGGGGCTGCCGGGAGGGGGGCCGGGTCCACAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCGGGAAGGGGGGGGCCGGGTCCGCAGTGGGGCTGTGCTGCCGGGA,TGGGAGGGGTCCGCGCGTCCGCAGTGGGGCTGCCGGGAGGGGGGCCGGGTCCACAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCGGGAAGGGGGGGGCCGGGTCCGCAGTGGGGCTGTGCTGCCGGGA,TGGGAGGGGTCCGCGCGTCCGCAGTGGGGCTGTGCTGCGGGAAG,CCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCGGGAAGGGGGGGGCCGGGTCCGCAGTGGGGATGTGCTGCCGGGA,TGGGAGGGGTCCGCGCGTCCGCAGTGGGGCTGTGCTGCGGGAAGAGGGGGGCCGGGTCCGCAGTGGGGATGTGCTGCCGGGA,CGGGGA,CGGGGAGGGGGGCGCGGGTCCGTAGTGGGGATGTGCTGCGGGAAGGGGGGGGCCGGGTCCACAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCGGGAA . PASS AF=0.05,0.05,0.15,0.05,0.1,0.15,0.05;UK=301;AK=52,112,95,44,47,70,54,97;MA=8;ID=chr1-899972-SNV-0-1:chr1-899987-DEL-0-5:chr1-899998-SNV-0-1:chr1-900009-DEL-1-1:chr1-900018-SNV-0-1:chr1-900112-SNV-0-1:chr1-900115-SNV-0-1:chr1-900123-SNV-0-1:chr1-900124-INS-0-1:chr1-900141-SNV-0-1,chr1-899987-DEL-0-5:chr1-899998-SNV-0-1:chr1-900009-DEL-1-1:chr1-900018-SNV-0-1:chr1-900112-SNV-0-1:chr1-900115-SNV-0-1:chr1-900123-SNV-0-1:chr1-900124-INS-0-1:chr1-900141-SNV-0-1,chr1-899989-SNV-0-1:chr1-900001-SNV-0-1:chr1-900003-DEL-0-151,chr1-899955-DEL-0-113:chr1-900112-SNV-0-1:chr1-900115-SNV-0-1:chr1-900123-SNV-0-1:chr1-900124-INS-0-1,chr1-899989-SNV-0-1:chr1-900000-INS-0-2:chr1-900009-DEL-2-115,chr1-899955-DEL-1-189:chr1-900150-SNV-0-1,chr1-899955-DEL-2-37:chr1-900019-SNV-0-1:chr1-900036-SNV-0-1:chr1-900039-SNV-0-1:chr1-900047-SNV-0-1:chr1-900048-INS-0-1:chr1-900056-SNV-0-1:chr1-900150-SNV-0-1:chr1-900153-SNV-0-1 GT:GQ:GL:KC 3/3:10000:-82.8146,-359.1,-264.5,-291,-281.3,-185.3,-93.41,-258.7,-197.2,0,-138.8,-309.4,-199.5,-35.39,-57.8,-169.7,-335,-273.5,-101.3,-122.4,-72.88,-187.1,-362.2,-315.3,-117.7,-191.2,-194,-120.8,-247.5,-409.4,-342.2,-178.1,-270.1,-210.3,-201.2,-202.4:52
how should I interpret it and convert it (if possible) to a simple variant record that only has one sequence in both REF and ALT?
Thank you very much
Best
PanGenie is a user-friendly and fast genotyper.
I generated vcf from minigraph-cactus.
When i used PanGenie v2.1.0, it can output results but too many (nearly 36%) variants from were skipped. The command is
~/tool/pangenie-2.1.0/build/src/PanGenie -i $fq -r $ref -v $mcvcf -j $th -s $sample -o $sample -t 20
Logs output like
VariantReader: skip variant at Chr01:5201 since it is contained in a previous one.
...
The source vcf file had been changed to the diploid version. Here are the first and second variants.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9 Sample10
Chr01 5194 >1008>1119 TGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTAAACCCTGAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTGAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCC TGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCC,TGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTAAACCCTAAACCCTAAACTCTAAACCCTAAACCCTAAACTCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTGAACCCTAAATCCTAAACCCTAAACCCTAAACCCTAAACCC,TGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCTGAACCCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCCTAAACCCTGAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTGAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCC,TAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAACCCTAAACCCTAAACCCTAACCCTAAGCCCTAAACCCTAAACCCTAAGCCCTAAGCCCTAAACCCTAAGCCCTAAACCCTAAGCCCTAAGCCCTAAACCCTAAGCCCTAAGCCCTAAACCCTAAGCCCTAAACCCTAAACCCTAAACCCTAAGCCCTAAGCCCTAAGCCCTAAGCCCTAAACCCTAAGCCCTAAGCCCTAAGCCCTAAACCCTAAGCCCTAAGCCCTAAGCCCTAAGCCCTAAGCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCC,TATT 60 . AC=1,1,1,1,1;AF=0.166667,0.166667,0.166667,0.166667,0.166667;AN=6;AT=>1008>1011>1012>1013>1014>1016>1017>1019>1020>1022>1023>1025>1026>1028>1029>1031>1032>1034>1035>1037>1038>1040>1047>1048>1050>1051>1053>1054>1056>1057>1059>1060>1062>1104>1106>1108>1109>1110>1112>1113>1115>1116>1118>1119,>1008>1011>1012>1013>1014>1016>1017>1019>1020>1022>1023>1025>1026>1028>1029>1031>1032>1034>1035>1037>1038>1040>1047>1048>1050>1051>1053>1054>1056>1057>1059>1060>1062>1104>1106>1107>1109>1110>1112>1114>1115>1116>1118>1119,>1008>1011>1012>1013>1014>1016>1017>1019>1020>1022>1023>1025>1026>1028>1029>1031>1032>1034>1035>1037>1038>1040>1047>1048>1050>1051>1053>1054>1056>1057>1059>1060>1062>1063>1065>1066>1068>1069>1071>1072>1074>1075>1077>1078>1080>1081>1083>1084>1086>1087>1089>1090>1092>1093>1095>1097>1099>1100>1102>1103>1104>1106>1107>1109>1111>1112>1113>1115>1117>1118>1119,>1008>1011>1012>1013>1014>1016>1017>1019>1020>1022>1023>1025>1026>1028>1029>1031>1032>1034>1035>1037>1038>1040>1041>1043>1044>1046>1047>1048>1050>1051>1053>1054>1056>1057>1059>1060>1062>1063>1065>1066>1068>1069>1071>1072>1074>1075>1077>1078>1080>1081>1083>1084>1086>1087>1089>1090>1092>1093>1095>1096>1097>1098>1100>1101>1103>1104>1106>1107>1109>1110>1112>1113>1115>1116>1118>1119,>1008>1009>1012>1013>1015>1016>1018>1019>1021>1022>1024>1025>1027>1028>1030>1031>1033>1034>1036>1037>1039>1040>1042>1043>1045>1046>1047>1049>1050>1052>1053>1055>1056>1058>1059>1061>1062>1064>1065>1067>1068>1070>1071>1073>1074>1076>1077>1079>1080>1082>1083>1085>1086>1088>1089>1091>1092>1094>1095>1097>1099>1100>1101>1103>1104>1105>1106>1107>1109>1110>1112>1114>1115>1116>1118>1119,>1008>1009>1010>1119;NS=6;LV=0 GT 5|5 .|. 3|3 1|1 0|0 2|2 .|. 4|4 .|. .|.
Chr01 5202 >1013>1016 G A 60 . AC=1;AF=0.2;AN=5;AT=>1013>1014>1016,>1013>1015>1016;NS=5;LV=1;PS=>1008>1119 GT .|. .|. 0|0 0|0 0|0 0|0 .|. 1|1 .|. .|.
...
I checked the REF/ALT sequence, and found the second variant was redundant.
RER[Sample5] 5202bp position can be inferred from the first variant. TGAACCCT[g]AA
ALT[Sample8] 5202bp position can be inferred from the first variant as well. TAAACCCT[a]AA
I know that PanGenie uses non-overlapping variants from the vcf file, but I wonder how to check the vcf file can be used as the input of PanGenie? Are there some suggested steps to pre-process the vcf file?
Hello,
Since I have interleaved reads ready to use, I was wondering if it is ok to use them instead of concatenating the reads or it would cause problems down the line?
Thanks!
Hello,
My sequencing files (.fastq) are paired end. But I didn't find parameter which fits this kind of file in panGenie. Can you give me an example usage of paired end file?
Thanks!
Hi,
Thanks for developing this great tool. I would like to use the "HPRC-CHM13 (88 haplotypes)" variant reference. Based on the name, I think the reference genome I should use for genotyping is CHM13, but I am not sure which version of CHM13 to use. Could you provide links to the reference genome you used when generating the variant reference vcf?
Thanks,
Han
pangenie]$ cd build/
build]$ cmake ..
-- The C compiler identification is GNU 9.3.0
-- The CXX compiler identification is GNU 9.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /xxx/software/miniconda3/bin/x86_64-conda-linux-gnu-cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /xxx/software/miniconda3/bin/x86_64-conda-linux-gnu-c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Checking for module 'jellyfish-2.0'
-- Found jellyfish-2.0, version 2.2.10
-- /Parastor300s_G30S/zhangjj/software/miniconda3/envs/pangenie/include/jellyfish-2.2.10
-- /parastor300/work01/zhangjj/software/PanGenie/pangenie/build/src
-- Configuring done
-- Generating done
-- Build files have been written to: /parastor300/work01/zhangjj/software/PanGenie/pangenie/build
build]$ make
Scanning dependencies of target PanGenieLib
[ 1%] Building CXX object src/CMakeFiles/PanGenieLib.dir/emissionprobabilitycomputer.cpp.o
[ 2%] Building CXX object src/CMakeFiles/PanGenieLib.dir/copynumber.cpp.o
[ 4%] Building CXX object src/CMakeFiles/PanGenieLib.dir/commandlineparser.cpp.o
[ 5%] Building CXX object src/CMakeFiles/PanGenieLib.dir/columnindexer.cpp.o
[ 7%] Building CXX object src/CMakeFiles/PanGenieLib.dir/dnasequence.cpp.o
[ 8%] Building CXX object src/CMakeFiles/PanGenieLib.dir/fastareader.cpp.o
[ 10%] Building CXX object src/CMakeFiles/PanGenieLib.dir/genotypingresult.cpp.o
[ 11%] Building CXX object src/CMakeFiles/PanGenieLib.dir/histogram.cpp.o
[ 13%] Building CXX object src/CMakeFiles/PanGenieLib.dir/hmm.cpp.o
。。。
[ 38%] Linking CXX executable PanGenie-graph
/xxx/software/miniconda3/bin/../lib/gcc/x86_64-conda-linux-gnu/9.3.0/../../../../x86_64-conda-linux-gnu/bin/ld: cannot find -ljellyfish-2.0
collect2: error: ld returned 1 exit status
make[2]: *** [src/CMakeFiles/PanGenie-graph.dir/build.make:104: src/PanGenie-graph] Error 1
make[1]: *** [CMakeFiles/Makefile2:166: src/CMakeFiles/PanGenie-graph.dir/all] Error 2
make: *** [Makefile:114: all] Error 2
build]$ jellyfish --version
jellyfish 2.2.10
Does the version of Jellyfish have to be 2.0?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.