pjedge / longshot Goto Github PK

View Code? Open in Web Editor NEW

174.0 174.0 26.0 10.28 MB

diploid SNV caller for error-prone reads

License: MIT License

Rust 78.55% C 20.43% C++ 0.64% Dockerfile 0.37%

longshot's People

Contributors

Stargazers

Watchers

longshot's Issues

VCF header is incorrect for genotype probabilities

The artic pipeline uses longshot after medaka to produce some ancilliary statistics of SARS-CoV-2 variants. The header associated with some probabilities incorrectly specifies some numbers as integer when in fact they are written as floats in teh records of the file, see nanoporetech/medaka#213

From src/print_output.rs:

let headerstr3 = format!("##INFO=<ID=PH,Number=G,Type=Integer,Description=\"PHRED-scaled Probabilities of Phased Genotypes\">

VCF writing should be done using rust-htslib's interface

currently VCF writing is performed using print statements.
Should switch to the rust-htslib method.
This may not be possible for the VCF file buffers passed to HapCUT2

Output phased genotype likelihoods for each variant: 00,01,10,11

thread 'main' panicked at 'assertion failed: p <= 0.0'

I am currently using longshot for variant calling in a reference-based bacterial genome assembly pipeline. For most genomes longshot does not render issues. However for some genomes I encounter the following error:

thread 'main' panicked at 'assertion failed: p <= 0.0'

I am using ONT reads and have set the -P 0.01 value which does not lead to a solution in this case.

Thank you in advance.
Beste regards,
Nick Vereecke

perform less pair-HMM realignment using dynamic programming

the pair-HMM realignment code currently iterates over 2^n short haplotypes for a variant cluster of size n and realigns to each one. This costs 2^n*mb where m is the total sequence length and b is band length. This makes the realignment step slow and affects accuracy by requiring truncating long potential variant clusters into short ones (currently n is bounded to 3 by default). This is especially important the more potential variants are known (ONT data). However, for every haplotype with a shared bit-string prefix of length k, the alignment probability of that prefix is recomputed unnecessarily across 2^(n-k) different pairHMM realignments. It is possible that another level of dynamic programming could be used to store and look up probabilities for shared portions of alignments and improve complexity somewhat. I don't have a complete solution at this time but may be worth considering in the future.

Feature request - multisample SNV calling

Hi, impressive and sorely needed tool - thanks.

I wonder if you are planning to implement multisample SNV calling in the future as this is very useful for comparison of multiple samples?

Thanks,
Colin

Estimate read error parameters directly from data

I think baum-welch algorithm seems appropriate?

Output count of ambiguous reads per SNV in VCF

vcf format interpretations

May I ask the annotations in the vcd files? Thanks.
What is "AM"? and what is its relationship to AC's two numbers?
Do you have a more detailed version on the annotations of VCF file from longshot? Thanks.

Regards,
Shaokang

error: Error generating haplotype fragments from BAM reads

caused by: Error extracting fragment from read.
caused by: Error converting u8 to String.
caused by: invalid utf-8 sequence of 1 bytes from index 1

variant position used for phase group should be in region

variants outside the region with reads overlapping the region are considered, since they may provide haplotype information to variants in the desired region.

Add support for haploid data.

currently code is diploid only.

different number of reads reported and in bam file

Hello,

I recently aligned reads from a heterozygous individual to the specie's reference genome. After identification of alignment breakpoints, I ran longshot on the conserved regions. Out of 1900 targets, 1780 were succesfully split into haplotypes. However, I hav noticed a few things:

there is a different number between the number of reported reads and the number of reads in the bam produced:

$ tail longshot.log
...
Separate fragments
2020-09-07 01:41:45     11668 reads (23.47%) assigned to haplotype 1
2020-09-07 01:41:45     11576 reads (23.28%) assigned to haplotype 2
2020-09-07 01:41:45     26476 reads (53.25%) unassigned.
2020-09-07 01:41:45 Writing haplotype-assigned reads to bam files...
2020-09-07 01:43:11 Printing VCF file...

that is 49720 reads phased and unphased, but

$ samtools view 10_1-18299080.bam | cut -f 1 | sort | uniq | wc -l
109628

the bam produced by longshot contains 109628 unique reads. That does not add up. Do you know what is going on here? Or am I reading it wrong?

BTW, the number of phased reads (HP:i:1 and HP:i:2) is correct, so the problem is from the unphased reads.

Cheers,

Juan D. Montenegro

make it so that --region flag is not required

Currently (temporarily) --region flag is required... this requirement needs to be lifted. This will go hand-in-hand with cleaning up some duplicate code -- bam parsing was handled separately for regions vs no region.

use genotyping rather than allele count cutoffs to decide candidate SNVs.

if we call genotypes using input alignments and qual, we could decide the set of candidate SNVs and refine them from there. This would unify the approach more.

question about haplotype bam files

Hi, I am currently trying to phase a diploid genome assembled with nanopore data
and Ive been playing around with longshot. Everything looks good until now, I was just wondering what is the actual output of the phased bam files. Are this containing only the phased block or the actual reads associated to the complete haplotype, including common regions between both haplotypes? I ask this cause I have the intention of extracting those reads and reassembling every haplotype indepdendently.

Thanks for your time.

test results with and without supplemental alignments included

supplemental alignments may contribute to errors via mismapping

Heterozygous non-ref variants

Hi everyone,

I was testing the tool with a real-life dataset of CLR PacBio reads mapped with pbmm2. The results puzzle me a little as:

I have no indels of sort in the calls (extremely unlikely to be true)
Variants are never heterozygous alternative (1/2), just 0/1 or 1/1, eventually phased (quite unlikely to be true, as I'm comparing related species)

Is this expected to happen? Why? Is there any solution?

Another thing I noticed is that the depth of coverage of the alleles are reported in the INFO field but not in the sample call. This make the information get completely lost when merging the VCFs of different samples. Is there any way your tool can report the information along with the genotype call?

Andrea

structural variants

Dear Peter,

Thank you for a very useful tool. I was wondering if you will add support for structural variants. Or if the current version can handle structural variants identified by other tools as a user-supplied VCF file.

I am having a bit of problem trying to resolve a heterozygous regions with lost of repeats forming complex structural variants like tandem duplications, inversions, large INDELs. I am using around 34X coverage (17X per chromatid) and I have already identified quite a few of the heterozygous structural variatns, and would like to phase them using longshot.

Any suggestions on how to use Longshot for this, would be more than welcome.

Juan D. Montenegro

improve pairHMM realignment by anchoring band to sparse alignment

Currently the alignment band for pairHMM realignment slides at a constant rate down the matrix. Sequences with uneven sizes are dealt with by increasing the band size proportionally. A better approach would be to perform a fast sparse alignment (available from rust-bio) and then move the pairHMM band along this sparse alignment. This is the approach used for rust-bio's implementation of banded alignment. This will improve efficiency (for alignments where sequence sizes differ greatly) and also improve accuracy (the banded pairHMM will stay very close to the best alignment at all times, which is currently not at all guaranteed).

FILTER 'dn' is not defined in the header

Hi,

Thanks for this tool - looks very promising. bcftools spits out an error though:

[W::vcf_parse] FILTER 'dn' is not defined in the header
Error: The filter is not defined in the header: dn

Regards,
Wouter

check for fasta reference and bam header mismatch

Currently the program panics in this scenario -- and the error message makes the problem sort of obvious.
Nonetheless, we should print an informative error message and quit gracefully.

longshot and HiFi reads

To whom it may concern,

I am writing to inquire whether HiFi reads have been tested with longshot. I am testing the ability of longshot to call germline mutations with HiFi reads and I wanted to get some second option on this. In addition, I was wondering if --thread option could be added to longshot.

Best regards,
Sangjin Lee

test short indels

add feature allowing an input VCF of candidate variants.
test separately every class of short indel -- 1 bp ins, 2 bp, etc. Figure out exact genotyping accuracy with reaper if variant is known. So we know if indels are even possible

Feature request: Write empty VCF file if no variants are found

Hi,

I'm using longshot in a snakemake pipeline that parallelizes variant calling by invoking longshot on subregions of the genome and merging all the resulting VCF files in the end.

I found that longshot doesn't output a VCF file if you invoke it with a region that doesn't have a variant. Would it be possible to make it output and empty VCF file with a valid header instead?

That would make merging and running it as a (e.g. snakemake) pipeline much easier.

Thanks,
Philipp

longshot crashes if -v is used but VCF doesn't contain contig information in the header.

multithreaded operation

multithreading is really easy in rust because of its strict memory safety. It would be nice to parallelize certain steps, especially fragment extraction / realignment.

when potential variants are given with -v, certain VCF fields are not populated.

specifically, the DA, MQxx, and SC fields do not get populated. This is done because in the VCF parsing mode the bam pileups do not get iterated over. This could be added in (at notable performance cost), or the fields should be marked as "empty".

How to know why a true snp is not call by longshot

Hi, I have a sample (nanopore data) where a variant known as a true snp is not call by longshot. This variant as enough read (176) and a correct vaf (0.42) when viewed on IGV.

This is my command :

longshot -r 16:3292028-3306627 --bam $bam --ref $refDir/HG19/hg19.fasta
--out $workDir/$bam_name/$bam_name-longshot.vcf
--min_cov 10
--min_alt_frac 0.3
&> $workDir/$bam_name/$bam_name-longshot.log

Do you have a feeling on why the variant isn't call ? Or do you have a parameter for having the list of "fail variant" and what parameter trigger the "fail" ?

Thanks

Here is the screenshot of IGV:

Reference genotype output

Hi,

Thank you for this tool! I would like to use longshot on trio data and am considering the follwoing pipeline:

Running longshot on each individual separately.
bcftools merge on the calls from all individuals.
Running longshot on each individual again with the option --potential_variants using the merged calls as input.

It would be great if I could get some output (e.g. genotype likelihoods) for variants genotyped as reference in the last step, too. This would make it easier to analyze the variants and genotypes in all three individuals of the trio together.

I spotted a parameter print_reference_genotype in your code in the function print_vcf but it seems to be always set to false. Would it be possible to make this parameter a program option?

Thanks,
Birte

VCF output missing header ##contig lines

Hello,
I suppose this is more of a feature request than bug.
It looks like the output VCF doesn't contain contig fields. Most command line VCF parsers (e.g. GATK, bcftools) require contig fields even though it is optional under VCFv4.2. Judging by the code, this isn't a bug. but just in case here are the parameters I used

longshot -n -c 2 --ts_tv_ratio 2 -D 10:800:50 -E 0.05 -e 3 -r chr{} --bam minimap_bams/ADA_1_nanopore_passed.bam --ref genome.fa --out {}_ada1_longshot_out.vcf -d longshot/ada1/debug_{}

Weird characters in error output

Hi,
I just tested v0.3.4 of longshot and had a typo in the command line; big +1 for a tool properly catching this and even suggesting the correct alternative! However, when I opened the error log in vim, it contained some weird characters - maybe, you want to look into this at some point...

+Peter

15x coverage and 3-6% error rate?

Hi there,

Do you think Longshot would work on a 15x coverage dataset of long reads (human) with an error rate somewhere in between 3 and 6%?

Thank you for the help.

Guillaume

multiple regions

Dear developers,
Thank you for this tool, it is very useful indeed. I wondered if there is a way to run multiple regions in a single command. I have a draft genome with multiple scaffolds (~1000) and I do not have the tools or budget to build pseudomolecules or chromosomes with thos scaffolds. If I could run perhaps 200 jobs with 5 scaffolds each, that would really help. Is there anyway to do it in the current release? Or is it being planned for future releases??

Thank you,

Juan Montenegro

Program version in help message does not match version in VCF header

The program version output in the help message does not match the version number written to the VCF header. This can lead to some confusion when using the program.

VCF specifications: GQ of type Integer

Hi,

The VCF specification says that FORMAT/GQ should be an integer, and not a float as outputted by LongShot. While bcftools doesn't complain about it, pysam does throw an error when parsing the file...

[E::vcf_parse_format] Invalid character '.' in 'GQ' FORMAT field at chr10:38970
Traceback (most recent call last):
(...)
  File "pysam/libcbcf.pyx", line 4108, in pysam.libcbcf.VariantFile.__next__
OSError: unable to parse next record

I can round these now to work around this issue, so no real trouble. I just wanted to inform you.

Cheers,
Wouter

Calling potential SNVs using pileup... Error.

Hello longshot,

I am attempting to use longshot on a 14 GB genome and have recently run into the following error:

longshot: sam.c:2316: resolve_cigar2: Assertion `s->k < c->n_cigar' failed.
/var/spool/slurmd/job133909/slurm_script: line 17: 226239 Aborted

I was wondering if I could get some advice on how to approach this problem?

Kind Regards,

Alex

revamp read anchoring scheme -- allow approximate matches?

revamp anchoring scheme to allow complex and high-scoring anchor alignments, not just exact matches.

Unable to find libclang

Hi Peter,

I've been trying to compile this using cargo, though I am having this problem where it complains not being able to find libclangAST.so.7 even though it exists.

AR_x86_64-unknown-linux-gnu = None
AR_x86_64_unknown_linux_gnu = None
HOST_AR = None
AR = None
running: "ar" "crs" "/home/username/tools/longshot/target/release/build/rust-htslib-007e912aeb150570/out/libwrapper.a" "/home/username/tools/longshot/target/release/build/rust-htslib-007e912aeb150570/out/wrapper.o"
exit code: 0
cargo:rustc-link-lib=static=wrapper
cargo:rustc-link-search=native=/home/username/tools/longshot/target/release/build/rust-htslib-007e912aeb150570/out

--- stderr
thread 'main' panicked at 'Unable to find libclang: "the libclang shared library at /usr/lib64/clang-private/libclang.so.7 could not be opened: libclangAST.so.7: cannot open shared object file: No such file or directory"', src/libcore/result.rs:999:5
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace.

no output if 0 SNVs found

Hi,
I'm using snakemake with 10Mbp regions on human datasets, but there are a few regions for some datasets that have 0 SNVs. Longshot exits without creating an output file, which causes problems for snakemake. I would rather just have a file printed with the header so the snakemake workflow can still continue.

If I increase the size of my region to 30Mbp then there are no regions with 0 SNVs, but I would like to keep the regions small so everything can be run quickly (other programs use the same regions too). Is it possible to include this enhancement, and include it in the conda repo (I'm trying to have conda handle the dependencies)?

Thanks

Feature request: add tags to bam rather than writing 3 separate files

Hi,

Thanks for longshot, it's a great tool. I like the option to write out 3 bam files for the haplotype-separated reads, but have a related feature request.
Do you think it would be possible to write out only a single bam file, but adding tags to each read to specify both the haplotype and the phase block identifier?

Thanks,
Wouter

read-haplotype assignment must EXCLUDE the variant it's being used to genotype

each SNV should have slightly different read-haplotype assignment information, which excludes itself from the calculation of the priors so as to not bias itself in its own favor.

change how SNV clustering works

a hard distance limit and hard short-haplotype limit aren't very elegant.
Ideally, each variant would be anchored independently and joined into disjoint sets if the SNVs fall into each other's anchorings. Then they'd maybe be recursively split based on descending distance, if need be, or the space of possible short haplotypes would be limited somehow (max number of SNVs??).

triallelic variants

currently not supported.
This is especially important if we support candidate variant input -- many tri-allelic variants in thousand genomes set to consider, for instance.

add feature allowing an input VCF of candidate variants

VCF output updates

output descriptions of the different tags (PS, RA..) in the VCF file

As per VCF specification, PS tag should be in the genotype column and correspond to the 'position' of the first variant in the haplotype block

output list of filtered variants in separate VCF file

read names

Hi,
thank you for this tool. It is working very nicely. One question though, what would happen if it is used on a polyploid specie where you can get more than 2 haplotypes? Is it able to estimate likelihood for more than two haplotypes? If not, would it be possible to write the names of the reads that support each SNV?

Cheers,

Juan D. Montenegro

add haplotype tag to reads

tag each read in the BAM file with a tag that captures the 'haplotype block' and 'allele' for each read

a compact output would be to output a list of read-ids and the haplotype information (similar to extracthairs output)

Cannot read FASTA

Hi,

When I try to run Longshot, it panics like this:

[erdi@login03 longshot]$ ./target/release/longshot --bam /ifs/home/erdi/pacbiocov/PacBio_deNovo/data/hg38.PBRT01-f__DNA17-07725_sorted1.bam --ref /ifs/home/erdi/pacbiocov/PacBio_deNovo/data/GRCh38_no_alt_plus_hs38d1_concat_nonprimary.fa --out /ifs/home/erdi/pacbiocov/PacBio_deNovo/results/SNVcalling/try1.vcf

2018-11-13 14:02:02 Estimating alignment parameters...
thread 'main' panicked at 'Failed to read fasta sequence record.: Custom { kind: Other, error: StringError("Unknown sequence name.") }', libcore/result.rs:1009:5
note: Run with `RUST_BACKTRACE=1` for a backtrace.

I haven't changed sequence names in ref file from the original file so couldn't figure out what is causing the issue.

Thanks a a lot in advance!

Best,
Erdi

VCF output needs PS (phase set / haplotype block) information.

this is critical if the VCF output is going to be phased.

pjedge / longshot Goto Github PK

longshot's People

Contributors

Stargazers

Watchers

Forkers

longshot's Issues

Recommend Projects

Recommend Topics

Recommend Org