Giter Club home page Giter Club logo

paragraph's Introduction

Paragraph: a suite of graph-based genotyping tools

Introduction

Accurate genotyping of known variants is a critical for the analysis of whole-genome sequencing data. Paragraph aims to facilitate this by providing an accurate genotyper for Structural Variations with short-read data.

Please reference Paragraph using:

Genotyping data in this paper can be found at paper-data/download-instructions.txt

For details of population genotyping, please also refer to:

Installation

Please check doc/Installation.md for system requirements and installation instructions.

Run Paragraph from VCF

Test example

After installation, run multigrmpy.py script from the build/bin directory on an example dataset as follows:

python3 bin/multigrmpy.py -i share/test-data/round-trip-genotyping/candidates.vcf \
                          -m share/test-data/round-trip-genotyping/samples.txt \
                          -r share/test-data/round-trip-genotyping/dummy.fa \
                          -o test \

This runs a simple genotyping example for two test samples.

  • candidates.vcf: this specifies candidate SV events in a vcf format.
  • samples.txt: Manifest that specifies some test BAM files. Tab or comma delimited.
  • dummy.fa a short dummy reference which only contains chr1

The output folder test then contains gzipped json for final genotypes:

$ tree test
test
├── grmpy.log            #  main workflow log file
├── genotypes.vcf.gz     #  Output VCF with individual genotypes
├── genotypes.json.gz    #  More detailed output than genotypes.vcf.gz
├── variants.vcf.gz      #  The input VCF with unique ID from Paragraph
└── variants.json.gz     #  The converted graphs from input VCF (no genotypes)

If successful, the last 3 lines of genotypes.vcf.gz will the same as in expected file.

Input requirements

VCF format

paraGRAPH will independently genotype each entry of the input VCF. You can use either indel-style representation (full REF and ALT allele sequence in 4th and 5th columns) or symbolic alleles, as long as they meet the format requirement of VCF 4.0+.

Currently we support 4 symbolic alleles:

  • <DEL> for deletion
    • Must have END key in INFO field.
  • <INS> for insertion
    • Must have a key in INFO field for insertion sequence (without padding base). The default key is SEQ.
    • For blockwise swap, we strongly recommend using indel-style representation, other than symbolic alleles.
  • <DUP> for duplication
    • Must have END key in INFO field. paraGRAPH assumes the sequence between POS and END being duplicated for one more time in the alternative allele.
  • <INV> for inversion
    • Must have END key in INFO field. paraGRAPH assumes the sequence between POS and END being reverse-complemented in the alternative allele.

Sample Manifest

Must be tab-deliemited.

Required columns:

  • id: Each sample must have a unique ID. The output VCF will include genotypes for all samples in the manifest
  • path: Path to the BAM/CRAM file.
  • depth: Average depth across the genome. Can be calculated with bin/idxdepth (faster than samtools).
  • read length: Average read length (bp) across the genome.

Optional columns:

  • depth sd: Specify standard deviation for genome depth. Used for the normal test of breakpoint read depth. Default is sqrt(5*depth).
  • depth variance: Square of depth sd.
  • sex: Affects chrX and chrY genotyping. Allow "male" or "M", "female" or "F", and "unknown" (quotes shouldn't be included in the manifest). If not specified, the sample will be treated as unknown.

Run time

  • On a 30x HiSeqX sample, Paragraph typically takes 1-2 seconds to genotype a simple SV in confident regions.

  • If the SV is in a low-complexity region with abnormal read pileups, the running time could vary.

  • For efficiency, it is recommended to manually set the "-M" option (maximum allowed read count for a variant) to skip these high-depth regions. We recommend "-M" as 20 times of your mean sample depth.

Population-scale genotyping

To efficiently genotype SVs across a population, we recommend doing single-sample mode as follows:

  • Create a manifest for each single sample
  • Run multigrmpy.py for each manifest. Be sure to set "-M" option for each sample according to its depth.
  • Multithreading (option "-t") is highly recommended for population-scale genotyping
  • Merge all genotypes.vcf.gz to create a big VCF of all samples. You can use either bcftools merge or your custom script.

Run Paragraph on complex variants

For more complicated events (e.g. genotype a deletion together with its nearby SNP), you can provide a custimized JSON to paraGRAPH:

Please follow the pattern in example JSON and make sure all required keys are provided. Here is a visualization of this sample graph.

To obtain graph alignments for this graph (including all reads), run:

bin/paragraph -b <input BAM> \
              -r <reference fasta> \
              -g <input graph JSON> \
              -o <output JSON path> \
              -E 1

To obtain the algnment summary, genotypes of each breakpoint, and the whole graph, run:

bin/grmpy -m <input manifest> \
          -r <reference fasta> \
          -i <input graph JSON> \
          -o <output JSON path> \
          -E 1

If you have multiple events listed in the input JSON, multigrmpy.py can help you to run multiple grmpy jobs together.

Further Information

Please check github wiki for common usage questions and errors.

Documentation

External links

  • The Illumina/Polaris repository gives the short-read sequencing data we used to test our method in population.

License

The LICENSE file contains information about libraries and other tools we use, and license information for these.

paragraph's People

Contributors

egor-dolzhenko avatar pkrusche avatar rizkg avatar traxexx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

paragraph's Issues

idxdepth underestimating depth of coverage, possibly?

I realigned the 300x Genome in a Bottle AJ Trio samples with bwa since the Illumina paired-end alignments they host were done with novoalign (booo).

When running idxdepth for the samples I get

  • 197.58
  • 181.58
  • 192.29

in the "depth" entries

However using bedtools genome coverage the values are

  • 297.65
  • 323.51
  • 292.37

which is closer to the "advertised" coverage from GiaB.

Which aligner?

Hi,
I am just starting to investage using Paragraph to genotype some SV within a number of poulations. I shall be doing a lot of alignments of short reads to my references. Is there a preferred aligner to use to generate the bam file for input into paragraph?

thanks.

build commands are out-dated

for example,

cmake ../paragraph-tools

as in the README has no target for paragraph-tools. Should this be: cmake ../ ?

when I fix that, I see:

/home/brentp/src/paragraph/build-paragraph/external/graphtools-src/src/graphIO/../../external/include/nlohmann/json.hpp:8678:43: error: logical ‘and’ of mutually exclusive tests is always false [-Werror=logical-op]
         const bool is_negative = (x <= 0) and (x != 0);  // see issue #755

if I manually edit that file, I see:

[ 67%] Linking CXX executable ../../../bin/idxdepth
/usr/bin/ld: cannot find -lBoost::boost
collect2: error: ld returned 1 exit status

but I have build boost as described and exported BOOST_ROOT:

ls $BOOST_ROOT/lib
libboost_atomic.a     libboost_exception.a   libboost_log_setup.a  libboost_math_tr1l.a         libboost_signals.a               libboost_test_exec_monitor.a    libboost_wserialization.a
libboost_chrono.a     libboost_filesystem.a  libboost_math_c99.a   libboost_prg_exec_monitor.a  libboost_stacktrace_addr2line.a  libboost_thread.a
libboost_container.a  libboost_graph.a       libboost_math_c99f.a  libboost_program_options.a   libboost_stacktrace_backtrace.a  libboost_timer.a
libboost_context.a    libboost_iostreams.a   libboost_math_c99l.a  libboost_random.a            libboost_stacktrace_basic.a      libboost_type_erasure.a
libboost_coroutine.a  libboost_locale.a      libboost_math_tr1.a   libboost_regex.a             libboost_stacktrace_noop.a       libboost_unit_test_framework.a
libboost_date_time.a  libboost_log.a         libboost_math_tr1f.a  libboost_serialization.a     libboost_system.a                libboost_wave.a

the docker build . from this repo also fails due to 404. Any ideas?

Error processing an INV

This error below occurs when processing this INV called by MANTA. Any suggestions? The variant record is supplying a REF. ALT, SVLEN and an SVINSEQ.

chr6 2893187 MantaINV:4:20498:20498:4:0:0;MantaINV:61660:0:0:1:0:0 C <INV> 548 PASS END=2893191;SVTYPE=INV;SVLEN=4;CIPOS=0,4;CIEND=-4,0;HOMLEN=4;HOMSEQ=TATA;SVINSLEN=51;SVINSSEQ=ACGTATATATATACGTATATATAATATATATATTATATATACGTATATATA;INV5;AC=2;AN=9578;FIBC_P=-0.000417444;HWE_SLP_P=-0.195618;FIBC_I=-0.000417444;HWE_SLP_I=-0.195618;MAX_IF=0.749985;MIN_IF=0.749985;LLK0=-281075;BETA_IF=0.49997,-3.8073e-06,1.89345e-06,3.9655e-06,2.17351e-06;ANN=<INV>|intron_variant|MODIFIER|SERPINB9|ENSG00000170542|transcript|ENST00000380698.4|protein_coding|5/6|c.567+220_567+223inv||||||;NS=4789;AF=0.000208812;MAF=0.000208812;AC_Het=2;AC_Hom=0;AC_Hemi=0;HWE=1;ExcHet=0.999896

It would be nice to get the chromosome in the below error message when running when trouble shooting.

2020-03-09 10:26:55,417 ERROR    Exception when running vcf2paragraph on /scratch/tmppmn0a0d6.vcf.gz
2020-03-09 10:26:55,421 ERROR    Traceback (most recent call last):
2020-03-09 10:26:55,422 ERROR      File "/share/pkg.7/paragraph/2.4a/install/lib/python3/grm/vcf2paragraph/__init__.py", line 286, in run_vcf2paragraph    alt_paths=params["alt_paths"])
2020-03-09 10:26:55,422 ERROR      File "/share/pkg.7/paragraph/2.4a/install/lib/python3/grm/vcf2paragraph/__init__.py", line 86, in convert_vcf    ref, indexed_vcf.name, ins_info_key, chrom, start, end, ref_node_padding, allele_graph)
2020-03-09 10:26:55,422 ERROR      File "/share/pkg.7/paragraph/2.4a/install/lib/python3/grm/vcfgraph/vcfgraph.py", line 128, in create_from_vcf    graph.add_record(record, allele_graph, varId, ins_info_key)
2020-03-09 10:26:55,422 ERROR      File "/share/pkg.7/paragraph/2.4a/install/lib/python3/grm/vcfgraph/vcfgraph.py", line 202, in add_record    self.add_alt(vcf.pos, vcf.stop, ref_sequence, alt_sequence, alt_samples, refSamples)
2020-03-09 10:26:55,423 ERROR      File "/share/pkg.7/paragraph/2.4a/install/lib/python3/grm/vcfgraph/vcfgraph.py", line 296, in add_alt    raise Exception("{}:{} missing REF or ALT sequence.".format(start, end))
2020-03-09 10:26:55,423 ERROR    Exception: 2893187:2893191 missing REF or ALT sequence.
2020-03-09 10:26:55,506 ERROR    VCF to JSON conversion failed.
2020-03-09 10:26:55,509 ERROR    multiprocessing.pool.RemoteTraceback: """Traceback (most recent call last):  File "/share/pkg.7/python3/3.6.9/install/lib/python3.6/multiprocessing/pool.py", line 119, in worker    result = (True, func(*args, **kwds))  File "/share/pkg.7/python3/3.6.9/install/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar    return list(map(*args))  File "/share/pkg.7/paragraph/2.4a/install/lib/python3/grm/vcf2paragraph/__init__.py", line 286, in run_vcf2paragraph    alt_paths=params["alt_paths"])  File "/share/pkg.7/paragraph/2.4a/install/lib/python3/grm/vcf2paragraph/__init__.py", line 86, in convert_vcf    ref, indexed_vcf.name, ins_info_key, chrom, start, end, ref_node_padding, allele_graph)  File "/share/pkg.7/paragraph/2.4a/install/lib/python3/grm/vcfgraph/vcfgraph.py", line 128, in create_from_vcf    graph.add_record(record, allele_graph, varId, ins_info_key)  File "/share/pkg.7/paragraph/2.4a/install/lib/python3/grm/vcfgraph/vcfgraph.py", line 202, in add_record    self.add_alt(vcf.pos, vcf.stop, ref_sequence, alt_sequence, alt_samples, refSamples)  File "/share/pkg.7/paragraph/2.4a/install/lib/python3/grm/vcfgraph/vcfgraph.py", line 296, in add_alt    raise Exception("{}:{} missing REF or ALT sequence.".format(start, end))Exception: 2893187:2893191 missing REF or ALT sequence."""
2020-03-09 10:26:55,509 ERROR    The above exception was the direct cause of the following exception:
2020-03-09 10:26:55,510 ERROR    Traceback (most recent call last):
2020-03-09 10:26:55,510 ERROR      File "/share/pkg.7/paragraph/2.4a/install/bin/multigrmpy.py", line 52, in load_graph_description    header, records, event_list = convert_vcf_to_json(args, alt_paths=True)
2020-03-09 10:26:55,510 ERROR      File "/share/pkg.7/paragraph/2.4a/install/lib/python3/grm/vcf2paragraph/__init__.py", line 156, in convert_vcf_to_json    variants = pool.map(run_vcf2paragraph, zip(to_process, itertools.repeat(params)))
2020-03-09 10:26:55,510 ERROR      File "/share/pkg.7/python3/3.6.9/install/lib/python3.6/multiprocessing/pool.py", line 266, in map    return self._map_async(func, iterable, mapstar, chunksize).get()
2020-03-09 10:26:55,510 ERROR      File "/share/pkg.7/python3/3.6.9/install/lib/python3.6/multiprocessing/pool.py", line 644, in get    raise self._value
2020-03-09 10:26:55,511 ERROR    Exception: 2893187:2893191 missing REF or ALT sequence.
2020-03-09 10:26:55,511 ERROR    multiprocessing.pool.RemoteTraceback: """Traceback (most recent call last):  File "/share/pkg.7/python3/3.6.9/install/lib/python3.6/multiprocessing/pool.py", line 119, in worker    result = (True, func(*args, **kwds))  File "/share/pkg.7/python3/3.6.9/install/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar    return list(map(*args))  File "/share/pkg.7/paragraph/2.4a/install/lib/python3/grm/vcf2paragraph/__init__.py", line 286, in run_vcf2paragraph    alt_paths=params["alt_paths"])  File "/share/pkg.7/paragraph/2.4a/install/lib/python3/grm/vcf2paragraph/__init__.py", line 86, in convert_vcf    ref, indexed_vcf.name, ins_info_key, chrom, start, end, ref_node_padding, allele_graph)  File "/share/pkg.7/paragraph/2.4a/install/lib/python3/grm/vcfgraph/vcfgraph.py", line 128, in create_from_vcf    graph.add_record(record, allele_graph, varId, ins_info_key)  File "/share/pkg.7/paragraph/2.4a/install/lib/python3/grm/vcfgraph/vcfgraph.py", line 202, in add_record    self.add_alt(vcf.pos, vcf.stop, ref_sequence, alt_sequence, alt_samples, refSamples)  File "/share/pkg.7/paragraph/2.4a/install/lib/python3/grm/vcfgraph/vcfgraph.py", line 296, in add_alt    raise Exception("{}:{} missing REF or ALT sequence.".format(start, end))Exception: 2893187:2893191 missing REF or ALT sequence."""
2020-03-09 10:26:55,511 ERROR    The above exception was the direct cause of the following exception:
2020-03-09 10:26:55,511 ERROR    Traceback (most recent call last):
2020-03-09 10:26:55,512 ERROR      File "/share/pkg.7/paragraph/2.4a/install/bin/multigrmpy.py", line 261, in run    graph_files = load_graph_description(args)
2020-03-09 10:26:55,512 ERROR      File "/share/pkg.7/paragraph/2.4a/install/bin/multigrmpy.py", line 52, in load_graph_description    header, records, event_list = convert_vcf_to_json(args, alt_paths=True)
2020-03-09 10:26:55,512 ERROR      File "/share/pkg.7/paragraph/2.4a/install/lib/python3/grm/vcf2paragraph/__init__.py", line 156, in convert_vcf_to_json    variants = pool.map(run_vcf2paragraph, zip(to_process, itertools.repeat(params)))
2020-03-09 10:26:55,512 ERROR      File "/share/pkg.7/python3/3.6.9/install/lib/python3.6/multiprocessing/pool.py", line 266, in map    return self._map_async(func, iterable, mapstar, chunksize).get()
2020-03-09 10:26:55,512 ERROR      File "/share/pkg.7/python3/3.6.9/install/lib/python3.6/multiprocessing/pool.py", line 644, in get    raise self._value
2020-03-09 10:26:55,513 ERROR    Exception: 2893187:2893191 missing REF or ALT sequence.
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/share/pkg.7/python3/3.6.9/install/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/share/pkg.7/python3/3.6.9/install/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/share/pkg.7/paragraph/2.4a/install/lib/python3/grm/vcf2paragraph/__init__.py", line 286, in run_vcf2paragraph
    alt_paths=params["alt_paths"])
  File "/share/pkg.7/paragraph/2.4a/install/lib/python3/grm/vcf2paragraph/__init__.py", line 86, in convert_vcf
    ref, indexed_vcf.name, ins_info_key, chrom, start, end, ref_node_padding, allele_graph)
  File "/share/pkg.7/paragraph/2.4a/install/lib/python3/grm/vcfgraph/vcfgraph.py", line 128, in create_from_vcf
    graph.add_record(record, allele_graph, varId, ins_info_key)
  File "/share/pkg.7/paragraph/2.4a/install/lib/python3/grm/vcfgraph/vcfgraph.py", line 202, in add_record
    self.add_alt(vcf.pos, vcf.stop, ref_sequence, alt_sequence, alt_samples, refSamples)
  File "/share/pkg.7/paragraph/2.4a/install/lib/python3/grm/vcfgraph/vcfgraph.py", line 296, in add_alt
    raise Exception("{}:{} missing REF or ALT sequence.".format(start, end))
Exception: 2893187:2893191 missing REF or ALT sequence.
"""

How to better call and genotype population SVs?

Hi @traxexx @KamilSJaron

Thanks for your contribution. I found the python script (convertManta2Paragraph_compatible_vcf.py) you uploaded ommitted the INSs whose INFO fields contain RIGHT_SVINSSEQ and LEFT_SVINSSEQ. Maybe paragraph can not genotype these variants so you ommitted them. Although the short reads mapping can not give us enough information about the complete inserted sequences, I have checked some insertions in IGV and found some with above two INFO fileds are actually make sense.

To @traxexx
It's difficult to assemble the complete inserted sequences (based on mapping to reference genome) when the INSs are longer than the read lengths. Thus, Manta reported the RIGHT_SVINSSEQ and LEFT_SVINSSEQ in these INSs. I wonder whether paragraph can handle these SVs properly.

Moreover, I found that the deviations of breakpoint, maybe over 100bp or even large as I used SURVIVOR to merge all individual VCFs into one population VCF using a maximum allowed distance of 1kb measured pairwise between breakpoints (begin1 vs begin2, end1 vs end2), affect the genotype results greatly. Do you have any suggestions about the discovery of population SVs before genotyping?

Sincerely,
Zheng Zhuqing

[E::idx_find_and_load] Could not retrieve index file

Hello,
I used the test-data to run the multigrmpy.py

the command:

multigrmpy.py \
                          -i ./candidates.vcf \
                          -m ./samples.txt \
                          -r ./dummy.fa \
                          -o test

The error:

[E::idx_find_and_load] Could not retrieve index file for 'test/variants.vcf.gz'

finally, I can get the result file genotypes.vcf.gz, does the error have any effect, or how to solve it?

Best wishes~

idxdepth error

Hi @traxexx

Thank you for this nice tool. I tried to run following command to generate the general statistics of the BAM file, but the program exited. Also, the warning is strange as the BAM file was generated by mapping to the reference.fa which was passed as an argument to option -r. Moreover, can we add an option to filter out the reads with low mapping quality? Thank you.
idxdepth -b $input.sort.dedup.bam --bam-index $input.sort.dedup.bai -r reference.fa --autosome-regex '[1-9][0-9]?' --sex-chromosome-regex '[XY]?' --threads 1 -o $input --log-level info
[2020-07-13 19:44:55.015] [idxdepth] [7450] [info] BAM: $input.sort.dedup.bam
[2020-07-13 19:44:55.023] [idxdepth] [7450] [info] Reference: reference.fa
[2020-07-13 19:44:55.023] [idxdepth] [7450] [info] Output path: $input
[2020-07-13 19:44:55.202] [idxdepth] [7450] [warning] BAM header only has a subset of the reference chromosomes -- please make sure they match!
[2020-07-13 19:44:55.209] [idxdepth] [7450] [critical] Assertion failed: index

Sincerely,
Zheng Zhuqing

Pysam error message after paragraph genotyping completes

Hi all,

I've been running into what I believe is a pysam error message when running multigrmpy.py. Before I go into too much detail, here is some background info on what I'm trying to do.

I've been working on an experiment where I take a mixture of two separate BAMS which have been sampled using samtools view -s and mixed with samtools merge. Then I try to genotype some variants from one of the two samples to see how robust Paragraph is to varying allele balance (by varying the mixture ratios of the two sample bams).

Paragraph seems to run just fine on a mixture BAM, since it seems to output and populate the genotypes.json.gz. In grmpy.log, the log messages do not indicate any errors and it seems to get through the entire genotyping process. It seems that when the time comes to create the corresponding genotypes.vcf.gz, I get a pysam error.

2020-11-03 06:18:01,921 ERROR Traceback (most recent call last): 2020-11-03 06:18:01,921 ERROR File "/mnt/local/paragraph/build/bin/multigrmpy.py", line 340, in run vcfupdate.update_vcf_from_grmpy(vcf_input_path, grmpyOutput, result_vcf_path, sample_names) 2020-11-03 06:18:01,921 ERROR File "/mnt/local/paragraph/build/lib/python3/grm/vcfgraph/vcfupdate.py", line 232, in update_vcf_from_grmpy set_record_for_sample(record, sample, grmpyRecord, alleleMap) 2020-11-03 06:18:01,921 ERROR File "/mnt/local/paragraph/build/lib/python3/grm/vcfgraph/vcfupdate.py", line 310, in set_record_for_sample record.samples[sample]["PL"] = pls_to_set 2020-11-03 06:18:01,921 ERROR File "pysam/libcbcf.pyx", line 3455, in pysam.libcbcf.VariantRecordSample.__setitem__ 2020-11-03 06:18:01,921 ERROR File "pysam/libcbcf.pyx", line 859, in pysam.libcbcf.bcf_format_set_value 2020-11-03 06:18:01,921 ERROR File "pysam/libcbcf.pyx", line 597, in genexpr 2020-11-03 06:18:01,921 ERROR File "pysam/libcbcf.pyx", line 597, in genexpr 2020-11-03 06:18:01,921 ERROR File "pysam/libcutils.pyx", line 129, in pysam.libcutils.force_bytes 2020-11-03 06:18:01,921 ERROR TypeError: Argument must be string, bytes or unicode. Traceback (most recent call last): File "/mnt/local/paragraph/build/bin/multigrmpy.py", line 353, in <module> main() File "/mnt/local/paragraph/build/bin/multigrmpy.py", line 349, in main run(args) File "/mnt/local/paragraph/build/bin/multigrmpy.py", line 340, in run vcfupdate.update_vcf_from_grmpy(vcf_input_path, grmpyOutput, result_vcf_path, sample_names) File "/mnt/local/paragraph/build/lib/python3/grm/vcfgraph/vcfupdate.py", line 232, in update_vcf_from_grmpy set_record_for_sample(record, sample, grmpyRecord, alleleMap) File "/mnt/local/paragraph/build/lib/python3/grm/vcfgraph/vcfupdate.py", line 310, in set_record_for_sample record.samples[sample]["PL"] = pls_to_set File "pysam/libcbcf.pyx", line 3455, in pysam.libcbcf.VariantRecordSample.__setitem__ File "pysam/libcbcf.pyx", line 859, in pysam.libcbcf.bcf_format_set_value File "pysam/libcbcf.pyx", line 597, in genexpr File "pysam/libcbcf.pyx", line 597, in genexpr File "pysam/libcutils.pyx", line 129, in pysam.libcutils.force_bytes TypeError: Argument must be string, bytes or unicode.

My initial thought was that there could have been something wrong with the input vcf format, but I'm not sure where to start based on the error alone.

Thanks for your help, and please let me know if you need more details to clear anything up.

Compatibility of paragraph and Sniffles VCF

Dear @traxexx

I ran following command to genotype the candidate deletion variants. Here the P1_DEL.vcf was generated by Sniffles, however the program exited with "Exception: Different padding base for REF and ALT at 1:233140". Maybe I need to some other custom scripts to convert the VCF file to be compatible with the paragraph.

multigrmpy.py -i P1_DEL.vcf -m manifest -o P1_DEL.genotype -r reference.fa

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       233140  .       TGTCCTGTGTCCGTGTCCCATGGTGTCCGTGTCTCAGTCTGTCCTGTGTCCGGTCCCGTGTCCGTGTCCCGTGTCCCACGTCCATGTCCCGTGTCCGTGTCTCATGTCCGGGTCCCGTGTCCGTGTCCCACGTCCATGTCCCGTGTCCGTGTCTCATGTCTGGGTCCTGTGTCCATGTCCCATGTCCATGTCCCGTGTCCGTGTCTCATGTCTGGGTCCTGTGTCCGTGTCCCATGTCCATGTCCCGTGTCCGTGTCTCATGTCCGCGTCCGTGTCCATGTCCATGTCCGTGTCCGTGTCTCATGTCCGGTCCTGTCCGGTCCCCTGTCCGTGTCCCGTGTCCGTGTCTCATGTCCGTGTCTCATGTCCGGGTCCGTTCCGTGTCCCTGTCCATGTCCCGTGTCCGTGTCTCATGTCTGGGTCCTGTGTCCGTGTCCCGTCCATGTCCCGTGTCCGTGTCCTGTCCGGTCCTGTCCGTGTCCGTGTCCATGTCCCGTGTCCGTGTCCGTGTCCATGTCCCGTGTCCGTGTCTCATGTCCCG   N       0       .       PRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=1;END=233681;ZMW=21;STD_quant_start=0;STD_quant_stop=0;Kurtosis_quant_start=7;Kurtosis_quant_stop=2.0032;SVTYPE=DEL;SUPTYPE=AL;SVLEN=-541;STRANDS=+-;STRANDS2=9,12,9,12;RE=21;REF_strand=13,14;AF=0.4375;MERGED_IDS=1,svim.DEL.5;NUM_JOINED_SVS=2;STDDEV_POS=0,2
1       233793  .       CACGTCCATGTCCCGTGTCCGTGTCTCATGTCCGGGTCCTGTGTCCGGTCCGTGTCCCGTGTCCGTGTCCCACGTCCATGTCCCGTGTCCGTGTCTCATGTCCGGTCCCGTGTCCGTGTCCCACGTCCATGTCCCGTGTCCGTGTCTCATGTCTCCGTGTCCTGTGTCCATGTCCGGTCCG   N       0       .       PRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=1;END=233868;ZMW=26;STD_quant_start=0;STD_quant_stop=0;Kurtosis_quant_start=10;Kurtosis_quant_stop=10;SVTYPE=DEL;SUPTYPE=AL;SVLEN=-75;STRANDS=+-;STRANDS2=13,13,13,13;RE=26;REF_strand=0,0;AF=1;MERGED_IDS=2,svim.DEL.7;NUM_JOINED_SVS=2;STDDEV_POS=0,0

sincerely,
Zheng Zhuqing

Errors on idxdepth

Hi,

I am trying to use the idxdepth to calculate the depth for the manifest file, but it always gives me a warning:

[warning] BAM header only has a subset of the reference chromosomes -- please make sure they match!

The issues falls on many datasets that I tried. I use bwa for alignment, and gatk for read groups adding/duplicates removing.

Any hint for how might this happened?

Best,
Monica

Exception: Different padding base for REF and ALT

I am running paragraph for GIAB dataset and I used the following command

python3 .../bin/multigrmpy.py -i HG002_SVs_Tier1_v0.6_chr22.vcf -m samples.txt -r hg19.chr22.fa -o test

My samples.txt file looks like the following

id	path	depth	read length
sample1	/stornext/snfs5/next-gen/scratch/fritz/projects/Sairam/Proj1_nibSV/TEST/HG002.hg19.chr22.bam	60	250

Could you please check the following error the paragraph throws ?

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File ".../bin/multigrmpy.py", line 353, in <module>
    main()
  File ".../bin/multigrmpy.py", line 349, in main
    run(args)
  File ".../bin/multigrmpy.py", line 261, in run
    graph_files = load_graph_description(args)
  File ".../bin/multigrmpy.py", line 52, in load_graph_description
    header, records, event_list = convert_vcf_to_json(args, alt_paths=True)
  File ".../lib/python3/grm/vcf2paragraph/__init__.py", line 156, in convert_vcf_to_json
    variants = pool.map(run_vcf2paragraph, zip(to_process, itertools.repeat(params)))
  File ".../lib/python3.7/multiprocessing/pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File ".../lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
Exception: Different padding base for REF and ALT at 22:18588640

Test output not in expected format

In the documentation, it is indicated that the format of the output of paragraph-to-csv.py genotypes.json.gz --genotype-only should be an ID consisting of the chromosome and position, as below:

#FORMAT=GT
#ID SWAPS
chrA:1500-1509 REF/REF
chrB:1500-1509 S1/S1
chrC:1500-1699 REF/S1

Instead, the output looks like this:

#FORMAT=GT
#ID SWAPS
swaps.vcf@5a0b775f60ed1cd0b938ae09b753ad0207c5ba9f83679f894f17d3d1fd352b6f:2 swap2:1/swap2:1
swaps.vcf@5a0b775f60ed1cd0b938ae09b753ad0207c5ba9f83679f894f17d3d1fd352b6f:3 REF/swap3:1
swaps.vcf@5a0b775f60ed1cd0b938ae09b753ad0207c5ba9f83679f894f17d3d1fd352b6f:1 REF/REF

which, while traceable to the chromosome/position via the vcf, is, I think, not the expected format?

multigrmpy.py: Illegal header name "depth sd"

Running a test of paragraph and I got this error

$ python3 paragraph/bin/multigrmpy.py -i test.vcf \
        -m HG03097.manifest.txt \
         -r GRCh38_full_analysis_set_plus_decoy_hla.fa \
         -o paragraph_test

Traceback (most recent call last):
  File "paragraph/bin/multigrmpy.py", line 353, in <module>
    main()
  File "paragraph/bin/multigrmpy.py", line 349, in main
    run(args)
  File "paragraph/bin/multigrmpy.py", line 249, in run
    raise Exception("Illegal header name %s. Allowed headers:\n%s" % (field, header_str))
Exception: Illegal header name depth sd. Allowed headers:
id,path,idxdepth,depth,read length,sex,depth variance

In the README, "sd depth" is one of the options, is this "idxdepth?" in the allowed headers? If I change "sd depth" to "idxdepth" the program begins to run.

Thanks!

Temp file directory not settable

An option to set the temp directory manually when running multigrmpy.py would be useful. Currently it defaults to putting files in /tmp and the server I'm using does not have enough space in that location so I keep getting out of disk space errors.

I tried setting --scratch-dir but if that is what it is intended for, it did not solve the issue - the temporary files still get placed in /tmp, though files get placed in the specified scratch directory as well.

Furthermore, when multigrmpy.py errors out, the /tmp and --scratch-dir directories need to be cleared of their temp files manually.

SV calling and genotyping across a population

Hi @traxexx

I wonder how do you generate the candidate SVs that used as input for paragraph when you have many short-read samples and several representative long-read samples. In my mind, I will call SVs using both long-read and short-read samples and then merge them together. During merging, I think the breakpoints will become coarse. Does this affect the genotype results when I do not have a precise breakpoint?

Sincerely,
Zheng Zhuqing

required version of python

Hello folks,

I think your required version of python is too old. In the README you write

Python 3.4+ is required.

but on many places you use fstrings, which are a feature of python 3.6+.

We (with @ptranvan) tired to remove all the fstrings and run it on python 3.5, but we got a super-crazy-long error on the test example you provide. Selected lines of the error log:

2019-06-06 16:15:56,155 ERROR    Exception when running vcf2paragraph on /tmp/tmppbauhqg4.vcf.gz
2019-06-06 16:15:56,156 ERROR    Exception when running vcf2paragraph on /tmp/tmpk58pllu3.vcf.gz
2019-06-06 16:15:56,191 ERROR    Traceback (most recent call last):
2
...
2019-06-06 16:15:56,264 ERROR    VCF to JSON conversion failed.
...
Traceback (most recent call last):
  File "bin/multigrmpy.py", line 353, in <module>
    main()
  File "bin/multigrmpy.py", line 349, in main
    run(args)
  File "bin/multigrmpy.py", line 261, in run
    graph_files = load_graph_description(args)
  File "bin/multigrmpy.py", line 52, in load_graph_description
    header, records, event_list = convert_vcf_to_json(args, alt_paths=True)
  File "/stn4/ptranvan/Software/paragraph/paragraph-tools-build/lib/python3/grm/vcf2paragraph/__init__.py", line 156, in convert_vcf_to_json
    variants = pool.map(run_vcf2paragraph, zip(to_process, itertools.repeat(params))) 
  File "/software/lib64/python3.5/multiprocessing/pool.py", line 260, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/software/lib64/python3.5/multiprocessing/pool.py", line 608, in get
    raise self._value
AssertionError: ref-{refSpan}

I tried to copy the relevant lines. Also, it could be that we broke something when we were removing that bloody fstrings. Regardless of which case it is, I suppose you should either remove fstrings and test paragraph on python 3.4, or update readme to a newer version of python.

Cheers :-)

how to run on large cohort

The example has effectively this:

python3 bin/multigrmpy.py -i $sites_vcf \
                          -m $manifest \
                          -r $reference_fasta \
                          -o $out_dir

if $manifest has hundreds of samples, can I genotype each sample seperately--i.e. will paragraph give different results if I split by sample (but the sites_vcf is the same?)

or should parallelization only be by site?
also, what does paragraph do with BND elements?

Issue with json when compiling

I'm trying to install Paragraph on Linux CentOS 6.9. I am currently using gcc 5.4.0, though I get the same error with 5.1.0.

After running cmake (version 3.8.2) successfully, I get the following error when running 'make' to compile.

Scanning dependencies of target graphIO
[ 64%] Building CXX object external/graphtools-build/src/graphIO/CMakeFiles/graphIO.dir/GraphJson.cpp.o
In file included from /home-4/[email protected]/bin/packages/paragraph-tools-build/external/graphtools-src/src/graphIO/../../include/graphIO/GraphJson.hh:30:0,
from /home-4/[email protected]/bin/packages/paragraph-tools-build/external/graphtools-src/src/graphIO/GraphJson.cpp:25:

...

/home-4/[email protected]/bin/packages/paragraph-tools-build/external/graphtools-src/src/graphIO/../../external/include/nlohmann/json.hpp:17216:25: required from here
/home-4/[email protected]/bin/packages/paragraph-tools-build/external/graphtools-src/src/graphIO/../../external/include/nlohmann/json.hpp:8678:43: error: logical ‘and’ of mutually exclusive tests is always false [-Werror=logical-op]
const bool is_negative = (x <= 0) and (x != 0); // see issue #755
^
cc1plus: all warnings being treated as errors
make[2]: *** [external/graphtools-build/src/graphIO/CMakeFiles/graphIO.dir/GraphJson.cpp.o] Error 1
make[1]: *** [external/graphtools-build/src/graphIO/CMakeFiles/graphIO.dir/all] Error 2
make: *** [all] Error 2

I found issue #755 in the json repo but they seem to have fixed the issue in release in 2017, and it seems to be related to using the Intel icpc compiler, which I am not using. It does appear that perhaps it also appears with gcc 5.2; I tried using gcc 4.9.2 instead of 5.1.0 or 5.4.0, but gcc 4.9.0 yielded other errors, before this point in the compilation process. If this is related to gcc version, what version(s) has Paragraph been successfully compiled with?

Make filter field for missing genotypes

In v2.4a some SVs with missing genotypes were still labeled as "PASS" in VCF filter field. We're going to revisit missing genotypes in v2.5 and adjust this filter properly...

Complex structural variants like TRA or INVDUP

Hi,
how to construct a graph for complex structural variants like TRA or INVDUP from sniffles or nanosv. Could you give me an example? And is it possible to genotype translocation between chroms now ?

Thanks!

Handling of temporary files by multigrmpy.py

I have noted a few problems with the way temporary files are handled by multigrmpy.py:

1- vcf.gz files are still written to /tmp or /scratch even when the option --scratch-dir is explicitly set to another directory (the .json files are written to that directory, but not the .vcf.gz ones)
2- The index files of the .vcf.gz files (.vcf.gz.csi files) are not cleaned from the temporary directories, even when multigrmpy.py exits successfully
3- The .json files are also not cleaned from the temporary directory after running multigrmpy.py

I assume this behavior is not the one expected from the program. In my case, I need to clean up the temporary directories after each run, but this prevents me from running several multigrmpy instances in parallel so as not to delete files that are used by another instance.

I saw that there has been an issue raised on this topic in the past and it has been closed, however the behavior of the program has not changed since.

Trouble genotyping lifted variants

I have a VCF of SVs in GRCh38, but I need to genotype a number of samples mapped to GRCh37. Remapping the samples is not an option. I converted my original variants from VCF to BED by just keeping the CHROM, BEGIN and END (from INFO) fields. I then lifted the BED file using UCSC and updated the VCF file with new coordinates (only BEGIN and END will change).

Paragraph could genotype the original calls without problem but it crashes when trying to genotype the lifted. I created a new VCF file with only 10 of the SVs and passed that to Paragraph to try to figure out what's wrong. Here's the error I get:

2020-10-13 16:44:49,884 WARNING  chr1:114350134 Padding base in genome is different from VCF. Use the one from genome.
2020-10-13 16:44:54,586 ERROR    Traceback (most recent call last):
2020-10-13 16:44:54,586 ERROR      File "/share/binaries/Paragraph/bin/multigrmpy.py", line 315, in run    subprocess.check_call(commandline, shell=True, stderr=subprocess.STDOUT)
2020-10-13 16:44:54,587 ERROR      File "/software/anaconda3/4.5.12/lssc0-linux/lib/python3.6/subprocess.py", line 311, in check_call    raise CalledProcessError(retcode, cmd)
2020-10-13 16:44:54,587 ERROR    subprocess.CalledProcessError: Command '/share/binaries/Paragraph/bin/grmpy --response-file=/tmp/tmpzmne8osh.txt' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/share/binaries/Paragraph/bin/multigrmpy.py", line 353, in <module>
    main()
  File "/share/binaries/Paragraph/bin/multigrmpy.py", line 349, in main
    run(args)
  File "/share/binaries/Paragraph/bin/multigrmpy.py", line 315, in run
    subprocess.check_call(commandline, shell=True, stderr=subprocess.STDOUT)
  File "/software/anaconda3/4.5.12/lssc0-linux/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/share/hormozdiarilab/Codes/NebulousSerendipity/binaries/Paragraph/bin/grmpy --response-file=/tmp/tmpzmne8osh.txt' returned non-zero exit status 1.

The SVs being genotyped (original coordinates):

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  GENOTYPE        2:GENOTYPE      3:GENOTYPE
chr1    59605   INS0000 C       <INS>   30      .       END=59605;SVTYPE=INS;SEQ=tttcttttttttttttttttttttttgaggagttccttgtcgccgctgggtggcggcgcgattgctcctgcagctccgcccccgtccccattcctgcctcgcctcccaagtactggactcagcgccccctcgcccggctaatttttgtatttttagtaagacgtttccgtttagcggggttcgatctctgacttcgtgtcctccgcctcgctcccagtgtgattacagCTGACCACCCCCCCAG;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60       GT      ./.     ./.     ./.
chr1    181325  DEL0000 G       <DEL>   30      .       END=181448;SVTYPE=DEL;SEQ=GCGCAGGCGCAGAGACACATGCTAGCGCGTCCAGGGGAGGAGGCGTGGCACAGGCGCAGAGACACATGCTAGCGCGCCCAGGGGAGGAGGCGTggcgcaggcgcagagaggcgcgCCGTGCTG;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60       GT      ./.     ./.     ./.
chr1    191407  INS0001 A       <INS>   30      .       END=191407;SVTYPE=INS;SEQ=TGTCTCTAGCACCTGGGATGGGCCTGATGTGTAACAGCTGCTGGCTGAACAGAAAGtgacagatgagcaaacatctcaaggaggtgatgaggatggtgatgagtgagaactcccgacatgtgaagataactgaagatgttctggctaaagatccgaagactctaagaatatgatcattccctttgaatatcaaatatcaaaagggctgtcaggtggagaagtgagtaaacttgtatcagaatagcggcagagTTGCAAGGAAACAGATCTCTGTTCTGTTAAAAAAAAAAAATTCCATAAACAACTGCATCACTTTGCATAGCAATTAGGTTCCAGCTCACAAGCGCCTTCCGGGGTGCCCCAAGGGTGAATCCTGCTAAGGTGGAGGTAGAAGACATGACCCTGGGGCTCTTTCCTTAGCCAAGAGCCCATGAGACTAAGGAACATCGTGCTTGTTGACAAAGACCCCGGACAGTCTATTCTCTTACGGTCACAGGCTATGGTGCCAAGGACAAGTGCAGACTCAGGATCAGAAAGCTTGCAGCATATCTGCTATCTCCATGGATAGCAGGATGGTCTGGAAGGCTGTGTCGGAAGGCCCTTAGGCCTCACTGGGGCCAGGCCGTTGATGAACAATGTCCACCCTGAGGGTCGGGAATGGTGCCATTTGTTTGTCATTCCTGGTCCAGACGCCCTTGGCTTGGTGGCTACTCAAGTAGGTCAGTTTACAAGCTCAGTGCTGAACCCATACCCTATGGCACGCTCGCCAGCACTAGAGAGGAAGCTGCCTCTGTGGACATCAGGGACGGAAGTGGCTCACCCAGCCTGTTCTGCGCGTGTCTCACTAAGGGTCCATCTTCCTCTATCTGCCCCGGAGGGGACCATCTCCAAGCATCCCTTGCTTTCCTTCTCCCCCCTCCACCCTCACTGTTCAATAACTTGAGTGCATCCCATTTGTAGAGCACATGCTGGGCCGTGGAGTGAAAGACAATCAGACGACACACATCCACATTCAAAGGGACTCAGGGCTCCTGGGAAGTAGAAATGAATATCAATAACCAAACATCCCacagcctgggtttcacatctgtttagcagcaggtgaccctggggaggtcactaactggtctctgcctcagcttcttccactgaaaaacaggaatggtcccttctacatcatggatactgtgaggTGAGAAGGAGCTGATCATGGCCCATCAACCTCAGCACACCAGTCCCCCTAGAGGCTGCTGGGAGAAGAAGCAGGGAGCACCCACTCCTGACCCAGATTCACATTCACTGCTCTCCTCCCCTGCCTCTGTCATGACCCCAGGGAAGCAGACGCTGAACCTGGGCTCTTGCCTTCATCTTTATCTTCTCCACTCTGGGATAATTAAGAATGACTTGCTAATTATGCAGATCTAGTGCAATGTGTAACTTCGGGCCACCAGTGCCAATCAGTAGAGCGGAGATGACGaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaTCAAATCAATTTAAAAAACAATAAACTCCACCACCTCCCCCCTCACCCTCCCGTCATCTGCACTGATTTGTTCTCCCGGGAGCTGGAGAGGAGGGGGGGGGGGCAGCG;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60   GT      ./.     ./.     ./.
chr1    710579  INS0002 T       <INS>   30      .       END=710579;SVTYPE=INS;SEQ=AAAGAACTGCCCGCCggcgcggtggctcacgcctgtaatcccagcactttgggaggccgaggcgggcggatcacgaggtcaggagatcgagaccatcccggctaaaacggtgaaacccgtctctactaaaaatacaaaaattagccgggcgtagtggcggcgcctgtagtcccagctacttgggaggctgaggcaggagaatggcgtgaacccgggaggtggagcttgcagtgagccgagatcccgccactgcactccagcctgggcgacagagcgagactccgtctcaaaaaaaaaaaaaaaaaaaaaaaaaa;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60        GT      ./.     ./.     ./.
chr1    732377  INS0003 A       <INS>   30      .       END=732377;SVTYPE=INS;SEQ=GACAGAGAGTAAAAAGAGAAATTAGGAAAGCATTCTACATGTTGAATAGGAAGACACTGGCCATGTTCGTGCAGCAGCAGTATGTCGTGACATGACATACCTTGGAGAGAAGTTAACAGATGAGGAAGTTGATAAAAATCATCAGAGAAGCAAAATACTGGTAGCGACACTCAAGTAAACCATGAAATTTCCATAACTTATGTCAGCAAAGTGGGAATATTGTACAGTGTGTGTTGAAGTTCCTATACAACATTGTTTATCTGCCTTTTGTTTGTTTGTAAGGAATGTACATACTAAAAGTTCTTCTTGCTGTCAAAAGAATATGCGTGAATAAGTCATTTTAACTTATTCTTCTGTTTTTCTTTTATCTTCCTGCCATCATCCCACAGCCTTACTTTAGAAATTTCTTTTTTAGAAAATTGAACAAGTGCTCCCTGTGGTGGCACATACCTCGAGGAtgggaggcagggtggaagggtcacttgaggccattagtttgacaccagcctggccaacaaagtgagaccccgtgtctacaaacaatttaaaaattagccaagtatcgtcatgtatacctacagtcccagctaTCTGAACTTACTGAGAATGTTCAGGGCCTGGAGAGAAGGCTGGGAGGCAGGAGCTGGGTCTAAAGAGGCCATTGTAACGATGGAGCTGTGCCTGTGGAGGCTGTTGTGAGGCAGTAGGCTCATCTGCGGAGGCTGCCGTGACGTAGGGTATGGGCCTAAATAGGCCATTGTGAGTCATGAGCTTGGTCTGTAGAGGCTGACTGGAGAAAGTTCTGGGCCTGGAGAGGCTGCTGGGAGGTAGGAGCTGGGCCAAAAgatgtaagcacatttgcatttattaggcactttatttgcattattacactgtaatatataataaaataattatagaactcaccataatgtagaatcagtgggcgtgttaagcttgttttcctgcaactggatggtcccacctgagcgtgatgggagaaagtgacagatcaataggtattagattctcataaggacagcgcaacctagatccctcacatgcacggttcacaacagggtgcgttctcctatgagaatctaacgctgctgctcatctgagaaggtggagctcaggcgggaatgtgagcaaaggggagtggctgtaaatacagacgaagcttccctcactccctcactcgacaccgctcacctcctgctgtgtggctccttgcggctccatggctcaggggttggggacccctgCTCAAGTGCATCCAAAGCGACCCTTCCCACACCAGTCTTCACAGTGGTCAAGGGCAGCAACCACTTAGCTCCCAAGGCATGTGCCTCAGCTGGCATTTCGTCACAATCAACAGTAAGTGGTAGCTTGAGTCACTGTGAGGTCACCTACTGGAAATCACCAGCATCCCATTTCCCACTGGCAAAGAGCTCAGCACTGCCCCCGGGAAACCAAACCTATGCCCAAATCCATCTGTGTGGGTGTATCTCCTGGGACCCTTCCTAACAtattagtcagagtccaatcaggaagcataaaccactcaaaagtttaaagtggtaaaatttaatacagagaattattcattgtaacaggtgaacagcataatgagagattggctagcacaaagtaaacagaactctagagaatataggactagcCCAggccaggcatggtggctcaggcctgaaattccagcaatttgagaagctaatgcaggaggattgcttaaggccaggagctagagaccggtctggacgacacagtgagaccctgtctctatccaaaagaagaaaaaagttagctgggggtggtagtgcacacttgtagtcccagctactcggaatgcggaagtttgagcctgggaggtcaaggctgcagtgaggcatgattatgccactacagtccagcctggtgacagagcaagaccctgtctcaaagaacaaaaCAACAACAACCATTTACAGACAGAAAAGAAATAGAGCTAATAAGCTGAGGAAAGATGTTgaaatgtgacaagtaaagtaatatgagttcttttgtctatgtaaaataatcaaacaaaaaatgacttactaaattataataccctgtgctggcaaaggtgcagtgaaatgggcaccttcttatactatgaggggtgtttaaattgtgtataagccttcccgggtaaagcctgtcaattttttaaaataatggagacagggtctcaccatactgccatactgcctcctccaactcttggcctcaagcaatcctcctctcttagcctcccaaagtgctaagattatagctgggaggcaccCAAAACCCTGTCAATTTACATCAAGGGTAAGGAGAATGTCCATTCACCATGACTCACAGTAATCTTACTTCTGGGGAGACAATTCAATCTAAACAAAAGGTCATCTGTACACACACAGTAAAAATCTGGGAGTAACTGAAGACAGAGTTGGTAAGTGAAATAAGAAACAGTTATAAGAAATTAAACTATGGTATCAATAGGCACCTGGTAAAAGGTCAGTTGATGTTAGCTGCTACttttttgttgttttgagacagggtctcactctgtcacccaggctggagtgcagaggcctgatcatgactcactgcagtctcagcctccctgggctcaagtgatcctcccacctcagcctcccaagtagctgggactacaggaacatgccaccacactaggctaattcatgtatttttctgtagggatggtgactccccctttgtttccaaggcctatcgcaaactcttggcctcaagccatcctcctgcctcagcctcccaaagtgttgcgattaccagtgtgagccaccacacctggccAGCTGCTACTTTTATCAATATTATTCTTATTCCACTCAATTAAAAATTATTATTTTCAAGGCTATGCAACAGTATGTATCCCACAGCATAATTGTAAAAACATATAGTCgtcgtccctcagtatacagaattagttccagccccccatctctgcatataccaaaatccatgcttactcacgtttcgctgtcacccctctagaatccacgtatacgaaaattccaaatgttagttgggcatagtggcaagcacctgtagtctcagccacgtgggaggttgaggtgggaggatcgcttcagcctggaaggttgaggctgcagtcagctgcgatagcactactacactccagccttggacaacagagggagaccctgtctcagaaaaaaaacaaaataaaaCAGGTTAGAAATTGTAATGAGGTCTGCTGGGCAAAATTCCATATAAGCAAAGTATAAATTAATAAAGCAAATCGTGATAAATTAGTACGATTGACTTTCTGGAGTTTCTGACAATAAAAGTAAGGAAAATGCAGAACACAAA;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60        GT      ./.     ./.     ./.
chr1    737102  INS0004 G       <INS>   30      .       END=737102;SVTYPE=INS;SEQ=GGCAGCAACCACTTAGCTCCCAAGGCATGTGCCTCAGCTGGCATTTCGTCACAATCAACAGTAAGTGGTAGCTTGAGTCACTGTGAGGTCACCTACTGGAAATCACCAGCATCCCATTTCCCACTGGCAAAGAGCTCAGCACTGCCCCCGGGAAACCAAACCTATGCCCAAATCCCATCTGTGTGGGTGTATCTCCTGGGACCCTTCCTAACAtattagtcagagtccaatcaggaagcataaaccactcaaaagtttaaagtggtaaaatttaatacagagaattattcattataacaggtgaacagcataatgagagattggctagcacaaagtaaacagaactctagagaatatggactagcCCAggccaggcatggtggctcagcctgaaattccagcaatttgagaagctaatgcaggaggattgcttaaggccaggagctagagaccggtctggacgacacagtgagaccctgtctctatccaaaagaagaaaaaagttagctgggggtggtagtgcacacttgtagtcccagctactcggaatgcgaagtttgagcctgggaggtcaaggctgcagtgaggcatgattatgccactacagtccagcctggtgacagagcaagacctgtctcaaagaacaaaacaacaacaaCCATTTACAGACAGAAAAGAAATAGAGCTAATAAGCTGAGGAAAGATGTTgaaatgtgacaagtaaagtaatatgagttcttttgtctatgtaaaataatcaaacaaaaaatgacttactaaattataataccctgtgctggcaaaggtgcagtgaaatgggcaccttcttatactatgaggggtgtttaaattgtgtataagccttccgggtaaagcCTGTCAATTTTTTAAAATAAtggagacagggtctcaccatactgccatactgcctcctccaactcttggcctcaagcaatcctcctctcttagcctcccaaagtgctaagattatagctgggaggcaccCAAAACCCTGTCAATTTACATCAAGGGTAAGGAGAATGTCCATTCACCATGACTCACAGTAATCTTACTTCTGGGGAGACAATTCAATCTAAACAAAAGGTCATCTGTACACACACAGTAAAAATCTGGGAGTAACTGAAGACAGAGTTGGTAAGTGAAATAAGAAACAGTTATAAGAAATTAAACTATGGTATCAATAGGCACCTGGTAAAAGGTCAGTTGATGTTAGCTGCTACttttttgttgttttgagacagggtctcactctgtcacccaggctggagtgcagaggcctgatcatgactcactgcagtctcagcctccctgggctcaagtgatcctcccacctcagcctcccaagtagctgggactacaggaacatgccaccacactaggctaattcatgtatttttctgtagggatggtgactccccctttgttccaaggcctatcgcaaactcttggcctcaagccatcctcctgcctcagcctcccaaagtgttgcgattaccagtgtgagccaccacacctggccAGCTGCTACTTTTATCAATATTATTCTTATTCCACTCAATTAAAAATTATTATTTTCAAGGCTATGCAACAGTATGTATCCACAGCATAATTGTAAAAACATATagtcgtcgtcctcagtatacagaattagttccagccccccatctctgcatataccaaaatccatgcttactcacgtttgctgtcacccctctggaatccacgtatacgaaaattccaaatttagttgggcatagtggcaagcacctgtagtctcagccacgtgggaggttgaggtgggaggatcgcttcagcctggaaggttgaggctgcagtcagctgcgatagcactactacactccagccttggacaacagagggagaccctgtctcagaaaaaaaaaaaaataaaaCAGGTTAGAAACTGTAATGAGGTCTGCTGGGCAAAATTCCATATAAGCAAAGTATAAATTAATAAAGCAAATCGTGATAAATTAGTACGATTGGCTTTCTGGAGTTTCTGACAATAAAAGTAAGGAAAATGCAGAACACAAAGACAGAGAGTAAAAAGAGAAATTAGGAAAGCATTCTACATGTTGAATAGGAAGACACTGGCCATGTTCGTGCAGCGGCAGTATGTCGTGACATGACATACCTTGGAGAGAAGTTAACAGATGAGGAAGTTGATAAAAATCATCAGAGAAGCAAAATACTGGTAGCGACACTCAAGTAAACCATGAAATTTCCATAACTTATGTCAGCAAAGTGGGAATATTGTACAGTGTGTGTTGAAGTTCCTATACAACATTGTTTATCTGCCTTTTGTTTGTTTGTAAGGAATGTAATACTAAAAGTTCTTCTTGCTGTCAAAAGAATATGGTGAATAAGTCATTTTAACTTATTCTTCTGTTTTTCTTTATCTTCCTGCCATCATCCCACAGCCTTACTTTAGAAATTTTTTTTTTAGAAAATTGAACAAGTGCTCCTgtggtggcacatgcctcgaggatgggaggcaggggtggaagggtcacttgaggccattagtttgacaccagcctggccaacaaagtgagaccccgtgtctacaaaacaatttaaaaattagccaagtatcatcatgtatacctacagtcccagctacCTGAACTTACTGAGAAAGTTCAGGCCTGGAGAGAAGGCTGGGAGGCAGGAGCTGGGTCTAAAGAGGCCATTGTAACGATGGAGCTGTGCCTGTGGAGGCTGTTGTGAGGCAGTAGCTCATCTGCGGAGGCTGCCGTGACGTAGGGTATGGGCCTAAATAGGCCATTGTGAGTCATGAGCTTGGTCTGTAGAGGCTGACTGGAGAAAGTTCTGGCCTGGAGAGGCTGCCGGGAGGTAGGAGCTGGGCCAAAAgatgtaagcacatttgcatttattaggcactttatttccattattacactgtaatatataataaaataattatagaactcaccataatgtagaatcagtgggcgtgttaagcttgttttcctgcaactggatgtcccacctgagcgtgatgggagaaagtaacagatcaataggtattagattctcataaggacagcgcaacctgatccctcacatgcacggttcacaacagggtgcgttctcctatgagaatctaacgctgctgctcatctgagaaggtggagctcaggcgggaatgtgagcaaaggggagtggctgtaaatacagacgaagcttccctcactccctcactcgacaccgctcacctcctgctgtgtgctccttgcggctccatggctcaggggttggggacccctgCTCAAGTGCATCCAAAGCGACCCTTCCCACACCAGTCTTCACAGTGGTCAA;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60  GT      ./.     ./.     ./.
chr1    814625  INS0005 T       <INS>   30      .       END=814625;SVTYPE=INS;SEQ=GGAAATGTTAATTCTGAAAATAGGTTTCACATCTTTTTTTTAACTTATATAAAATTGACTGGATTTCTCTTCTGTGTGTTGTGTTAGATATTTAGGA;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60 GT      ./.     ./.     ./.
chr1    831217  DEL0001 T       <DEL>   30      .       END=833736;SVTYPE=DEL;SEQ=TTGtcttatgtttaaaaatgtccttcagtcattgcaggtcacaagcaggctatcagctcagtaattaaaataattcggttcttcatagtgaatgtaattctaaattagattttaagttgtaactccctgcttcagcAATGGTGATGGGGCCTAGAAACCAGAGCACCTGAGCTCCATCCTACAGGGGGCCATACCGGGATCTTTCCATTTTCAGAGGCTTCTCTCTGACAGTGAAGTGTGATGACAGACTTGGGGGCAGGGCAATGGCTAGCTTCTGAAAGCCGCTGGCACTTTAGTGATAAATTTAAATTAAGTGACGGGTAGTGAGGTGTTTGTCAAGGAAAGTGCCGTCCAAATGCTAAATACTGATTATTTCTGCAGCAGTGACTGCAATACCTCACTCAATCTCTGTCTTTCTTGAAGAAGTCATAAATAAACACGATGAATCTATGTAGAAGCGGTAAGTCAGAAAAATCTGTGTGTTTCATTACATAAACAACGGTTTATCATTAATTGACAGGCTTGGATTGGGAGTTGTTAATGAAACTGATGAGATGTTGGACAGATGAGCTCCCTCTTATTTCGAAGAGCTTATCTAGGGCTGAGTCATGGGACCTGATAGCGTCTTGTGGTGCTGTCTTCTTGTAGATATATCCGTGTTTTAGAGGATTTAGTTTTTTAAAATTTCTCTTAGAATGTGAATTTTACAAAAAAGCACTTCCCAAATGGATGATTATTTGAAAAATGAATTGTCAGACAAAACTGACACATCAGTTATGGAGAAAACCCTTCAAGAACTGGCTTTAAATGTGTTTTAGTGGGAGCCACAGTGTGGAGAGAAACAGAAGAGGGAGGAGAGGGCGCCCCTTGTTTCTTCTCTCCACAGCCAGGCCTTCGCCACCTTTCTCAGTGTCTTCAAGAATAAAATGCCTCCGTTGTTGGTTTTAGCTGCTTTTCTCCCTCGGGGTAGGTAAAGTGGTTCCAAAACGACAAGCATCCTGTAAAGTCGGAAGAGCTGTGTCAACATTAAGCTGCGTGACTTTGGCTATGAGGGAAAAAAGGCTGGTGAGTGCAGAGAAGACAGAGCTGTGGCAGGGCTCCTCCCGCCAAGTCGCCATGGAGAGGGGCTGTGAGGTGTCCTTAAACGGCCTGGTCTCCAGGGTGACTCAGGAAGGGCTGAGAGTGGTCAGCTCCCTCACCTGCTAAACCCGCAGCGCCCCGCTCAGCACACACCCTCCACTCTCCAACCTTGCCCAAGTGCTGGTCCGTCACGGCACCAGGACAGGGCATGGAGACTTGGGCTGAttcttttctctcccttcctccctcttttttttcttctctcactcctccttttcctttcctgctgtttcctgctctcctgtttctGTCCTGCAGTGTCTGGAGCTCCAGAGAGGCTGGCCCTGGGGTGGGGTCCACATGGACATGGGCGTAAGCAGGTTTGATGGTCATGGGCATAGGCAGGTTCGATGGCCAGAGTTCTTTCAGCTCACAGTAAgttttgttttgttttgttttgttttgttttgttttgttttgttttagatggagtcttgctttgtcgcccaggctgtagtgcagtggcgtgatcttggctcactgcagcctccaccttagagcaatcctcttgcctcatcctcccgggtagttgggactacatgtgcatgccacatgcctggctaatttttgtatttttagtagagacacggtttcaccatgttggccaggctggtgtccaactcctgacctcaggtgatccatccgcctcagcctcccaaagtgccgggattacaggtatgagccactgcacctggccTCAGCTGACAGTAGGTTTTAGAGCCAGATATTTACACACTAACTTGCCAGAAACATATATGACTTTATTATTCTAATTGATTTTAAGAGATATTATGAACTCAAATCCAAAGTTACGTCCCACCTATCATGACAATTTCATTAAGGAAAAAGTCAAACCATTTTGGAAATGATTTAAGTGAGCAACTTGGAAAAATTTTCTACATTCCTAACTTACTTTCCAGGGGATCGTTCCTGACTTAACATCTATCAGGTGTCTTAGCTTAGCTCTCTTTTTACTTCAGGTTTTTCTTGCCTCCTCAGTGTGCTGGGAGTCCCACTCCACTCAAATGCCCTCAGGTCTAATAATTAACTTCATTGCAGGCTCCTGGCAGGCCTGGGTGGGCGGCAGCTGCATTGTGCTCCTGAAGAAGATTTAAGTTGGGTTTGGTGAACTGGTAGAATTTGCATTTTGCTGTTTCTTTCCCTCTCCCAGAATTTGTACCTTTAAATAGGTTTTTTAGTGTCATTAAGTATATCAAAAGGAAACCCAGTGGGGCAAATTGGCCGGGCTccatagaggtggccttgtctaagcctttcatcttatcgataaggaaagacaggaccagagaagtCGCCGACTGTCCCTGGTCCCACTGCTTGGTTTGGGGCAATTTCCTGAAAATAATATCCAAGATGCA;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60   GT      ./.     ./.     ./.
chr1    1195963 INS0006 A       <INS>   30      .       END=1195963;SVTYPE=INS;SEQ=TGGGGTCTCACCATGTTGGCCAGGCTGgtctcaaactcctgagctcaagcgatcctcctgcctcagcctcccaaagtgctgggactacaggtgtgagccatgcgcccgaccaatttgtgtatttttagtagagatggggtctcaccatgttggccaggctggtctcaaactcctgagctcaagcgatcctcctgcctcagcctcccaaagtgctgggactacaggtgtgagccacgcgcctgaccAACTTGTGTATTTCTAGTAGAG;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60      GT      ./.     ./.     ./.
chr1    1240675 INS0007 C       <INS>   30      .       END=1240675;SVTYPE=INS;SEQ=CAGCcccccgcccccattcaccccggccgtggtccctgccccagcccccgccgcccccattcaccccggccgtggtccctgccc;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60     GT      ./.     ./.     ./.
chr1    1248055 DEL0002 A       <DEL>   30      .       END=1248319;SVTYPE=DEL;SEQ=GGCTGGATCTCCAACTCTGACCTACAGGCAGGAAAGTGGGCAGCCCTGGGAGGCTGGACTGAGGGAGGCTGGACTTCCCACTCAGGCCTACACGCAGGAAAATGGGCAGCCCTGGGAGGCTGGACCGAGGGAGGCTGGGCCTCCCACTCCACCCTACAGGCCAGGACACGGGCAGCCCTGGGAGGCTAGACCGAGGGAGGCTGGGCCTCCCATCTACCCTACAGGCCGGGACACAGGCAGCCCTGGGAGGCTGTACCGAGGGAG;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60 GT      ./.     ./.     ./.
chr1    1477854 INS0008 C       <INS>   30      .       END=1477854;SVTYPE=INS;SEQ=CccaccacgcctggctaatgttgtattttagtagagacggggtttctccatgttggtcaggctggtctctaactcccgacctcaggtgatccacccgcctcggcctctcaaactgttgggattacaggcatgT;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60    GT      ./.     ./.     ./.
chr1    1494665 INS0009 A       <INS>   30      .       END=1494665;SVTYPE=INS;SEQ=TGGTGTGCTGCTGCCCCTGCACCCCGTGAGATGAATCCTGCCTCTGGGAGGTACAGCTTCCTGGAGGGGTGGCCCTGTGAGCATCTGCGTAGCCCCTCTCCTCTGCTGGGCCCTGGGTGACGTGCAGCCACTCGGGTGGACCCTGAGGGTCCCTGCACCTGTTTGCCCTCTCTTGGGTGGGCTCAAGACCAAAAATGATGTTGAGCAGTCCTGGGCCCCTGAGCCACAGTGGCGGTGCGGCTCCGGTCAGTGTCTCCTGCGCTCCCGGGCCCCCGACCCACAGTGGCGGTCCGGCTCTGGTCAGTGTCTCCTGCGCTCCCGGGCCCCCGACCCACAGTGGCGGTCCGGCTCCGGTCGGTGTCTCCCCACACAGTGGCTCTTGGCGAGGGGTGGGCGCTGGCAGAGGGGACGGGCACCACGTGGTCATCCCCATGACAGGTTCTGTCATGGTGACAGTGTTGTGGAGGA;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60     GT      ./.     ./.     ./.
chr1    1565684 INS00010        T       <INS>   30      .       END=1565684;SVTYPE=INS;SEQ=GGTGCAGGCAGAGAACAGACGTCGCGATGGGCCCGACGGTGCTGGCTCCATGGGAACCGAGACCCAACACCCAAAGGAGTCCCACAGGCTCAGGGG;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60 GT      ./.     ./.     ./.

Lifted to GRCh37 will look like:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  GENOTYPE
chr1    59605   INS0000 C       <INS>   30      .       END=59605;SVTYPE=INS;SEQ=tttcttttttttttttttttttttttgaggagttccttgtcgccgctgggtggcggcgcgattgctcctgcagctccgcccccgtccccattcctgcctcgcctcccaagtactggactcagcgccccctcgcccggctaatttttgtatttttagtaagacgtttccgtttagcggggttcgatctctgacttcgtgtcctccgcctcgctcccagtgtgattacagCTGACCACCCCCCCAG;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60       GT      ./.
chr1    114350134       INS0001 A       <INS>   30      .       END=114350134;SVTYPE=INS;SEQ=TGTCTCTAGCACCTGGGATGGGCCTGATGTGTAACAGCTGCTGGCTGAACAGAAAGtgacagatgagcaaacatctcaaggaggtgatgaggatggtgatgagtgagaactcccgacatgtgaagataactgaagatgttctggctaaagatccgaagactctaagaatatgatcattccctttgaatatcaaatatcaaaagggctgtcaggtggagaagtgagtaaacttgtatcagaatagcggcagagTTGCAAGGAAACAGATCTCTGTTCTGTTAAAAAAAAAAAATTCCATAAACAACTGCATCACTTTGCATAGCAATTAGGTTCCAGCTCACAAGCGCCTTCCGGGGTGCCCCAAGGGTGAATCCTGCTAAGGTGGAGGTAGAAGACATGACCCTGGGGCTCTTTCCTTAGCCAAGAGCCCATGAGACTAAGGAACATCGTGCTTGTTGACAAAGACCCCGGACAGTCTATTCTCTTACGGTCACAGGCTATGGTGCCAAGGACAAGTGCAGACTCAGGATCAGAAAGCTTGCAGCATATCTGCTATCTCCATGGATAGCAGGATGGTCTGGAAGGCTGTGTCGGAAGGCCCTTAGGCCTCACTGGGGCCAGGCCGTTGATGAACAATGTCCACCCTGAGGGTCGGGAATGGTGCCATTTGTTTGTCATTCCTGGTCCAGACGCCCTTGGCTTGGTGGCTACTCAAGTAGGTCAGTTTACAAGCTCAGTGCTGAACCCATACCCTATGGCACGCTCGCCAGCACTAGAGAGGAAGCTGCCTCTGTGGACATCAGGGACGGAAGTGGCTCACCCAGCCTGTTCTGCGCGTGTCTCACTAAGGGTCCATCTTCCTCTATCTGCCCCGGAGGGGACCATCTCCAAGCATCCCTTGCTTTCCTTCTCCCCCCTCCACCCTCACTGTTCAATAACTTGAGTGCATCCCATTTGTAGAGCACATGCTGGGCCGTGGAGTGAAAGACAATCAGACGACACACATCCACATTCAAAGGGACTCAGGGCTCCTGGGAAGTAGAAATGAATATCAATAACCAAACATCCCacagcctgggtttcacatctgtttagcagcaggtgaccctggggaggtcactaactggtctctgcctcagcttcttccactgaaaaacaggaatggtcccttctacatcatggatactgtgaggTGAGAAGGAGCTGATCATGGCCCATCAACCTCAGCACACCAGTCCCCCTAGAGGCTGCTGGGAGAAGAAGCAGGGAGCACCCACTCCTGACCCAGATTCACATTCACTGCTCTCCTCCCCTGCCTCTGTCATGACCCCAGGGAAGCAGACGCTGAACCTGGGCTCTTGCCTTCATCTTTATCTTCTCCACTCTGGGATAATTAAGAATGACTTGCTAATTATGCAGATCTAGTGCAATGTGTAACTTCGGGCCACCAGTGCCAATCAGTAGAGCGGAGATGACGaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaTCAAATCAATTTAAAAAACAATAAACTCCACCACCTCCCCCCTCACCCTCCCGTCATCTGCACTGATTTGTTCTCCCGGGAGCTGGAGAGGAGGGGGGGGGGGCAGCG;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60        GT      ./.
chr1    645959  INS0002 T       <INS>   30      .       END=645959;SVTYPE=INS;SEQ=AAAGAACTGCCCGCCggcgcggtggctcacgcctgtaatcccagcactttgggaggccgaggcgggcggatcacgaggtcaggagatcgagaccatcccggctaaaacggtgaaacccgtctctactaaaaatacaaaaattagccgggcgtagtggcggcgcctgtagtcccagctacttgggaggctgaggcaggagaatggcgtgaacccgggaggtggagcttgcagtgagccgagatcccgccactgcactccagcctgggcgacagagcgagactccgtctcaaaaaaaaaaaaaaaaaaaaaaaaaa;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60        GT      ./.
chr1    667757  INS0003 A       <INS>   30      .       END=667757;SVTYPE=INS;SEQ=GACAGAGAGTAAAAAGAGAAATTAGGAAAGCATTCTACATGTTGAATAGGAAGACACTGGCCATGTTCGTGCAGCAGCAGTATGTCGTGACATGACATACCTTGGAGAGAAGTTAACAGATGAGGAAGTTGATAAAAATCATCAGAGAAGCAAAATACTGGTAGCGACACTCAAGTAAACCATGAAATTTCCATAACTTATGTCAGCAAAGTGGGAATATTGTACAGTGTGTGTTGAAGTTCCTATACAACATTGTTTATCTGCCTTTTGTTTGTTTGTAAGGAATGTACATACTAAAAGTTCTTCTTGCTGTCAAAAGAATATGCGTGAATAAGTCATTTTAACTTATTCTTCTGTTTTTCTTTTATCTTCCTGCCATCATCCCACAGCCTTACTTTAGAAATTTCTTTTTTAGAAAATTGAACAAGTGCTCCCTGTGGTGGCACATACCTCGAGGAtgggaggcagggtggaagggtcacttgaggccattagtttgacaccagcctggccaacaaagtgagaccccgtgtctacaaacaatttaaaaattagccaagtatcgtcatgtatacctacagtcccagctaTCTGAACTTACTGAGAATGTTCAGGGCCTGGAGAGAAGGCTGGGAGGCAGGAGCTGGGTCTAAAGAGGCCATTGTAACGATGGAGCTGTGCCTGTGGAGGCTGTTGTGAGGCAGTAGGCTCATCTGCGGAGGCTGCCGTGACGTAGGGTATGGGCCTAAATAGGCCATTGTGAGTCATGAGCTTGGTCTGTAGAGGCTGACTGGAGAAAGTTCTGGGCCTGGAGAGGCTGCTGGGAGGTAGGAGCTGGGCCAAAAgatgtaagcacatttgcatttattaggcactttatttgcattattacactgtaatatataataaaataattatagaactcaccataatgtagaatcagtgggcgtgttaagcttgttttcctgcaactggatggtcccacctgagcgtgatgggagaaagtgacagatcaataggtattagattctcataaggacagcgcaacctagatccctcacatgcacggttcacaacagggtgcgttctcctatgagaatctaacgctgctgctcatctgagaaggtggagctcaggcgggaatgtgagcaaaggggagtggctgtaaatacagacgaagcttccctcactccctcactcgacaccgctcacctcctgctgtgtggctccttgcggctccatggctcaggggttggggacccctgCTCAAGTGCATCCAAAGCGACCCTTCCCACACCAGTCTTCACAGTGGTCAAGGGCAGCAACCACTTAGCTCCCAAGGCATGTGCCTCAGCTGGCATTTCGTCACAATCAACAGTAAGTGGTAGCTTGAGTCACTGTGAGGTCACCTACTGGAAATCACCAGCATCCCATTTCCCACTGGCAAAGAGCTCAGCACTGCCCCCGGGAAACCAAACCTATGCCCAAATCCATCTGTGTGGGTGTATCTCCTGGGACCCTTCCTAACAtattagtcagagtccaatcaggaagcataaaccactcaaaagtttaaagtggtaaaatttaatacagagaattattcattgtaacaggtgaacagcataatgagagattggctagcacaaagtaaacagaactctagagaatataggactagcCCAggccaggcatggtggctcaggcctgaaattccagcaatttgagaagctaatgcaggaggattgcttaaggccaggagctagagaccggtctggacgacacagtgagaccctgtctctatccaaaagaagaaaaaagttagctgggggtggtagtgcacacttgtagtcccagctactcggaatgcggaagtttgagcctgggaggtcaaggctgcagtgaggcatgattatgccactacagtccagcctggtgacagagcaagaccctgtctcaaagaacaaaaCAACAACAACCATTTACAGACAGAAAAGAAATAGAGCTAATAAGCTGAGGAAAGATGTTgaaatgtgacaagtaaagtaatatgagttcttttgtctatgtaaaataatcaaacaaaaaatgacttactaaattataataccctgtgctggcaaaggtgcagtgaaatgggcaccttcttatactatgaggggtgtttaaattgtgtataagccttcccgggtaaagcctgtcaattttttaaaataatggagacagggtctcaccatactgccatactgcctcctccaactcttggcctcaagcaatcctcctctcttagcctcccaaagtgctaagattatagctgggaggcaccCAAAACCCTGTCAATTTACATCAAGGGTAAGGAGAATGTCCATTCACCATGACTCACAGTAATCTTACTTCTGGGGAGACAATTCAATCTAAACAAAAGGTCATCTGTACACACACAGTAAAAATCTGGGAGTAACTGAAGACAGAGTTGGTAAGTGAAATAAGAAACAGTTATAAGAAATTAAACTATGGTATCAATAGGCACCTGGTAAAAGGTCAGTTGATGTTAGCTGCTACttttttgttgttttgagacagggtctcactctgtcacccaggctggagtgcagaggcctgatcatgactcactgcagtctcagcctccctgggctcaagtgatcctcccacctcagcctcccaagtagctgggactacaggaacatgccaccacactaggctaattcatgtatttttctgtagggatggtgactccccctttgtttccaaggcctatcgcaaactcttggcctcaagccatcctcctgcctcagcctcccaaagtgttgcgattaccagtgtgagccaccacacctggccAGCTGCTACTTTTATCAATATTATTCTTATTCCACTCAATTAAAAATTATTATTTTCAAGGCTATGCAACAGTATGTATCCCACAGCATAATTGTAAAAACATATAGTCgtcgtccctcagtatacagaattagttccagccccccatctctgcatataccaaaatccatgcttactcacgtttcgctgtcacccctctagaatccacgtatacgaaaattccaaatgttagttgggcatagtggcaagcacctgtagtctcagccacgtgggaggttgaggtgggaggatcgcttcagcctggaaggttgaggctgcagtcagctgcgatagcactactacactccagccttggacaacagagggagaccctgtctcagaaaaaaaacaaaataaaaCAGGTTAGAAATTGTAATGAGGTCTGCTGGGCAAAATTCCATATAAGCAAAGTATAAATTAATAAAGCAAATCGTGATAAATTAGTACGATTGACTTTCTGGAGTTTCTGACAATAAAAGTAAGGAAAATGCAGAACACAAA;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60        GT      ./.
chr1    672482  INS0004 G       <INS>   30      .       END=672482;SVTYPE=INS;SEQ=GGCAGCAACCACTTAGCTCCCAAGGCATGTGCCTCAGCTGGCATTTCGTCACAATCAACAGTAAGTGGTAGCTTGAGTCACTGTGAGGTCACCTACTGGAAATCACCAGCATCCCATTTCCCACTGGCAAAGAGCTCAGCACTGCCCCCGGGAAACCAAACCTATGCCCAAATCCCATCTGTGTGGGTGTATCTCCTGGGACCCTTCCTAACAtattagtcagagtccaatcaggaagcataaaccactcaaaagtttaaagtggtaaaatttaatacagagaattattcattataacaggtgaacagcataatgagagattggctagcacaaagtaaacagaactctagagaatatggactagcCCAggccaggcatggtggctcagcctgaaattccagcaatttgagaagctaatgcaggaggattgcttaaggccaggagctagagaccggtctggacgacacagtgagaccctgtctctatccaaaagaagaaaaaagttagctgggggtggtagtgcacacttgtagtcccagctactcggaatgcgaagtttgagcctgggaggtcaaggctgcagtgaggcatgattatgccactacagtccagcctggtgacagagcaagacctgtctcaaagaacaaaacaacaacaaCCATTTACAGACAGAAAAGAAATAGAGCTAATAAGCTGAGGAAAGATGTTgaaatgtgacaagtaaagtaatatgagttcttttgtctatgtaaaataatcaaacaaaaaatgacttactaaattataataccctgtgctggcaaaggtgcagtgaaatgggcaccttcttatactatgaggggtgtttaaattgtgtataagccttccgggtaaagcCTGTCAATTTTTTAAAATAAtggagacagggtctcaccatactgccatactgcctcctccaactcttggcctcaagcaatcctcctctcttagcctcccaaagtgctaagattatagctgggaggcaccCAAAACCCTGTCAATTTACATCAAGGGTAAGGAGAATGTCCATTCACCATGACTCACAGTAATCTTACTTCTGGGGAGACAATTCAATCTAAACAAAAGGTCATCTGTACACACACAGTAAAAATCTGGGAGTAACTGAAGACAGAGTTGGTAAGTGAAATAAGAAACAGTTATAAGAAATTAAACTATGGTATCAATAGGCACCTGGTAAAAGGTCAGTTGATGTTAGCTGCTACttttttgttgttttgagacagggtctcactctgtcacccaggctggagtgcagaggcctgatcatgactcactgcagtctcagcctccctgggctcaagtgatcctcccacctcagcctcccaagtagctgggactacaggaacatgccaccacactaggctaattcatgtatttttctgtagggatggtgactccccctttgttccaaggcctatcgcaaactcttggcctcaagccatcctcctgcctcagcctcccaaagtgttgcgattaccagtgtgagccaccacacctggccAGCTGCTACTTTTATCAATATTATTCTTATTCCACTCAATTAAAAATTATTATTTTCAAGGCTATGCAACAGTATGTATCCACAGCATAATTGTAAAAACATATagtcgtcgtcctcagtatacagaattagttccagccccccatctctgcatataccaaaatccatgcttactcacgtttgctgtcacccctctggaatccacgtatacgaaaattccaaatttagttgggcatagtggcaagcacctgtagtctcagccacgtgggaggttgaggtgggaggatcgcttcagcctggaaggttgaggctgcagtcagctgcgatagcactactacactccagccttggacaacagagggagaccctgtctcagaaaaaaaaaaaaataaaaCAGGTTAGAAACTGTAATGAGGTCTGCTGGGCAAAATTCCATATAAGCAAAGTATAAATTAATAAAGCAAATCGTGATAAATTAGTACGATTGGCTTTCTGGAGTTTCTGACAATAAAAGTAAGGAAAATGCAGAACACAAAGACAGAGAGTAAAAAGAGAAATTAGGAAAGCATTCTACATGTTGAATAGGAAGACACTGGCCATGTTCGTGCAGCGGCAGTATGTCGTGACATGACATACCTTGGAGAGAAGTTAACAGATGAGGAAGTTGATAAAAATCATCAGAGAAGCAAAATACTGGTAGCGACACTCAAGTAAACCATGAAATTTCCATAACTTATGTCAGCAAAGTGGGAATATTGTACAGTGTGTGTTGAAGTTCCTATACAACATTGTTTATCTGCCTTTTGTTTGTTTGTAAGGAATGTAATACTAAAAGTTCTTCTTGCTGTCAAAAGAATATGGTGAATAAGTCATTTTAACTTATTCTTCTGTTTTTCTTTATCTTCCTGCCATCATCCCACAGCCTTACTTTAGAAATTTTTTTTTTAGAAAATTGAACAAGTGCTCCTgtggtggcacatgcctcgaggatgggaggcaggggtggaagggtcacttgaggccattagtttgacaccagcctggccaacaaagtgagaccccgtgtctacaaaacaatttaaaaattagccaagtatcatcatgtatacctacagtcccagctacCTGAACTTACTGAGAAAGTTCAGGCCTGGAGAGAAGGCTGGGAGGCAGGAGCTGGGTCTAAAGAGGCCATTGTAACGATGGAGCTGTGCCTGTGGAGGCTGTTGTGAGGCAGTAGCTCATCTGCGGAGGCTGCCGTGACGTAGGGTATGGGCCTAAATAGGCCATTGTGAGTCATGAGCTTGGTCTGTAGAGGCTGACTGGAGAAAGTTCTGGCCTGGAGAGGCTGCCGGGAGGTAGGAGCTGGGCCAAAAgatgtaagcacatttgcatttattaggcactttatttccattattacactgtaatatataataaaataattatagaactcaccataatgtagaatcagtgggcgtgttaagcttgttttcctgcaactggatgtcccacctgagcgtgatgggagaaagtaacagatcaataggtattagattctcataaggacagcgcaacctgatccctcacatgcacggttcacaacagggtgcgttctcctatgagaatctaacgctgctgctcatctgagaaggtggagctcaggcgggaatgtgagcaaaggggagtggctgtaaatacagacgaagcttccctcactccctcactcgacaccgctcacctcctgctgtgtgctccttgcggctccatggctcaggggttggggacccctgCTCAAGTGCATCCAAAGCGACCCTTCCCACACCAGTCTTCACAGTGGTCAA;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60  GT      ./.
chr1    750005  INS0005 T       <INS>   30      .       END=750005;SVTYPE=INS;SEQ=GGAAATGTTAATTCTGAAAATAGGTTTCACATCTTTTTTTTAACTTATATAAAATTGACTGGATTTCTCTTCTGTGTGTTGTGTTAGATATTTAGGA;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60 GT      ./.
chr1    766597  DEL0001 T       <DEL>   30      .       END=769116;SVTYPE=DEL;SEQ=TTGtcttatgtttaaaaatgtccttcagtcattgcaggtcacaagcaggctatcagctcagtaattaaaataattcggttcttcatagtgaatgtaattctaaattagattttaagttgtaactccctgcttcagcAATGGTGATGGGGCCTAGAAACCAGAGCACCTGAGCTCCATCCTACAGGGGGCCATACCGGGATCTTTCCATTTTCAGAGGCTTCTCTCTGACAGTGAAGTGTGATGACAGACTTGGGGGCAGGGCAATGGCTAGCTTCTGAAAGCCGCTGGCACTTTAGTGATAAATTTAAATTAAGTGACGGGTAGTGAGGTGTTTGTCAAGGAAAGTGCCGTCCAAATGCTAAATACTGATTATTTCTGCAGCAGTGACTGCAATACCTCACTCAATCTCTGTCTTTCTTGAAGAAGTCATAAATAAACACGATGAATCTATGTAGAAGCGGTAAGTCAGAAAAATCTGTGTGTTTCATTACATAAACAACGGTTTATCATTAATTGACAGGCTTGGATTGGGAGTTGTTAATGAAACTGATGAGATGTTGGACAGATGAGCTCCCTCTTATTTCGAAGAGCTTATCTAGGGCTGAGTCATGGGACCTGATAGCGTCTTGTGGTGCTGTCTTCTTGTAGATATATCCGTGTTTTAGAGGATTTAGTTTTTTAAAATTTCTCTTAGAATGTGAATTTTACAAAAAAGCACTTCCCAAATGGATGATTATTTGAAAAATGAATTGTCAGACAAAACTGACACATCAGTTATGGAGAAAACCCTTCAAGAACTGGCTTTAAATGTGTTTTAGTGGGAGCCACAGTGTGGAGAGAAACAGAAGAGGGAGGAGAGGGCGCCCCTTGTTTCTTCTCTCCACAGCCAGGCCTTCGCCACCTTTCTCAGTGTCTTCAAGAATAAAATGCCTCCGTTGTTGGTTTTAGCTGCTTTTCTCCCTCGGGGTAGGTAAAGTGGTTCCAAAACGACAAGCATCCTGTAAAGTCGGAAGAGCTGTGTCAACATTAAGCTGCGTGACTTTGGCTATGAGGGAAAAAAGGCTGGTGAGTGCAGAGAAGACAGAGCTGTGGCAGGGCTCCTCCCGCCAAGTCGCCATGGAGAGGGGCTGTGAGGTGTCCTTAAACGGCCTGGTCTCCAGGGTGACTCAGGAAGGGCTGAGAGTGGTCAGCTCCCTCACCTGCTAAACCCGCAGCGCCCCGCTCAGCACACACCCTCCACTCTCCAACCTTGCCCAAGTGCTGGTCCGTCACGGCACCAGGACAGGGCATGGAGACTTGGGCTGAttcttttctctcccttcctccctcttttttttcttctctcactcctccttttcctttcctgctgtttcctgctctcctgtttctGTCCTGCAGTGTCTGGAGCTCCAGAGAGGCTGGCCCTGGGGTGGGGTCCACATGGACATGGGCGTAAGCAGGTTTGATGGTCATGGGCATAGGCAGGTTCGATGGCCAGAGTTCTTTCAGCTCACAGTAAgttttgttttgttttgttttgttttgttttgttttgttttgttttagatggagtcttgctttgtcgcccaggctgtagtgcagtggcgtgatcttggctcactgcagcctccaccttagagcaatcctcttgcctcatcctcccgggtagttgggactacatgtgcatgccacatgcctggctaatttttgtatttttagtagagacacggtttcaccatgttggccaggctggtgtccaactcctgacctcaggtgatccatccgcctcagcctcccaaagtgccgggattacaggtatgagccactgcacctggccTCAGCTGACAGTAGGTTTTAGAGCCAGATATTTACACACTAACTTGCCAGAAACATATATGACTTTATTATTCTAATTGATTTTAAGAGATATTATGAACTCAAATCCAAAGTTACGTCCCACCTATCATGACAATTTCATTAAGGAAAAAGTCAAACCATTTTGGAAATGATTTAAGTGAGCAACTTGGAAAAATTTTCTACATTCCTAACTTACTTTCCAGGGGATCGTTCCTGACTTAACATCTATCAGGTGTCTTAGCTTAGCTCTCTTTTTACTTCAGGTTTTTCTTGCCTCCTCAGTGTGCTGGGAGTCCCACTCCACTCAAATGCCCTCAGGTCTAATAATTAACTTCATTGCAGGCTCCTGGCAGGCCTGGGTGGGCGGCAGCTGCATTGTGCTCCTGAAGAAGATTTAAGTTGGGTTTGGTGAACTGGTAGAATTTGCATTTTGCTGTTTCTTTCCCTCTCCCAGAATTTGTACCTTTAAATAGGTTTTTTAGTGTCATTAAGTATATCAAAAGGAAACCCAGTGGGGCAAATTGGCCGGGCTccatagaggtggccttgtctaagcctttcatcttatcgataaggaaagacaggaccagagaagtCGCCGACTGTCCCTGGTCCCACTGCTTGGTTTGGGGCAATTTCCTGAAAATAATATCCAAGATGCA;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60   GT      ./.
chr1    1131343 INS0006 A       <INS>   30      .       END=1131343;SVTYPE=INS;SEQ=TGGGGTCTCACCATGTTGGCCAGGCTGgtctcaaactcctgagctcaagcgatcctcctgcctcagcctcccaaagtgctgggactacaggtgtgagccatgcgcccgaccaatttgtgtatttttagtagagatggggtctcaccatgttggccaggctggtctcaaactcctgagctcaagcgatcctcctgcctcagcctcccaaagtgctgggactacaggtgtgagccacgcgcctgaccAACTTGTGTATTTCTAGTAGAG;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60      GT      ./.
chr1    1176055 INS0007 C       <INS>   30      .       END=1176055;SVTYPE=INS;SEQ=CAGCcccccgcccccattcaccccggccgtggtccctgccccagcccccgccgcccccattcaccccggccgtggtccctgccc;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60     GT      ./.
chr1    1183435 DEL0002 A       <DEL>   30      .       END=1183699;SVTYPE=DEL;SEQ=GGCTGGATCTCCAACTCTGACCTACAGGCAGGAAAGTGGGCAGCCCTGGGAGGCTGGACTGAGGGAGGCTGGACTTCCCACTCAGGCCTACACGCAGGAAAATGGGCAGCCCTGGGAGGCTGGACCGAGGGAGGCTGGGCCTCCCACTCCACCCTACAGGCCAGGACACGGGCAGCCCTGGGAGGCTAGACCGAGGGAGGCTGGGCCTCCCATCTACCCTACAGGCCGGGACACAGGCAGCCCTGGGAGGCTGTACCGAGGGAG;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60 GT      ./.
chr1    1413234 INS0008 C       <INS>   30      .       END=1413234;SVTYPE=INS;SEQ=CccaccacgcctggctaatgttgtattttagtagagacggggtttctccatgttggtcaggctggtctctaactcccgacctcaggtgatccacccgcctcggcctctcaaactgttgggattacaggcatgT;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60    GT      ./.
chr1    1430045 INS0009 A       <INS>   30      .       END=1430045;SVTYPE=INS;SEQ=TGGTGTGCTGCTGCCCCTGCACCCCGTGAGATGAATCCTGCCTCTGGGAGGTACAGCTTCCTGGAGGGGTGGCCCTGTGAGCATCTGCGTAGCCCCTCTCCTCTGCTGGGCCCTGGGTGACGTGCAGCCACTCGGGTGGACCCTGAGGGTCCCTGCACCTGTTTGCCCTCTCTTGGGTGGGCTCAAGACCAAAAATGATGTTGAGCAGTCCTGGGCCCCTGAGCCACAGTGGCGGTGCGGCTCCGGTCAGTGTCTCCTGCGCTCCCGGGCCCCCGACCCACAGTGGCGGTCCGGCTCTGGTCAGTGTCTCCTGCGCTCCCGGGCCCCCGACCCACAGTGGCGGTCCGGCTCCGGTCGGTGTCTCCCCACACAGTGGCTCTTGGCGAGGGGTGGGCGCTGGCAGAGGGGACGGGCACCACGTGGTCATCCCCATGACAGGTTCTGTCATGGTGACAGTGTTGTGGAGGA;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60     GT      ./.
chr1    1501064 INS00010        T       <INS>   30      .       END=1501064;SVTYPE=INS;SEQ=GGTGCAGGCAGAGAACAGACGTCGCGATGGGCCCGACGGTGCTGGCTCCATGGGAACCGAGACCCAACACCCAAAGGAGTCCCACAGGCTCAGGGG;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60 GT      ./.

The error isn't really helpful. What am I doing wrong?

Compatibility of paragraph and manta

Hello, thanks a lot for making paragraph!

I am just figuring out how it works now, so I started with just taking one .vcf file generated by manta, and I used the exact same .bam file to genotype the variants I called (just to see the consistency of manta and paragraph), but I got back an error

Exception: Distance between vcf position and chrom start is smaller than read length.

tried to dig a bit, but the lines of code were not very indicative of why manta does not have troubles to call variants close to the scaffold edges and paragraph does. I removed all variants that started < 150 bases from the start of scaffolds and restarted genotyping and now it seems it runs.

So, I wanted to ask. What is the point? Why it is not possible to genotype SVs close borders? And would it be worth making manta and paragraph compatible?

Thanks

background

I have a bunch of Illumina reseq data (1 reference, 5 reseq individuals) with reasonable coverage (~60x ref, ~15x reseq). I have a non-model species, i.e. without a good library of SVs, but I still think that genotyping individuals is by far smarter idea than just merging SV calls. I am just figuring out what is the best way to create a library of SVs out of SV calls that I will feed to Paragraph to get the same data genotyped on the pool variants I found in the population.

I was thinking before about using SURVIVOR, but the merging does not explicitly resolve the sequences of SVs (discussed here), now I am thinking about just pasting .vcf files of all 6 individuals while filtering out only the exact overlaps. Not sure what is the best approach here, any input welcome.

Using more than one input VCF file

I have paragraph working using multigrmpy.py on test data using one VCF as input. My use case is actually having two different VCF files for the same sample, made using independent methods. Is it possible for paragraph to use both VCFs, ie build a graph from the union of all the variants from both VCF files, and then genotype? I've been trying out the vcf2paragraph.py and addVariants.py scripts, but not managed to do it.

INS error

Hi, I'm trying paragraph for genotyping with the following command:

python ~/benchmark/tools/paragraph/paragraph-tools-build/bin/multigrmpy.py -i ~/benchmark/all_sv_grc37.vcf -m samples.txt -r ~/dataset/human_g1k_v37_gatk.fasta -o 50x

but I receive the following error:

Traceback (most recent call last):
File "/home/asoylev/benchmark/tools/paragraph/paragraph-tools-build/bin/multigrmpy.py", line 34, in
from grm.vcf2paragraph import convert_vcf_to_json
File "/mnt/compgen/homes/asoylev/benchmark/tools/paragraph/paragraph-tools-build/lib/python3/grm/vcf2paragraph/init.py", line 32, in
from grm.vcfgraph import VCFGraph, NoVCFRecordsException
File "/mnt/compgen/homes/asoylev/benchmark/tools/paragraph/paragraph-tools-build/lib/python3/grm/vcfgraph/init.py", line 21, in
from grm.vcfgraph.vcfgraph import VCFGraph, NoVCFRecordsException
File "/mnt/compgen/homes/asoylev/benchmark/tools/paragraph/paragraph-tools-build/lib/python3/grm/vcfgraph/vcfgraph.py", line 178
f"Missing key {ins_info_key} for at {self.chrom}:{vcf.pos}; ")

Below is an INS line in the input VCF:

1 10028610 nssv14474350 A INS:ME:ALU . . DBVARID;SVTYPE=INS;END=10028610;SVLEN=140;EXPERIMENT=1;SAMPLE=HG00733;REGIO
NID=nsv3326290;SEQ=aggtcaggagtttgagaccagcctggccaacgtggtgaaaccccgactctactaaaaaaaaaagaacaaaaattaggcctggcgcggtggctcacgcctgtaatcccagcactttgggaggccgaggcgggcagat
cacG;Eichler

any idea?

Thanks,
Arda

Missing key {ins_info_key} for <INS> at {self.chrom}:{vcf.pos};

I'm trying to run Paragraph on a small VCF file (~100 entries) with no Insertions, but I'm getting the below error:

  File "/share/Codes/binaries/Paragraph/bin/multigrmpy.py", line 34, in <module>
    from grm.vcf2paragraph import convert_vcf_to_json
  File "/share/Codes/binaries/Paragraph/lib/python3/grm/vcf2paragraph/__init__.py", line 32, in <module>
    from grm.vcfgraph import VCFGraph, NoVCFRecordsException
  File "/share/Codes/binaries/Paragraph/lib/python3/grm/vcfgraph/__init__.py", line 21, in <module>
    from grm.vcfgraph.vcfgraph import VCFGraph, NoVCFRecordsException
  File "/share/Codes/binaries/Paragraph/lib/python3/grm/vcfgraph/vcfgraph.py", line 178
    f"Missing key {ins_info_key} for <INS> at {self.chrom}:{vcf.pos}; ")
                                                                      ^
SyntaxError: invalid syntax

All VCF entires have END and SEQ in the INFO field. Here's the header and one entry in the file:

##fileformat=VCFv4.2
##source=LUMPY
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=STRANDS,Number=.,Type=String,Description="Strand orientation of the adjacency in BEDPE format (DEL:+-, DUP:-+, INV:++/--)">
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">
##INFO=<ID=CIPOS95,Number=2,Type=Integer,Description="Confidence interval (95%) around POS for imprecise variants">
##INFO=<ID=CIEND95,Number=2,Type=Integer,Description="Confidence interval (95%) around END for imprecise variants">
##INFO=<ID=MATEID,Number=.,Type=String,Description="ID of mate breakends">
##INFO=<ID=EVENT,Number=1,Type=String,Description="ID of event associated to breakend">
##INFO=<ID=SEQ,Number=1,Type=String,Description="Sequence of the structural variation">
##INFO=<ID=SECONDARY,Number=0,Type=Flag,Description="Secondary breakend in a multi-line variants">
##INFO=<ID=SU,Number=.,Type=Integer,Description="Number of pieces of evidence supporting the variant across all samples">
##INFO=<ID=PE,Number=.,Type=Integer,Description="Number of paired-end reads supporting the variant across all samples">
##INFO=<ID=SR,Number=.,Type=Integer,Description="Number of split reads supporting the variant across all samples">
##INFO=<ID=BD,Number=.,Type=Integer,Description="Amount of BED evidence supporting the variant across all samples">
##INFO=<ID=EV,Number=.,Type=String,Description="Type of LUMPY evidence contributing to the variant call">
##INFO=<ID=PRPOS,Number=.,Type=String,Description="LUMPY probability curve of the POS breakend">
##INFO=<ID=PREND,Number=.,Type=String,Description="LUMPY probability curve of the END breakend">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=DUP:TANDEM,Description="Tandem duplication">
##ALT=<ID=INS,Description="Insertion of novel sequence">
##ALT=<ID=CNV,Description="Copy number variable region">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=SU,Number=1,Type=Integer,Description="Number of pieces of evidence supporting the variant">
##FORMAT=<ID=PE,Number=1,Type=Integer,Description="Number of paired-end reads supporting the variant">
##FORMAT=<ID=SR,Number=1,Type=Integer,Description="Number of split reads supporting the variant">
##FORMAT=<ID=BD,Number=1,Type=Integer,Description="Amount of BED evidence supporting the variant">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG00514
chr1    180321  .       C       <DEL>   30      PASS    END=180372;SVTYPE=DEL;SVLEN=-51;SEQ=CATAACCCTAAAACGCTAACCCTCATCCTCACCCTCACACCTCACCCTCAC;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;SU=0;PE=0;SR=0      GT:SU:PE:SR     ./.:0:0:0

Runtime exceeds expectations

I have been testing Paragraph and have been experiencing longer than expected runtimes. The stated performance is:

It typically takes up to a few seconds to genotype a single event in one sample (single-threaded). It took us 30 minutes to genotype ~20,000 SVs using 20 CPU cores (with I/O).

This works out to roughly 1000 SVs per Core per 30min (1.8 seconds/sv single core).

My setup:

  • Paragraph version 2.2b via Docker
  • Google Cloud instance with 8 VCPU @ 2.3GHz, 7.2GB RAM, Local SSD
  • Single sample (23GB CRAM)
  • GRCh38 reference w alt+decoy
  • 1042 Insertions (called only on Chr 1-22,X,Y)

Results:

Threads Runtime (Min) Seconds/SV/core
1 123 7.08
2 64 7.37
4 34 7.83
8 26 11.98

Do you have any feedback as to why these runtimes seem significantly slower than your suggested times?

Cheers,
Wayne

Pysam error

When running paragraph on a test vcf with just one variant row, this error is triggered. Any suggestions?

2020-03-05 19:38:20,888 ERROR    Traceback (most recent call last):
2020-03-05 19:38:20,889 ERROR      File "/share/pkg.7/paragraph/2.4a/install/bin/multigrmpy.py", line 340, in run    vcfupdate.update_vcf_from_grmpy(vcf_input_path, grmpyOutput, result_vcf_path, sample_names)
2020-03-05 19:38:20,889 ERROR      File "/share/pkg.7/paragraph/2.4a/install/lib/python3/grm/vcfgraph/vcfupdate.py", line 218, in update_vcf_from_grmpy    record.samples[sample][k] = v
2020-03-05 19:38:20,889 ERROR      File "pysam/libcbcf.pyx", line 3455, in pysam.libcbcf.VariantRecordSample.__setitem__
2020-03-05 19:38:20,889 ERROR      File "pysam/libcbcf.pyx", line 859, in pysam.libcbcf.bcf_format_set_value
2020-03-05 19:38:20,890 ERROR      File "pysam/libcbcf.pyx", line 583, in pysam.libcbcf.bcf_check_values
2020-03-05 19:38:20,890 ERROR    TypeError: values expected to be 3-tuple, given len=1
Traceback (most recent call last):
  File "/share/pkg.7/paragraph/2.4a/install/bin/multigrmpy.py", line 353, in <module>
    main()
  File "/share/pkg.7/paragraph/2.4a/install/bin/multigrmpy.py", line 349, in main
    run(args)
  File "/share/pkg.7/paragraph/2.4a/install/bin/multigrmpy.py", line 340, in run
    vcfupdate.update_vcf_from_grmpy(vcf_input_path, grmpyOutput, result_vcf_path, sample_names)
  File "/share/pkg.7/paragraph/2.4a/install/lib/python3/grm/vcfgraph/vcfupdate.py", line 218, in update_vcf_from_grmpy
    record.samples[sample][k] = v
  File "pysam/libcbcf.pyx", line 3455, in pysam.libcbcf.VariantRecordSample.__setitem__
  File "pysam/libcbcf.pyx", line 859, in pysam.libcbcf.bcf_format_set_value
  File "pysam/libcbcf.pyx", line 583, in pysam.libcbcf.bcf_check_values
TypeError: values expected to be 3-tuple, given len=1

install error

Hi,
I have downloaded the lastest version, and use the configure file to install. I was a linux user and did not have root authority. It showed Configuring incomplete, errors occurred.
Is that normal?
Cheers

Unable to compile without errors

Dear paragraph developers,
I am trying to compile paragraph on Ubuntu 14.04 using g++ 7.3.0 and cmake 3.14.0.
When I run:

cd /mnt/cifs01/simone/software/paragraph-build
/mnt/cifs01/simone/software/cmake-3.14.0/bin/cmake ../paragraph
make

I get error message:

[ 22%] Building CXX object src/c++/lib/CMakeFiles/grmpy_common.dir/grm/GraphAligner.cpp.o
In file included from /home/simone/home_disk/software/paragraph/external/gssw/gssw.h:19:0,
                 from /home/simone/home_disk/software/paragraph/src/c++/lib/grm/GraphAligner.cpp:30:
/usr/lib/gcc/x86_64-linux-gnu/4.8/include/smmintrin.h:31:3: error: #error "SSE4.1 instruction set not enabled"
 # error "SSE4.1 instruction set not enabled"
   ^
In file included from /home/simone/home_disk/software/paragraph/src/c++/lib/grm/GraphAligner.cpp:30:0:
/home/simone/home_disk/software/paragraph/external/gssw/gssw.h:67:5: error: ‘__m128i’ does not name a type
     __m128i* pvE;
     ^
/home/simone/home_disk/software/paragraph/external/gssw/gssw.h:68:5: error: ‘__m128i’ does not name a type
     __m128i* pvHStore;
     ^
/home/simone/home_disk/software/paragraph/external/gssw/gssw.h:138:2: error: ‘__m128i’ does not name a type
  __m128i* profile_byte; // 0: none
  ^
/home/simone/home_disk/software/paragraph/external/gssw/gssw.h:140:2: error: ‘__m128i’ does not name a type
  __m128i* profile_word; // 0: none
  ^
make[2]: *** [src/c++/lib/CMakeFiles/grmpy_common.dir/build.make:414: src/c++/lib/CMakeFiles/grmpy_common.dir/grm/GraphAligner.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:486: src/c++/lib/CMakeFiles/grmpy_common.dir/all] Error 2
make: *** [Makefile:130: all] Error 2

I saw a post on Github (vgteam/vg#99) where a similar error was solved setting CXXFLAGS environment variable, so I tried it out.

export CXXFLAGS=-msse4.1
/mnt/cifs01/simone/software/cmake-3.14.0/bin/cmake ../paragraph
make

That error seems to be partially solved, but then it stops again:

[ 26%] Building CXX object src/c++/lib/CMakeFiles/grmpy_common.dir/grmpy/AlignSamples.cpp.o
/home/simone/home_disk/software/paragraph/src/c++/lib/grmpy/AlignSamples.cpp: In function ‘void grmpy::writeAlignments(Json::Value&, const grmpy::Parameters&, const paragraph::Parameters&, const string&, genotyping::SampleInfo&)’:
/home/simone/home_disk/software/paragraph/src/c++/lib/grmpy/AlignSamples.cpp:76:105: error: no matching function for call to ‘regex_replace(const string&, const regex&, const char [2])’
     const std::string safe_sample_name = std::regex_replace(sample.sample_name(), unsafe_characters, "_");
                                                                                                         ^
/home/simone/home_disk/software/paragraph/src/c++/lib/grmpy/AlignSamples.cpp:76:105: note: candidates are:
In file included from /usr/include/c++/4.8/regex:62:0,
                 from /home/simone/home_disk/software/paragraph/src/c++/lib/grmpy/AlignSamples.cpp:49:
/usr/include/c++/4.8/bits/regex.h:2162:5: note: template<class _Out_iter, class _Bi_iter, class _Rx_traits, class _Ch_type> _Out_iter std::regex_replace(_Out_iter, _Bi_iter, _Bi_iter, const std::basic_regex<_Ch_type, _Rx_traits>&, const std::basic_string<_Ch_type>&, std::regex_constants::match_flag_type)
     regex_replace(_Out_iter __out, _Bi_iter __first, _Bi_iter __last,
     ^
/usr/include/c++/4.8/bits/regex.h:2162:5: note:   template argument deduction/substitution failed:
/home/simone/home_disk/software/paragraph/src/c++/lib/grmpy/AlignSamples.cpp:76:105: note:   deduced conflicting types for parameter ‘_Bi_iter’ (‘std::basic_regex<char>’ and ‘const char*’)
     const std::string safe_sample_name = std::regex_replace(sample.sample_name(), unsafe_characters, "_");
                                                                                                         ^
In file included from /usr/include/c++/4.8/regex:62:0,
                 from /home/simone/home_disk/software/paragraph/src/c++/lib/grmpy/AlignSamples.cpp:49:
/usr/include/c++/4.8/bits/regex.h:2182:5: note: template<class _Rx_traits, class _Ch_type> std::basic_string<_Ch_type> std::regex_replace(const std::basic_string<_Ch_type>&, const std::basic_regex<_Ch_type, _Rx_traits>&, const std::basic_string<_Ch_type>&, std::regex_constants::match_flag_type)
     regex_replace(const basic_string<_Ch_type>& __s,
     ^
/usr/include/c++/4.8/bits/regex.h:2182:5: note:   template argument deduction/substitution failed:
/home/simone/home_disk/software/paragraph/src/c++/lib/grmpy/AlignSamples.cpp:76:105: note:   mismatched types ‘const std::basic_string<_Ch_type>’ and ‘const char [2]’
     const std::string safe_sample_name = std::regex_replace(sample.sample_name(), unsafe_characters, "_");
                                                                                                         ^
/home/simone/home_disk/software/paragraph/src/c++/lib/grmpy/AlignSamples.cpp:82:31: error: no matching function for call to ‘regex_replace(boost::iterators::iterator_value<boost::iterators::transform_iterator<boost::range_detail::default_constructible_unary_fn_wrapper<grmpy::writeAlignments(Json::Value&, const grmpy::Parameters&, const paragraph::Parameters&, const string&, genotyping::SampleInfo&)::__lambda3, std::basic_string<char> >, std::_List_const_iterator<common::Region>, boost::iterators::use_default, boost::iterators::use_default> >::type, const regex&, const char [2])’
         unsafe_characters, "_");
                               ^
/home/simone/home_disk/software/paragraph/src/c++/lib/grmpy/AlignSamples.cpp:82:31: note: candidates are:
In file included from /usr/include/c++/4.8/regex:62:0,
                 from /home/simone/home_disk/software/paragraph/src/c++/lib/grmpy/AlignSamples.cpp:49:
/usr/include/c++/4.8/bits/regex.h:2162:5: note: template<class _Out_iter, class _Bi_iter, class _Rx_traits, class _Ch_type> _Out_iter std::regex_replace(_Out_iter, _Bi_iter, _Bi_iter, const std::basic_regex<_Ch_type, _Rx_traits>&, const std::basic_string<_Ch_type>&, std::regex_constants::match_flag_type)
     regex_replace(_Out_iter __out, _Bi_iter __first, _Bi_iter __last,
     ^
/usr/include/c++/4.8/bits/regex.h:2162:5: note:   template argument deduction/substitution failed:
/home/simone/home_disk/software/paragraph/src/c++/lib/grmpy/AlignSamples.cpp:82:31: note:   deduced conflicting types for parameter ‘_Bi_iter’ (‘std::basic_regex<char>’ and ‘const char*’)
         unsafe_characters, "_");
                               ^
In file included from /usr/include/c++/4.8/regex:62:0,
                 from /home/simone/home_disk/software/paragraph/src/c++/lib/grmpy/AlignSamples.cpp:49:
/usr/include/c++/4.8/bits/regex.h:2182:5: note: template<class _Rx_traits, class _Ch_type> std::basic_string<_Ch_type> std::regex_replace(const std::basic_string<_Ch_type>&, const std::basic_regex<_Ch_type, _Rx_traits>&, const std::basic_string<_Ch_type>&, std::regex_constants::match_flag_type)
     regex_replace(const basic_string<_Ch_type>& __s,
     ^
/usr/include/c++/4.8/bits/regex.h:2182:5: note:   template argument deduction/substitution failed:
/home/simone/home_disk/software/paragraph/src/c++/lib/grmpy/AlignSamples.cpp:82:31: note:   mismatched types ‘const std::basic_string<_Ch_type>’ and ‘const char [2]’
         unsafe_characters, "_");
                               ^
/home/simone/home_disk/software/paragraph/src/c++/lib/grmpy/AlignSamples.cpp:99:90: error: no matching function for call to ‘regex_replace(std::string&, const regex&, const char [2])’
     const std::string safe_graph_id = std::regex_replace(graph_id, unsafe_characters, "_");
                                                                                          ^
/home/simone/home_disk/software/paragraph/src/c++/lib/grmpy/AlignSamples.cpp:99:90: note: candidates are:
In file included from /usr/include/c++/4.8/regex:62:0,
                 from /home/simone/home_disk/software/paragraph/src/c++/lib/grmpy/AlignSamples.cpp:49:
/usr/include/c++/4.8/bits/regex.h:2162:5: note: template<class _Out_iter, class _Bi_iter, class _Rx_traits, class _Ch_type> _Out_iter std::regex_replace(_Out_iter, _Bi_iter, _Bi_iter, const std::basic_regex<_Ch_type, _Rx_traits>&, const std::basic_string<_Ch_type>&, std::regex_constants::match_flag_type)
     regex_replace(_Out_iter __out, _Bi_iter __first, _Bi_iter __last,
     ^
/usr/include/c++/4.8/bits/regex.h:2162:5: note:   template argument deduction/substitution failed:
/home/simone/home_disk/software/paragraph/src/c++/lib/grmpy/AlignSamples.cpp:99:90: note:   deduced conflicting types for parameter ‘_Bi_iter’ (‘std::basic_regex<char>’ and ‘const char*’)
     const std::string safe_graph_id = std::regex_replace(graph_id, unsafe_characters, "_");
                                                                                          ^
In file included from /usr/include/c++/4.8/regex:62:0,
                 from /home/simone/home_disk/software/paragraph/src/c++/lib/grmpy/AlignSamples.cpp:49:
/usr/include/c++/4.8/bits/regex.h:2182:5: note: template<class _Rx_traits, class _Ch_type> std::basic_string<_Ch_type> std::regex_replace(const std::basic_string<_Ch_type>&, const std::basic_regex<_Ch_type, _Rx_traits>&, const std::basic_string<_Ch_type>&, std::regex_constants::match_flag_type)
     regex_replace(const basic_string<_Ch_type>& __s,
     ^
/usr/include/c++/4.8/bits/regex.h:2182:5: note:   template argument deduction/substitution failed:
/home/simone/home_disk/software/paragraph/src/c++/lib/grmpy/AlignSamples.cpp:99:90: note:   mismatched types ‘const std::basic_string<_Ch_type>’ and ‘const char [2]’
     const std::string safe_graph_id = std::regex_replace(graph_id, unsafe_characters, "_");
                                                                                          ^
make[2]: *** [src/c++/lib/CMakeFiles/grmpy_common.dir/build.make:492: src/c++/lib/CMakeFiles/grmpy_common.dir/grmpy/AlignSamples.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:486: src/c++/lib/CMakeFiles/grmpy_common.dir/all] Error 2
make: *** [Makefile:130: all] Error 2

I know Ubuntu 14.04 is not amongst the distributions you have tested paragraph on.
Do you have any ideas about how I could solve it? Thanks.
Simone

Genotyping unresolved insertions using reads around the breakpoint

Hi,

Some SV callers can identify unresolved insertions and add the partially assembled insertions around the breakpoint using the LEFT_SVINSSEQ and RIGHT_SVINSSEQ tags in the VCF. paragraph doesn't seem to able to genotype such insertions, but since the algorithm uses reads around the breakpoint it seems like it should be able to do so. Could I request such a feature in the next release?

Thanks,
Mo

Native build seems to point to wrong boost library

I am attempting to build Paragraph (not from the docker file). I am using centos7, python 3.6, gcc/g++ 6.4.0, and cmake 3.12.1. I have pointed to the correct boost installation, installed as instructed, by setting $BOOST_ROOT and cmake seems to recognize this based on the version number it reports finding. Note that below I am also setting DCMAKE_INCLUDE_PATH, as otherwise I get the error -- Could NOT find LibLZMA (missing: LIBLZMA_INCLUDE_DIR). The directory I include has lzma.h and an lzma folder, and including this allows cmake to find these.

export BOOST_ROOT=/home-4/[email protected]/lib/boost_1_65_0_install
cmake ../paragraph_v2.2 -DCMAKE_CXX_COMPILER=`which g++` -DCMAKE_C_COMPILER=`which gcc` -DBOOST_ROOT=$BOOST_ROOT -DCMAKE_SYSTEM_LIBRARY_PATH=/software/centos7/usr/lib64 -DCMAKE_INCLUDE_PATH=/software/centos7/usr/include

-- using compiler: g++ version 6.4.0
-- Found LibLZMA: /software/centos7/usr/include (found version "5.2.2") 
Using included htslib
-- Configuring done
-- Generating done
-- Build files have been written to: /home-net/home-4/[email protected]/bin/packages/paragraph_v2.2_build/external/htslib-build
[100%] Built target htslib
-- Configuring done
-- Generating done
-- Build files have been written to: /home-net/home-4/[email protected]/bin/packages/paragraph_v2.2_build/external/googletest-build
Scanning dependencies of target googletest
[100%] Built target googletest
-- Configuring done
-- Generating done
-- Build files have been written to: /home-net/home-4/[email protected]/bin/packages/paragraph_v2.2_build/external/graphtools-build
Scanning dependencies of target graphtools
[100%] Built target graphtools
-- Boost version: 1.65.0
-- Found the following Boost libraries:
--   program_options
--   filesystem
--   system
-- Configuring done
-- Generating done
-- Build files have been written to: /home-net/home-4/[email protected]/bin/packages/paragraph_v2.2_build/external/spdlog-build
[100%] Built target spdlog
-- Boost version: 1.65.0
-- Found the following Boost libraries:
--   iostreams
--   program_options
--   filesystem
--   system
--   regex
-- Configuring done
-- Generating done
-- Build files have been written to: /home-4/[email protected]/bin/packages/paragraph_v2.2_build

However, when I do make, I get the following errors (this is just a snippet -- the many errors are all similar and seem to me like perhaps the issue is that they are picking things up from my included /software/centos7/usr/include/ folder that they should not be, including an older boost installation located there):

[  2%] Built target external
[  3%] Building CXX object src/c++/lib/CMakeFiles/grmpy_common.dir/common/Alignment.cpp.o
In file included from /software/centos7/usr/include/boost/assert.hpp:50:0,
                 from /software/centos7/usr/include/boost/system/error_code.hpp:16,
                 from /software/centos7/usr/include/boost/filesystem/path_traits.hpp:23,
                 from /software/centos7/usr/include/boost/filesystem/path.hpp:25,
                 from /software/centos7/usr/include/boost/filesystem.hpp:16,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2/src/c++/include/common/Error.hh:176,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2/src/c++/include/common/BCFHelpers.hh:58,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2/src/c++/include/variant/RefVar.hh:46,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2/src/c++/include/common/Alignment.hh:43,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2/src/c++/lib/common/Alignment.cpp:36:
/software/centos7/usr/include/assert.h:68:13: error: redundant redeclaration of ‘void __assert_fail(const char*, const char*, unsigned int, const char*)’ in same scope [-Werror=redundant-decls]
 extern void __assert_fail (const char *__assertion, const char *__file,
             ^~~~~~~~~~~~~
In file included from /software/apps/compilers/gcc/6.4.0/include/c++/6.4.0/cassert:44:0,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2_build/external/spdlog-src/include/spdlog/fmt/bundled/format.h:31,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2_build/external/spdlog-src/include/spdlog/fmt/fmt.h:21,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2_build/external/spdlog-src/include/spdlog/fmt/ostr.h:11,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2/src/c++/include/common/Error.hh:44,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2/src/c++/include/common/BCFHelpers.hh:58,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2/src/c++/include/variant/RefVar.hh:46,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2/src/c++/include/common/Alignment.hh:43,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2/src/c++/lib/common/Alignment.cpp:36:
/software/centos7/usr/include/assert.h:68:13: note: previous declaration of ‘void __assert_fail(const char*, const char*, unsigned int, const char*)’
 extern void __assert_fail (const char *__assertion, const char *__file,
             ^~~~~~~~~~~~~
In file included from /software/centos7/usr/include/boost/assert.hpp:50:0,
                 from /software/centos7/usr/include/boost/system/error_code.hpp:16,
                 from /software/centos7/usr/include/boost/filesystem/path_traits.hpp:23,
                 from /software/centos7/usr/include/boost/filesystem/path.hpp:25,
                 from /software/centos7/usr/include/boost/filesystem.hpp:16,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2/src/c++/include/common/Error.hh:176,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2/src/c++/include/common/BCFHelpers.hh:58,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2/src/c++/include/variant/RefVar.hh:46,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2/src/c++/include/common/Alignment.hh:43,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2/src/c++/lib/common/Alignment.cpp:36:
/software/centos7/usr/include/assert.h:73:13: error: redundant redeclaration of ‘void __assert_perror_fail(int, const char*, unsigned int, const char*)’ in same scope [-Werror=redundant-decls]
 extern void __assert_perror_fail (int __errnum, const char *__file,
       ^~~~~~~~~~~~~~~~~~~~
In file included from /software/apps/compilers/gcc/6.4.0/include/c++/6.4.0/cassert:44:0,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2_build/external/spdlog-src/include/spdlog/fmt/bundled/format.h:31,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2_build/external/spdlog-src/include/spdlog/fmt/fmt.h:21,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2_build/external/spdlog-src/include/spdlog/fmt/ostr.h:11,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2/src/c++/include/common/Error.hh:44,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2/src/c++/include/common/BCFHelpers.hh:58,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2/src/c++/include/variant/RefVar.hh:46,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2/src/c++/include/common/Alignment.hh:43,
                 from /home-4/[email protected]/bin/packages/paragraph_v2.2/src/c++/lib/common/Alignment.cpp:36:
/software/centos7/usr/include/assert.h:73:13: note: previous declaration of ‘void __assert_perror_fail(int, const char*, unsigned int, const char*)’

...


/software/centos7/usr/include/assert.h:80:13: note: previous declaration of ‘void __assert(const char*, const char*, int)’
 extern void __assert (const char *__assertion, const char *__file, int __line)
             ^~~~~~~~
cc1plus: all warnings being treated as errors
make[2]: *** [src/c++/lib/CMakeFiles/grmpy_common.dir/common/Alignment.cpp.o] Error 1
make[1]: *** [src/c++/lib/CMakeFiles/grmpy_common.dir/all] Error 2
make: *** [all] Error 2

How to prepare vcf file for Duplication & Tandem Duplication?

Hi,
I'm trying paragraph for genotyping DUP & TDUP with the following command:
python3 ~/miniconda3/pkgs/paragraph-2.3-h8908b6f_0/bin/multigrmpy.py -i TDUP.vcf -m samples.txt -r ~/reference/genome.fa -o TDUP

Here are some of the contents in samples.txt & TDUP.vcf file:
image

image

But 75% of genotypes are missing when genotyping DUP & TDUP.
image

Could you give me some advice? Thanks!

Zhiliang

How to reuse variants converted from vcf to json for population genotyping?

The repository README recommends running multigrmpy.py independently for several samples when running an analysis at population scale.

My understanding is that multigrmpy.py first converts the input vcf to a set of .json files written to a temporary directory. These .json files are then used by the grmpy program to carry genotyping.

If my understanding is right, then the conversion step is conducted as many times as multigrmpy.py is launched, whereas we only really need conversion to happen once for a given vcf. This results in wasted computing time and a lot more temporary files than what is needed, which causes problems on my system because the large number of temporary files becomes hard to manage.

Therefore, I would like to know if there is a way that I could use the tools provided by Paragraph to first convert the vcf file to a set of .json files, and then use those as input for genotyping. I believe this would not be too complicated, but I can't figure out how to do this based on the information that is provided.

error reference_sequence.empty

Hello again,

My testing on few individuals has passed (#42 ), but when I run it on all the data I got, I run into one more issue:

[2020-05-08 15:03:10.357] [Genotyping] [16979] [info] [Done with alignment step 1250 total aligned (path: 0 [0 anchored] kmers: 0 / ksw: 0 / gssw: 1037) ; 213 were filtered]
[2020-05-08 15:03:10.358] [Genotyping] [16979] [warning] WARNING: rethrowing a thread exception 
[2020-05-08 15:03:10.360] [Genotyping] [16973] [info] [Done with alignment step 1250 total aligned (path: 0 [0 anchored] kmers: 0 / ksw: 0 / gssw: 1051) ; 199 were filtered]
[2020-05-08 15:03:10.360] [Genotyping] [16973] [warning] WARNING: rethrowing a thread exception 
[2020-05-08 15:03:10.362] [Genotyping] [16986] [info] [Done with alignment step 1250 total aligned (path: 0 [0 anchored] kmers: 0 / ksw: 0 / gssw: 1052) ; 198 were filtered]
[2020-05-08 15:03:10.363] [Genotyping] [16986] [warning] WARNING: rethrowing a thread exception 
[2020-05-08 15:03:10.373] [Genotyping] [16977] [info] [Done with alignment step 1250 total aligned (path: 0 [0 anchored] kmers: 0 / ksw: 0 / gssw: 1047) ; 203 were filtered]
[2020-05-08 15:03:10.373] [Genotyping] [16977] [warning] WARNING: rethrowing a thread exception 
[2020-05-08 15:03:10.379] [Genotyping] [16985] [info] [Done with alignment step 1250 total aligned (path: 0 [0 anchored] kmers: 0 / ksw: 0 / gssw: 1051) ; 199 were filtered]
[2020-05-08 15:03:10.379] [Genotyping] [16985] [warning] WARNING: rethrowing a thread exception 
[2020-05-08 15:03:10.481] [Genotyping] [16976] [critical] ERROR: This thread also caught an exception
[2020-05-08 15:03:10.693] [Genotyping] [16975] [critical] ERROR: This thread also caught an exception
[2020-05-08 15:03:10.863] [Genotyping] [16974] [critical] ERROR: This thread also caught an exception
[2020-05-08 15:03:11.080] [Genotyping] [16987] [critical] ERROR: This thread also caught an exception
[2020-05-08 15:03:11.097] [Genotyping] [16982] [info] [Done with alignment step 1250 total aligned (path: 0 [0 anchored] kmers: 0 / ksw: 0 / gssw: 1042) ; 208 were filtered]
[2020-05-08 15:03:11.100] [Genotyping] [16982] [warning] WARNING: rethrowing a thread exception 
[2020-05-08 15:03:11.100] [Genotyping] [16980] [warning] WARNING: rethrowing a thread exception 
[2020-05-08 15:03:11.302] [Genotyping] [16972] [critical] ERROR: This thread also caught an exception
[2020-05-08 15:03:11.524] [Genotyping] [16984] [critical] ERROR: This thread also caught an exception
[2020-05-08 15:03:11.752] [Genotyping] [16981] [critical] ERROR: This thread also caught an exception
[2020-05-08 15:03:11.961] [Genotyping] [16978] [critical] ERROR: This thread also caught an exception
[2020-05-08 15:03:12.149] [Genotyping] [16979] [critical] ERROR: This thread also caught an exception
[2020-05-08 15:03:12.322] [Genotyping] [16973] [critical] ERROR: This thread also caught an exception
[2020-05-08 15:03:12.543] [Genotyping] [16986] [critical] ERROR: This thread also caught an exception
[2020-05-08 15:03:12.791] [Genotyping] [16977] [critical] ERROR: This thread also caught an exception
[2020-05-08 15:03:13.006] [Genotyping] [16985] [critical] ERROR: This thread also caught an exception
[2020-05-08 15:03:13.246] [Genotyping] [16982] [critical] ERROR: This thread also caught an exception
[2020-05-08 15:03:13.514] [Genotyping] [16980] [critical] ERROR: This thread also caught an exception
[2020-05-08 15:03:13.514] [Genotyping] [16972] [warning] WARNING: rethrowing a thread exception 
[2020-05-08 15:03:13.906] [Genotyping] [16972] [critical] Assertion failed: !reference_sequence.empty()

It complains about an empty reference, but I am not sure which SV is causing the trouble.

When I grepped 16972 out of the log file, the last-mentioned variant
tmpumt4us9n.json.zip and indeed, the json contains some NNNNNs.

The sequence of the corresponding is:

>4_Tte_b3v08_scaf031309
TAGCGTAATTAGTACTTAAGCGTATAACCCGAACGGGTAGTAAGACGCGGTGCTAAATAA
TAATAATAATAATAATAATCTAAACAATATACAAATTAAGAAGCGTACTATTCAATTCTT
AGTGGCATGGATTTACAAATGATTTGTTGGGTACAAGACAAACTATAATTCAAAGTTGAA
CTAATTTCTAATTGATGTAATTATGTAATATAAGTTATTACAAAAAAAAAAAGTGTGGCG
TCATTAAAATTAAGTTAGATCGTGTGATAAAATAAGCGTGTGTGTGTGTGTGCAAGACGA
CGCAATAAATCACGGTGTTAAAAATGACCGTTACGTCACTGTTTGTCGGCCAATGAATGA
ATCGCCATAGTCATATTTCTGCTAACGTCCGTGAGATCGGATGAAATATAACCTGAGAAC
CGTAGCCTGTTATTTTAGAACTGGAGACAAATAGATGTTAGTTCAACTCTGATGTAACTT
CTAGTACAGAAACAGGTGGTCAGAACATTCACTAGAATCAGGAGGTCCCTGCCGCTAATA
GCCCCCCTCCCGCGAAAACAAACCTTAACATAATAATACAAAGCGGTGTCTCCCAAACTG
GGGTACGCGTGACCCTGGAGGTATGCTACGCCACATGTGTGTGTTTTTTTTAGCACCGCG
TCTTACTACCAGTTCGGGGTATACGCTTGTATACTAATCATTACAGTAGTGGTTTTAAGG
ATTAGGAAGATTATATTTAGAGGAAGTGTACCCGCATTAACGTGGAGAGAGAGTGAAAAA
CCATTTTGGAAAAATAACCTTGGTACATCCAACAAATATTCGAACTTCGATCGCCGCGTC
ATCGGAAATCTAGTCTATTGCGAGAGTAGAGTAGCGACTTAGACCATGCGGCCAATCGTT
GTATTAAACATCTTTTGAATAACTTGTTGGTATTATGTAACTTTAATTCGATTTTGTTTG
GGTCTGTCAAATTCCACCGCGCGTGATTAAAGTGTCGATAGATTAAATCTAGAAGAATTG
ATGTTTTGTGTAATTTCGCTTTCAATCTTAAGCTTTTTTAAAAAAAAAAAAGATGTTGTA
ATGAAGTAGGACTAAGTATTAGGCCATAATACTGTCCAAACAATAATTTTAAACTAGTCA
TCGGACCTTGGGGGGAGAATCACAAGAGTCAACANTAGGAGTTTTATGCATTTCTACAAA
TCAATGCCTTACTTTAGAGATCATTCCCGATGTTTTATGACATTGGGAACTCTCATAATA
ACATCCATATATATTCGGTGATTAACGTAAACTTTATATGTATGTATACATTTTATATAA
CTAGGCATATATATGTATAATTTTACTATATAAAAATAAAAGGAATTGTTTGTCTGTGTT
TGTTTGTGTGCGATGCCCAGCCAAATTTACGGCACGCAGAGATCTAAAAAAATTTAACAT
AGGTGGACGGAAGGGGGTCCGAATGCACCTCGAAGCAGGATTTTTAAATTTTTAATTAGC
TTTTTAATTAGG

and the nucleotide range of the variant (1176-1325) should be (probably):

TAGGAGTTTTATGCATTTCTACAAATCAATGCCTTACTTTAGAGATCATTCCCGATGTTTTATGACATTGGGAACTCTCATAATAACATCCATATATATTCGGTGATTAACGTAAACTTTATATGTATGTATACATTTTATATAACTAGG

certainly does not seem full of Ns. Originally this sequence was masked (I tried paragraph both with masked and unmasked reference, but I did not try to remap reads on the unmasked ref).

Sorry to bother you again, but I think there is just something rather small I am missing now.

idxdepth error

I am trying to use the idxdepth utility, as I'm running Paragraph on a large number of already produced cram files. The human reference used included HLA decoy sequences, which have * and : in their names (I did not choose the reference, nor can I change it at this point). I think idxdepth is failing due to the * or : in the names -- it seems to run fine then errors when it gets to the HLA sequences.

...
[2019-06-19 15:04:39.748] [idxdepth] [3725] [info] Thread 47113698678528 estimating depth for chrUn_JTFH01001997v1_decoy
[2019-06-19 15:04:39.752] [idxdepth] [3722] [info] Thread 47113692374784 done estimating depth for chrUn_JTFH01001976v1_decoy ; DP = 35.23 after 182.823 us
[2019-06-19 15:04:39.752] [idxdepth] [3722] [info] Thread 47113692374784 estimating depth for chrUn_JTFH01001998v1_decoy
[2019-06-19 15:04:39.774] [idxdepth] [3733] [info] Thread 47117047957248 done estimating depth for chrUn_JTFH01001980v1_decoy ; DP = 2.38 after 196.527 us
[2019-06-19 15:04:39.774] [idxdepth] [3733] [info] Thread 47117047957248 estimating depth for HLA-A*01:01:01:01
[2019-06-19 15:04:39.815] [idxdepth] [3736] [info] Thread 47117054260992 done estimating depth for chrUn_JTFH01001983v1_decoy ; DP = 0.01 after 233.495 us
[2019-06-19 15:04:39.815] [idxdepth] [3736] [info] Thread 47117054260992 estimating depth for HLA-A*01:01:01:02N
[2019-06-19 15:04:39.815] [idxdepth] [3729] [info] Thread 47113707083520 done estimating depth for chrUn_JTFH01001984v1_decoy ; DP = 0.02 after 230.673 us
[2019-06-19 15:04:39.815] [idxdepth] [3729] [info] Thread 47113707083520 estimating depth for HLA-A*01:01:38L
[2019-06-19 15:04:39.815] [idxdepth] [3730] [info] Thread 47113709184768 done estimating depth for chrUn_JTFH01001985v1_decoy ; DP = 1.47 after 229.729 us
[2019-06-19 15:04:39.815] [idxdepth] [3730] [info] Thread 47113709184768 estimating depth for HLA-A*01:02
[2019-06-19 15:04:39.815] [idxdepth] [3720] [info] Thread 47113688172288 done estimating depth for chrUn_JTFH01001982v1_decoy ; DP = 0.74 after 234.657 us
[2019-06-19 15:04:39.815] [idxdepth] [3720] [info] Thread 47113688172288 estimating depth for HLA-A*01:03
terminate called after throwing an instance of 'std::invalid_argument'
  what():  stoll
Aborted

Latest docker image not working

Hi,

Would the latest docker image be updated? The docker pull gives the following error:
Docker pull fails, Error: /usr/bin/python3: can't open file
'/opt/paragraph/bin/runGraphTyping.py': [Errno 2] No such file or directory

Thanks!

Stripping END tag in INFO field

Observed behavior:

For non-symbolic alleles Paragraph seems to be stripping the END tag from the INFO field (see below). This isn't desired behavior as it can impact tools that rely on this tag (for example vcfToBedpe).

Expected behavior:

Preserve the original info fields as they were in the input and only append the GRMPY_ID tag.

Example

Manta
chr1 66160 MantaDEL:8:0:0:0:1:0 TTATATATATATATATTATATATACTATATATTTATATATATTACATATTATATATATAATATATATTATATAATATATATTATATTATATAATATATAATATAAATATAATATAAATTATATTATATAATATATAATATAAATATAATATAAATTATATAAATATAATATATATTTTATTATATAATATAATATATATTATATAAATATAATATATAAATTATATAATATAATATATATTATATAATATAATATATTTTATTATATAAATATATATTATATTATATAATATATATTTTATTATATAATATATATTATATATTTATAGAATATAATATATATTTTATTATATAATATATATTATATAATATATATTATATTTATATATAACATATATTATTATATAAAATATGTATAATATATATTATATAAATATATTTATATATTATATAAA T 196 PASS END=66613;SVTYPE=DEL;SVLEN=-453;CIGAR=1M453D;CIPOS=0,9;HOMLEN=9;HOMSEQ=TATATATAT

Paragraph
chr1 66160 MantaDEL:8:0:0:0:1:0 TTATATATATATATATTATATATACTATATATTTATATATATTACATATTATATATATAATATATATTATATAATATATATTATATTATATAATATATAATATAAATATAATATAAATTATATTATATAATATATAATATAAATATAATATAAATTATATAAATATAATATATATTTTATTATATAATATAATATATATTATATAAATATAATATATAAATTATATAATATAATATATATTATATAATATAATATATTTTATTATATAAATATATATTATATTATATAATATATATTTTATTATATAATATATATTATATATTTATAGAATATAATATATATTTTATTATATAATATATATTATATAATATATATTATATTTATATATAACATATATTATTATATAAAATATGTATAATATATATTATATAAATATATTTATATATTATATAAA T 196 PASS SVTYPE=DEL;SVLEN=-453;CIGAR=1M453D;CIPOS=0,9;HOMLEN=9;HOMSEQ=TATATATAT;GRMPY_ID=chr1.vcf@a66f377e14617d867835ed906c5d6b272b1c404e2263781380e6c6c1da4e9267:1 GT:DP:FT:AD:ADF:ADR:PL 0/0:54:PASS:119,0:70,0:49,0:0,167,781

Genotype does not meet VCF 4.2 spec

In rare cases Paragraph returns a genotype that does not match the VCF spec; returning a single '.' genotype on non-sex chromosomes. For example I have the following genotypes for 2 samples on chr3. My interpretation is that the second sample should be './.' since it is in a diploid region of the genome.

0/0:40:....:40,0:20,0:20,0:0,124,599 .:0:NO_VALID_GT,UNMATCHED:0,0:0,0:0,0:.,.,.

Illegal character in reference sequence (W base and possibly other noncanonical bases)

Running paragraph using hg38 genomes and ran into this error

Exception: chr3:90549400:<INV> illegal character in reference sequence

Traceback:

Traceback (most recent call last):
  File "/home/dantakli/paragraph/lib/python3/grm/vcfgraph/vcfgraph.py", line 199, in add_record
    alt_sequence = ref_sequence[0] + reverse_complement(inv_ref)
  File "/home/dantakli/paragraph/lib/python3/grm/vcfgraph/vcfgraph.py", line 436, in reverse_complement
    return ''.join([complement[x] for x in seq[::-1]])
  File "/home/dantakli/paragraph/lib/python3/grm/vcfgraph/vcfgraph.py", line 436, in <listcomp>
    return ''.join([complement[x] for x in seq[::-1]])
KeyError: 'W'
$ samtools faidx /home/dantakli/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa chr3:90549400-91081922 | grep "W"
AAGTTTCTGAGAATCATTCTCTCTTGTTTTTCTGTGAAGWTATTGCCTTTTCTACCATAG

For now I will likely skip this SV, but just letting you all know that it seems that Paragraph doesn't support noncanonical bases.

Libraries not linking

I'm having trouble with install; on the htslib build step; I'm using Linux CentOS 6.9 and gcc/g++ 5.1.0.

cmake (I'm using version 3.5.0) seems to find my installed lzma library files just fine:

-- Found ZLIB: /home-4/[email protected]/bin/packages/miniconda2/lib/libz.so (found version "1.2.11")
-- Found BZip2: /home-4/[email protected]/bin/packages/miniconda2/lib/libbz2.so (found version "1.0.6")
-- Looking for BZ2_bzCompressInit
-- Looking for BZ2_bzCompressInit - found
-- Looking for lzma_auto_decoder in /home-4/[email protected]/bin/packages/miniconda2/lib/liblzma.so
-- Looking for lzma_auto_decoder in /home-4/[email protected]/bin/packages/miniconda2/lib/liblzma.so - found
-- Looking for lzma_easy_encoder in /home-4/[email protected]/bin/packages/miniconda2/lib/liblzma.so
-- Looking for lzma_easy_encoder in /home-4/[email protected]/bin/packages/miniconda2/lib/liblzma.so - found
-- Looking for lzma_lzma_preset in /home-4/[email protected]/bin/packages/miniconda2/lib/liblzma.so
-- Looking for lzma_lzma_preset in /home-4/[email protected]/bin/packages/miniconda2/lib/liblzma.so - found
-- Found LibLZMA: /home-4/[email protected]/bin/packages/miniconda2/include (found version "5.2.3")

However, it is then unable to find lzma.h.

[ 75%] Performing build step for 'htslib'
cram/cram_io.c:61:18: fatal error: lzma.h: No such file or directory
compilation terminated.
make[3]: *** [cram/cram_io.o] Error 1
gmake[2]: *** [htslib-prefix/src/htslib-stamp/htslib-build] Error 2
gmake[1]: *** [CMakeFiles/htslib.dir/all] Error 2
gmake: *** [all] Error 2
CMake Error at src/cmake/GetHtslib.cmake:37 (message):
Build step for htslib failed: 2
Call Stack (most recent call first):
CMakeLists.txt:33 (include)

-- Configuring incomplete, errors occurred!

Pointing DCMAKE_INCLUDE_PATH and DCMAKE_SYSTEM_LIBRARAY_PATH to the correct locations seems to have no effect. Manually modifying the include <lzma.h> in cram_io.c to point to the correct location does fix the immediate problem, but then the following step just fails to find the lzma library.

/usr/bin/ld: cannot find -llzma
collect2: error: ld returned 1 exit status
make[3]: *** [libhts.so] Error 1
gmake[2]: *** [htslib-prefix/src/htslib-stamp/htslib-build] Error 2
gmake[1]: *** [CMakeFiles/htslib.dir/all] Error 2
gmake: *** [all] Error 2
CMake Error at src/cmake/GetHtslib.cmake:37 (message):
Build step for htslib failed: 2
Call Stack (most recent call first):
CMakeLists.txt:33 (include)

Any suggestions on how to get cmake to recognize these installed libraries? I can't figure out why it seems to find them in one step, and then not link to them in subsequent compilation steps.

Reference sequences do not match error messages would be helpful

Not so much an issue as a suggestion, but it would be great to have some sort of error message in the following cases regarding references not matching, all of which I have encountered and don't give the most informative errors:

If the reference for the short read bam and vcf use differing notation (eg "chr1" vs "1"), it'd be great if this could either be handled by Paragraph, or if Paragraph could do an initial check and report an error to the user.

If the reference used in multigrmpy.py with -r doesn't match the reference sequences from the header, it yields a very uninformative error of "subprocess.CalledProcessError: grmpy --response-file [tempFile] returned non-zero exit status 1". Having a check which tells the user that the reference used with -r does not match would be extremely helpful; it took me quite a while to figure out I was accidentally using the wrong reference (which used "1" instead of "chr1" etc).

Distance between vcf position and chrom start is smaller than read length.

Hi All,

I try to run paragraph to my test dataset. but got an error below:

$ python3 ../../../Tools/paragraph/bin/multigrmpy.py -i ../pbsv_sample_Bonobo_sv.vcf.gz -m sample.txt -r ../ref.fa -o test &
[1] 39497
$ 2020-02-04 16:37:34,250 ERROR VCF to JSON conversion failed.
2020-02-04 16:37:34,303 ERROR Traceback (most recent call last):
2020-02-04 16:37:34,303 ERROR File "../../../Tools/paragraph/bin/multigrmpy.py", line 52, in load_graph_description header, records, event_list = convert_vcf_to_json(args, alt_paths=True)
2020-02-04 16:37:34,303 ERROR File "/net/eichler/vol26/projects/primate_sv/nobackups/Tools/paragraph/lib/python3/grm/vcf2paragraph/init.py", line 133, in convert_vcf_to_json header, records, block_ids = parse_vcf_lines(args.input, args.read_length, args.split_type)
2020-02-04 16:37:34,304 ERROR File "/net/eichler/vol26/projects/primate_sv/nobackups/Tools/paragraph/lib/python3/grm/vcf2paragraph/init.py", line 209, in parse_vcf_lines raise Exception("Distance between vcf position and chrom start is smaller than read length.")
2020-02-04 16:37:34,304 ERROR Exception: Distance between vcf position and chrom start is smaller than read length.
2020-02-04 16:37:34,305 ERROR Traceback (most recent call last):
2020-02-04 16:37:34,305 ERROR File "../../../Tools/paragraph/bin/multigrmpy.py", line 261, in run graph_files = load_graph_description(args)
2020-02-04 16:37:34,305 ERROR File "../../../Tools/paragraph/bin/multigrmpy.py", line 52, in load_graph_description header, records, event_list = convert_vcf_to_json(args, alt_paths=True)
2020-02-04 16:37:34,305 ERROR File "/net/eichler/vol26/projects/primate_sv/nobackups/Tools/paragraph/lib/python3/grm/vcf2paragraph/init.py", line 133, in convert_vcf_to_json header, records, block_ids = parse_vcf_lines(args.input, args.read_length, args.split_type)
2020-02-04 16:37:34,306 ERROR File "/net/eichler/vol26/projects/primate_sv/nobackups/Tools/paragraph/lib/python3/grm/vcf2paragraph/init.py", line 209, in parse_vcf_lines raise Exception("Distance between vcf position and chrom start is smaller than read length.")
2020-02-04 16:37:34,306 ERROR Exception: Distance between vcf position and chrom start is smaller than read length.
Traceback (most recent call last):
File "../../../Tools/paragraph/bin/multigrmpy.py", line 353, in
main()
File "../../../Tools/paragraph/bin/multigrmpy.py", line 349, in main
run(args)
File "../../../Tools/paragraph/bin/multigrmpy.py", line 261, in run
graph_files = load_graph_description(args)
File "../../../Tools/paragraph/bin/multigrmpy.py", line 52, in load_graph_description
header, records, event_list = convert_vcf_to_json(args, alt_paths=True)
File "/net/eichler/vol26/projects/primate_sv/nobackups/Tools/paragraph/lib/python3/grm/vcf2paragraph/init.py", line 133, in convert_vcf_to_json
header, records, block_ids = parse_vcf_lines(args.input, args.read_length, args.split_type)
File "/net/eichler/vol26/projects/primate_sv/nobackups/Tools/paragraph/lib/python3/grm/vcf2paragraph/init.py", line 209, in parse_vcf_lines
raise Exception("Distance between vcf position and chrom start is smaller than read length.")
Exception: Distance between vcf position and chrom start is smaller than read length.

Here is my manifest file:

$ cat sample.txt
id path idxdepth
bonobo_10 aln_realigned_reads.bam aln_realigned_reads.bam.json
$ ls aln_realigned_reads.bam
aln_realigned_reads.bam
$ ls aln_realigned_reads.bam.json
aln_realigned_reads.bam.json

Do you have any idea about it?

Thank you so much.

Best,
Yafei

Errors when make

Hello, I got the error information as below when I 'make' the Paragraph as Installation.md said:

[ 64%] Building CXX object external/graphtools-
/gpfs/home/heyaoxi/boost_1_65_0/paragraph-tools-build/external/graphtools-src/src/graphIO/../../external/include/nlohmann/json.hpp:8678:43: error: logical ‘and’ of mutually exclusive tests is always false [-Werror=logical-op]
const bool is_negative = (x <= 0) and (x != 0); // see issue #755
cc1plus: all warnings being treated as errors
make[2]: *** [external/graphtools-build/src/graphIO/CMakeFiles/graphIO.dir/build.make:83: external/graphtools-build/src/graphIO/CMakeFiles/graphIO.dir/GraphJson.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:571: external/graphtools-build/src/graphIO/CMakeFiles/graphIO.dir/all] Error 2
make: *** [Makefile:150: all] Error 2

I saw the issue #755 and changed some code as recommended: changed "const bool is_negative = x < 0;" to "const bool is_negative = std::is_same<NumberType, number_integer_t>::value and (x < 0);" but I got new error information:
collect2: error: ld returned 1 exit status
make[2]: *** [src/c++/main/CMakeFiles/grmpy.dir/build.make:115: bin/grmpy] Error 1
make[1]: *** [CMakeFiles/Makefile2:653: src/c++/main/CMakeFiles/grmpy.dir/all] Error 2
make: *** [Makefile:150: all] Error 2.

Is anyone could give some help here?

--Yaoxi

comparison to svtyper

I use svtyper in smoove. After reading your paper, I thought I might replace svtyper with paragraph.
I did a separate evaluation using the GiaB truthset from here
and using truvari.
I evaluated only on deletions > 300 bases.

When genotyping this large-DEL truthset. I get 81% recall from paragraph and 91% with svtyper.
I realize that you used a different call-set and not limiting to Tier 1 regions, but I am surprised the results are so different. I am wondering if you have any insight on this.

I used paragraph via the docker image (as updated with my pending pull-request) and the code below:

wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz
bcftools view -f "PASS,." -O z -o HG002_SVs_Tier1_v0.6.DEL.vcf.gz -i 'SVTYPE == "DEL" & SVLEN < -40' HG002_SVs_Tier1_v0.6.vcf.gz
# paragraph complains about reference so manually change:
zcat HG002_SVs_Tier1_v0.6.DEL.vcf.gz | awk 'BEGIN{FS=OFS="\t"} ($0 ~ /^#/) { print } ($0 !~ /^#/ ) { $4="N"; $5 = "<DEL>"; print }' | bgzip -c > tmp
mv tmp HG002_SVs_Tier1_v0.6.DEL.vcf.gz
docker run -v $(pwd):/pwd -v /data/human:/data/human 5a75c4ae6ebc -m /pwd/manifest.txt -r /data/human/g1k_v37_decoy.fa --threads 4 -o /pwd/ -i /pwd/HG002_SVs_Tier1_v0.6.DEL.vcf.gz
wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.bed

truth_del=HG002_SVs_Tier1_v0.6.DEL.vcf.gz
ODIR=evaluate/
sizemax=15000000
sizemin=300
bed=HG002_SVs_Tier1_v0.6.bed

set -euo pipefail

rm -r $ODIR
tabix -f  genotypes.vcf.gz
  
python ~/src/truvari/truvari.py --sizemax $sizemax -s $sizemin -S $((sizemin - 30)) -b $truth_del -c genotypes.vcf.gz -o $ODIR/ --passonly --pctsim=0 -r 20 --giabreport -f /data/human/g1k_v37_decoy.fa --no-ref --includebed $bed -O 0.6
cat $ODIR/summary.txt

zcat $truth_del | ./add_ci 

svtyper -B /data/human/hg002.cram -T /data/human/g1k_v37_decoy.fa \
	-i with-ci.vcf \
	--max_ci_dist 0 \
	-o svtyper.genotyped.vcf

perl -pi -e 's/""/"/' svtyper.genotyped.vcf
bgzip svtyper.genotyped.vcf
tabix -f svtyper.genotyped.vcf.gz

ODIR=evaluate-svtyper/

rm -r $ODIR
python ~/src/truvari/truvari.py --sizemax $sizemax -s $sizemin -S $((sizemin - 30)) -b $truth_del -c svtyper.genotyped.vcf.gz -o $ODIR/ --passonly --pctsim=0  -r 20 --giabreport -f /data/human/g1k_v37_decoy.fa --no-ref --includebed $bed -O 0.6
cat $ODIR/summary.txt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.