Giter Club home page Giter Club logo

vcf2tsvpy's Introduction

vcf2tsvpy: genomic VCF to tab-separated values (TSV)

Anaconda-Server Badge ย Anaconda-Server Badge

A small Python program that converts genomic variant data encoded in VCF format into a tab-separated values (TSV) file.

The program utilizes the cyvcf2 library to parse the VCF file. By default, the program prints the fixed VCF columns, all INFO tag values (as defined in the VCF header, INFO tags not present in a given record is appended with a '.'), and all genotype data (FORMAT columns) for heterozygotes and homozygotes. If genotype data is present, it prints one line per sample, and a column denoted VCF_SAMPLE_ID indicates data for a given sample. Importantly, the program has optional arguments to

  • skip sample genotype data (i.e. FORMAT colums) - print only variant information
  • keep rejected genotypes (i.e. FILTER != 'PASS' / GT == './.')
  • skip INFO data.
  • compress output TSV
  • print data types of VCF columns as a header line

IMPORTANT: If you run vcf2tsvpy with a large multi-sample VCF file, the file size of the output TSV will quickly grow fairly large, since there is, by default, one line per sample genotype in the output. Turn on --skip_genotype_data if you are primarily interested in the variant INFO elements, file size of output TSV will also be considerably smaller.

News

  • March 9th 2023: 0.6.1 release
    • Handling of cases where a tag is found both in INFO and FORMAT columns of VCF (e.g. DP). For such cases, the INFO tag name is now prepended with a INFO_ string (e.g. INFO_DP), ensuring non-duplicate columns in the final output TSV file.

Installation

The software can be installed with the Conda package manager, using the following command:

conda install -c bioconda vcf2tsvpy

Usage

vcf2tsvpy --input_vcf <INPUT_VCF> --out_tsv <OUTPUT_TSV>
       -h [options]

vcf2tsvpy:  Convert a VCF (Variant Call Format) file with genomic variants to a file with
        tab-separated values (TSV). One entry (TSV line) per sample genotype.

Required arguments:
    --input_vcf INPUT_VCF   Bgzipped input VCF file with input variants (SNVs/InDels)
    --out_tsv OUT_TSV       Output TSV file with one line per non-rejected sample genotype
                        (variant, genotype and annotation data as tab-separated values)

Optional arguments:
    --skip_info_data        Skip output of data in INFO column
    --skip_genotype_data    Skip output of genotype_data (FORMAT columns)
    --keep_rejected_calls   Output data also for rejected (non-PASS) calls
    --print_data_type_header    Output a header line with data types of VCF annotations
    --compress              Compress output TSV file with gzip
    --version               Show program's version number and exit

vcf2tsvpy's People

Contributors

razshaikh avatar sigven avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

vcf2tsvpy's Issues

default options excluding entries with PASS filter?

I have run vct2tsvpy with default arguments i.e. only required arguments with the following command:

vcf2tsvpy --input_vcf {in_vcf} --out_tsv {out_tsv} --skip_info_data

I have noticed, however, that some entries from the vcf, which have a PASS value under FILTER column were excluded from the output tsv file.

For example, the entry below is present on the vcf:

1       776546  rs12124819      A       G       .       PASS    .       GT:GQ:BAF:LRR   ./.:0:0.590754:0.0825162

but it is not present in the output tsv file, unless I pass the --keep_rejected_calls, in which case, the tsv file is complete.

Below is a vimdiff screenshot, the left-hand side with --keep_rejected_calls, right-hand side only with required arguments.

image

Is this the expected behaviour? How come not passing --keep_rejected_calls excludes calls that have a PASS under FILTER?

Thanks in advance

Segmentation fault

Hi, I am running vcf2tsv on an annotated VCF file, and getting this error:

> python3 vcf2tsv.py myhg38.hg38_multianno.vcf myhg38.hg38_multianno.tab
[W::vcf_parse] INFO '.' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'REVEL_rankscor' is not defined in the header, assuming Type=String
Segmentation fault

How can I resolve this? Any help would be greatly appreciated, thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.