edgardomortiz / vcf2phylip Goto Github PK

View Code? Open in Web Editor NEW

269.0 9.0 84.0 139 KB

Convert SNPs in VCF format to PHYLIP, NEXUS, binary NEXUS, or FASTA alignments for phylogenetic analysis

License: GNU General Public License v3.0

Python 100.00%

vcf snps phylip alignment phylogenetics nexus binary snapp fasta outgroup

vcf2phylip's Introduction

vcf2phylip

Convert SNPs in VCF format to PHYLIP, NEXUS, binary NEXUS, or FASTA alignments for phylogenetic analysis

Brief description

This script works with Python 3, it takes as input a VCF file and will use the SNP genotypes to create a matrix for phylogenetic analysis in the PHYLIP (relaxed version), FASTA, NEXUS, or binary NEXUS formats. For heterozygous SNPs the consensus is made and the IUPAC nucleotide ambiguity codes are written to the final matrix(ces), any ploidy level is allowed and automatically detected. The code is optimized for large VCF matrices (hundreds of samples and millions of genotypes), for example, in our tests it processed a 20GB VCF (~3 million SNPs x 650 individuals) in ~27 minutes. The initial version of the script just produced a PHYLIP matrix but now we have added other popular formats, including the binary NEXUS file to run SNPs analysis with the SNAPP plugin in BEAST (only for diploid genotypes).

Additionally, you can choose a minimum number of samples per SNP to control the final amount of missing data. Since phylogenetic software usually root the trees at the first sequence in the alignment (e.g. RAxML, IQTREE, and MrBayes), the script also allows you to specify an OUTGROUP sequence that will be written in the first place in the alignment.

Compressed VCF files can be directly analyzed but the extension must be .vcf.gz.

The script has been tested with VCF files produced by pyrad v.3.0.66, ipyrad v.0.7.x, Stacks v.1.47, dDocent, GATK, freebayes, and graphtyper

Please don't hesitate to open an Issue if you find any problem or suggestions for a new feature.

Usage

Just type python vcf2phylip.py -h to show the help of the program:

usage: vcf2phylip.py [-h] -i FILENAME [--output-folder FOLDER]
                     [--output-prefix PREFIX] [-m MIN_SAMPLES_LOCUS]
                     [-o OUTGROUP] [-p] [-f] [-n] [-b] [-r] [-w] [-v]

The script converts a collection of SNPs in VCF format into a PHYLIP, FASTA,
NEXUS, or binary NEXUS file for phylogenetic analysis. The code is optimized
to process VCF files with sizes >1GB. For small VCF files the algorithm slows
down as the number of taxa increases (but is still fast).

Any ploidy is allowed, but binary NEXUS is produced only for diploid VCFs.

optional arguments:
  -h, --help            show this help message and exit
  -i FILENAME, --input FILENAME
                        Name of the input VCF file, can be gzipped
  --output-folder FOLDER
                        Output folder name, it will be created if it does not
                        exist (same folder as input by default)
  --output-prefix PREFIX
                        Prefix for output filenames (same as the input VCF
                        filename without the extension by default)
  -m MIN_SAMPLES_LOCUS, --min-samples-locus MIN_SAMPLES_LOCUS
                        Minimum of samples required to be present at a locus
                        (default=4)
  -o OUTGROUP, --outgroup OUTGROUP
                        Name of the outgroup in the matrix. Sequence will be
                        written as first taxon in the alignment.
  -p, --phylip-disable  A PHYLIP matrix is written by default unless you
                        enable this flag
  -f, --fasta           Write a FASTA matrix (disabled by default)
  -n, --nexus           Write a NEXUS matrix (disabled by default)
  -b, --nexus-binary    Write a binary NEXUS matrix for analysis of biallelic
                        SNPs in SNAPP, only diploid genotypes will be
                        processed (disabled by default)
  -r, --resolve-IUPAC   Randomly resolve heterozygous genotypes to avoid IUPAC
                        ambiguities in the matrices (disabled by default)
  -w, --write-used-sites
                        Save the list of coordinates that passed the filters
                        and were used in the alignments (disabled by default)
  -v, --version         show program's version number and exit

Examples

In the following examples you can omit python if you change the permissions of vcf2phylip.py to executable.

Example 1: Use default parameters to create a PHYLIP matrix with a minimum of 4 samples per SNP:

python vcf2phylip.py --input myfile.vcf
# Which i equivalent to:
python vcf2phylip.py -i myfile.vcf
# This command will create a PHYLIP called myfile_min4.phy

Example 2: Create a PHYLIP and a FASTA matrix using a minimum of 60 samples per SNP:

python vcf2phylip.py --input myfile.vcf --fasta --min-samples-locus 60
# Which is equivalent to:
python vcf2phylip.py -i myfile.vcf -f -m 60
# This command will create a PHYLIP called myfile_min60.phy and a FASTA called myfile_min60.fasta

Example 3: Create all output formats, and select "sample1" as outgroup:

python vcf2phylip.py --input myfile.vcf --outgroup sample1 --fasta --nexus --nexus-binary
# Which is equivalent to:
python vcf2phylip.py -i myfile.vcf -o sample1 -f -n -b
# This command will create a PHYLIP called myfile_min4.phy, a FASTA called myfile_min4.fasta, a NEXUS called myfile_min4.nexus, and a binary NEXUS called myfile_min4.bin.nexus

Example 4: If, for example, you wish to disable the creation of the PHYLIP matrix and only create a NEXUS matrix:

python vcf2phylip.py --input myfile.vcf --phylip-disable --nexus
# Which is equivalent to:
python vcf2phylip.py -i myfile.vcf -p -n
# This command will create only a NEXUS matrix called myfile_min4.nexus

Example 5: If for some reason you don't want to have IUPAC ambiguities representing heterozygous genotypes:

python vcf2phylip.py --input myfile.vcf --resolve-IUPAC
# Which is equivalent to:
python vcf2phylip.py -i myfile.vcf -r
# This command will create only a PHYLIP matrix called myfile_min4.phy where IUPAC ambiguites have been randomly resolved

Example 6: Specify output folder and output prefix:

python vcf2phylip.py -i myfile.vcf.gz --output-folder /data/results --output-prefix mymatrix
# This command will create the file `myfile.min4.phy` in the folder `/data/results`

Example 7: Write a list of the sites that were used in the alignments:

python vcf2phylip.py -i myfile.vcf.gz -w
# This command will create the file `myfile.min4.phy` and the list `myfile.min4.used_sites.tsv`

Credits

Code: Edgardo M. Ortiz
Data and testing: Juan D. Palacio-Mejía

Citation

Ortiz, E.M. 2019. vcf2phylip v2.0: convert a VCF matrix into several matrix formats for phylogenetic analysis. DOI:10.5281/zenodo.2540861

vcf2phylip's People

Contributors

Stargazers

Watchers

vcf2phylip's Issues

Index Error: string out of range

Hello,
I am having a problem similar to the last issue reported, but mine is for the non-binary nexus format. The problem may be on my end, I'm new to coding. I am trying to merge two vcf files and then convert the merged file into nexus format. Both vcf files were made using GATK best practices, but when I try to combine the files with GATK "combine variants" I get errors and the files won't merge. When I used vcftools, I was able to merge the files (I am not sure why this worked and the other did not). I tried to convert the vcf file to nexus format using the following code:

python vcf2phylip.py --input MY_VCF_MERGED.vcf --phylip-disable --nexus

Then I get the following error:

File "vcf2phylip.py" , line 384, in main()
File "vcf2phylip.py" , line 209, in the main site_tmp = ' '.join([amb[(nuc[broken[i][0]], nuc[broken[i][2]])]) for i in range(9, index_last_sample)])
IndexError: string out of range

I tried going back and converting the files before I merged them, and for some reason, this works. I am able to convert each vcf file into nexus format, but when I try to convert the merged vcf file, it won't work. If you have any ideas on what I might need to change in order to get my merged file to convert to nexus format, I would really appreciate the help! Thank you very much :)

Leigh Ann

IndexError: string index out of range

Hi I am trying to convert a multisample vcffile to binary nexus format for SNAPP.

I am getting following error:

Traceback (most recent call last):
File "vcf2phylip.py", line 419, in
main()
File "vcf2phylip.py", line 239, in main
site_tmp = ''.join([amb[''.join(sorted(set([nuc[broken[i][j]] for j in gt_idx])))] for i in range(9, index_last_sample)])
IndexError: string index out of range

This is the command I am using:
python vcf2phylip.py -i /path/to/vcf -n

ori_5000.vcf.gz

Thanks a lot in advance!

Give option for name and directory of output file

I use snakemake, where one has to specify the name of the output file before hand.

At the moment, vcftophylip creates a file in the directory of the source file and replaces the vcf suffix with 'min4.phy' when using default options.

It's very useful if I can specify the name and directory of the output file. Saves me extra commands in snakemake. It also brings your program into line with standard unix utilities.

Reduced number of SNPs in the output file

Hi,

I'm using your python script to convert my vcf file to nexus and phylip formats. However, I realized that there is a discrepancy in the number of sites between the input and output files.
I am currently using a VCF file with no missing data and 681 variant sites, and both output files (nexus and phylip) have a total of 647 sites.
The same happens if I use a vcf file with more sites (including missing data).
Is there an explanation for such difference?

Thank you.

SyntaxError: invalid syntax

Hello, when I run this script in terminal I get the following error message:

'File "vcf2phylip.py", line 190
print str(snp_num)+" genotypes processed"
^
SyntaxError: invalid syntax'

Thanks!
Hannah

Can't regonized by Treebest to buid NJ tree

I converted vcf file to fasta file, then I used Treebest to build NJ tree with command " treebest nj -W -t ntmm -b 1000 fastfile > result", Error occurred:
[ma_trans_align] not seem to be a nucleotide alignment (218638).
[ma_nucl_filter] fail to translate a nucleotide alignment. Filtering abort.

I couldn't find what was wrong.

issues convert vcf file to phylyp

Hi,
I have the following issue. the vcf2phylip tool did not process the VCF file provided as expected. The output format 58 0 indicates that it detected 58 samples but 0 sites, which is not typical for a valid VCF file containing genotype information.
I used the following code to generate the vcf file prior to use vcf2phylyp

enroot start --mount $HOME --root --rw staphb+bcftools sh -c "
bcftools view -h /home/carlos.carrion/output_filtered.vcf > /home/carlos.carrion/output_reformat.vcf &&
bcftools query -f "%CHROM\t%POS\t%ID\t%REF\t%ALT\t%QUAL\t%FILTER\t%INFO[\t%SAMPLE=%GP]\n" /home/carlos.carrion/output_filtered.vcf >> /home/carlos.carrion/output_reformat.vcf"

Thanks

Hours taken to find the ploidy

HI there!

My multi sample vcf file with non-header 551 lines (only 1 contig) takes hours to be processed. In fact, it has never finished.
This is RNA-seq data from multiple haploid parasites in the host, so ploidy does not matter. GATK haplotypecaller was used to produce GVCF files, then they were filtered with GATK VariantFiltration, and then indels were removed with bcftools. Finally the individual sample vcf files were merged with GATK CombineVariants.

I made a version of the vcf that only has the header and the first snp line, and that still takes a few minutes and has not finished as of this writing.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT R47 R48 R49 R50 R51 R52 PRELSG_01_v1 470 . C A,<NON_REF> . . BaseQRankSum=-2.287;DP=10;ExcessHet=3.01;MQRankSum=0;RAW_MQandDP=36000,10;ReadPosRankSum=0.728GT:AD:DP:GQ:PL:SB ./.:.:.:.:.:. ./.:8,2,0:10:10:0,10,256,24,261,275:7,1,2,0 ./.:.:.:.:.:. ./.:.:.:.:.:. ./.:.:.:.:.:. ./.:.:.:.:.:.

That is the first and only snp line in the vcf file that the script loops around.
I uncommented the print statements for 'missing', 'broken[j], etc, bu they never print.

Outputfile(.phy) couldn't be open by Phylip and FastTree

Hi, Edgardo

I tried to use vcf2phylip.py to preprare the inputfile for phylip as well as FastTree, and encountered the following problems. Would you please to help me out? Thank you very much.

python ../vcf2phylip/vcf2phylip.py -i ../chr251.LDfilter.vcf -r

phylip-3.697 dnadist "ERROR: Unexpected end-of-file."

FastTree Version 2.1.10 "No sequence in phylip line"

Best,

Rong Liu

vcf file (monomorphic SNPs discarded) to phylip

Hi!
I'm interested in using RAxML to assess how my samples cluster based on RADseq population SNPs.
I converted my VCF file to phylip format using your phyton code.
I'm completely sure I don't have monomorphic SNPs in my data set. However when I run RAxML with the ASC correction option (recommended when using only variable sites=SNPs) the program displays and error: "For partition No Name Provided you specified that the likelihood score shall be corrected for invariant sites via an ascertainment bias correction. However, some sites in this partition are already invariant. This is not allowed, please remove all invariant sites and try again, exiting".
How is this possible. Can someone help me?

output path issue

How to set a certain output path for the PHY? My VCF is too large and my C drive is too small to hold the output PHY. Thanks.

genotypes excluded even if missing data is less than -m 4

Hi
I have a multi-sample vcf which I have filtered to retain reference and SNP calls ONLY if at least 25 samples out of 31 total samples have non-missing data. However, when I convert that into phylip using your script, genotypes are still being excluded when they should not be. Or am I understanding the -m parameter wrong? I also tried -m 0 and still facing the same problem.
Is there any way to see the excluded genotypes to troubleshoot this?

vcf2phylip.py -i RMPs.vcf -m 4
Total of genotypes processed: 6372167
Genotypes excluded because they exceeded the amount of missing data allowed: 875118
Genotypes that passed missing data filter but were excluded for not being SNPs: 0
SNPs that passed the filters: 5497049
vcf2phylip.py -i RMPs.vcf -m 0
Total of genotypes processed: 6372167
Genotypes excluded because they exceeded the amount of missing data allowed: 810339
Genotypes that passed missing data filter but were excluded for not being SNPs: 0
SNPs that passed the filters: 5561828

Thanks.

Add --version flag

% vcf2phylip.py --version 2> /dev/null
vcf2phylip 2.1

% echo $?
0

transfer vcf file of SV into phylip failed

Hi!
I have a muti-sample vcf file which is about stractural variants (SV) instead of SNP. And I want to know if I can use vcf2phylip to tranfer this SV vcf file into phylip.
If yes, I'm sorry that I had a trouble and failed.

code: vcf2phylip.py -i sheep_534sample.sv.vcf --output-prefix sheep_534sample.sv

And log is showed as below:

Converting file 'sheep_534sample.sv.vcf':
Number of samples in VCF: 534
Total of genotypes processed: 62452
Genotypes excluded because they exceeded the amount of missing data allowed: 12493
Genotypes that passed missing data filter but were excluded for being MNPs: 49959
SNPs that passed the filters: 0

Another question is that:
In SV vcf file, there are lots of "./." instead of "0/0" to represent the same as ref. While I replace "./." with "0/0", and then use vcf2phylip, it still failed.

Hope your reply. Thanks!

SNPs that passed the filters: 0

Hello,
when I tried to convert VCF format(
Uploading timetree_1.vcf.gz…
) to Phylip format using vcf2phylip, all the SNPs was not preserved. This data is 9 samples extracted from a complete VCF file, and the complete VCF file can be converted normally.

Using with Phylip

Hi- what Phylip programs does this produce inputs for? I tried using the file with the Neighbor program and got an error that the input was of the wrong type.

Thanks!

error

File "/Users/awaisrasheed/vcf2phylip.py", line 71
<title>vcf2phylip/vcf2phylip.py at master · edgardomortiz/vcf2phylip · GitHub</title>
^
SyntaxError: invalid character '·' (U+00B7)

alt allele assignment for heterozygous SNP site instead of IUPAC codes

Hi,

is it possible with the current script to assign alternate allele for all heterozygous sites, instead of IUPAC codes? With the default, while trying to translate the converted phylip file, many stop codons are appearing in the alignment because of IUPAC codes and its a painful process to correct 1-2 million SNPs. Any suggestions? I am sure it won't be difficult to add this functionality as an option.

Many thanks.

Kumar

UnboundLocalError: local variable 'used_sites' referenced before assignment

Hi Edgardo,

When I used the vcf2phylip.py to convert the vcf file of SNP to nexus format, I got the error message as follow.

Converting file 'dqgg_20210720.filtered.vcf':

Number of samples in VCF: 31
Traceback (most recent call last):
  File "./vcf2phylip.py", line 513, in <module>
    main()
  File "./vcf2phylip.py", line 340, in main
    used_sites.write(record[0] + "\t" + record[1] + "\t" + str(num_samples_locus) + "\n")
UnboundLocalError: local variable 'used_sites' referenced before assignment

The code I used was

./vcf2phylip.py --input dqgg_20210720.filtered.vcf --phylip-disable --nexus --output-prefix 2ndconversion

The SNP in the vcf file was called using bcftools mpileup, and the vcf file has been indexed using bcftools index. Do you know how should I do to fix my problem?

Best,
Xin

Getting an error on vcf2phylip

Hello,
I'm having issues converting my vcf file. I work with Ubuntu.

The command I entered: python ~/vcf2phylip/vcf2phylip.py -i Final.vcf
Traceback (most recent call last):
File "/home/computer/vcf2phylip/vcf2phylip.py", line 397, in
main()
File "/home/computer/vcf2phylip/vcf2phylip.py", line 220, in main
site_tmp = ''.join([(amb[(nuc[broken[i][0]], nuc[broken[i][2]])]) for i in range(9, index_last_sample)])
KeyError: '2'

I've attached the file here as well.
Final.vcf.gz

Any and all help is appreciated!

convert to vcf to interleaved phylip format?

Hi Edgardo

I'm trying to convert a vcf assembly filtered with VCFtools into a .phylip interleaved formated file in order to perform a bpp run. When I use the script a sequential phylip is returned. Is there a way to achieve this?
thanks!!

error in binary nexus

To convert the vcf into binary nexus it throw the error:-

Traceback (most recent call last):
File "vcf2phylip.py", line 384, in
main()
File "vcf2phylip.py", line 224, in main
binsite_tmp = ''.join([(gen_bin[broken[i][0:3]]) for i in range(9, index_last_sample)])
KeyError: '1|1'

Max character length for seq IDs

VCF allows unlimited character length for sequence IDs, but apparently PHYLIP allows only 10. Would there be a way to force unique character names if the input is longer than 10 characters? Or, at least provide a warning that the output will be invalid?

Also, I got my vcf from graphtyper and the tool is working well!

input file can't be file.vcf.gz?

helllo, thanks for your software.
I met an error while I used the command "vcf2phylip.py -f -o O.chinensis_01,O.octandra_01,O.scaberrima_01,O.scaberrima_02 -i melastoma.filter.vcf.gz"
Then I got the message.
Traceback (most recent call last):
File "/DATA4/Liang/yanzhong/software/vcf2phylip-master/vcf2phylip.py", line 430, in
main()
File "/DATA4/Liang/yanzhong/software/vcf2phylip-master/vcf2phylip.py", line 124, in main
if line.startswith("#CHROM"):
TypeError: startswith first arg must be bytes or a tuple of bytes, not str

any reply will be welcomed.

error: unexpected end-of-file.

I ran vcf2phylip to generate the phylip file and got an error:

error: unexpected end-of-file.

After checking the .phy file and found the length of name + padding is not equal 9, which is not 10 that is required by dnapars. After I manually padded the name to 10, the problem is gone.

error with using python 3

Hi,

Thank you for providing the useful tools. However, I got error of "invalid syntax" on line 197. I think it's because I'm using python 3 so "print" becomes a function. I tried print() instead of print and it worked.
thank you.

YiMing

Invalid character * in the alignment

Hello,
Thanks for this amazing script.
I converted a 264 sample vcf file into a fasta alignment using vcf2phylip with default parameters. And attempted constructing a phylogeny using iqtree. this was unsuccessful because of the presence of invalid * characters in multiple sites in the alignment.
Any recommendations?

Plans to port to python3 ?

DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020.
 Please upgrade your Python as Python 2.7 won't be maintained after that date. 
A future version of pip will drop support for Python 2.7. 
More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support

output for phased data

Hi there,

Thank you for making this super useful tool! Issue #23 was a really helpful improvement for utilizing heterozygous sites. I was wondering if it would be feasible to include an option for an output file that is two alignments per diploid individual? E.g.

Ind1_A
ATGCAA
Ind1_B
GTACCG

This would provide a reasonable alternative to discarding het sites or selecting them randomly when the data is phased confidently.

Thank you!
Erik

unexpected token 'newline' error

I'm having an issue converting a vcf created with pggb. It is in format ##fileformat=VCFv4.2
Any ideas?

vcf2phylip.py -i combined.fasta.a8a102b.7608fc1.afc7f52.smooth.final.Panubis1.1.vcf
/usr/local/bin/vcf2phylip.py: line 7: syntax error near unexpected token `newline'
/usr/local/bin/vcf2phylip.py: line 7: `<!DOCTYPE html>'

Deciding an outgroup to root the tree

Hi there,
I am not sure which sample I use an outgroup to root the tree. Is there any way to choose midpoint while converting a vcf file to phlylip using your script?
Thanks

Phylogenetic tree

Dear Sir,
Any lead how can we import .py generated file in RAxML, IQTREE, and MrBayes? I am not able to do it successfully. Is there any example for the the same?
Thanks
Devender Arora

ImportError: No module named pathlib

my code is
python vcf2phylip.py --input of_snp.vcf.gz
and I got an error but don't know how to fix it, can you help me? thank you so much!

Traceback (most recent call last):
File "vcf2phylip.py", line 23, in
from pathlib import Path
ImportError: No module named pathlib
tree_out.log (END)

All SNPs were removed by vcf2phylip

Hello, I am trying to go from vcf file to Phylip file but it is not possible. I get this notice telling me that all my SNPs have been removed.

python3 vcf2phylip.py -i unido2.vcf -o prueb2.phylip -o muestra_03_sort.bam -m 4

Converting file unido2.vcf:

Number of samples in VCF: 6
Total of genotypes processed: 9305
Genotypes excluded because they exceeded the amount of missing data allowed: 9305
Genotypes that passed missing data filter but were excluded for not being SNPs: 0
SNPs that passed the filters: 0

Outgroup, muestra_03_sort.bam, added to the matrix(ces).
Sample 1 of 6, muestra_01_sort.bam, added to the nucleotide matrix(ces).
Sample 2 of 6, muestra_02_sort.bam, added to the nucleotide matrix(ces).
Sample 4 of 6, muestra_04_sort.bam, added to the nucleotide matrix(ces).
Sample 5 of 6, muestra_05_sort.bam, added to the nucleotide matrix(ces).
Sample 6 of 6, muestra_06_sort.bam, added to the nucleotide matrix(ces).

Done!

I did a manual inspection of the vcf file and I know that there are SNPs that are found in common in the 6 samples. Could you help me with this error please?

P.S. This is my file:
unido2.txt

I used it in .vcf but it doesn't allow me to upload it in that format.

all SNPs were removed when converting my vcf file by using vcf2phylip

Hello,

I found that all my SNPs were removed after converting my vcf file. I did that on the Linux system, the command I type is "python vcf2phylip.py -m 0 -fi 1111.filter.vcf".
And the result is
"Number of samples in VCF: 8
Total of genotypes processed: 130883
Genotypes excluded because they exceeded the amount of missing data allowed: 130883
Genotypes that passed missing data filter but were excluded for not being SNPs: 0
SNPs that passed the filters: 0"

I don't know why all my SNPs were removed, and how can I fix that problem?

Appreciate for your help! ^_^

Error from vcf2phylip v2.3 when converting GATK vcf

Hello.
I got the following error when converting GATK vcf to phylip using vcf2phylip v2.3.
Typed command: ./vcf2phylip.py -i input.vcf -r

Traceback (most recent call last):
File "./vcf2phylip.py", line 447, in
main()
File “./vcf2phylip.py", line 65, in main
outgroup = args.outgroup.split(",").split(";")[0]
AttributeError: 'list' object has no attribute 'split'

Could you give me any solutions?
It didn’t happen when I used v2.0 instead of v2.3, but I would like to apply “-r, --resolve-IUPAC” option in v2.3.

In addition to that, I would like to know how genotype is determined in the case of hetero using “-r, --resolve-IUPAC” option in this tool.

Thank you very much.

AttributeError: 'str' object has no attribute 'decode'

Hi - Is this a Python 3 issue?
(base) Jessicas-MacBook-Pro-2:Monadenia_filtered_SNPs jessicaoswald$ python vcf2phylip.py -i DP3g95maf05.recode.vcf
Traceback (most recent call last):
File "vcf2phylip.py", line 431, in
main()
File "vcf2phylip.py", line 123, in main
line = line.decode("UTF-8")
AttributeError: 'str' object has no attribute 'decode'

Thanks,
Jess

Multiple VCF files?

Hi,

I would like to create a phylip file from several VCF files produced by ATLAS variant caller.
Since ATLAS seems not to support producing gvcf files to date, I have several VCFs from different individuals.
I tried vcf2phylip.py with options like

vcf2phylip.py -i Ind.vcf.gz -i Ind.vcf.gz --min-samples-locus 2

but got an error.

Does this script support multiple inputs?
Or could you suggest any alternative ways?(Merge VCFs before the script?)

Thank you for your kind helps!

All of the SNPs were removed by vcf2phylip

Aloha,

For some reason all my SNPs were removed after converting my vcf file. Processed on a MacOS system, the command I used is "python vcf2phylip.py -i all.vcf".

The result is:
"Number of samples in VCF: 74
Total of genotypes processed: 11576930
Genotypes excluded because they exceeded the amount of missing data allowed: 11576930
Genotypes that passed missing data filter but were excluded for being MNPs: 0
SNPs that passed the filters: 0"

I used GATK to create this merged VCF file, and I suspect there's a formatting issue. To check this, I have included the first 1000 rows of the merged VCF as an attachment.

How can I fix this? Mahalo in advance for any help you can give.
vcf1000.txt

-Bjarne

Key Error '-'

Hello,

I'm trying to convert my vcf file to phylip, but I can't work out the error I get.

$ python vcf2phylip.py -i cfaplusoutD.recode.vcf -o AN02:C6G73ANXX:7:250444719

Converting file cfaplusoutD.recode.vcf:

Number of samples in VCF: 47
Traceback (most recent call last):
File "vcf2phylip.py", line 421, in
main()
File "vcf2phylip.py", line 241, in main
site_tmp = ''.join([amb[''.join(sorted(set([nuc[broken[i][j]] for j in gt_idx])))] for i in range(9, index_last_sample)])
KeyError: '-'

I'm attaching the first 2000 variants of the file
here, in case there is any way you can help.
Thank you very much for your time, in advance.

Installation problem

Hello
I downloaded the software from this URL "wget https://github.com/edgardomortiz/vcf2phylip/archive/refs/tags/2.6.tar.gz" and used the "tar xf vcf2phylip-2.6.tar.gz" command to decompress it , And then I ran the "python vcf2phylip -h" command, and encountered the following problems:

Traceback (most recent call last):
File "vcf2phylip.py", line 26, in
from pathlib import Path
ImportError: No module named pathlib

How can i fix it？
best wishes
xin long

Is there a way to get vcf2phylip to output the REF for non-variant positions rather than N?

Hi Edgardo

I have a series of large multi-vcf files of bacterial snps (so everything is haploid) that I want to convert to phylip and use ultimately with FastTree (http://www.microbesonline.org/fasttree/#FAQ). I have used vcf2phylip on some test files and it has all worked nicely and quickly however I then found that FastTree wont accept non-ACGT characters so it doesn't like the output I now have. I would really like to use FastTree (this work is a comparison against another FastTree using pipeline) so is there a way to get vcf2phylip to output the REF nucleotide rather than N in the final phylip file?

Thanks for your help

Richard

the result consists of the character of "RY".

Hi,
When I run the code below, It produced the result file named "input.min1.phy".
python vcf2phylip.py -i input.vcf
However,the file "input.min1.phy" consists of the character of "RY" which is not the part of DNA nucleotide "AGTC".
Is it normal?
input.zip
sun,

Coordinates of alignment sites

Thank you for such a useful tool. Is it be possible to extract the coordinates for the sites included in the alignment?

KeyError: '2'

I'm trying to run vcf2phylip on a VCF obtained from GATK's CombineVariants.

However, when trying to run it I get:

$ python ~/bin/vcf2phylip/vcf2phylip.py --input all_variants.vcf -f -p
    Traceback (most recent call last):
      File "/home/mnguyen/bin/vcf2phylip/vcf2phylip.py", line 387, in <module>
        main()
      File "/home/mnguyen/bin/vcf2phylip/vcf2phylip.py", line 212, in main
        site_tmp = ''.join([(amb[(nuc[broken[i][0]], nuc[broken[i][2]])]) for i in range(9, index_last_sample)])
    KeyError: '2'

KeyError: '2': always met this error when converting a vcf file to phylip file.

I have modified the variety names to be four digits, such as 4111, ensuring they are within 10 characters. However, I am still encountering this error. Could you please teach me how to resolve it?

$ python3 vcf2phylip-2.8/vcf2phylip.py -i test.vcf

Converting file 'test.vcf':

Number of samples in VCF: 290
Traceback (most recent call last):
File "vcf2phylip-2.8/vcf2phylip.py", line 502, in
main()
File "vcf2phylip-2.8/vcf2phylip.py", line 316, in main
site_tmp = get_matrix_column(record, num_samples,
File "vcf2phylip-2.8/vcf2phylip.py", line 129, in get_matrix_column
column += AMBIG[geno_nuc]
KeyError: '2'