samuelesoraggi / hmmploidy Goto Github PK

A tool to calculate ploidy levels from genotype likelihoods and coverage using Hidden Markov Models

License: GNU Lesser General Public License v3.0

R 57.02% Shell 1.73% Python 41.25%

genotype-likelihoods alleles ploidy-level allele-counts ploidy-numbers

hmmploidy's Introduction

HMMploidy: genotype-likelihood- and coverage-based Hidden Markov Model for inference of ploidy levels.

Tools for inferring ploidy levels, testing for aneuploidy and other stuff. Calculating its allele frequencies and genotype likelihoods requires to download the followings:

python 3, with packages gzip, numpy, scipy, pandas, lmfit
R, with packages pracma, data.table, Rcpp, getopt
samtools

Simulation of polyploid data

Overview: simulate poliploidy organisms and output them in an mpileup.gz format

simulationScript.sh -p $FILE -d $DEPTH -l $LOCI -o $OUTNAME

Input

-p or --ploidyFile: file containing the desired simulated data (see below for the sintax). Each line of the ploidyFile contains the ploidy number and the number of individuals for $K$ adjacent segments of a number $J$ of genomes. Example: $K=3$ adjacent segments and $20$ genomes. The first segment is diploid for all $20$ genomes, the second with is diploid for 10 genomes and tetraploid for 10 genomes, and the third is triploid for all genomes.

2x20
2x10,4x10
3x20

The length of each segment is expressed by the --loci parameter below.

-l or --loci: sequence of $K$ number of loci $l_1,l_2,...,l_K$ used in each simulated segment of the genomes.
-d or --depth: vector of haploid depths to simulate. For each depth, a compressed mpileup file is created.
-o or --out: output file name (without extension). Default: out

Output

One .mpileup.gz file for each haploid depth, number of individuals $N$, with the name $OUTNAME.DP$depth.mpileup.gz.

Syntax Example

$SOFTWAREPATH/simulationScript.sh -p ploidy.file -d 10,20 -l 1000,1000,1000 -o outFile

with the file ploidy.file containing

2x20
2x10,4x10
3x20

The output will consists of two files: outFile.DP10.mpileup.gz and outFile.DP20.mpileup.gz

Calculate genotype likelihoods

Overview: calculate genotype likelihoods

Genotype_Likelihoods.py names.filelist

Input

The file names.filelist, that contains the prefix of each file (that is, the name of each file including the extension), for example:

file1.mpileup.gz
file2.mpileup.gz
file3.mpileup.gz

for the files file1.mpileup.gz, file2.mpileup.gz, file3.mpileup.gz. Supported file types are 'bam','mpileup' and 'mpileup.gz' (gzipped mpileup file).

-o or --outFolder: Output folder. Default: the folder of each input files
-i or --Inbreeding: Inbreeding coefficients for each sample accepted as a comma seperated list e.g 0.3,0.2,0.1 alternatively can take in the format 0.2x3,0.4 which is equivilent to 0.2,0.2,0.2,0.4. All values must be between 0 and 1. Default value is 0xNSAMS
-p or --ploidyFile: File containing the list of ploidy levels to be used in analysis, that contains one ploidy level per line, for example:

2
4
5

for diploidy (2), tetraploidy (4) and pentaploidy (5). If not set, default ploidy levels [1, 2, 3, 4, 5, 6] will be used in the analysis.

-d or --downsampling: Fraction of the data to be included included in the calculation of genotype likelihoods and aneuploidy inference. That is for a value v in (0,1] for each read there is a vx100% chance the base is included in the calculations. this can be used to speed up calculations for high coverage samples. Be careful using this argument for low coverage data. Default: 1
-m or --min_non_major_freq: Set the minimum frequency of non major alleles for bases to be included in the calculations. Default: 0.2
-M2 or --max_minor2_freq: Set the maximum frequency of third most prolific alleles for bases to be included in the calculations. Used to determine strengh of confidence on bases being biallelic. Default: 0.1
-M3 or --max_minor3_freq: Set the maximum frequency of fourth most prolific alleles for bases to be included in the calculations. Used to determine strengh of confidence on bases being biallelic. Default: 0.1
-dp or --min_global_depth: Set the minimum global depth of a base to be included in calculations. All bases with more than this number of reads, after filtering for other conditions mentioned above, across all bases will be included.
-dpInd or --min_ind_depth: Set the minimum depth of a base for each sample to included those in the calculations. A locus is not considered if one or more samples have depth lower than the minimum. Default: 0.
-s or --random_seed: Set the random seed to be included in the calculations. If not set, every repeat of analysis will give different results.
-r or --ref_fasta: File name of the reference fasta file. Use this if bam files are being used. e.g TAIR10.fa.

Output

A .genolikes.gz file for each prefix in the input file. The columns of the file represent: chromosome name, site number, individual number, ref.allele, site coverage, major allele, minor allele, major allele counts, minor allele counts, genotype likelihoods at ploidy 1 (2 columns), genotype likelihoods at ploidy 2 (3 columns), ..., genotype likelihoods at ploidy 8 (9 columns)

Syntax Example

python3 Genotype_Likelihoods.py names.filelist -p ploidyFile -i 0.1x7,0.15,0.1x2 -d 0.9 -m 0.2 -M2 0.15 -M3 0.1 -dp 5 -s 1

Inference of ploidy levels

Overview: infer ploidy levels of each individual in a .genolikes file (uncompressed file from the output of the Genotype_Likelihoods.py script) using sequencing coverage and genotype likelihoods.

Rscript HMMploidy.R fileList=$FILELIST wind=$WIND

Input

fileList: list of base names of files with formats .genolikes, uncompressed from the output of Genotype_Likelihoods.py. Alternatively, use file and write directly the prefix of the desired file that has format .genolikes. For example

file1
file2

for the files file1.genolikes, file2.genolikes.

wind: windows size, i.e. nr of loci whose means and genotype likelihoods are summed together
maxPloidy: max number of ploidy levels (default 6)
nameFile: file with names of all the individuals in the genotype likelihoods file (default: names are ind1, ind2, ...). For example

Cryptococcus
Streptococcus

will be used as names for the first and the second individual of the input file. This will work if the genotype likelihood files have the same number of individuals.

chosenInd: comma separated indices of individuals to analyze (default: analyzes all individuals). For example chosenInd=1,2,7 to choose the first, second and seventh individual.
quantileTrim: comma separated values of 2 quantiles to trim depth values. (default 0,1 = keep all data)
minInd: min number of individuals with data for which a locus is considered in the analysis (default 1)
eps: sequencing/mapping error rate (default 0.0005)

Note:

if working on simulations, the list of basenames is already given in output by the simulation script.
the .genolikes.gz files given in output by the Genotype_Likelihoods.py script must be gunzipped.

Output

For each base name, there are two outputs:

a .pdf file with inferred ploidy numbers for each individual
a .HMMploidy file where, for each individual, results are arranged on lines as it follows:
- File name and sample name
- starting probability vector $\delta$ of inferred ploidy numbers
- transition matrix $A$ printed on one line (column by column)
- $\alpha$ vector of parameters for the depth distribution
- $\beta$ vector of parameters for the depth distributions
- final loglikelihood of the model
- inferred ploidy numbers of the model
- posterior probabilities for the inferred states (column by column)
- proportion of posterior probability for each inferred ploidy number
- starting locus of each window of SNPs
- ending locus of each window of SNPs
- inferred ploidy level in each of the windows defined in the two lines above

Application Example: Analyze ploidy numbers from simulations

Simulate 5 genomes ina file called poliploidyGenome, with 4 segments having ploidy numbers 2-5-4-2 for all genomes, and simulating two different haploid depths, 3X and 8X. Consider 1000 simulated loci for each segment. Let SOFTWAREPATHPATH be the folder containing the scripts of this repository.

Simulate the dataset

$SOFTWAREPATHPATH/simulationScript.sh -p ploidy.file -d 3,8 -l 1000,1000,1000,1000 -o poliploidyGenome

with the file ploidy.file containing:

2x5
5x5
4x5
2x5

There are two simulated .mpileup.gz files in output: poliploidyGenome.DP3.mpileup.gz, poliploidyGenome.DP8.mpileup.gz, and a file containing the list of prefixes for calculating the genotype likelihoods: poliploidyGenome.filelist

Calculate genotype likelihoods without applying any filter (see the script options for filtering details):

python3 $SOFTWAREPATHPATH/Genotype_Likelihoods.py poliploidyGenome.filelist

Run the analysis of ploidy numbers for the two simulated datasets. Use window size 100, analyze all the individuals, consider the max ploidy number being 5, do not trim the data, and use loci where there is data for >=2 individuals

gunzip *.genolikes.gz #the R script needs gunzipped genolikes files
Rscript $SOFTWAREPATHPATH/HMMploidy.R  fileList=poliploidyGenome.filelist  maxPloidy=5  wind=100  minInd=2

In output you will find the pdf files poliploidyGenome.DP4.pdf,poliploidyGenome.DP8.pdf with a fancy plot of the inferred ploidy for each individual and the mean depth of each window. The text output is contained in the files poliploidyGenome.DP4.HMMploidy,poliploidyGenome.DP8.HMMploidy

Notes

Each .genolikes file given in input to HMMploidy.R is expected to have only one chromosome. However, in presence of more chromosomes/contigs, the software will automatically analyze the contigs as adjacent in a whole genome fashion. Windows of loci will be done in each chromosome/contig to avoid sharing loci in a window between chromosomes/contigs. Windows will be downsized in chromosomes/contigs not large enough, and chromosomes/contigs without SNPs will be automatically removed from the analysis.

hmmploidy's People

Contributors

Stargazers

Watchers

Forkers

olivertarrant17 isinaltinkaya

hmmploidy's Issues

Genotype_Likelihoods.py error

Hi there,

Thanks for creating the HMMploidy tool. I have met a problem when I use the Genotype_Likelihoods.py, my command is: python3 Genotype_Likelihoods.py test.filelist -p ploidylevel.txt -s 1 -r reference.fa , and inside the test.filelist there was a bam file from a CNV simulator.
After I run it, it shows:
Ploidy levels to be tested in analysis are: [1, 2, 3, 4, 5]
1 file found
['chr20/30X_cnv1.bam']
File type detected: bam
Output file is: chr20/30X_cnv1.genolikes.gz
[mpileup] 1 samples in 1 input files
[0.0]
Traceback (most recent call last):
File "Genotype_Likelihoods.py", line 225, in
nonMajorCount = generics.calcNonMajorCounts(myReads)
File "/Volumes/Grace/Work/HMMploidy/generics.py", line 43, in calcNonMajorCounts
counts[alleles.index(read.base[i])]+=1
ValueError: 'a' is not in list

I used HMMploidy tool for may times, but this is the first time I met the error. I don't know why. Could you please give me some suggestions? thanks a lot!

Value error ‘g’ not in list

Hello,
I’ve tried with both .bam and .mpileup files but keep getting the following error:

python3.5 Genotype_Likelihoods.py PileupList.txt
Seed is not set.
Default ploidy levels to be tested in analysis are: [1, 2, 3, 4, 5, 6]
112 files found
['FF0395.mpileup', 'FF0397.mpileup', 'FF0399.mpileup', 'FF0401.mpileup', 'FF0402.mpileup', 'FF0403.mpileup', 'FF0404.mpileup', 'FF0405.mpileup', 'FF0406.mpileup', 'FF0407.mpileup', 'FF0408.mpileup', 'FF0409.mpileup', 'FF0410.mpileup', 'FF0411.mpileup', 'FF0412.mpileup', 'FF0413.mpileup', 'FF0414.mpileup', 'FF0415.mpileup', 'FF0416.mpileup', 'FF0417.mpileup', 'FF0418.mpileup', 'FF0419.mpileup', 'FF0420.mpileup', 'FF0421.mpileup', 'FF0422.mpileup', 'FF0423.mpileup', 'FF0424.mpileup', 'FF0425.mpileup', 'FF0426.mpileup', 'FF0427.mpileup', 'FF0428.mpileup', 'FF0429.mpileup', 'FF0430.mpileup', 'FF0431.mpileup', 'FF0432.mpileup', 'FF0433.mpileup', 'FF0434.mpileup', 'FF0435.mpileup', 'FF0436.mpileup', 'FF0437.mpileup', 'FF0438.mpileup', 'FF0439.mpileup', 'FF0440.mpileup', 'FF0441.mpileup', 'FF0442.mpileup', 'FF0443.mpileup', 'FF0444.mpileup', 'FF0445.mpileup', 'FF0446.mpileup', 'FF0447.mpileup', 'FF0448.mpileup', 'FF0449.mpileup', 'FF0450.mpileup', 'FF0451.mpileup', 'FF0452.mpileup', 'FF0453.mpileup', 'FF0454.mpileup', 'FF0455.mpileup', 'FF0456.mpileup', 'FF0457.mpileup', 'FF0458.mpileup', 'FF0459.mpileup', 'FF0460.mpileup', 'FF0461.mpileup', 'FF0462.mpileup', 'FF0463.mpileup', 'FF0464.mpileup', 'FF0465.mpileup', 'FF0466.mpileup', 'FF0467.mpileup', 'FF0468.mpileup', 'FF0469.mpileup', 'FF0470.mpileup', 'FF0471.mpileup', 'FF0472.mpileup', 'FF0473.mpileup', 'FF0474.mpileup', 'FF0475.mpileup', 'FF0476.mpileup', 'FF0477.mpileup', 'FF0478.mpileup', 'FF0479.mpileup', 'FF0480.mpileup', 'FF0481.mpileup', 'FF0482.mpileup', 'FF0483.mpileup', 'FF0484.mpileup', 'FF0485.mpileup', 'FF0486.mpileup', 'FF0487.mpileup', 'FF0488.mpileup', 'FF0489.mpileup', 'FF0490.mpileup', 'FF0491.mpileup', 'FF0492.mpileup', 'FF0493.mpileup', 'FF0494.mpileup', 'FF0495.mpileup', 'FF0496.mpileup', 'FF0497.mpileup', 'FF0498.mpileup', 'FF0499.mpileup', 'FF0500.mpileup', 'FF0501.mpileup', 'FF0502.mpileup', 'FF0503.mpileup', 'FF0504.mpileup', 'FF0505.mpileup', 'FF0506.mpileup', 'FF0507.mpileup', 'FF0508.mpileup', 'FF0509.mpileup']
File type detected: mpileup
Output file is: ./FF0395.genolikes.gz
[0.0]
Traceback (most recent call last):
File "Genotype_Likelihoods.py", line 420, in
nonMajorCount = generics.calcNonMajorCounts(myReads)
File "/mnt/data/micksong/HMMploidy/generics.py", line 43, in calcNonMajorCounts
counts[alleles.index(read.base[i])]+=1
ValueError: 'g' is not in list

Any help would be appreciated.
Thanks!
Michael

hiddenMarkovPloidyShell.R error

Hi,

I've been trying to run the hiddenMarkovPloidyShell.R script on a set of genomes. The species has a genome size that's relatively small, ~ 20Mbp. The data is around 20X coverage. When I run the script with pretty much default parameters I get the following error:

Error in seq.default(loci[1], loci[length(loci) - 1], minSize) :
  'to' must be a finite number
Calls: windowsBuilder -> seq -> seq.default
Execution halted

I tried running the example generated through using the simulation script and the script finished successfully so I assume that I might be doing something incorrectly in either generating the mpileups or the way these pileups are being handled by the genotypelikelihood script? Anyway, please let me know if additional info would be useful in troubleshooting this error.

names.filelist

I've been trying to calculate genotype likelihoods on a test file (30gb), but I keep getting this error:

python3 Genotype_Likelihoods.py output.file.mpileup
Seed is not set.
Default ploidy levels to be tested in analysis are: [1, 2, 3, 4, 5, 6]
utg000001l 235 T 2 ^).^&. EE )& file is not supported. Supported file types are '.mpileup', '.mpileup.gz' and '.bam'.

I've tried with both .mpileup file and with .bam file. For the .bam the traceback is as follows:

Seed is not set.
Default ploidy levels to be tested in analysis are: [1, 2, 3, 4, 5, 6]
Traceback (most recent call last):
File "/Users/sashanikolaeva/Desktop/HMMploidy-master/Genotype_Likelihoods.py", line 73, in
line = line.decode().strip('\n') # convert bytes into strings
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I assume it must be an issue with my file, but I have no idea what it might be. My other guess was that maybe it is because I am giving it an individual file and not names.filelist?

Any suggestions?

Thank you!

HMMploidy.R per contig lists of sites out of order

Hi there,

I'm trying to use HMMploidy.R with genotype likelihoods estimated with the provided python script and come across the following error at line 1274 of HMMploidy.R:

Error in if (max(ctgSites) - min(ctgSites) == 0) { :
missing value where TRUE/FALSE needed
Execution halted

I added a print statement to see where the script was tripping up and came across this:

[1] "TRINITY_DN55323_c2_g1_i1"
[1] 876 897 898 902 904 953 956 959 76 79 87 130 969 NA NA NA NA NA

It's the NAs which are causing the problem, and looking upstream a number of ctgSites objects appear to have the sites listed out of order, e.g.:

[1] "TRINITY_DN56192_c3_g46_i1"
[1] 786 989 1001 68
[1] "TRINITY_DN56106_c1_g1_i2"
[1] 71

In the above case it seems like that trailing site at position 68 of the first contig should actually be at position 68 of the following contig.

Attached is the genotype likelihood file if it helps: F2.sub.genolikes.zip

It's roughly ~1000 contigs of ~1000 bp each (I've subsampled an alignment of a much larger number of contigs to get ~1 Mb of contigs with >100 mean depth) so I'm aware that it doesn't strictly meet the assumptions of the analysis. The distribution of sites across contigs is also quite uneven. I'm working with a not-even-close-to-model organism and am interested in corroborating some earlier findings with my own RNAseq data, so probably quite removed from what the script was written for.

In any case I hope this is useful information. Can you advise if it's a problem on my end?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.