mgalardini / pyseer Goto Github PK

View Code? Open in Web Editor NEW

98.0 10.0 24.0 75.17 MB

SEER, reimplemented in python 🐍🔮

Home Page: http://pyseer.readthedocs.io

License: Apache License 2.0

Python 96.20% Shell 3.80%

gwas bacteria k-mer reimplementation reproducible-science

pyseer's Introduction

pyseer

SEER, reimplemented in python by Marco Galardini and John Lees

pyseer --phenotypes phenotypes.tsv --kmers kmers.gz --distances structure.tsv --min-af 0.01 --max-af 0.99 --cpu 15 --filter-pvalue 1E-8

Motivation

Kmers-based GWAS analysis is particularly well suited for bacterial samples, given their high genetic variability. This approach has been implemented by Lees, Vehkala et al., in the form of the SEER software.

The reimplementation presented here should be consistent with the current version of the C++ seer (though we do not guarantee this for all possible cases).

In this version, as well as all the original features, many new features (input types, association models and output parsing) have been implemented. See the documentation and paper for full details.

Citations

Unitigs and elastic net preprint: Lees, John A., Tien Mai, T., et al. Improved inference and prediction of bacterial genotype-phenotype associations using pangenome-spanning regressions. bioRxiv 852426 (2019) doi: 10.1101/852426

pyseer and LMM implementation paper: Lees, John A., Galardini, M., et al. pyseer: a comprehensive tool for microbial pangenome-wide association studies. Bioinformatics 34:4310–4312 (2018). doi: 10.1093/bioinformatics/bty539

Original SEER implementation paper: Lees, John A., et al. Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes. Nature communications 7:12797 (2016). doi: 10.1038/ncomms12797

Documentation

Full documentation is available at readthedocs.

You can also build the docs locally (requires sphinx) by typing:

cd docs/
make html

Prerequisites

Between parenthesis the versions the script was tested against:

python 3+ (3.6.6)
numpy (1.15.2)
scipy (1.1.0)
pandas (0.23.4)
scikit-learn (0.20.0)
statsmodels (0.9.0)
pysam (0.15.1)
glmnet_python (commit 946b65c)
DendroPy (4.4.0)

If you would like to use the scree_plot_pyseer script you will also need to have matplotlib installed. If you would like to use the scripts to map and annotate kmers, you will also need bwa, bedtools, bedops and pybedtools installed.

Installation

The easiest way to install pyseer and its dependencies is through conda::

conda install pyseer

If you need conda, download miniconda and add the necessary channels::

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

pyseer can also be installed through pip; download this repository (or one of the releases), cd into it, followed by:

python -m pip install .

If you want multithreading make sure that you are using a version 3 python interpreter:

python3 -m pip install .

If you want the next pre-release, just clone/download this repository and run:

python pyseer-runner.py

Copyright

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

pyseer's People

Contributors

Stargazers

Watchers

pyseer's Issues

Must check for PerfectSeparationError

Using statsmodels 0.8.0 I get the following error:

$ ./pyseer firth_kmers.gz firth.pheno firth_distances.tsv                                                                                                               
kmer    af      filter-pvalue   lrt-pvalue      beta    beta-std-err    intercept       PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9     PC10
Traceback (most recent call last):
  File "./pyseer", line 590, in <module>
    ret = binary(*data)
  File "./pyseer", line 222, in binary
    res = mod.fit(start_params=start_vec, method='newton')
  File "/home/mgalactus/workspace/pyseer/venv/lib/python3.5/site-packages/statsmodels/discrete/discrete_model.py", line 1377, in fit
    disp=disp, callback=callback, **kwargs)
  File "/home/mgalactus/workspace/pyseer/venv/lib/python3.5/site-packages/statsmodels/discrete/discrete_model.py", line 204, in fit
    disp=disp, callback=callback, **kwargs)
  File "/home/mgalactus/workspace/pyseer/venv/lib/python3.5/site-packages/statsmodels/base/model.py", line 451, in fit
    full_output=full_output)
  File "/home/mgalactus/workspace/pyseer/venv/lib/python3.5/site-packages/statsmodels/base/optimizer.py", line 184, in _fit
    hess=hessian)
  File "/home/mgalactus/workspace/pyseer/venv/lib/python3.5/site-packages/statsmodels/base/optimizer.py", line 242, in _fit_newton
    callback(newparams)
  File "/home/mgalactus/workspace/pyseer/venv/lib/python3.5/site-packages/statsmodels/discrete/discrete_model.py", line 187, in _check_perfect_pred
    raise PerfectSeparationError(msg)
statsmodels.tools.sm_exceptions.PerfectSeparationError: Perfect separation detected, results not available

Will catch it and go on with computation accordingly.

Check that distance and similarity matrices have same samples with mixed effects analysis

When running --lmm with lineage effect, both the similarity and distance matrix are required. No preflight checks are done that they contain the same samples. It should be relatively simple to implement.

Fitting null with perfect separation

This dataset fails:
http://klif.vet.uu.nl/pyseer/genomes/

More concise output to save disk space

We are currently printing the full list of samples that have or don't have each kmer, which can lead to a very large output file size. We could default to a more concise output format, as the original seer did.

Spaces in file names

In bwa.py, change to:

if algorithm == "mem":
        command = "bwa mem -v 1 -k 8 '" + reference + "' '" + fasta + "'"
    elif algorithm == "fastmap":
        command = "bwa fastmap -l 9 " + reference + " " + fasta + "'"
    else:
        sys.stderr.write("Unknown algorithm type for bwa\n")
        raise ValueError(algorithm)

    try:
        devnull = stderr=sys.stderr
    except AttributeError:
        # python 2.x
        devnull = open(sys.stderr, 'w')
    bwa_p = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=devnull, shell=True, universal_newlines=True)

Need to test python2 ok, and may want better error handling

Add tests and CI

GT KeyError with own data

Hello,

I'm really excited to use this software, I've been hoping someone produces a GWA method for prokaryotes that accounts for clonality and lineage effects!

I ran into a KeyError when I tried to run pyseer on my data:

Traceback (most recent call last):

File "/home/sbusby/miniconda3/bin/pyseer", line 10, in
sys.exit(main())
File "/home/sbusby/miniconda3/lib/python3.6/site-packages/pyseer/main.py", line 450, in main
options.cpu*options.block_size))
File "/home/sbusby/miniconda3/lib/python3.6/multiprocessing/pool.py", line 274, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/home/sbusby/miniconda3/lib/python3.6/multiprocessing/pool.py", line 376, in _map_async
iterable = list(iterable)
File "/home/sbusby/miniconda3/lib/python3.6/site-packages/pyseer/input.py", line 471, in iter_variants
sample_order)
File "/home/sbusby/miniconda3/lib/python3.6/site-packages/pyseer/input.py", line 298, in read_variant
var_name = read_vcf_var(line_in, d)
File "/home/sbusby/miniconda3/lib/python3.6/site-packages/pyseer/input.py", line 375, in read_vcf_var
for haplotype in call['GT']:
File "pysam/libcbcf.pyx", line 3439, in pysam.libcbcf.VariantRecordSample.getitem
File "pysam/libcbcf.pyx", line 792, in pysam.libcbcf.bcf_format_get_value
KeyError: 'invalid FORMAT: GT'

I have the genotype field populated in the VCFs and the merged VCF (I can attach my data if needed). I have pysam version 0.14.1. I'm using the distances flag with a phylogenetic tree that I generated using RAxML. I created the distances.tsv with the phylogeny_distance.py.

I also tried running pyseer using the test data, core_genome_aln.tree, but there is not output when I run: python pyseer-master/scripts/phylogeny_distance.py core_genome_aln.tree. I wanted to make sure it wasn't a formatting problem on my end but I can't test the --distances flag on the test data provided.

What could I be doing wrong?

Thanks for your help!

Sarah

Guidelines for contributing

Hi,
I was just curious if there are adopted guidelines for contributing to this project? I have just been through the tutorial and noticed a couple of minor typos but wasn't sure of the preferred route to make these known.
Cheers!

Input file formats

Hi,

Great to see a python version!

Can you please provide example files for phenotypes.tsv and structure.tsv?

Cheers,
Liam

annotate_hits: GFF shouldn't have quotes in attributes tags

I encountered some problems while running annotate_hits_pyseer on some data I have. The results looked like this:

TGTAGTTCAGTCGGTAGAACGGCGGACTGTTAATCCGTATGTCACTGGTTCGAGTCCAGTCAGAGGAGCCAATTT	6.03E-01	9.39E-25	1.80E-13	4.77E-01	6.09E-02	4.59E-01	MDS3	contig1:2106259-2106333;;;,contig1:2120658-2120732;;;,contig1:2124900-2124974;;;,contig1:2122491-2122565;;;
TGTAGTTCAGTCGGTAGAACGGCGGACTGTTAATCCGTATGTCACTGGTTCGAGTCCAGTCAGAGGAGCCAATTTTCTGTTTTC	5.60E-01	2.18E-21	1.91E-09	3.99E-01	6.38E-02	3.81E-01	MDS3	contig1:2122491-2122574;;;
TAGTTCAGTCGGTAGAACGGCGGACTGTTAATCCGTATGTCACTGGTTCGAGTCCAGTCAGAGGAGCCAATTT	6.03E-01	9.39E-25	1.80E-13	4.77E-01	6.09E-02	4.59E-01	MDS1	contig1:2106261-2106333;;;,contig1:2122493-2122565;;;,contig1:2124902-2124974;;;,contig1:2120658-2120730;;;
GTAGAACGGCGGACTGTTAATCCGTATGTCACTGGTTCGAGTCCAGTCAGAGGAGCCAATTT	6.12E-01	8.92E-25	2.24E-13	4.80E-01	6.16E-02	4.57E-01	MDS1	contig1:2122504-2122565;;;,contig1:2124913-2124974;;;,contig1:2106272-2106333;;;,contig1:2120658-2120719;;;

Notice that no gene ID or name ever appears (e.g. contig1:2122504-2122565;;;).

I then compared the GFF files we have in the tutorial:

FM211187 EMBL databank_entry 1 2221315 0.000 + . ID="FM211187.1";organism="Streptococcus pneumoniae ATCC 700669";strain="ATCC 700669";mol_type="genomic DNA";db_xref="taxon:561276"

and the one I have:

contig1 Prodigal:2.6 CDS 335 2797 . + 0 ID=XXX_00001;Parent=XXX_00001_gene;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:genome.faa:x0002;locus_tag=XXX_00001;product=Bifunctional aspartokinase/homoserine dehydrogenase 1

And I noticed that the attributes are quoted in the tutorial example and unquoted in my prokka-generated GFFs. The GFF3 specification states:

A list of feature attributes in the format tag=value. Multiple tag=value pairs are separated by semicolons. URL escaping rules are used for tags or values containing the following characters: ",=;". Spaces are allowed in this field, but tabs must be replaced with the %09 URL escape. Attribute values do not need to be and should not be quoted. The quotes should be included as part of the value by parsers and not stripped.

I'll implement a fix where the regular expression doesn't use the quotes anymore, but we would need to change the tutorial files, and maybe also have a workaround for those GFFs that use quotes?

Open to suggestions,
Marco

Option for multiple chromosomes when remapping kmers

Hi,
Would it be possible to implement use of multiple chromosome assemblies/species when mapping kmers back to a reference genome? I'm working with a species that has two chromosomes and currently the second chromosome appears to be ignored in the phandango_mapper step (or at least not output in the plot files). Apologies if this is an error on my part and this is already supported.

Thanks,

Derek

Covariates on binary phenotypes lead to very different results

There might be something going on here: to begin with pyseer outputs only 50 kmers, versus 174 of seer. pyseer is reporting an lrt-pvalue of 1 for all 50 kmers, while seer is reporting more realistic values. Any idea what the cause might be? I might have missed something on how the covariates are handled?

Commands used:

seer -k kmers.gz -p subset.pheno --struct all_structure_new --covar_file covariates.txt --covar_list 2q,3 --pval 1 --chisq 1

pyseer kmers.gz subset.pheno distances.tsv --filter-pvalue 1 --lrt-pvalue 1 --max-dimensions 3 --covariates covariates.txt --use-covariates 2q 3

Allow no distance matrix

It is possible to have the population structure control in the user-provided covariates (e.g. as BAPS clusters), and therefore run without a --distances argument. However, this argument is currently required.

We should update so that this requirement is removed when a --no-distances argument is explicitly provided (to avoid inadvertently forgetting to incorporate population structure).

Report regression status

It's useful to know whether k-mers were fit with Firth regression, and whether that was due to a large SE or low chisq table entries. The other comments available in seer won't be necessary here

Does not work if no covariates supplied

Perhaps should pass a NULL, which is checked by regression functions. See #13

bwa mapping with draft references

Problem is perhaps with multiple contigs
e.g.

Draft reference 1
Traceback (most recent call last):
  File "/usr/local/bin/annotate_hits_pyseer", line 9, in <module>
    load_entry_point('pyseer==1.1.2', 'console_scripts', 'annotate_hits_pyseer')()
  File "/usr/local/lib/python3.5/dist-packages/pyseer/kmer_mapping/annotate_hits.py", line 182, in main
    down_genes = extract_genes(sorted_query.closest(b=ref_annotation, s=False, D="ref", id=True, stream=True))
  File "/usr/local/lib/python3.5/dist-packages/pyseer/kmer_mapping/annotate_hits.py", line 55, in extract_genes
    for tag in match.fields[15].split(";"):
IndexError: list index out of range

Draft reference 1
Traceback (most recent call last):
  File "/usr/local/bin/annotate_hits_pyseer", line 9, in <module>
    load_entry_point('pyseer==1.1.2', 'console_scripts', 'annotate_hits_pyseer')()
  File "/usr/local/lib/python3.5/dist-packages/pyseer/kmer_mapping/annotate_hits.py", line 158, in main
    for mapping, kmer_line in zip(mapped_kmers, seer_remaining):
  File "/usr/local/lib/python3.5/dist-packages/pyseer/kmer_mapping/bwa.py", line 117, in bwa_iter
    (contig, pos) = hit.split(":")
ValueError: not enough values to unpack (expected 2, got 1)

Example files attached
bwa_issue.tar.gz

No output

Thanks for making this which looks awesome - I've spent a few days playing with it but have a few issues: firstly when running with kmers the programme runs fine but finds no significant kmers - I 26/87 samples witht the phenotype of interes, is my analysis underpowered? I know that real relevant SNPs exist from heuristic analysis but pyseer fails to find these - I'm not sure what's going wrong?

Secondly when attempting to use the output of roary gene_presence_abscence.Rtab, I get the error:
Read 87 phenotypes
Detected binary phenotype
Setting up LMM
Similarity matrix has dimension (87, 87)
Analysing 87 samples found in both phenotype and similarity matrix
ValueError: Unexpected mismatch between header and data row.

I think the Rtab file is fine (I've checked it in R) and I'm using the same phenotype file as before. Do you have any idea what is causing this?

Add user defined covariates

As in seer, use dummy encoding for categorical covariates

Pyseer prompts uncontrolled number of processes in computer clusters

I am trying to run pyseer on a series of datasets through a pipeline on a compute cluster (ComputeCanada) and it causes a strange error. Here is the note by cluster admin regarding the job running pyseer:

I noticed that your job is asking for 32 cpu cores, but is starting to use many more threads than that. This will cause many context switches, inefficient use of the processors on the node, and poor performance of your code.

And trying to run the job using a different setting, apparently the error appeared again

According to the Graham sysadmin your Python jobs are again creating havoc by spinning off a huge number of threads/processes:
[root@gra572:~] uptime
20:35:54 up 70 days, 0 min, 1 user, load average: 978.96, 972.12, 966.96

[root@gra572:~] ps -fu masih|wc -l
41

[root@gra572:~] ps -Lfu masih|wc -l
1069

I am using latest version of pyseer installed using conda and --cpu option is set to 32, so I was expecting 32 processes at max and and it is the first time I see a python code causes such error. Searching online and asking cluster admins, I could not find the reason why this is happening so I was wondering whether you know the cause of this behavior of pyseer and its solution (Although it might not be a real bug in pyseer).

Thanks

Matrix inversion error for null model

Hi Marco and John,

When I use pyseer, I get this error:

$ pyseer --phenotypes 33S.PA.pheno --kmers PA.kmer.out.gz --distances 33S.PA.tree.dist.tsv --cpu 20 > PA.kmer.pyseer.out

Read 33 phenotypes
Detected binary phenotype
Structure matrix has dimension (33, 33)
Analysing 33 samples found in both phenotype and structure matrix
Matrix inversion error for null model
Could not fit null model, exiting

Is it due to the numbers of strains and phenotypes are too small？Or something wrong with my inputs? I have only 33 strains and their phenotype. @mgalardini @johnlees

Problem with using multi-threading?

I'm using pyseer on Ubuntu 16.04, within a conda env running python3.7.

The below crash happens at the same point on multiple runs, the output file has reached 3.6 GB (it's the same size to the byte when it throws the error).

Any thoughts?

Read 744 phenotypes
Detected binary phenotype
Structure matrix has dimension (448, 448)
Analysing 448 samples found in both phenotype and structure matrix
Traceback (most recent call last):
  File "/home/ubuntu/anaconda2/envs/py37/lib/python3.7/site-packages/pyseer/input.py", line 476, in iter_variants
    raise StopIteration
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda2/envs/py37/bin/pyseer", line 11, in <module>
    sys.exit(main())
  File "/home/ubuntu/anaconda2/envs/py37/lib/python3.7/site-packages/pyseer/__main__.py", line 450, in main
    options.cpu*options.block_size))
  File "/home/ubuntu/anaconda2/envs/py37/lib/python3.7/multiprocessing/pool.py", line 276, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/home/ubuntu/anaconda2/envs/py37/lib/python3.7/multiprocessing/pool.py", line 383, in _map_async
    iterable = list(iterable)
RuntimeError: generator raised StopIteration

Allow concise distance matrix

Right now we are requiring the user to provide a square matrix with distances between strains, which might scale badly up to a few thousand samples. We could look into using a more concise format:

sample1    sample2    0.00025
sample1    sample3    0.05
[...]

ValueError: Unexpected mismatch between header and data row

hi,
I am trying to run pyseer on total number of isolate 1199. the command line as follows:
pyseer --phenotypes host.pheno --pres gene_presence_absence.Rtab --no-distances --cpu 20 >host_association.txt
But i am encountering the same error all the time. I have cross check the isolates name and number in both the input but didn't found any difference.

First few lines are the warning and then execution and later the error arived:
/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
Read 1199 phenotypes
Detected binary phenotype
Traceback (most recent call last):
File "/bin/pyseer", line 11, in
sys.exit(main())
File "/lib/python3.6/site-packages/pyseer/main.py", line 450, in main
options.cpu*options.block_size))
File "/lib/python3.6/multiprocessing/pool.py", line 274, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/lib/python3.6/multiprocessing/pool.py", line 376, in _map_async
iterable = list(iterable)
File "/lib/python3.6/site-packages/pyseer/input.py", line 471, in iter_variants
sample_order)
File "/lib/python3.6/site-packages/pyseer/input.py", line 323, in read_variant
raise ValueError('Unexpected mismatch between header and data row')
ValueError: Unexpected mismatch between header and data row

Store decomposed matrices

So don't have to do cmdscale/rotation every time

Add to bioconda

Once the current issues are closed I'd say we'd have a good version to add to bioconda

Saving non-numerical covariates with the elastic net

With a string as the covariates using --enet and --save-model:

File "/nfs/users/nfs_j/jl11/installations/pyseer/pyseer-runner.py", line 8, in <module>
    main()
File "/nfs/users/nfs_j/jl11/installations/pyseer/pyseer/__main__.py", line 614, in main
pred_model[covariate] = (np.mean(covariate), enet_betas[cov_idx])
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 2920, in mean
out=out, **kwargs)
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/numpy/core/_methods.py", line 75, in _mean
ret = umr_sum(arr, axis, dtype, out, keepdims)
TypeError: cannot perform reduce with flexible type

Covariates are probably not being encoded correctly in the first place, or may be the wrong variable name.

SyntaxError: Non-ASCII character

Hi guys,
On both my laptop and my cluster, git clone followed by python pyseer-runner.py gives

  File "pyseer-runner.py", line 5, in <module>
    from pyseer.__main__ import main
  File "/home/chris/pyseer/pyseer/__main__.py", line 45, in <module>
    from .enet import load_all_vars
  File "/home/chris/pyseer/pyseer/enet.py", line 174
SyntaxError: Non-ASCII character '\xc2' in file /home/chris/pyseer/pyseer/enet.py on line 174, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

Any ideas?
Cheers
Chris

Try alternative regression optimisers

Report total number of tests

This can be useful for checking filtering, and in gauging multiple-testing burden

Move ranges package from bedtools

There is a native python implementation of intervals here:
https://github.com/vsbuffalo/BioRanges

This might save on tmp files, dependencies and improve speed

Script taking up all available CPUs despite --cpu=1

I observed this on my laptop: even though I define --cpu 1, the script takes up all my four cores. There must be something up in the underlying libraries.

Code refactoring

It would be nice to refactor the code so that it becomes a proper python package, which will definitely improve things for future additions.

I suggest to follow the ideas exposed here: https://gehrcke.de/2014/02/distributing-a-python-command-line-application/

I can look into it whenever I've got some free time.

Add tab separated note to help in argparse

It is reasonable to assume inputs may be CSV rather than space separated - probably a good idea to note this in help args as well as docs

Results might be different on nearly separable k-mers

Trying both on the Trimethoprim k-mers from the paper, I get different results between pyseer and seer (also in Firth regression). The slope isn't in the right direction.

seer:

sequence        maf     chisq_p_val     wald_p_val      lrt_p_val       beta    se      comments
TCAAGACAAGAATCTCTATATTCTCGGTGG  0.066   1.032e-11       5.005e-06       1.597e-17       3.250e+00       7.121e-01       NA

pyseer:

sequence        maf     chisq_p_val     wald_p_val      lrt_p_val       beta    se      comments
TCAAGACAAGAATCTCTATATTCTCGGTGG  6.61E-02        1.16E-11       0.125       0.116       -0.265       0.173       NA

I think a comparison with glm() in R would help, but I think seer is correct.

It may also be a regression caused by my (as yet uncommitted) code, so I think some tests (#4) would be useful

phenotype file format

Hi,

My first time using pyseer and it seems great!

I found a very minor issue with the documentation (on readthedocs) - it says the phenotype file "format is whitespace separated" but for me it only works when this file is tab-separated.

Lineage association question

Good day,

I wanted to know if some results that Im obtain, and which wasnt expected could be fine or if there might be some issu with my approach and use of pyseer.

I have a set of isolates already classified into 5 different populations (lineages) of different sizes (number of isolates).

And wanted to know if there were some kmers associated to the lineage, which could explain why the populations were different.

To do so I used fms-lite to count the kmers and mash to obtain the distance matrix. Then I did 5 different analysis, one for each population, setting the pheno file with 1 if that isolate belonged to the queried population (case) and 0 if not (control), using the lineage option, given the labels for each population and forcing the use of continuous phenotype (as binary would perfectly separate the data and no regression was done).

In each experiment, the lineage_effects.txt file associated the corresponded population as the most important to the phenotype (expected result). But when looking into the kmers, in one case, there were no kmers related to the population.

My question is: What could be happening there that no kmer was found to be associated to the expected population, something that didnt happen to the other experiments?

Thank you,

Unit testing

I have a few unit tests made already, but it's probably best to address "version 2" so that some work doesn't get duplicated.

@johnlees would you have some VCFs and .RTab files I can play with? Thanks!

Phandango script

I would like to plot in phandango the mapped kmers to my reference which were associated to given lineages. An I used the phandango_mapper script to do so, but I wasnt able to find a way to map the results in the .plot file to the kmers file.

Is there any identification for the mapped kmers, or a table for the kmer and in which position it was mapped in the reference to be able to identify the relationship to the lineage?

Thanks,

Draw a scree plot

Helpful for picking the number of dimensions to include. Might be best to have a mode which does this and doesn't run the association.

covariates

Hi Marco,

I've been trying to use covariates and I have a few issues. First I get this error when using '--use-covariates':

Traceback (most recent call last):
File "/home/linuxbrew/.linuxbrew/bin/pyseer", line 11, in
load_entry_point('pyseer==0.3.1', 'console_scripts', 'pyseer')()
File "/home/linuxbrew/.linuxbrew/opt/python3/lib/python3.6/site-packages/pyseer/main.py", line 162, in main
p)
File "/home/linuxbrew/.linuxbrew/opt/python3/lib/python3.6/site-packages/pyseer/input.py", line 97, in load_covariates
sys.stderr.write('Covariates columns values should be > 1 and lower ' +
NameError: name 'sys' is not defined

Not sure why its throwing that as 'sys' is clearly imported. Might be a system specific issue for me.

The main problem I have is that I don't understand your -h for '--use-covariates':

--use-covariates [USE_COVARIATES [USE_COVARIATES ...]]
Covariates to use. Format is "2 3q 4" (q for
quantitative) [Default: load covariates but don't use
them]
Can you please elaborate on 'Format is "2 3q 4" '?

The error that fails above would tell me 'Covariates columns values should be > 1 and lower than total number of columns (3)'. Which for me only leaves 2 as an option. I've modelled my covariates file on the example (covariates.txt) you give in the test folder. Tab separated [sample, sample count/number, covar]. I have 1 covar (country) that has 18 options (UK, US, Australia etc..).

Calling it as 'nice pyseer kmers.txt.gz invasive2.txt snp_dist.tsv --min-af 0.01 --max-af 0.99 --cpu 44 --filter-pvalue 1E-2 --max-dimensions 2 --covariates co_vars.txt --use-covariates 2' I get 'Matrix inversion error for null model. Could not fit null model, exiting'

bwa mem doesn't catch kmers shorter than 30bp

When running annotate_hits_pyseer on genomes indicated as references, bwa mem is used, allowing non-exact matches, while for draft genomes we use bwa fastmap looking exclusively for exact matches.

The problem is that with current settings, bwa mem only considers kmers longer than 30bp (and I don't think there is a way to force it to look for smaller sizes?). This is what I get when using two references and some drafts:

>>> python3 annotate_hits_pyseer-runner.py filtered_kmers.tsv references.txt test.txt
Reference 1                                         
31163 kmers remain                                  
Reference 2                                                                                              
31158 kmers remain                                  
Draft reference 3                                                                                        
20208 kmers remain                                                                                       
Draft reference 4                                                                                        
20060 kmers remain                                  
Draft reference 5                                   
90 kmers remain                                                                                          
[...]

And this when I use only draft genomes (same order):

>>> python3 annotate_hits_pyseer-runner.py filtered_kmers.tsv alldrafts.txt test.txt
Draft reference 1
31163 kmers remain
Draft reference 2
30838 kmers remain
Draft reference 3
729 kmers remain
Draft reference 4
728 kmers remain
Draft reference 5
0 kmers remain
0 kmers remain unannotated

This is not technically a bug, but if we want to prioritize reference genomes to map kmers we should run a two pass search: first with bwa mem and then bwa fastmap. I have it implemented, I'll do a pull request shortly.

Add Firth regression

When Newton fails, or SE is high

k-mer association fails with IndexError

When using the newest version of pyseer with python 3.6.7, I get the following issue. I have formatted my phenotype data to match the requirement ("sample \t continuous" header, two tab-delimited columns with sample names and continuous data).

I executed this command in a for loop:
nohup python ~/pyseer/pyseer-runner.py --phenotypes $METADATAPATH2/$i.pheno --kmers fsm_kmers2.txt.gz --distances mash.tsv --cpu 15 --output-patterns $i""kmer2_patterns.txt > $i""kmers2.txt &

This is the output after pyseer crashes
(py36) molleraj@loma:~/FASTA$ more pyopyseer_kmers2.txt
variant af filter-pvalue lrt-pvalue beta beta-std-err intercept PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 notes
/data1/home/molleraj/miniconda2/envs/py36/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/data1/home/molleraj/miniconda2/envs/py36/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(args, **kwds)
Read 259 phenotypes
Detected continuous phenotype
Structure matrix has dimension (264, 264)
Analysing 255 samples found in both phenotype and structure matrix
Traceback (most recent call last):
File "/data1/home/molleraj/pyseer/pyseer-runner.py", line 8, in
main()
File "/data1/home/molleraj/pyseer/pyseer/main.py", line 659, in main
options.cpuoptions.block_size))
File "/data1/home/molleraj/miniconda2/envs/py36/lib/python3.6/multiprocessing/pool.py", line 296, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/data1/home/molleraj/miniconda2/envs/py36/lib/python3.6/multiprocessing/pool.py", line 398, in _map_async
iterable = list(iterable)
File "/data1/home/molleraj/pyseer/pyseer/input.py", line 541, in iter_variants
sample_order)
File "/data1/home/molleraj/pyseer/pyseer/input.py", line 342, in read_variant
'|')[1].lstrip().split())
IndexError: list index out of range

I do not get this issue with SNP or COG association.

Handle missing data in VCF and Rtab files

I was made aware of the fact that we are considering . entries in VCFs as the variant being absent instead of uncalled. I think this should be changed to meet users' expectations. While we are at it it could good to add this feature to Rtab files, as some people use the format to encode their own data such as CNVs. Should be relatively easy, I will try a fix soon.

Generator raised stop-iteration

Hi, I have installed pyseer on python 3.7 and running the program, terminates premature, produces output including the analysis for only half of the variants with the following error:

Read 2997 phenotypes
Detected binary phenotype
Structure matrix has dimension (3000, 3000)
Analysing 2997 samples found in both phenotype and structure matrix
Traceback (most recent call last):
File "Sim/lib/python3.7/site-packages/pyseer/input.py", line 479, in iter_variants
raise StopIteration
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "Sim/bin/pyseer", line 11, in
sys.exit(main())
File "Sim/lib/python3.7/site-packages/pyseer/main.py", line 453, in main
options.cpu*options.block_size))
File "python/3.7.0/lib/python3.7/multiprocessing/pool.py", line 276, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "python/3.7.0/lib/python3.7/multiprocessing/pool.py", line 383, in _map_async
iterable = list(iterable)
RuntimeError: generator raised StopIteration

Distances used for population structure estimation

Hello,

With regard to the distance matrix used in the fixed effects analysis, is there anything specific about the distance matrices calculated using mash or the patristic distances from a phylogenetic tree for the pyseer models to work? ie. does this have to be a distance matrix calculated using mash (estimated Jaccard distance) or one containing patristic distances?

Or are these distances interchangeable, and this could be a distance matrix calculated using one of the classical evolutionary models, such as the Jukes-Cantor or F84 models?

Any response is much appreciated!

Thank you,
Conrad Izydorczyk

Auto-detect continuous/binary

Still allow forcing of continuous
Also update README.md

Some lrt p-values are zero

Probably because arbitrary precision is needed in erfc when calculating the normal dist integral. See test results in #4

Performance improvements (reprise)

I think the Firth regression reintroduced some bottlenecks in the code; I'll do some profiling to see if there's anything we can do to speed it up. Hopefully it'll be easy.

installation testing

Hi guys, the test installation instructions say
Run unit tests:
pytest -v tests
Test functions and output:
cd tests/ && bash run_test.sh && cd ../
Can you clarify in what directory these commands should be run? If installed via a git clone it's fairly obvious; if via pip or conda it's not...
Cheers
Chris