admiralenola / scoary Goto Github PK

View Code? Open in Web Editor NEW

145.0 10.0 34.0 2.57 MB

Pan-genome wide association studies

License: GNU General Public License v3.0

Python 100.00%

gene-presence-absence gwas pan-genome bacteria genomics

scoary's People

Stargazers

Watchers

scoary's Issues

Scoary not accepting my traits.csv

Hello.

I am trying to use Scoary to generate a GWAS from my Roary pan genome. However, the code insists that my traits.csv is incorrectly formatted, and then crashes. At first it complained that the box in the top-left should be blank (as per your instructions). However, the box was always blank. Now it is trying to tell me that the strain names are not the same from the Roary output. However, these names were taken straight from the output and placed into a list. The local bioinformatistician and I have been trying to fix this all day, and the problem persists.

The newest error is that the program reports that a strain is missing from the genes file despite no other program having this issue, and the strain in the traits list is straight from the gene file. I can manually find it within seconds by looking at the header for both files.

I have been googling this issue and it does not seem to be a common occurrence. Any advice for this problem?

KeyError when trait file contains subset of genomes

If the traits file contain only a subset of genomes of the Roary file, Scoary currently exits with a KeyError.

If you want to run Scoary on just a subset of the genomes that you ran Roary on (You might be missing phenotypic data for some isolates for example), there are currently two ways of handling this:

Using --restrict_to, pointing to a csv file which lists only the genomes you want to include.
Editing the Roary file by column-wise deletion of the genomes you don't have in your traits file. (Scoary doesnt use the summary statistics in the first columns of the Roary file, so this will not impact analysis)

Plan:

Should not throw a KeyError. Implement a formal check that the names in the two files are identical. Then, if one file contains fewer genomes, analyze only the subset, but give warning.

Add functionality to handle missing values in the genetic data

Although not seen in Roary files, genetic data can also have missing components. Implement handling of this.

recording missing data in trait.csv file?

Hi @AdmiralenOla, thanks so much for your awesome tool!

I am getting an error and I think it's related to the fact that I have missing data in my matrix. I have it recorded as "NA". I don't want to code it as "0" because I don't know if the data is missing, I didn't collect that trait for that strain.

Is there a better way for me to code missing data in the traits.csv file?

Thanks so much! ~ Josh

Bug: Empirical p-values > 1.0

There is a bug that can sometimes cause the empirical p-values from permutations to become > 1.0 (Maximum 1.03).

Scoary run in cluster MPI???

Hi!
I would like to implement the scoary in our cluster.
Would I like to know if possible to run in a cluster?
Do you know if it is supported by MPI?

Regards

Error in detecting names for columns in Roary file (solved) + TypeError: binom_test() ...

Hi,
This is my first time to use Scoary. I'm on version 1.3.3. I seem to be having trouble proceeding with the run because of the Roary file? I didn't change anything with the gene_presence_absence.csv, it's just as Roary produced it.

Here is the error it throws:

Warning: Could not properly detect the correct names for all columns in the ROARY table.
Traceback (most recent call last):
File "scoary.py", line 22, in
methods.main()
File "/home/adgm/Scoary/scoary/methods.py", line 117, in main
allowed_isolates=allowed_isolates)
File "/home/adgm/Scoary/scoary/methods.py", line 218, in Csv_to_dic_Roary
r[q[genecol]] = {"Non-unique Gene name": q[nugcol], "Annotation": q[anncol]} if roaryfile else {}
IndexError: list index out of range

Don't crash when newick file has node labels

Should be fairly simple to allow internal node labels in the newick file. Although this does not impact results, there is no reason for the program to crash.

Maximum number of genomes tested?

I am doing a Scoary test with a 5,829 genome Roary file (~250 Mb) and a custom tree. It works fine in the beginning, but crashes (out of memory?) when storing the pairs. The server I use is Ubuntu 14.04 LTS (Biolinux 8) with 4 core-Xeon processor (8 threads) and 32 Gb RAM and 32 Gb swap.

Is there a maximum to the number of genomes for Scoary?

Problem adding long and float64

There is a strange bug where Scoary will sometimes crash with the following message:

TypeError: unsupported operand type(s) for +: 'long' and 'numpy.float64'

I don't know why only some systems are seeing this error. In fact, addition of long and numpy.float64 does not throw a TypeError on my 1.11.1 or 1.11.2 versions, but it does on some other systems.

Piggy output as input to use with Scoary?

Is it possible to use piggy output (IGR_presence_absence.csv) as input for Scoary?

How to filter Scoary Results

I've used Scoary to decipher COGs that might have different associations between Host Species, and everything worked like a charm. But now, I'm unsure of what columns I should use to extract the best observations. Sensitivity, Specificity, Odds ratio and the multiple p values that were outputted vary in interpretation (High Odds ratio, Low Sensitivity). If I want to prune the results, which column should I care the most about and filter?

Empty output files with k-mer GWAS

Hi,

I am running Scoary with manually created k-mer file. While the job finishes ok, the output file is empty, and I don't see any errors. I have 60 different datasets for which the same occurs, so I believe I am either misusing the options, not properly formatting the k-mer file, or exceeding the array.

I used Scoary 1.6.16 with Python 3.6 installed using conda, and the command I use is:

scoary -s 2 -t gr1.csv -g gr1_matrix_kmer.csv --threads 16 \
    -o scoary_out_gr1 --delimiter ',' -c I EPW

"gr1.csv" has 1077 rows and looks like:

,serovar_phenotype
DRR106950,0
ERR023784,0

while "gr1_matrix_kmer.csv" has 1077 columns and 718522 rows, where the first column is the k-mer, and remaining columns are the samples (thus the -s 2 option).

I would really appreciate your input on this. If you need any additional information, or you have any ideas why I am not getting any output, please let me know.

Thank you,
Natasha

GenomeIDs not matching custom tree and gene pres/abs

Hi!

This is probably my issue but I've noticed a few things that are a bit confusing.

I'm running scoary for 535 genomes and have made sure that all genomeIDs in my custom newick file match the genomeIDs in the gene presence/absence file but scoary reports that they don't match?

Reading custom tree file
CRITICAL:
Traceback (most recent call last):
File "/Users/matt/bin/miniconda3/lib/python3.6/site-packages/scoary/methods.py", line 246, in main
sys.exit("CRITICAL: Please make sure that isolates in "
SystemExit: CRITICAL: Please make sure that isolates in your custom tree match those in your gene presence absence file.
CRITICAL: Please make sure that isolates in your custom tree match those in your gene presence absence file.

I checked that the IDs do match in a number of ways, one of which was to run without and have scoary output its own newick file. I noticed that scoary outputs the 'inference' column header as a leaf on the tree (so +1 leaves). I deleted this column and then scoary runs with my custom tree with no issues but for n = 534 genomes?

Thanks in advance for any help with this!

Images

An issue for placing images to be linked in the readme.

Isolate_tree_pop_structure_highlysignificant_26May.pdf
Isolate_tree_pop_structure_notsignificant_26May.pdf

Don't enforce "Non-unique gene name" and "Annotation" columns

Remove enforcing of the columns "Non-unique gene name" and "Annotation" in the output. Some users might have input file with only a single identifier column (Gene ID) before sample info starts, and wants to run with -s 2.

In the current version, this will cause Scoary to fill in the "Non-unique Gene name" and "Annotation" columns with sample data. (Because it automatically assumes that this info can be found in columns 2 and 3). There is really no need to enforce any other columns than Gene ID.

/var/spool/gridengine/execd/cu17/job_scripts/371921: line 11: 9884 Killed

/var/spool/gridengine/execd/cu17/job_scripts/371921: line 11: 9884 Killed /gluster/home/yangtao/miniconda3/bin/scoary -g /gluster/home/yangtao/zhihe_seq/velvet/contigs/prokka/b/chrom/gene_presence_absence1.csv -t /gluster/home/yangtao/zhihe_seq/velvet/contigs/prokka/b/chrom/trait1.csv

there is no result

Allow user to specify delimiter in output

Segentation fault

I am trying to run Scoary on dataset of 26 strains and 6927 genes. The program begins executing until a point where it is killed and returns a segmentation error. Does this mean I do not have sufficient memory to run the analysis?

Gene enrichments across host sites

Does unequal sample sizes (strain counts) per hosts affects the enrichment analysis (using --no_pairwise flag)?

RuntimeError: maximum recursion depth exceeded while calling a Python object

I am trying to run scoary on roary output from 3100 samples. I get the following error
RuntimeError: maximum recursion depth exceeded while calling a Python object

I tried both python versions 2.7 and 3.5 but the error remains the same

Separate trait files

Hi!

I was having RAM problems to run scoary. I was able to solve it by using a traits.csv file per trait instead of using a single traits.csv file with all the different traits. However, I am not sure if this is going to modify the statistic calculations. My purpose is to find bacterial proteins associated to the isolation source of the bacteria, so I am doing a similar approach to that explained with "cattle, human, sheep and food" in the main page of Scoary in github ("Enrichment of genes in select host groups"). So, the question is:
Can I use a single csv file per trait and run the process different times instead of a unique csv file with all the traits?

Thank you very much in advance

Not all genes are reported with "-p 1.0"

Hello,

I ran scoary with "-p 1.0" option to obtain p-values for all genes, but a part of genes in gene_presence_absence.csv were not reported.
Could you please tell me why scoary does not output information for all genes even when I used "-p 1.0"？

Maximum recursion depth

As originally reported by @dutchscientist in #53 , Scoary currently throws a

RuntimeError: maximum recursion depth exceeded while calling a Python object

when attempting to perform pairwise comparisons on a too big dataset. The exact threshold is unknown to me. I have ran it with ~3500 isolates. This error was reported with ~5800.

Full message:

Storing results: ST45
Calculating max number of contrasting pairs for each nominally significant gene
100.00%Traceback (most recent call last):
File "/usr/local/bin/scoary", line 11, in 
load_entry_point('scoary==1.6.9', 'console_scripts', 'scoary')()
File "/usr/local/lib/python2.7/dist-packages/scoary-1.6.9-py2.7.egg/scoary/methods.py", line 244, in main
delimiter=args.delimiter)
File "/usr/local/lib/python2.7/dist-packages/scoary-1.6.9-py2.7.egg/scoary/methods.py", line 813, in StoreResults
num_threads, no_time, delimiter)
File "/usr/local/lib/python2.7/dist-packages/scoary-1.6.9-py2.7.egg/scoary/methods.py", line 920, in StoreTraitResult
Threadresults = list(Threadresults)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 668, in next
raise value
RuntimeError: maximum recursion depth exceeded while calling a Python object

Bug: Pruning isolates from phylogenetic tree

I've been made aware of a bug that sometimes occurs when pruning many isolates from the phylogenetic tree calculated internally. The issue occurs when the following subtree is encountered, and X-es are isolates to prune:

     ---- [x]
----|     ---- [x]
     ---- |
           ---- [x]

(Excuse my horrible ASCII drawing)

Test data are missing so test suite does not succeed

Hi,
I want to build a Debian package which usually includes to run the test suite. Unfortunately the result file Tetracycline_resistance.results.csv is not part of the download tarball so the test suite fails.
Kind regards, Andreas.

Bug found - adjusted p-value

I was testing your script and i noticed that it seems to be printing the same value for the Holm-Sidak_p and Benjamini_H_p for all genes.

Best regards

C Mendes

P value calculations

Hi,

Nice idea for utilising Roary output to do GWAS, but I am a little concerned that people are going to be using this and reporting results when it's not appropriate to do so.

For example, you have already pointed out that Fisher's test is just not appropriate for population structure reasons. To an untrained eye P-values are good, ergo there is good evidence for the hypothesis. When in actual fact, that is not an appropriate statistics to apply..

It's a neat attempt to deal with this using non-intersecting contrasting pairs, although it would be nice if you have mentioned a definition for what that actually is. You are still faced with the same problem (perhaps worse) of not really dealing with population structure, and applying a test that hinges on independence of trials. Plus, you have picked p=0.5 for binomial distribution, but do not justify it! Is it appropriate for all kinds of trees? Is it species specific? Why not 0.3657849, or 0.6903453?

Obviously I wish you luck and hope you find good solution for dealing with population structures, but people who are naive to stats and looking for low P values this is a dangerous tool. It should be made clear to talk to statistician/bioinformatician if they don't know what they are doing and just want to apply to their Roary analysis.

I hope you would add python bindings to existing tools for doing this from Roary output:

https://github.com/jessiewu/bacterialGWAS
and
https://github.com/sgearle/bugwashttps://github.com/sgearle/bugwas
are from the same paper.

Splitting paralogs influence to Scoary?

@AdmiralenOla does not splitting paralogs (-s in roary) affect Scoary results?

Please switch to Python3

Hi,
in issue #19 you confirm compatibility with Python3. I'd recommend to officially switch to Python3 since Python2 is EOL and distributions will stop to distribute it soon.
Kind regards, Andreas.

Multiprocessing when using GUI

The GUI currently only uses one thread. I was getting the following error
fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
at the multiprocess stage of methods.py. I'm not sure what's causing it, but everything works fine when running with a single thread. (In fact when running with a single thread the multiprocessing.Pool object is never initiated.)

Scoary output for phandango

Hi,

I was wondering if there is any way to incorporate the output from Scoary into phandango for visualization. As far as I understand, phandango requires a Manhattan plot format for GWAS data https://github.com/jameshadfield/phandango/wiki/Input%20data%20formats#manhattan-plots

Any idea on how to proceed.

Thanks!!!

Missing genes in results

Hi,

I was using scoary for ~3,600 isolates to test trait association on ~22,000 genes; however, even when I specify -p 1.0, scoary only reports ~3,100 genes in the results rather than the complete set of ~22,000 genes. The analysis ran to completion without errors as well.

Here's the log file content:

08/13/2020 10:34:24 AM    ==== Scoary started ====
08/13/2020 10:34:24 AM    Command: /home/jimmy.liu/.conda/envs/scoary-1.6.16/bin/scoary --threads 32 -g /scratch/jimmy.liu/reference_structure_chewbbaca_res_2020/cluster_157_gwas/allelic_presence_roary.csv -t /scratch/jimmy.liu/reference_structure_chewbbaca_res_2020/cluster_157_gwas/Cluster_157_subset_metadata.csv -o /scratch/jimmy.liu/reference_structure_chewbbaca_res_2020/cluster_157_gwas/ -p 1.0 -m 22416
08/13/2020 10:34:24 AM    Reading gene presence absence file
08/13/2020 10:34:49 AM    Creating Hamming distance matrix based on gene presence/absence
08/13/2020 10:36:30 AM    Building UPGMA tree from distance matrix
08/13/2020 10:38:34 AM    Reading traits file
08/13/2020 10:38:34 AM    Finished loading files into memory.


08/13/2020 10:38:34 AM    ==== Performing statistics ====
08/13/2020 10:38:34 AM    -- Filtration options --
08/13/2020 10:38:34 AM    Individual (Naive):    1.0
08/13/2020 10:38:34 AM    Collapse genes:    False


08/13/2020 10:38:34 AM    Tallying genes and performing statistical analyses
08/13/2020 10:38:34 AM    Gene-wise counting and Fisher's exact tests for trait: grp
08/13/2020 10:39:50 AM    Adding p-values adjusted for testing multiple hypotheses
08/13/2020 10:39:50 AM    Storing results: grp
08/13/2020 10:39:50 AM    Calculating max number of contrasting pairs for each nominally significant gene
08/13/2020 10:41:04 AM    Storing results to file
08/13/2020 10:41:04 AM    

08/13/2020 10:41:04 AM    ==== Finished ====
08/13/2020 10:41:04 AM    Checked a total of 22416 genes for associations to 1 trait(s). Total time used: 399 seconds.
08/13/2020 10:41:04 AM    No warnings were recorded.

You can find my data here:
Trait file: https://drive.google.com/file/d/18nj3zFWS5OWONIn1xZhM_Uht6siOY6-n/view?usp=sharing
Gene presence/absence file: https://drive.google.com/file/d/1pWaDezegBbhc06yTV2OoiMcr3Es6SeRj/view?usp=sharing

Cheers,
Jimmy

gene_presence_absence.csv file from Roary

Dear developers,

I wonder if the input gene_presence_absence.csv file for Scoary should contain binary values (1 and 0) rather than Gene ID?

The example of the input file (https://raw.githubusercontent.com/AdmiralenOla/Scoary/master/scoary/exampledata/Gene_presence_absence.csv) contains binary values (1 and 0) indicating the presence and absence of each gene in each sample, like the gene_presence_absence.Rtab file with binary values (1 and 0) from Roary, rather than the gene_presence_absence.csv file with the Gene ID from Roary (https://github.com/haruosuz/mgsa/blob/master/roary/analysis/i95/gene_presence_absence.csv).

error

Hi,

I keep getting this error message when I ran Scoary on my Roary Output:

CRITICAL:
Traceback (most recent call last): File "/miniconda3/envs/scoary/lib/python3.6/site-packages/scoary/methods.py", line 268, in main strains)
File "miniconda3/envs/scoary/lib/python3.6/site-packages/scoary/methods.py", line 568, in Csv_to_dic sys.exit("Make sure the top-left cell in the traits file "
SystemExit: Make sure the top-left cell in the traits file is either empty or 'Name'. Do not include empty rows Make sure the top-left cell in the traits file is either empty or 'Name'. Do not include empty rows

This my trait CSV file and there seems to be no error with the traits file.

Name,Abortive,Non_Abortive
B197.11581,0,1
B197.7887,0,1
B197.7889,0,1
B197.789,0,1
B197.7927,0,1

What could be the issue? Thanks

VCF GWAS

How do I use multiple vcf files to create the Scoary csv to be used with the the main script? Or in other words if I start with multiple vcf files of different isolates mapped to the same reference, how can I get scoary results (similar to starting from roary).

FileNotFoundError: [Errno 2] No such file or directory: 'README_pypi.md'

I am unable to upgrade my existing Scoary installation:

pip3 install --upgrade scoary
Collecting scoary
  Using cached scoary-1.6.13.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/linuxbrew/pip-build-4n646sv8/scoary/setup.py", line 14, in <module>
        long_description=readme(),
      File "/tmp/linuxbrew/pip-build-4n646sv8/scoary/setup.py", line 7, in readme
        with open('README_pypi.md') as f:
    FileNotFoundError: [Errno 2] No such file or directory: 'README_pypi.md'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/linuxbrew/pip-build-4n646sv8/scoary/

IndexError: list index out of range

Hi, I'm new to using scoary and am running into an issue. Here is the full error that scoary gives me:

Traceback (most recent call last): File "/home/hcm59/miniconda3/envs/scoary/bin/scoary", line 8, in <module> sys.exit(main()) File "/home/hcm59/miniconda3/envs/scoary/lib/python3.9/site-packages/scoary/methods.py", line 278, in main RES_and_GTC = Setup_results(genedic, traitsdic, args.collapse) File "/home/hcm59/miniconda3/envs/scoary/lib/python3.9/site-packages/scoary/methods.py", line 914, in Setup_results bh_c_p_v[s_p_v[len(s_p_v)-1][0]] = last_bh = s_p_v[len(s_p_v)-1][1] IndexError: list index out of range

It seems to be working prior to this, but stops here and doesn't give any output files. I looked in the methods.py script but couldn't find anything obviously wrong.
My data are output from Roary, a phenotype file, both delimited with commas, and a Newick tree file from IQTree.

I found a previous issue that was similar (#23) but it looks like their problem was that their Roary file was delimited with semicolons, but I'm 99% sure mine is commas.

Any help is appreciated! I can send example files too.

Here's the script I used:

scoary -t /path/dog_verified_host_PhenoForScoary.csv \ -g /path/gene_presence_absence_roary.csv \ -o /path \ -n /path/core_gene_alignment.aln-gb.nw \ --delimiter , \ --permute 1000 --threads 10

I'm using scoary in a conda environment that I built on a Linux server. Here are some specifications:

# packages in environment at /home/hcm59/miniconda3/envs/scoary:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
argparse                  1.4.0                    pypi_0    pypi
ca-certificates           2021.4.13            h06a4308_1  
certifi                   2020.12.5        py39h06a4308_0  
ete3                      3.1.2                    pypi_0    pypi
ld_impl_linux-64          2.33.1               h53a641e_7  
libffi                    3.3                  he6710b0_2  
libgcc-ng                 9.1.0                hdf63c60_0  
libstdcxx-ng              9.1.0                hdf63c60_0  
ncurses                   6.2                  he6710b0_1  
numpy                     1.20.2                   pypi_0    pypi
openssl                   1.1.1k               h27cfd23_0  
pip                       21.0.1           py39h06a4308_0  
python                    3.9.2                hdb3f193_0  
readline                  8.1                  h27cfd23_0  
scipy                     1.6.2                    pypi_0    pypi
scoary                    1.6.16                   pypi_0    pypi
setuptools                52.0.0           py39h06a4308_0  
six                       1.15.0           py39h06a4308_0  
sqlite                    3.35.4               hdfb4753_0  
tk                        8.6.10               hbc83047_0  
tzdata                    2020f                h52ac0ba_0  
wheel                     0.36.2             pyhd3eb1b0_0  
xz                        5.2.5                h7b6447c_0  
zlib                      1.2.11               h7b6447c_3

Thanks!!
-Holly

Update: just found out we had used Panaroo, not Roary, so I will be looking into this and seeing if I can find a solution!!

Cryptic Python error

Hi,

I want to run Scoary on 384 genomes, for which I have 5 antibiotic resistance phenotypes (A,B,C,D,E) and – obviously – the Roary results. However, the following error occurs (macOS 10.12.2, Python 2.7.13, Scoary 1.6.10 installed via pip)

==== Scoary started ====
Reading gene presence absence file
Creating Hamming distance matrix based on gene presence/absence
Building UPGMA tree from distance matrix
Reading traits file
WARNING: Some isolates have missing values for trait C. Missing-value isolates will not be counted in association analysis towards this trait.
ERROR: Some isolates in your gene presence absence file were not represented in your traits file. These will count as MISSING data and will not be included.
Finished loading files into memory.


==== Performing statistics ====
-- Filtration options --
Individual (Naive):    0.05
Collapse genes:    False


Tallying genes and performing statistical analyses
Gene-wise counting and Fisher's exact tests for trait: C
0.00%Traceback (most recent call last):
  File "/usr/local/bin/scoary", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python2.7/site-packages/scoary/methods.py", line 253, in main
    RES_and_GTC = Setup_results(genedic, traitsdic, args.collapse)
  File "/usr/local/lib/python2.7/site-packages/scoary/methods.py", line 715, in Setup_results
    stats = Perform_statistics(traitsdic[trait], genedic[gene])
  File "/usr/local/lib/python2.7/site-packages/scoary/methods.py", line 863, in Perform_statistics
    if int(traits[t]) == 1 and genes[t] == 1:
ValueError: invalid literal for int() with base 10: ''

The log file ends with the step Gene-wise counting and Fisher's exact tests, no further information is given. I am aware of the WARNING (missing data for some traits) and ERROR (more isolates than phenotyping results, for now).

My traits.csv file looks like this:

,A,B,C,D,E
CH2500,1,0,0,0,1
CH2502,NA,0,0,NA,1
...

Cutting the traits.csv into 5 individual files produced 2/5 results, with 3 Scoary runs still failing with the same error message. Any idea what's going on and how to proceed?

Thank you.

Genomic associations between subspecies

From what I read, scoary is currently not able to work with non-binary traits.
I want to use scoary in order to determine the pangenomic differences between three apparent subspecies of my bacterium of interest. There appears to be a pretty strong signal, as the genomes cluster distinctly in a PCoA based on gene presence / absence data.
Specifically, I would like to find out which genes are differentially prevalent between the three clusters. Can I supply a trait file that has "dummy variables", something like this. My approach should work if scoary simply removes those samples that have no information for a specific trait. What do you think about this?

Sample_name        Comp_clust_1_2             Comp_clust_1_3                Comp_clust_2_3
member_cluster_1     0                         0                         NA/empty
member_cluster_1     0                         0                         NA/empty
...
...
member_cluster_2     1                         NA/empty                 0 
member_cluster_2     1                         NA/empty                 0
...
...
member_cluster_3     NA/empty                 1                         1
member_cluster_3     NA/empty                 1                         1

Self created absence_presence file as input?

Is it possible to use self created absence_presence data file? For example for virulence genes detected with abricate?

How to output information concerning all genes ?

Hello, Ola,

After completing running, I found only genes associated with the trait were reports. How can I get the information of all genes when using GUI though most of them were not significant ?

Kind regards,

Lanhong

Producing a log

Like any proper bioinformatics software, Scoary should produce a log file.

Feature: non-dichotomous trait table

I am looking at a particular trait (heat resistance) across a set of strains of a given species, and I was wondering if it would be possible to examine a continuous variation in the trait (normalized from 0-1?) or a discrete set of values (low, medium, high? preferably more = better).

Thanks

_csv.Error: field larger than field limit (131072)

Getting the following error while running scoary.
Reading gene presence absence file
Traceback (most recent call last):
File "/home/ga23981/src/Scoary-master/scoary.py", line 25, in
methods.main()
File "/home/ga23981/src/Scoary-master/scoary/methods.py", line 215, in main
outdir=args.outdir)
File "/home/ga23981/src/Scoary-master/scoary/methods.py", line 352, in Csv_to_dic_Roary
header = next(csvfile)
_csv.Error: field larger than field limit (131072)

Incomplete presence/absence data from Single-cell genomes

Hi Ola,

I'm working with a large number single-cell amplified genomes, i.e. the individual assemblies are incomplete, ranging from ~30%-95% estimated completeness. This means that I do get reliable gene "presences", but "absences" can mean either true absence or just missed in the assembly.

I was wondering, what your thoughts on these kind of data would be with respect to association testing. And do you think, Scoary could be used / customized to analyze those data?

Cheers,
Thomas

SNP gwas

Hi,

I have previously got scoary to work with roary output but can't get it to work with SNP output created with the SNP2vcf.py script.

My traits file is deffinitely formatted correctly (it works with roary input)

The SNP file looks OK but I get the error:
CRITICAL: Could not find 92dd9dbb-81ae-4faf-8867-4f27deef779f in the genes file. CRITICAL: Traceback (most recent call last): File "/home/ndm.local/sam/CSOLD/dev/lib/python2.7/site-packages/scoary/methods.py", line 278, in main RES_and_GTC = Setup_results(genedic, traitsdic, args.collapse) File "/home/ndm.local/sam/CSOLD/dev/lib/python2.7/site-packages/scoary/methods.py", line 798, in Setup_results stats = Perform_statistics(traitsdic[trait], genedic[gene]) File "/home/ndm.local/sam/CSOLD/dev/lib/python2.7/site-packages/scoary/methods.py", line 979, in Perform_statistics sys.exit("Make sure strains are named the same in your " SystemExit: Make sure strains are named the same in your traits file as in your gene presence/absence file
The vcf file definitely contains the isolate in question so I'm not sure what is going on?? Any ideas? (the names are definitely the same too!)

Python3 incompatibility

Scoary currently doesn't work with python3. It seems to freeze and eat up memory when trying to populate the quadtree with pairwise hamming distances.

Quoting fields in results file

Hi Ola,

I was giving a go with the latest version (the ascii logo is very neat!) and noticed that the non numeric fields in the output table are not quoted, which might cause problems when parsing the results, especially in the gene product field.

Example of a line which causes my parser to break:

group_3038,,outer membrane pore protein N, non-specific,3,12,3,332,50.0,96.511627907,27.6666666667,0.0011870360825,1.0,0.421032887975,3,3,1,0.125,0.5

(notice the "outer membrane pore protein N, non-specific" bit)

I'm hotfixing this issue by putting an empty string in the "Annotation" field of Roary's output, but I figured you might want to have a look into this potential issue.

Thanks a lot, Marco

Add pairwise comp filter results option

Amongst the list of filter options should be an option that allow users to only see results where:

Best_pairwise_p < alpha
Best_pairwise_p AND Worst_pairwise_p < alpha

admiralenola / scoary Goto Github PK

scoary's People

Stargazers

Watchers

Forkers

scoary's Issues

Recommend Projects

Recommend Topics

Recommend Org