lmrodriguezr / nonpareil Goto Github PK

View Code? Open in Web Editor NEW

40.0 40.0 11.0 12.5 MB

Estimate metagenomic coverage and sequence diversity

Home Page: http://enve-omics.ce.gatech.edu/nonpareil/

License: Other

Makefile 1.72% C++ 68.96% C 9.87% R 19.25% Shell 0.20%

bioinformatics metagenomics microbial-genomics

nonpareil's People

Contributors

Stargazers

Watchers

Forkers

ikauser7 bioinformaticsarchive gunturus bebatut brittanysuttner lptolik tankmermaid enveomics zxgsy520 jianshu93 rpetit3

nonpareil's Issues

Fatal error: Reduce the number of query reads (-X)

I had the following fatal error while running nonpareil Nonpareil v3.4 as follows:

nonpareil -T kmer -s ../xx.human/$DS.NonHuman.1.fa -b np$DS -t 16 -f fasta -X 100

Fatal error:
Reduce the number of query reads (-X) to 10%% of total reads
[ 0.0] Fatal error: Reduce the number of query reads (-X) to 10%% of total reads

It turns out I had an error in the path (my files names are *.fasta and not *.fa), so when that was fixed it runs fine. But the error message is misleading

Thank you!

Installation of nonpareil via conda

Hi,

Nonpareil has been integrated into Bioconda, to facilitate its installation via conda: http://bioconda.github.io/recipes/nonpareil/README.html (the latest version will be soon available)
If you want, I can open a Pull Request here to update the README file and the doc to add how to install Nonpareil using conda and also the badge: .

Bérénice

Warnings compiling

The code produces the following warnings when compiling with gcc 4.6:

universal.cpp: In function ‘void say(const char*, ...)’:
universal.cpp:110:25: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
[...]
nonpareil_mating.cpp: In function ‘void nonpareil_count_mates_block(int*&, int, char**&, char**&, int, int, int, matepar_t)’:
nonpareil_mating.cpp:123:49: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
[...]
nonpareil_sampling.cpp: In function ‘int nonpareil_sample_portion(double*&, int, samplepar_t)’:
nonpareil_sampling.cpp:53:49: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]

This problem was first reported as part of #4.

extracting out "final coverage" , "actual effort" and "required effort" in numbers from .npo file

Hey @lmrodriguezr @bebatut @gunturus

I was wondering if it's possible to extract out the actual values of stats measured by nonpareil software directly, without the R graph.

I am looking at 200 samples and visualizing all of them on a plot doesn't work. So, I wanted to get the values of "final coverage" , "actual sequencing effort" and "required sequencing effort" in numbers from .npo file to compare among samples?

Thanks!

Better visualization of Nonpareil.curve.batch for 200 samples

Hey @lmrodriguezr

I was wondering if it's possible to call ggplot or any other visualization R software to get a cleaner/publication quality image for Nonpareil.curve for 200 samples?

My codes-
$ls 230*.npo > sample_list_batch.txt #add .rpo filenames to a text file
R

sample_names<-read.table("sample_list_batch.txt",header=FALSE)
colnames(sample_names)<-c("File") #the filenames is the first column called "File"
sample_names$Name<-gsub("_R1.fastq.gz.fa-nonpareli-output.npo","",sample_names$File) #create a new column called "Name"
attach(sample_names)
pdf("batch_curve_plot.pdf")
np<-Nonpareil.curve.batch(sample_names$File,label=sample_names$Name ,modelOnly=TRUE);
#>Nonpareil.legend(np)
detach(sample_names)
dev.off()

My plot-

Thanks for help!
Jigyasa

Cannot install Nonpareil.R

See the following session reported in a GNU Linux box:

$ sudo make install
if [ ! -d /usr/local/bin ] ; then mkdir -p /usr/local/bin ; fi
if [ ! -d /usr/local/man/man1 ] ; then mkdir -p /usr/local/man/man1 ; fi
if [ -e nonpareil ] ; then install -m 0755 nonpareil /usr/local/bin/ ; fi
if [ -e nonpareil-mpi ] ; then install -m 0755 nonpareil-mpi /usr/local/bin/ ; fi
cp docs/_build/man/nonpareil.1 /usr/local/man/man1/nonpareil.1
R CMD INSTALL utils/Nonpareil
* installing to library ‘/usr/local/lib/R/site-library’
* installing *source* package ‘Nonpareil’ ...
** R
** preparing package for lazy loading
Error in eval(expr, envir, enclos) : object '..' not found
Error : unable to load R code in package 'Nonpareil'
ERROR: lazy loading failed for package 'Nonpareil'
* removing '/usr/local/lib/R/site-library/Nonpareil'
make: *** [install] Error 1

Segmentation fault

I am having problems running nonpareil, it runs fine but when it starts the sub-sampling I get a segmentation fault

[ 15.8] Thread 5 completed 85 comparisons, joining results
[ 15.8] Sub-sampling library
Segmentation fault (core dumped)00000%

I am also getting this problem when compiling, it may be related but I do not know how to fix it.

cd enveomics/ && make nonpareil
make[1]: Entering directory /home/erick/tools/nonpareil/enveomics' g++ universal.cpp -c universal.cpp: In function ‘void say(const char*, ...)’: universal.cpp:110:25: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] g++ sequence.cpp -c g++ nonpareil_mating.cpp -c nonpareil_mating.cpp: In function ‘void nonpareil_count_mates_block(int*&, int, char**&, char**&, int, int, int, matepar_t)’: nonpareil_mating.cpp:123:49: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] g++ nonpareil_sampling.cpp -c nonpareil_sampling.cpp: In function ‘int nonpareil_sample_portion(double*&, int, samplepar_t)’: nonpareil_sampling.cpp:53:49: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] make[1]: Leaving directory/home/erick/tools/nonpareil/enveomics'
g++ nonpareil.cpp enveomics/universal.o enveomics/sequence.o -lpthread enveomics/nonpareil_mating.o enveomics/nonpareil_sampling.o -o nonpareil

Nonpareil.curve fails on test dataset

nonpareil -s test.fastq -T alignment -b test -f fastq

R version 3.2.5 (2016-04-14)

library,(Nonpareil);
Nonpareil.curve('test.npo');
Error in if (a[twenty.pc, 5] == 0) { : argument is of length zero
In addition: Warning message:
In log(read.length) : NaNs produced

Errors to .npl

Errors should be logged into the .npl file.

Strange output filenames

Hi,

I see some strange behavior, where the some of the output files don't have the prefix assigned via -b:

$ ls
C1825_AAGACGA_L006.nobg.fastq
$ nonpareil -s C1825_AAGACGCA_L006.nobg.fastq -f fastq -T kmer -o C1825
Nonpareil v3.2                                      
 [      0.0]  reading C1825_AAGACGCA_L006.nobg.fastq
 [      0.0]  Picking 10000 random sequences        
 [      0.0]  Started counting                      
 [      6.5]  Read file with 18544670 sequences     
 [      6.5]  Average read length is 103.408643bp   
 [      6.5]  Sub-sampling library                  
 [      6.7]  Evaluating consistency                
 [      6.7]  Everything seems correct              
$ ls
C1825   C1825_AAGACGCA_L006.nobg.fastq  ??UH?????.npa   ??UH?????.npc   ??UH?????.npl

The files named ??UH????? are a bit annoying to work with on the terminal, and running nonpareil twice in the same dir overwrites these files, which is also a bit annoying.

My system is Linux:

$ uname -srvmpio
Linux 2.6.32-696.13.2.el6.x86_64 #1 SMP Thu Oct 5 17:13:40 CDT 2017 x86_64 x86_64 x86_64 GNU/Linux

Is this the intended behavior?

Estimating coverage per sequencing effort

When running nonpareil from the command line and viewing the .npo output file in R, where does the "empty circle" value for the estimated coverage depth vs sequencing effort com from?? Presumably it is from the end of the .npo file (see below for output), which in the case of the attached output suggests that the coverage was about 72.8% for my sequencing effort:
...
1732486 0.35225 0.05063 0.31818 0.35354 0.38462
2474980 0.40542 0.04240 0.37692 0.40458 0.43333
3535685 0.46346 0.03631 0.43787 0.46512 0.48837
5050979 0.51771 0.02825 0.49793 0.51894 0.53689
7215684 0.57511 0.02255 0.56012 0.57550 0.59053
10308120 0.63122 0.01838 0.61943 0.63125 0.64300
14725885 0.68184 0.01314 0.67288 0.68182 0.69086
21036979 0.72835 0.00807 0.72249 0.72833 0.73418

Yet when I plot the data in R, the empty circle appears about 8-10% higher than this, and this is the same for all six of the datasets I'm trying this on.

Am I interpreting this correctly, Is the last line, second column of the .npo output file the estimate of coverage depth for my sequencing effort?? If not, how do I determine this? Thanks for your help, I think your method has a lot of potential!!

The problem with xlim

When I plot a curve with the code "Nonpareil.curve.batch(File,libnames=Name,xlim = c(1e+8,1e+15))",but the xlim didnot work.　The picture is below. I want to get the x-axis bigger, so that it can show more information. The parameter xlim only work in Nonpareil.curve(), and it dosenot work in Nonpareil.curve.batch(). And I want to know what mean is the "arrow" in a picture.

MPI version installation error

Dear authors,

It is me again :)

The install for the regular version of nonpareil. I am now attempting the install of nonpareil-mpi. However, below is the error that I receive:

$ make nonpareil-mpi
cd enveomics/ && make nonpareil-mpi
make[1]: Entering directory '/mnt/gaiagpfs/projects/ecosystem_biology/local_tools/nonpareil/enveomics'
mpic++ -DENVEOMICS_MULTI_NODE universal.cpp -c -Wall -std=c++11
mpic++ -DENVEOMICS_MULTI_NODE multinode.cpp -c -Wall -std=c++11
multinode.cpp: In function ‘void init_multinode(int&, char**&, int&, int&)’:
multinode.cpp:16:4: error: ‘MPI’ has not been declared
    MPI::Init(argc, argv);
    ^~~
multinode.cpp:17:16: error: ‘MPI’ has not been declared
    processes = MPI::COMM_WORLD.Get_size();
                ^~~
multinode.cpp:18:16: error: ‘MPI’ has not been declared
    processID = MPI::COMM_WORLD.Get_rank();
                ^~~
multinode.cpp: In function ‘void finalize_multinode()’:
multinode.cpp:21:4: error: ‘MPI’ has not been declared
    MPI::Finalize();
    ^~~
multinode.cpp: In function ‘void barrier_multinode()’:
multinode.cpp:25:4: error: ‘MPI’ has not been declared
    MPI::COMM_WORLD.Barrier();
    ^~~
multinode.cpp: In function ‘size_t broadcast_int(size_t)’:
multinode.cpp:31:4: error: ‘MPI’ has not been declared
    MPI::COMM_WORLD.Bcast(buffer, 1, MPI::INT, 0);
    ^~~
multinode.cpp:31:37: error: ‘MPI’ has not been declared
    MPI::COMM_WORLD.Bcast(buffer, 1, MPI::INT, 0);
                                     ^~~
multinode.cpp: In function ‘double broadcast_double(double)’:
multinode.cpp:40:4: error: ‘MPI’ has not been declared
    MPI::COMM_WORLD.Bcast(buffer, 1, MPI::DOUBLE, 0);
    ^~~
multinode.cpp:40:37: error: ‘MPI’ has not been declared
    MPI::COMM_WORLD.Bcast(buffer, 1, MPI::DOUBLE, 0);
                                     ^~~
multinode.cpp: In function ‘char* broadcast_char(char*, size_t)’:
multinode.cpp:47:4: error: ‘MPI’ has not been declared
    MPI::COMM_WORLD.Bcast(value, size, MPI::CHAR, 0);
    ^~~
multinode.cpp:47:39: error: ‘MPI’ has not been declared
    MPI::COMM_WORLD.Bcast(value, size, MPI::CHAR, 0);
                                       ^~~
multinode.cpp: In function ‘char broadcast_char(char)’:
multinode.cpp:53:4: error: ‘MPI’ has not been declared
    MPI::COMM_WORLD.Bcast(buffer, 1, MPI::CHAR, 0);
    ^~~
multinode.cpp:53:37: error: ‘MPI’ has not been declared
    MPI::COMM_WORLD.Bcast(buffer, 1, MPI::CHAR, 0);
                                     ^~~
multinode.cpp: In function ‘void reduce_sum_int(int*, int*, int)’:
multinode.cpp:60:4: error: ‘MPI’ has not been declared
    MPI::COMM_WORLD.Reduce(send, receive, size, MPI::INT, MPI::SUM, 0);
    ^~~
multinode.cpp:60:48: error: ‘MPI’ has not been declared
    MPI::COMM_WORLD.Reduce(send, receive, size, MPI::INT, MPI::SUM, 0);
                                                ^~~
multinode.cpp:60:58: error: ‘MPI’ has not been declared
    MPI::COMM_WORLD.Reduce(send, receive, size, MPI::INT, MPI::SUM, 0);
                                                          ^~~
multinode.cpp: In function ‘void reduce_sum_int(int, int)’:
multinode.cpp:65:4: error: ‘MPI’ has not been declared
    MPI::COMM_WORLD.Reduce(send_ar, receive_ar, 1, MPI::INT, MPI::SUM, 0);
    ^~~
multinode.cpp:65:51: error: ‘MPI’ has not been declared
    MPI::COMM_WORLD.Reduce(send_ar, receive_ar, 1, MPI::INT, MPI::SUM, 0);
                                                   ^~~
multinode.cpp:65:61: error: ‘MPI’ has not been declared
    MPI::COMM_WORLD.Reduce(send_ar, receive_ar, 1, MPI::INT, MPI::SUM, 0);
                                                             ^~~
multinode.cpp: In function ‘void reduce_sum_double(double*, double*, int)’:
multinode.cpp:70:4: error: ‘MPI’ has not been declared
    MPI::COMM_WORLD.Reduce(send, receive, size, MPI::DOUBLE, MPI::SUM, 0);
    ^~~
multinode.cpp:70:48: error: ‘MPI’ has not been declared
    MPI::COMM_WORLD.Reduce(send, receive, size, MPI::DOUBLE, MPI::SUM, 0);
                                                ^~~
multinode.cpp:70:61: error: ‘MPI’ has not been declared
    MPI::COMM_WORLD.Reduce(send, receive, size, MPI::DOUBLE, MPI::SUM, 0);
                                                             ^~~
Makefile:32: recipe for target 'nonpareil-mpi' failed
make[1]: *** [nonpareil-mpi] Error 1
make[1]: Leaving directory '/mnt/gaiagpfs/projects/ecosystem_biology/local_tools/nonpareil/enveomics'
Makefile:29: recipe for target 'nonpareil-mpi' failed
make: *** [nonpareil-mpi] Error 2

Below is the MPI version available on my cluster.

$ mpic++ --version
g++ (GCC) 6.3.0
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Let me know if you need more information. It would be nice to implement the mpi version of the tool.

Best regards,
Shaman

Fatal error: Cannot open the file: reads.fa

Hello there,

I have an error when I tried to run nonpareil:

tanshiming@S620100019205:~/Documents/nonpareil-trial$ nonpareil -s reads.fa -b output
Nonpareil v3
[ 0.0] Counting sequences
Fatal error:
Cannot open the file:
reads.fa
[ 0.0] Fatal error: Cannot open the file: reads.fa

I have tried doing this on fasta and fastq files; the files are an illumina-trimmed dataset. Can someone please advice?

Thanks!

nonpareil.curve.batch - double legend?

Hello,

I am trying to graph 24 samples together using Nonpareil.curve.batch, and the regular command outputs a legend that goes right over top of my curves, and the box isn't see-through. When I try to move the legend using nonpareil.legend, it makes a second legend (so the graph has two). Is there a way to remove the legend from the Nonpareil.curve.batch command? Or am I doing something wrong? (note that I am currently just working with 2 samples to get it to work).

Here is my code:
samples <- read.table('Nonpareil.samples.txt', sep='\t', header=TRUE)
attach(samples)
non <- Nonpareil.curve.batch(File, libnames=Name, skip.model=TRUE, plot=TRUE)

Nonpareil.legend(non, 'topright')

Thanks!

Read length fatal error

Hello! I'm trying to use the software to estimate the coverage of paired-end metagenomes generated on the NovaSeq platform and many of the samples give me the same error.

Command (in folder containing files):

for f in *_L002_R1_001.fastq; do nonpareil -s $f -T kmer -f fastq -b /nonpareil_out/${f%.*}; done

Error:

Fatal error: Reads are required to have a minimum length of kmer size

My reads are approximately ~250 bp in length (verified this during my quality-checking and preprocessing steps), so I'm not sure why I'm getting this error. I'm only using one of the pairs and am using default parameters. Weirdly, some files that seem to have basically the same properties finish without issue. Any ideas why this might be happening? I've used this effectively with other samples and it's been very helpful, so I'm hoping to be able to use it with these samples as well.

Thanks so much in advance!

Any way to use with corrected PacBio reads?

I know that curve generation is optimized for small reads as stated on your website, but is there a way to use this program on corrected PacBio reads?

Not reproducible results

Hello,

I have a question regarding the reproducibility of the results: I ran nonpareil on the same input using the same command line and got slightly different results for both runs.
Is that something to be expected? Do you know what the source of this randomness is and whether the analysis could be made deterministic in the future?

Used version: nonpareil=3.3.3=r341h470a237_0 installed via conda

Thank you in advance!

Best,
Valentina

Throat metagenomes

Hello,

I have 150bp paired end reads from throat metagenomes. Human reads have been removed and I have quality trimmed. Depth is highly variable because human % varies.

I ran Nonpareil with overlap 100 and default -S (0.95). Approximately half of the samples give the following warning and then fatal error:

 [     35.6]  Evaluating consistency
 [     35.6]  WARNING: The curve reached near-saturation, hence coverage estimations could be unreliable
 [     35.6]  The overlap (-L) is currently set to the maximum, meaning that the actual coverage is probably above 100X
 [     35.6]  You could increase -S but values other than 0.95 are untested
Fatal error:
Sequencing depth above detection limit.

I don't fully understand what it means for depth to be "above" detection limit, and the sense (or risk) of increasing -S. I am also confused that average coverage is reported in % in the Nonpareil curves, but in this error it's reported as a fold value.

I also am confused whether I should consider this just a warning, or as a fatal error, totally disregard the output?

Thanks,

Andrew

Conda Installation Breaks R (everytime)

Hi There,

Installing nonpareil into a fresh conda environment with R-base produces the following error:
"/home/roli/anaconda3/envs/nonpareil/lib/R/bin/exec/R: error while loading shared libraries: libiconv.so.2: cannot open shared object file: No such file or directory"

Please advise.
(this error was replicated in my base conda environment: i.e. user beware)

number of initial reads

This is more of a question than an issue but :
is nonpareil biases on the number of initial reads we give it ? Should we change -X to more if we have datasets with great number of reads ?

And also, just to be sure : this tool works with shotgun not amplicon metagenomes right ?

Thank you very much in advance,

Input from stdin

Is it possible to get the input file from standard input?
For example, all my data is compressed and it would be more convenient to just pipe gunzip output to stdout and the use it as stdin in nonpareil.

Question on the use of paired end reads

You mention in the documentation that when paired end reads are available only one set of the pair should be use. Could you expand the reason for this? Running it with both pairs will generate a wrong results or it will decrease performance,
thanks

Erick

Failure to generate nonpareil curves

I have problems making the nonpareil curves with R.
I run mpi nonpareil as usual except I use the -L 25 options because I expect poor coverage. I also know from pyrotag data that I have closely related species in my samples.
There was no errors in running nonpareil and I get "everything seems correct result'
However when I try to get the curves I get the following warning messages:

Warning messages:
...Convergence failure: false convergence (8)
...Model didn't converge

My questions are:
-Is my coverage result still valid?
-Do I need to rerun nonpareil with different paramaters? or the modelling of the curves only affects the sequencing effort needed.

thanks,
Erick

error in Nonpareil.curve

Dear author
I have a problem in Nonpareil.curve(). when i do Nonpareil.curve("134.npo"), it show an error that "Error in if (x$y.p50[twenty.pc] == 0)".
The file "134.npo" is as follow. I can not understand what is wrong with it?

# @impl: Nonpareil
# @ksize: 842085680
# @version: 3.20
# @maxL: 292
# @L: 244.584
# @R: 27811543
# @overlap: 50.00
# @divide: 0.70
0	0.00000	0.00000	0.00000	0.00000	0.00000
1	0.00000	0.00000	0.00000	0.00000	0.00000
2	0.00000	0.00000	0.00000	0.00000	0.00000
3	0.00000	0.00000	0.00000	0.00000	0.00000
4	0.00000	0.00000	0.00000	0.00000	0.00000
6	0.00000	0.00000	0.00000	0.00000	0.00000
9	0.00000	0.00000	0.00000	0.00000	0.00000
12	0.00000	0.00000	0.00000	0.00000	0.00000
18	0.00000	0.00000	0.00000	0.00000	0.00000
25	0.00000	0.00000	0.00000	0.00000	0.00000
36	0.00000	0.00000	0.00000	0.00000	0.00000
52	0.00000	0.00000	0.00000	0.00000	0.00000
74	0.00000	0.00000	0.00000	0.00000	0.00000
105	0.00000	0.00000	0.00000	0.00000	0.00000
151	0.00000	0.00000	0.00000	0.00000	0.00000
215	0.00000	0.00000	0.00000	0.00000	0.00000
307	0.00000	0.00000	0.00000	0.00000	0.00000
439	0.00000	0.00000	0.00000	0.00000	0.00000
627	0.00000	0.00000	0.00000	0.00000	0.00000
896	0.00000	0.00000	0.00000	0.00000	0.00000
1279	0.00000	0.00000	0.00000	0.00000	0.00000
1828	0.00000	0.00000	0.00000	0.00000	0.00000
2611	0.00000	0.00000	0.00000	0.00000	0.00000
3730	0.00000	0.00000	0.00000	0.00000	0.00000
5328	0.00000	0.00000	0.00000	0.00000	0.00000
7612	0.00000	0.00000	0.00000	0.00000	0.00000
10874	0.00000	0.00000	0.00000	0.00000	0.00000
15534	0.00000	0.00000	0.00000	0.00000	0.00000
22191	0.00000	0.00000	0.00000	0.00000	0.00000
31702	0.00049	0.01562	0.00000	0.00000	0.00000
45289	0.00179	0.03641	0.00000	0.00000	0.00000
64698	0.00049	0.01562	0.00000	0.00000	0.00000
92426	0.00068	0.01320	0.00000	0.00000	0.00000
132037	0.00371	0.02864	0.00000	0.00000	0.00000
188624	0.00358	0.02764	0.00000	0.00000	0.00000
269463	0.00342	0.01987	0.00000	0.00000	0.00000
384948	0.00835	0.02641	0.00000	0.00000	0.00000
549925	0.00910	0.02306	0.00000	0.00000	0.00000
785607	0.01305	0.02149	0.00000	0.00000	0.03030
1122296	0.01927	0.02201	0.00000	0.02083	0.02941
1603280	0.02646	0.02128	0.01538	0.02000	0.03774
2290400	0.03576	0.02038	0.02273	0.03297	0.04878
3272000	0.04735	0.01983	0.03279	0.04651	0.06087
4674286	0.06177	0.01718	0.04938	0.06135	0.07317
6677551	0.08123	0.01694	0.06944	0.08036	0.09266
9539359	0.10346	0.01524	0.09270	0.10826	0.11401
13627656	0.12778	0.01304	0.11895	0.12815	0.13608
19468080	0.15561	0.01099	0.14820	0.15549	0.16228
27811543	0.18379	0.00675	0.17907	0.18109	0.18813

make nonpareil-mpi failed: enveomics/SeqReader.o: No such file or directory

Hi,

Came across this running make nonpareil-mpi:

cd enveomics/ && make nonpareil-mpi
make[1]: Entering directory '/tmp/guix-build-nonpareil-2.4.1-1.72ce3c9e.drv-0/source/enveomics'
mpic++ -DENVEOMICS_MULTI_NODE universal.cpp -c -Wall -std=c++11
mpic++ -DENVEOMICS_MULTI_NODE multinode.cpp -c -Wall -std=c++11
mpic++ -DENVEOMICS_MULTI_NODE sequence.cpp -c -Wall -std=c++11
mpic++ -DENVEOMICS_MULTI_NODE nonpareil_mating.cpp -c -Wall -std=c++11
mpic++ -DENVEOMICS_MULTI_NODE nonpareil_sampling.cpp -c -Wall -std=c++11
make[1]: Leaving directory '/tmp/guix-build-nonpareil-2.4.1-1.72ce3c9e.drv-0/source/enveomics'
mpic++ nonpareil.cpp enveomics/universal.o enveomics/multinode.o enveomics/sequence.o enveomics/SeqReader.o  enveomics/KmerCounter.o enveomics/References.o enveomics/Hash.o enveomics/nonpareil_mating.o enveomics/nonpareil_sampling.o -lpthread -DENVEOMICS_MULTI_NODE -Wall -std=c++11 -o nonpareil-mpi
g++: error: enveomics/SeqReader.o: No such file or directory
g++: error: enveomics/KmerCounter.o: No such file or directory
g++: error: enveomics/References.o: No such file or directory
g++: error: enveomics/Hash.o: No such file or directory
make: *** [Makefile:30: nonpareil-mpi] Error 1

I believe this is caused by not all the objects being compiled in enveomics/Makefile. I was able to workaround this using something similar to make nonpareil; make nonpareil-mpi.

Can I ask also, does it make sense to use the git or the release, since the newest release is from 2015?
ta
ben

Average read length

Hello Miguel,
I've been using Nonpareil for calculating the average coverage for metagenomes samples. However, after testing a sample with 159,997,927 sequences and a sequencing effort of 14,087,765,504 bp, the estimated average read length is ~7bp but it should be ~88 bp. I can provide you more details if you need them.
Thanks,
Coto

Paired end reads (again!)

Can I asked again about paired end reads. I have followed the advice to use only one mate pair. However, it leaves me with a challenge of interpretation. Say I determine that 1e10 bases gives 95% coverage, and this is what I wish to achieve, do I decide:

I need 2e10 bases to achieve this coverage (seems unlikely, since the second mate pair not analysed by nonpareil will contribute more coverage)
I need 1e10 bases (but second mate pair perhaps won't double coverage?)
Something in between

You warned in #8 that if paired end reads are analysed together, the coverage may be underestimated. Having tested this (with the kmer approach), this does not seem to be the case. Perhaps it's because of the kmer approach:

Is there a way to consider even non-overlapping paired reads as a single unit (e.g. concatenate with NNNN between), and would this be desirable?

Thanks for your help,

Andrew

Systematic bias at low coverage (under 20%)

(For others who come across this: this is an issue with an edge use case of Nonpareil; I’m otherwise very happy with the program and trust it for higher coverage samples).

I designed an experiment to see how the output of Nonpareil changes when a FASTQ is repeatedly halved in size. The behavior above a redundancy value of 20% is that the subsampled FASTQ files follow the Nonpareil curve of the larger FASTQ file. Under 20%, however, the data show a systematic bias towards low redundancy (an example of the data is shown within the affected range in the below plot). The bias affects estimates of diversity and how much additional sequencing effort is needed. I suspect that the issue may be, using the language in the original paper, in the assumptions behind how the total number of reads affects the probability of observing matches between reads. At low total number of reads, it becomes less and less likely to find matches between reads; is the binomial distribution still appropriate in such a context?

How to set -X and -n

Hi,
I was wondering if there is any need to deviate from the default values for -X and -n. I saw that -X is should be 10 x less than the complete dataset. So all my datasets have at least 10 milion reads, Would it then be right to set -X at 1 million reads. What would be the advantage for the calculations. I noticed that takes quite a bit longer to run the job.

The same for -n, it is by default at 1024. Could you think of an argument why I would want to increase that number, say to 2048...

nonpareil alignment version can't use fastq files

There is a problem with nonpareil alignment version. It can't use fastq files. It will go through the whole code. But, it will output files with all zeros.
SRR096387_1.npo.txt

Obtaining diversity estimate

Thanks for the great documentation and software design. Installation and running was super easy.

One question - how to I obtain the nonpareil sequence diversity estimate? I may have missed it, but I couldn't find this info in the documentation. I assume I get this from the 'diversty' slot in the nonpareil object after running the Nonpareil.curve function. Is that correct?

Error in file(con, "r") : invalid 'description' argument

Hello!

Thank you for your work with Nonpareil -- it's an incredibly useful tool! Unfortunately, I am running into some issues when attempting to run 'Nonpareil.set' and 'Nonpareil.curve.batch'. I didn't see this error popping up in any other issue posts, and I don't think the problem is necessarily related to Nonpareil itself but I was hoping to get some insight into what I may be doing wrong.

I have ~270 samples which I successfully ran through Nonpareil on a Linux system. Each of the output files contains a sample-specific identifier such as 'ER####.nonpareil.npo'. I used Python to compile the paths to each of these files in a dataframe with randomly generated colors associated in a separate column (similar to your suggestion on Read the Docs). So, I have a "File" column which contains the full path to all NPO files, an "ER_ID" column with the sample-specific identifier, and a "color" column (see attached).

ERIN_Nonpareil_npo_output_files.csv

However, when attempting to run Nonpareil.curve.batch() or Nonpareil.set(), I continue to be met with this error:
Error in file(con, "r") : invalid 'description' argument

The full traceback is:

8: file(con, "r")
7: readLines(x$file)
6: grep("^# @", readLines(x$file), value = TRUE)
5: gsub("^# @", "", grep("^# @", readLines(x$file), value = TRUE))
4: Nonpareil.read_metadata(np)
3: Nonpareil.curve(plot = FALSE, file = NA_character_, col = NA_character_, 
       label = "")
2: do.call("Nonpareil.curve", nonpareil.opts)
1: Nonpareil.set(files = coverage$File, col = coverage$color, labels = "", 
       plot = TRUE, plot.opts = list(plot.observed = FALSE))

I tried digging around Google and most people say that this is typically an issue when your Excel file is open when trying to import data -- however, my CSV files are closed whenever I'm running my R code and this is still occurring. Do you know what I may be missing? Did I improperly format my CSV containing the file paths and names? I'm sorry if this turns out to be a silly fix, I think I just need another set of eyes on it.

Thank you for your time!

Diversity index formula

In the latest version a new diversity index is calculated. What formula and index value does it use?

Most recent R package version 3.3 lost all values explanations. Maybe that information should be put there.

meaning of warning / use of -L in kmer mode

Hey & thanks Devs for the packages.

I have a persistent warning that my sequences are approaching saturation:

WARNING: The curve reached near-saturation, hence coverage estimations could be unreliable 
To avoid saturation increase the -L parameter, currently set at ...lots%

Is the overlap param, -L, used in kmer comparisons also (i.e. in -T kmer)?

And if so, is the unreliability only to be found at the extremes (near the asymptotic point of saturation), or could the values be particularly unreliable across all values generated (and especially the kmer diversity, see #38 )? I've increased my L to 90% but still receive the same warning (though these are particularly thorough sampling efforts (15GB unzipped fastq, human gut faecal microbiome) so I'm not surprised).

linux compatiable

doesn't create proper filenames in linux systems

Core dumped issue

First thanks for the nice program you provide to the scientific community.
I have a bug while running Nonpareil :./nonpareil -s 20_1M.fastq -f fastq -t 12 -R 20000 -b test
Nonpareil v2.2
[ 0.0] Counting sequences
[ 0.0] The file 20_1M.fastq.enve-seq.18514 was just created
[ 0.0] Longest sequence has 150 characters
[ 0.0] Average read length is 150.000000 bp
[ 0.0] Reading file with 250000 sequences
[ 0.0] Sequences to store in 19997.909180Mb free: 137054208 (54821.683200%)
[ 0.0] Querying library with 0.004000 times the total size (1000 seqs)
[ 0.0] Building query set at 20_1M.fastq.enve-seq.18514.subsample.18514
[ 0.0] Query set built with 1004 sequences
[ 0.0] Designing the blocks scheme for 250000 sequences
[ 0.0] Qry blocks:1, seqs/block:1004
[ 0.0] Sbj blocks:1, seqs/block:250000
[ 0.0] Mating sequences in 1 by 1 blocks
[ 0.0] Allocating ~0 Mib in RAM for block qry:1
[ 0.0] Allocating ~35 Mib in RAM for block sbj:1
[ 0.0] Computing block 1/1
[ 0.0] Launching parallel comparisons to 12 threads
[ 110.8] Thread 0 completed 84 comparisons, joining results
[ 111.4] Thread 1 completed 84 comparisons, joining results
[ 111.5] Thread 4 completed 84 comparisons, joining results
[ 112.6] Thread 3 completed 84 comparisons, joining results
[ 113.2] Thread 2 completed 84 comparisons, joining results
[ 113.7] Thread 5 completed 84 comparisons, joining results
[ 117.7] Thread 11 completed 80 comparisons, joining results
[ 119.4] Thread 8 completed 84 comparisons, joining results
[ 119.6] Thread 6 completed 84 comparisons, joining results
[ 119.7] Thread 10 completed 84 comparisons, joining results
[ 119.7] Thread 7 completed 84 comparisons, joining results
[ 119.8] Thread 9 completed 84 comparisons, joining results
[ 119.8] Sub-sampling library
Erreur de segmentation (core dumped)0%

Any ideas of what could be wrong ?

All the best from Montréal,

-X not properly working

When the corresponding -x is too small, it collapses to zero and doesn't create a query set.

Infinite loop when subsampling more reads than available with -T kmer

When using -T kmer and the number of requested query reads (-X) is greater than the number of reads in the entire set, Nonpareil stays in an eternal loop (e.g., when using -s test/test.fasta). It should instead fail with an informative error or (even better) adjust -T to the total number of reads in the set.

-b doesn't generate .npc

The -b argument fails to create the mates file.

Using np to target subsampling

Hello,

This is a question rather than an issue, apologies if this is a bad place to post.

Thanks for such a useful tool. I'd be interested in your views on using Nonpareil curves to guide subsampling.

Let's say we want to coassemble a large number of samples, and the size exceeds what can be done within resource limitations. One solution is to randomly subsample a proportion of the reads from each sample. At the expense of less information, and less depth for minority organisms, this may allow the assembly to proceed.

However, it may be that some samples are more informative for assembly than others: e.g. it may be better to more aggressively subsample a high-coverage and low diversity sample.

I am considering trying an approach whereby for each sample I estimate the total bp required to achieve (say) 0.95 coverage from each sample. Those with proportion >= 1 are not subsampled, and those with proportion < 1 are subsampled to the required proportion of reads.

In my head, it feels like this might provide an efficient way of subsampling for assembly (although I appreciate it will not save as much memory, since it will preserve more unique kmers). But it would be great to sense check the idea!

All thoughts welcome!

Thanks,

Andrew

make nonpareil-mpi

I am getting this problem when compiling .
The code produces the following warnings when compiling with mpic++:
make mpicpp=/mnt/ilustre/users/chengchen.ye/software/openmpi/bin/mpic++ nonpareil-mpi
cd enveomics/ && make nonpareil-mpi
make[1]: Entering directory `/mnt/ilustre/users/chengchen.ye/project/Nonpareil/lmrodriguezr-nonpareil-92e9a25/enveomics'
/mnt/ilustre/users/chengchen.ye/software/openmpi/bin/mpic++ -DENVEOMICS_MULTI_NODE universal.cpp -c -Wall
/mnt/ilustre/users/chengchen.ye/software/openmpi/bin/mpic++ -DENVEOMICS_MULTI_NODE multinode.cpp -c -Wall
multinode.cpp: 在函数‘void init_multinode(int&, char**&, int&, int&)’中:
multinode.cpp:16:4: 错误：‘MPI’未声明
MPI::Init(argc, argv);
^
multinode.cpp:17:16: 错误：‘MPI’未声明
processes = MPI::COMM_WORLD.Get_size();
^
multinode.cpp:18:16: 错误：‘MPI’未声明
processID = MPI::COMM_WORLD.Get_rank();
^
multinode.cpp: 在函数‘void finalize_multinode()’中:
multinode.cpp:21:4: 错误：‘MPI’未声明
MPI::Finalize();
^
multinode.cpp: 在函数‘void barrier_multinode()’中:
multinode.cpp:25:4: 错误：‘MPI’未声明

How Can I save the plot?

Hi,

How can I save the Nonpareil curves to files? I run it in ssh client, without X windows.

Thanks

Can't process gzipped fastq

Hi, I'm just getting started with Nonpareil, thanks for your work.

I'm unable to process my gzipped fastq. If I first uncompress the file, it processes as expected. The error:

$ nonpareil -s ETNP_120m_R2.name.fastq.gz -t 4 -T kmer -f fastq -b ETNP_120m_R2.nonpareil.k
Nonpareil v3.301
Fatal error:
The file provided does not have the proper fastq format
 [      0.0] Fatal error: The file provided does not have the proper fastq format

Re-add factor to curve?

Hi,

earlier releases of the package (I checked v2.3) supported a 'factor' argument for the curve that allowed
to easily switch from bp to Mbp/Gbp/.. - but also mentioned that this could possibly influence the fit of the
model. Would it be possible to re-introduce something similar again, maybe not for the curve but only as
an additional argument to the plot function (so model fit wouldn't change)?

Unable to find required package ‘Nonpareil’ in the R package

Dear Luis,

I have installed nonpareil using conda, and have successfully performed obtain the .npo files. Unfortunately, I am not able to load the nonpareil package library in R.

> library(Nonpareil)
Error in library(Nonpareil) : there is no package called ‘Nonpareil’

I have also tried to install nonpareil using the "make" command, but to no avail.

Will appreciate any help that can be rendered.

Thank you!

"Picking N random sequences" forever

I'm running Nonpareil v3.303, and for one particular sample in a publicly available metagenome (accession SRS015133), nonpareil runs Picking N random sequences forever (I've let it run for many hours). The other samples in that bioproject work just fine.

My command:

nonpareil -T kmer   -R 3000    -t 8   -f fastq   -s $INPUT_FASTQ  -b $BASENAME

Even if I use -X 100, the command runs forever. If I append reads from another sample, nonpareil runs successfully. It appears that nonpareil is running a while loop to randomly select sequences, and breaks the loop once enough "good" sequences are found. However, in this case, nonpareil is never able to find enough "good" sequences in sample SRS015133. Reducing the kmer size (-k) to 14 allows the nonpareil to finish successfully, but raising the length any higher will cause an infinite loop. Why is nonpareil having a problem with a kmer length of 24? I don't see any cutoffs for sequence quality.

Here's the first couple of reads from the fastq (No. of reads = 1 mil):

@SRR061182.9 HWUSI-EAS677_102343534:3:1:1031:5135 length=100
CATGAATTGAGAGCNCTGGTCAGTGAATCGCTTCAAGATTTGGAAATGCACTTGCAGAAAGAGGAAAACGTATTGTTCCCTTATTTGTACGATTTGTATG
+
GGGGEGEBGDCCCC!AACA?E=EEEDEDD=EDEBD?EBEEAEDED?ED?EDBEEDAE?DAEEE=EE=E?E@CC,E@DA=CA->C=AA3565==<7:BC-@
@SRR061182.10 HWUSI-EAS677_102343534:3:1:1031:3309 length=100
CTGATTGCTAGTCTNTGCAGCCATTAGGACAGTTGATGAATAATCGTTATTCAACTTATGTTTAGGAGAATATACTGCACCAACCGTCAAAGAACTC
+
DFF:BFFF?FCCCC!6B@@@EAAEBEE:EE?AA:D?=5ADAED;EE?E5EE?A?AEE-AEEEEEE->AA,C=5C::??C@BD-?D57),6<>B4B:@
@SRR061182.11 HWUSI-EAS677_102343534:3:1:1031:3041 length=100
GGTGATTATCAAGGNAGAGAGTGGGAAAGGGAAGAGCAGTCTGCTGAACTTGATATCCCGTTTTTGCGGCACGGATGGCCAGGCGGTCAGCATG
+
EDEDE?E?EB?BAC!;@;C7@,@@;4:?AADBCDCCBB?DB5AB:C?@=B=DABBBC5:DAAB:9B=?,A>A:>55><<:??5::@=???@???

Nonpareil curve output file

It looks like the nonpareil output file has on the x axis sequencing effort (bp) and that number is not accurate. The x axis should be sequencing effort (number of reads) that number should then be multiplied by the average read length to figure out sequencing effort in number of bps.

Segmentation fault (core dumped)

Hello,

I saw there were a couple of closed issues reporting something similar, but it seems like they were all corrected by updating. I am using Nonpareil v3.2, just installed today via conda.

I executed this command:
nonpareil -s ~/pathto/seqs.fastq -T kmer -f fastq -t 2

Which produced this output:
Nonpareil v3.2
[ 0.0] reading /home/imss/Vault/Community/NunavutWWTPMetagen/Reads/18SEPin4_R1.fastq
[ 0.0] Picking 10000 random sequences
[ 0.0] Started counting
Segmentation fault (core dumped)

Installation fails with R

Dear authors/developers,

I am attempting to install nonpareil on my cluster environment. Therefore, I do not have any root access.

However, I keep failing on the R package getting a "Permission denied" error, despite the R package install path belonging to me.

$ make prefix=/mnt/nfs/projects/ecosystem_biology/local_tools/nonpareil R=/mnt/nfs/projects/ecosystem_biology/local_tools/R-ESB install
if [ ! -d /mnt/nfs/projects/ecosystem_biology/local_tools/nonpareil/bin ] ; then mkdir -p /mnt/nfs/projects/ecosystem_biology/local_tools/nonpareil/bin ; fi
if [ ! -d /mnt/nfs/projects/ecosystem_biology/local_tools/nonpareil/man/man1 ] ; then mkdir -p /mnt/nfs/projects/ecosystem_biology/local_tools/nonpareil/man/man1 ; fi
if [ -e nonpareil ] ; then install -m 0755 nonpareil /mnt/nfs/projects/ecosystem_biology/local_tools/nonpareil/bin/ ; fi
if [ -e nonpareil-mpi ] ; then install -m 0755 nonpareil-mpi /mnt/nfs/projects/ecosystem_biology/local_tools/nonpareil/bin/ ; fi
cp docs/_build/man/nonpareil.1 /mnt/nfs/projects/ecosystem_biology/local_tools/nonpareil/man/man1/nonpareil.1
/mnt/nfs/projects/ecosystem_biology/local_tools/R-ESB CMD INSTALL utils/Nonpareil
make: execvp: /mnt/nfs/projects/ecosystem_biology/local_tools/R-ESB: Permission denied
Makefile:44: recipe for target 'install' failed
make: *** [install] Error 127

I attempted the same by giving my $HOME/R as the path with the same outcome.

To further check, I also tried installing the packages outside make install like so:

$ R CMD INSTALL -l /mnt/nfs/projects/ecosystem_biology/local_tools/R-ESB utils/Nonpareil
* installing *source* package ‘Nonpareil’ ...
** R
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (Nonpareil)

Despite the packages already installed, make install still attempts to install R.

Any idea why this is happening and if there is a fix for it?

I look forward to your reply.

Best regards,
Shaman

lmrodriguezr / nonpareil Goto Github PK

nonpareil's People

Contributors

Stargazers

Watchers

Forkers

nonpareil's Issues

Recommend Projects

Recommend Topics

Recommend Org