Giter Club home page Giter Club logo

Comments (21)

FaezeK avatar FaezeK commented on July 19, 2024 1

I upgraded my centrifuge to version 1.0.4 and it now shows me that there are 26,641,924 unclassified reads that looks like a right number, but does it make sense to have more than 80% unclassified reads? My data is from horse feces (horse metagenomics) and I used nt database for classification, so I expected to see much less unclassified reads.

from centrifuge.

bastian-wur avatar bastian-wur commented on July 19, 2024

Bumping this.
I have samples where 40% of the reads are not analyzed...or ...whatever is not happening to them.
So I look at the krereport file, which says that I have numbers like 0% unclassified, but in fact 40% of the reads were not analyzed.
This wouldn't be an issue, if it was clear in the krereport, but like this it's highly misleading.

from centrifuge.

khyox avatar khyox commented on July 19, 2024

Do you mean 40% of reads not classified but also not in the output listed as unclassified?
I've never had 0% unclassified even in the "better" samples and with setting the minimum hit length parameter to 20. Is that some special sample?

from centrifuge.

bastian-wur avatar bastian-wur commented on July 19, 2024

Okay, guess I need to clarify it ^^.

The krereport begins like this:

  0.00	0	0	U	0	unclassified
100.00	6017277	0	-	1	root
 90.48	5444425	0	-	131567	  cellular organisms
 86.06	5178498	181	D	2	    Bacteria
 42.25	2542251	0	-	1783270	      FCB group

So neat, right? 6 million reads, all at least assigned to root, 100% of the reads assigned to root.
The problem is that the sample has 14 million reads.
That should be reflected in the report, because like this it's pretty misleading.

EDIT: This is supposedly human feces, so the reference db should pretty much cover everything.

from centrifuge.

khyox avatar khyox commented on July 19, 2024

Thanks! I see... It looks like the rest of the reads are really unclassified and kreport is doing something odd. Have you count the real unclassified reads provided by the centrifuge output? I mean, to not rely for such statistics on kreport but on the direct centrifuge output.

from centrifuge.

ffinfo avatar ffinfo commented on July 19, 2024

This also does not work because all unclassified reads are not in the centrifuge output, this is also the main cause of this problem I think.

edit: Only place to find this is in the metrics file but this is an optional output file and not a input to centrifuge-kreport

from centrifuge.

khyox avatar khyox commented on July 19, 2024

I don't think the issue is related with number of reads. I have never seen this effect of missing unclassified reads even when I "centrifuged" samples with near 60 Mreads paired-ends. In all cases, when I count the reads in the centrifuge output they are exactly the same as in the input files (not even 1 read of difference). Same for unpaired reads, but in such cases I processed quite less number of reads (but quite longer in bp). This seems and issue not easy to reproduce.
I am sorry I cannot help any more! :-/

from centrifuge.

GabrieleNocchi avatar GabrieleNocchi commented on July 19, 2024

I just run into the same problem. has this been sorted yet?

from centrifuge.

mourisl avatar mourisl commented on July 19, 2024

The support for output unclassified reads is implemented in the new release version of Centrifuge v1.0.3 (not beta now :) ).

from centrifuge.

FaezeK avatar FaezeK commented on July 19, 2024

I use 1.0.3-beta version of centrifuge and I am losing more than 80% of my reads in the centrifuge output. I don't receive any error messages and there are zero unclassified reads in the kreport results. My original fastq file has 32,915,706 reads, while the output of centrifuge has 9,925,342 lines (including header) and finally the kreport shows 6,274,630 reads and zero unclassified reads.
I also tried to run the centrifuge with one assignment/hit per read and the output contains 6,273,783 lines (including header). The kreport for this output contains all 6,273,783 reads (zero unclassified). However, I still don't know what happened to the other 26,641,923 reads!!

from centrifuge.

mourisl avatar mourisl commented on July 19, 2024

@FaezeK The output of unclassified reads was introduced in 1.0.3 version. So 1.0.3-beta version does not output unclassified reads.

from centrifuge.

ypsung avatar ypsung commented on July 19, 2024

I had the same problem here. My data were from nasopharyngeal swabs had the same problem and I used v1.0.4, either i tried original paired reads or Human-removed clear reads (with p_compressed+h+v index), there were many "Unclassified" AND "Warning: skipping read/mate....because it was < 2 characters long (or length (0) <= # seed mismatches (0))" as followed:
image

I'd like to ask if there appropriate interpretation as well solutions to these errors.
Many thanks

from centrifuge.

mourisl avatar mourisl commented on July 19, 2024

@ypsung Thanks for providing the warning messages. From those warnings, it seems your file is wrongly formatted, where the sequence of the read showed up in the read id field. Could you please show me a few lines of your fastq file?

from centrifuge.

ypsung avatar ypsung commented on July 19, 2024

@mourisl Thanks for the quick reply. my fastq file was like:
image
and i tranfered them into .fa files that Centrifuge could read using:
cat test.fastq | paste - - - - | sed 's/^@/>/g'| cut -f1-2 | tr '\t' '\n' > test.fa

while i checked my .fa files as well, it seems like the description line and the sequences were actually in the same line:
image

i'm considering if this was the cause to the error. what do you think?
Thanks again,
YP

from centrifuge.

mourisl avatar mourisl commented on July 19, 2024

Yes, this will definitely cause the error. The command for conversion looks right, so I don't know which part of the conversion went wrong.

from centrifuge.

FaezeK avatar FaezeK commented on July 19, 2024

I didn't convert my fastq files to fasta since fastq is the default format for centrifuge:
Centrifuge version 1.0.4 by the Centrifuge developer team ([email protected])
Usage:
centrifuge [options]* -x {-1 -2 | -U | --sample-sheet } [-S ] [--report-file ]

. . .

Input:
-q query input files are FASTQ .fq/.fastq (default)

And I didn't have any error or warning messages:

report file centrifuge_B20_summary.tsv
Number of iterations in EM algorithm: 699
Probability diff. (P - P_prev) in the last iteration: 9.98508e-11
Calculating abundance: 00:00:10

Yet, 80.94 % of my reads are unclassified. Just to make sure my fastq file has the correct format, here are the first few lines:
@NB501138:169:HWT2VBGX5:1:11101:14983:1085 1:N:0:GTTTCG
GGATGTATGAATATGCCTTCGTCTTAAATTCACCGAACGTATACGAGTTGATGACACCGCTTAATTCAAGCAGCATGTAGTTCGCACCGCGTTTTTCGTTAATGCGCATNGTATCCATAACAACTTATCCCCTTACTTACTGNTGCACNT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEAEEEEEAEEEEEEEEEEEEEEEEEEEEEEAAAAA<EEAE<EEEEEEEEEEE/EEEE#EEEAEAE<AA6E/EEE<EA/<AE<EEEAE<EE#EEA<<#<
@NB501138:169:HWT2VBGX5:1:11101:22359:1086 1:N:0:GTTTCG
CACTGCTCCTCCTTTCCCTGGATTTGACATCTGATTCTTCTTTTTATCAGAACCTGTCTTTTTACTTTCATCCGGCTGTTTCTCCGGATCTGTTTCATAGTAATTTTCTNTTTCATAACGACGGATCGCCGGACGCAGGAATNTGACANT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEAE<EEEEEAA<EEAEAAEEEAEEEEEEAEEEEEE#EEE<EEEEEAE/<AEE/6<AAAAAAA<EEEEE#AEEAE#E

And this is the command I used for classification:
centrifuge --phred33 --threads 16 --un-conc missingReads/ -x /dbs/centrifuge/nt -1 R1_001.fastq -2 R2_001.fastq --report-file centrifuge_B20_summary.tsv -S centrifuge_B20

from centrifuge.

mourisl avatar mourisl commented on July 19, 2024

@FaezeK Is this happen to your other read files?
You can also provide me some unclassified reads in missingReads/*.fq and we can take a look.

from centrifuge.

FaezeK avatar FaezeK commented on July 19, 2024

Yes, it happened to all 16 samples of mine. The percentage varies slightly from a sample to another, but they all approximately have 80% unclassified reads.

Here is an example of the forward reads in my missingReads:
@NB501138:169:HWT2VBGX5:1:11101:20594:1086
CCCTTGAACTTCATGTTAGATGAACCTACCCATCTCAAGTACATCGATACGAGTCTTGCTTTGCACGCCGCACTTGGAGAACGTCTTGTTCAGGAATACAGGACCTCGGNTAAGGCTCCCTTTGTGGGACCTGCGGATCCGCNGGACGNT
+
AAAAAAEEEEEEEE/EEEEAEEEEEEEEEEAEEEEEEEEEEEEEAEEEEAEE6EEEEEEEE/E6EEE<EEE6EE//EEEEEAEE<AAE<EEEEEEEEEEEAEEEEEEE<#EEEEA<E<A/EEE6/A/AAE/AA/AAE<66A<#//EEA#<
@NB501138:169:HWT2VBGX5:1:11101:9173:1087
TTTCTCGACCCATATCAAAAAGGAGTATGTTCCGTCAATCAGGAACACGGCCCACATCGCCAGGATCGCCAGCGACGTGTCGTCCGTTTGCCACCACATGAAGACCTCGATGGCGTCGACCGCGAGCCACATCACCCACTGCNCGACANA
+
AAAAAEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEAEAEEE<E<EAEEEEEEEEEEEEEEEEEAEEEEEAA<AEAE<EEAEEAAEEAAEEE/EEEEEEEEEEEEEAEEEEEA/<A<<E<EE/A6E<AAAA#6</<<#E
@NB501138:169:HWT2VBGX5:1:11101:20489:1087
ATGCAGGCAATCTTCTCTGCAAAGCTGGACAGGCCCCTGTCAAGGGTATCGATCATGGCATCATCAGCACCTTTCTGCCTTGCGAAGACGCAGGCCGCACCACGCGGTCCGCAGAACGGAGAGTCCACATCGCAGGCAATCCNGAATTNA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEE/EAAAE<6AEAEEEAEE/EEEEAEEAEEAE/AEEEEE/EEEA<A/EAA<E<EEE/EAA#<AAE<#E
@NB501138:169:HWT2VBGX5:1:11101:9334:1089
ATCCGGAACTCATACCCGGTTCCTGCCGGAGATTGGTAGCATCCCAGGTGAAGATGATGTCGTATTTGTTCTCGATCAGGCCCTCGGCCAGATAGTCACTGGCGTGGCGGTAGCAGGTGATCAGAACATTGGGGAAAAGGTGNTGGAANT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<<<<AEEEEA<AAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE<EAEEE<EEEEEEEE#EA<EE#E
@NB501138:169:HWT2VBGX5:1:11101:22359:1086
CACTGCTCCTCCTTTCCCTGGATTTGACATCTGATTCTTCTTTTTATCAGAACCTGTCTTTTTACTTTCATCCGGCTGTTTCTCCGGATCTGTTTCATAGTAATTTTCTNTTTCATAACGACGGATCGCCGGACGCAGGAATNTGACANT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEAE<EEEEEAA<EEAEAAEEEAEEEEEEAEEEEEE#EEE<EEEEEAE/<AEE/6<AAAAAAA<EEEEE#AEEAE#E

And these are from the reverse missing reads:
@NB501138:169:HWT2VBGX5:1:11101:20594:1086
NCAGNCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNA
+
#AAA#EE##############################################################################################################################################A
@NB501138:169:HWT2VBGX5:1:11101:9173:1087
TGTCNTTNNCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACNNNGNTNNCGANNNTGNNNNNNNNNNNNNNNNNNNTG
+
AAAA#AE##E######################################################################################################EE###E#A##EEA###EE###################A<
@NB501138:169:HWT2VBGX5:1:11101:20489:1087
NTCCNCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGNNNNNNNNNNNNNNNNNNN
+
#AAA#EE##########################################################################################################################<###################
@NB501138:169:HWT2VBGX5:1:11101:9334:1089
CATTNCCNNCNCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNATGCTNCNNNCNNNNNNGNNNNNGTNTNGNATNTTGNCNATNNNNANNNNNNNNNNNNNNGC
+
AAAA#EE##E#E#############################################################################<AEEE#E###E######E#####AE#E#E#EE#AEE#<#EE####E##############AA
@NB501138:169:HWT2VBGX5:1:11101:22359:1086
NGGCNTTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
#AAA#EE#############################################################################################################################################

from centrifuge.

mourisl avatar mourisl commented on July 19, 2024

It seems that you have many reads with undetermined nucleotide "N", thus could not be classified. How many higher-quality reads can not be classified?

from centrifuge.

FaezeK avatar FaezeK commented on July 19, 2024

My forward reads file has a very good quality and even when I ran centrifuge with --norc (do not align reverse-complement version of read (off)) parameter, I get the exact same amount of unclassified reads (80.94%):
centrifuge --phred33 --threads 32 --norc --un-conc missingReads/ -x /centrifuge/nt -1 R1_001.fastq -2 R2_001.fastq --report-file B20_fwd_only_summary.tsv -S B20_fwd_only

from centrifuge.

FaezeK avatar FaezeK commented on July 19, 2024

The first 5 fastq reads that I provided in the previous comment belong to forward reads and the the last 5 belong to reverse reads as mentioned in the comment.

from centrifuge.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.