I did run some samples with centrifuge but I notice that I'm missing some reads in the

Number of unclassified reads are not correct about centrifuge HOT 21 CLOSED

daehwankimlab commented on July 19, 2024

Number of unclassified reads are not correct

from centrifuge.

Comments (21)

FaezeK commented on July 19, 2024 1

I upgraded my centrifuge to version 1.0.4 and it now shows me that there are 26,641,924 unclassified reads that looks like a right number, but does it make sense to have more than 80% unclassified reads? My data is from horse feces (horse metagenomics) and I used nt database for classification, so I expected to see much less unclassified reads.

from centrifuge.

bastian-wur commented on July 19, 2024

Bumping this.
I have samples where 40% of the reads are not analyzed...or ...whatever is not happening to them.
So I look at the krereport file, which says that I have numbers like 0% unclassified, but in fact 40% of the reads were not analyzed.
This wouldn't be an issue, if it was clear in the krereport, but like this it's highly misleading.

from centrifuge.

khyox commented on July 19, 2024

Do you mean 40% of reads not classified but also not in the output listed as unclassified?
I've never had 0% unclassified even in the "better" samples and with setting the minimum hit length parameter to 20. Is that some special sample?

from centrifuge.

bastian-wur commented on July 19, 2024

Okay, guess I need to clarify it ^^.

The krereport begins like this:

  0.00	0	0	U	0	unclassified
100.00	6017277	0	-	1	root
 90.48	5444425	0	-	131567	  cellular organisms
 86.06	5178498	181	D	2	    Bacteria
 42.25	2542251	0	-	1783270	      FCB group

So neat, right? 6 million reads, all at least assigned to root, 100% of the reads assigned to root.
The problem is that the sample has 14 million reads.
That should be reflected in the report, because like this it's pretty misleading.

EDIT: This is supposedly human feces, so the reference db should pretty much cover everything.

from centrifuge.

khyox commented on July 19, 2024

Thanks! I see... It looks like the rest of the reads are really unclassified and kreport is doing something odd. Have you count the real unclassified reads provided by the centrifuge output? I mean, to not rely for such statistics on kreport but on the direct centrifuge output.

from centrifuge.

ffinfo commented on July 19, 2024

This also does not work because all unclassified reads are not in the centrifuge output, this is also the main cause of this problem I think.

edit: Only place to find this is in the metrics file but this is an optional output file and not a input to centrifuge-kreport

from centrifuge.

khyox commented on July 19, 2024

I don't think the issue is related with number of reads. I have never seen this effect of missing unclassified reads even when I "centrifuged" samples with near 60 Mreads paired-ends. In all cases, when I count the reads in the centrifuge output they are exactly the same as in the input files (not even 1 read of difference). Same for unpaired reads, but in such cases I processed quite less number of reads (but quite longer in bp). This seems and issue not easy to reproduce.
I am sorry I cannot help any more! :-/

from centrifuge.

GabrieleNocchi commented on July 19, 2024

I just run into the same problem. has this been sorted yet?

from centrifuge.

mourisl commented on July 19, 2024

The support for output unclassified reads is implemented in the new release version of Centrifuge v1.0.3 (not beta now :) ).

from centrifuge.

FaezeK commented on July 19, 2024

I use 1.0.3-beta version of centrifuge and I am losing more than 80% of my reads in the centrifuge output. I don't receive any error messages and there are zero unclassified reads in the kreport results. My original fastq file has 32,915,706 reads, while the output of centrifuge has 9,925,342 lines (including header) and finally the kreport shows 6,274,630 reads and zero unclassified reads.
I also tried to run the centrifuge with one assignment/hit per read and the output contains 6,273,783 lines (including header). The kreport for this output contains all 6,273,783 reads (zero unclassified). However, I still don't know what happened to the other 26,641,923 reads!!

from centrifuge.

mourisl commented on July 19, 2024

@FaezeK The output of unclassified reads was introduced in 1.0.3 version. So 1.0.3-beta version does not output unclassified reads.

from centrifuge.

ypsung commented on July 19, 2024

I had the same problem here. My data were from nasopharyngeal swabs had the same problem and I used v1.0.4, either i tried original paired reads or Human-removed clear reads (with p_compressed+h+v index), there were many "Unclassified" AND "Warning: skipping read/mate....because it was < 2 characters long (or length (0) <= # seed mismatches (0))" as followed:

I'd like to ask if there appropriate interpretation as well solutions to these errors.
Many thanks

from centrifuge.

mourisl commented on July 19, 2024

@ypsung Thanks for providing the warning messages. From those warnings, it seems your file is wrongly formatted, where the sequence of the read showed up in the read id field. Could you please show me a few lines of your fastq file?

from centrifuge.

ypsung commented on July 19, 2024

@mourisl Thanks for the quick reply. my fastq file was like:

and i tranfered them into .fa files that Centrifuge could read using:
cat test.fastq | paste - - - - | sed 's/^@/>/g'| cut -f1-2 | tr '\t' '\n' > test.fa

while i checked my .fa files as well, it seems like the description line and the sequences were actually in the same line:

i'm considering if this was the cause to the error. what do you think?
Thanks again,
YP

from centrifuge.

mourisl commented on July 19, 2024

Yes, this will definitely cause the error. The command for conversion looks right, so I don't know which part of the conversion went wrong.

from centrifuge.

FaezeK commented on July 19, 2024

I didn't convert my fastq files to fasta since fastq is the default format for centrifuge:
Centrifuge version 1.0.4 by the Centrifuge developer team ([email protected])
Usage:
centrifuge [options]* -x {-1 -2 | -U | --sample-sheet } [-S ] [--report-file ]

. . .

Input:
-q query input files are FASTQ .fq/.fastq (default)

And I didn't have any error or warning messages:

report file centrifuge_B20_summary.tsv
Number of iterations in EM algorithm: 699
Probability diff. (P - P_prev) in the last iteration: 9.98508e-11
Calculating abundance: 00:00:10

Yet, 80.94 % of my reads are unclassified. Just to make sure my fastq file has the correct format, here are the first few lines:
@NB501138:169:HWT2VBGX5:1:11101:14983:1085 1:N:0:GTTTCG
GGATGTATGAATATGCCTTCGTCTTAAATTCACCGAACGTATACGAGTTGATGACACCGCTTAATTCAAGCAGCATGTAGTTCGCACCGCGTTTTTCGTTAATGCGCATNGTATCCATAACAACTTATCCCCTTACTTACTGNTGCACNT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEAEEEEEAEEEEEEEEEEEEEEEEEEEEEEAAAAA<EEAE<EEEEEEEEEEE/EEEE#EEEAEAE<AA6E/EEE<EA/<AE<EEEAE<EE#EEA<<#<
@NB501138:169:HWT2VBGX5:1:11101:22359:1086 1:N:0:GTTTCG
CACTGCTCCTCCTTTCCCTGGATTTGACATCTGATTCTTCTTTTTATCAGAACCTGTCTTTTTACTTTCATCCGGCTGTTTCTCCGGATCTGTTTCATAGTAATTTTCTNTTTCATAACGACGGATCGCCGGACGCAGGAATNTGACANT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEAE<EEEEEAA<EEAEAAEEEAEEEEEEAEEEEEE#EEE<EEEEEAE/<AEE/6<AAAAAAA<EEEEE#AEEAE#E

And this is the command I used for classification:
centrifuge --phred33 --threads 16 --un-conc missingReads/ -x /dbs/centrifuge/nt -1 R1_001.fastq -2 R2_001.fastq --report-file centrifuge_B20_summary.tsv -S centrifuge_B20

from centrifuge.

mourisl commented on July 19, 2024

@FaezeK Is this happen to your other read files?
You can also provide me some unclassified reads in missingReads/*.fq and we can take a look.

from centrifuge.

FaezeK commented on July 19, 2024

Yes, it happened to all 16 samples of mine. The percentage varies slightly from a sample to another, but they all approximately have 80% unclassified reads.

Here is an example of the forward reads in my missingReads:
@NB501138:169:HWT2VBGX5:1:11101:20594:1086
CCCTTGAACTTCATGTTAGATGAACCTACCCATCTCAAGTACATCGATACGAGTCTTGCTTTGCACGCCGCACTTGGAGAACGTCTTGTTCAGGAATACAGGACCTCGGNTAAGGCTCCCTTTGTGGGACCTGCGGATCCGCNGGACGNT
+
AAAAAAEEEEEEEE/EEEEAEEEEEEEEEEAEEEEEEEEEEEEEAEEEEAEE6EEEEEEEE/E6EEE<EEE6EE//EEEEEAEE<AAE<EEEEEEEEEEEAEEEEEEE<#EEEEA<E<A/EEE6/A/AAE/AA/AAE<66A<#//EEA#<
@NB501138:169:HWT2VBGX5:1:11101:9173:1087
TTTCTCGACCCATATCAAAAAGGAGTATGTTCCGTCAATCAGGAACACGGCCCACATCGCCAGGATCGCCAGCGACGTGTCGTCCGTTTGCCACCACATGAAGACCTCGATGGCGTCGACCGCGAGCCACATCACCCACTGCNCGACANA
+
AAAAAEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEAEAEEE<E<EAEEEEEEEEEEEEEEEEEAEEEEEAA<AEAE<EEAEEAAEEAAEEE/EEEEEEEEEEEEEAEEEEEA/<A<<E<EE/A6E<AAAA#6</<<#E
@NB501138:169:HWT2VBGX5:1:11101:20489:1087
ATGCAGGCAATCTTCTCTGCAAAGCTGGACAGGCCCCTGTCAAGGGTATCGATCATGGCATCATCAGCACCTTTCTGCCTTGCGAAGACGCAGGCCGCACCACGCGGTCCGCAGAACGGAGAGTCCACATCGCAGGCAATCCNGAATTNA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEE/EAAAE<6AEAEEEAEE/EEEEAEEAEEAE/AEEEEE/EEEA<A/EAA<E<EEE/EAA#<AAE<#E
@NB501138:169:HWT2VBGX5:1:11101:9334:1089
ATCCGGAACTCATACCCGGTTCCTGCCGGAGATTGGTAGCATCCCAGGTGAAGATGATGTCGTATTTGTTCTCGATCAGGCCCTCGGCCAGATAGTCACTGGCGTGGCGGTAGCAGGTGATCAGAACATTGGGGAAAAGGTGNTGGAANT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<<<<AEEEEA<AAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE<EAEEE<EEEEEEEE#EA<EE#E
@NB501138:169:HWT2VBGX5:1:11101:22359:1086
CACTGCTCCTCCTTTCCCTGGATTTGACATCTGATTCTTCTTTTTATCAGAACCTGTCTTTTTACTTTCATCCGGCTGTTTCTCCGGATCTGTTTCATAGTAATTTTCTNTTTCATAACGACGGATCGCCGGACGCAGGAATNTGACANT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEAE<EEEEEAA<EEAEAAEEEAEEEEEEAEEEEEE#EEE<EEEEEAE/<AEE/6<AAAAAAA<EEEEE#AEEAE#E

And these are from the reverse missing reads:
@NB501138:169:HWT2VBGX5:1:11101:20594:1086
NCAGNCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNA
+
#AAA#EE##############################################################################################################################################A
@NB501138:169:HWT2VBGX5:1:11101:9173:1087
TGTCNTTNNCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACNNNGNTNNCGANNNTGNNNNNNNNNNNNNNNNNNNTG
+
AAAA#AE##E######################################################################################################EE###E#A##EEA###EE###################A<
@NB501138:169:HWT2VBGX5:1:11101:20489:1087
NTCCNCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGNNNNNNNNNNNNNNNNNNN
+
#AAA#EE##########################################################################################################################<###################
@NB501138:169:HWT2VBGX5:1:11101:9334:1089
CATTNCCNNCNCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNATGCTNCNNNCNNNNNNGNNNNNGTNTNGNATNTTGNCNATNNNNANNNNNNNNNNNNNNGC
+
AAAA#EE##E#E#############################################################################<AEEE#E###E######E#####AE#E#E#EE#AEE#<#EE####E##############AA
@NB501138:169:HWT2VBGX5:1:11101:22359:1086
NGGCNTTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
#AAA#EE#############################################################################################################################################

from centrifuge.

mourisl commented on July 19, 2024

It seems that you have many reads with undetermined nucleotide "N", thus could not be classified. How many higher-quality reads can not be classified?

from centrifuge.

FaezeK commented on July 19, 2024

My forward reads file has a very good quality and even when I ran centrifuge with --norc (do not align reverse-complement version of read (off)) parameter, I get the exact same amount of unclassified reads (80.94%):
centrifuge --phred33 --threads 32 --norc --un-conc missingReads/ -x /centrifuge/nt -1 R1_001.fastq -2 R2_001.fastq --report-file B20_fwd_only_summary.tsv -S B20_fwd_only

from centrifuge.

FaezeK commented on July 19, 2024

The first 5 fastq reads that I provided in the previous comment belong to forward reads and the the last 5 belong to reverse reads as mentioned in the comment.

from centrifuge.

Number of unclassified reads are not correct about centrifuge HOT 21 CLOSED

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent