daehwankimlab / centrifuge Goto Github PK
View Code? Open in Web Editor NEWClassifier for metagenomic sequences
License: GNU General Public License v3.0
Classifier for metagenomic sequences
License: GNU General Public License v3.0
Hi Daehwan,
I think the manual does not refer to centrifuge and instead has HISAT manual.
https://github.com/infphilo/centrifuge/blob/master/MANUAL
Thanks.
% make install
make: *** No rule to make target `install'. Stop.
This is using your latest public release and following your Install docs.
If this is fixed in HEAD can you please make a new release for packaging in Homebrew Science.
12 -rwxrwxr-x. 1 linuxbrew linuxbrew 12122 Aug 18 04:05 centrifuge-BuildSharedSequence.pl
4 -rw-rw-r--. 1 linuxbrew linuxbrew 403 Aug 18 04:05 centrifuge-RemoveEmptySequence.pl
4 -rw-rw-r--. 1 linuxbrew linuxbrew 1002 Aug 18 04:05 centrifuge-RemoveN.pl
16 -rwxrwxr-x. 1 linuxbrew linuxbrew 12918 Aug 18 04:05 centrifuge-compress.pl
4 -rwxrwxr-x. 1 linuxbrew linuxbrew 1564 Aug 18 04:05 centrifuge-sort-nt.pl
has any come across this error so far? both my input_sequence file and seqid2taxa.map files has this id, centrifuge-build is still spitting this error out..
I've encountered a weird bug which happens whenever --metric-file
or --report-file
are given as parameters. The output of report.tsv
(or whatever I named it) is incomplete (or rather blank) and only contains one line of headers.
If I don't define --metric-file
and/or --report-file
, the standard centrifuge_report.tsv
is written normally.
Classification results are always being written though.
Hi,
--report-file option seems to generate files with no name, genome_size or avg_size.
I am attaching a sample report file.
Thanks
-rwxrwxr-x. 1 linuxbrew linuxbrew 8506 Aug 18 04:05 tinythread.cpp
-rwxrwxr-x. 1 linuxbrew linuxbrew 21220 Aug 18 04:05 tinythread.h
-rwxrwxr-x. 1 linuxbrew linuxbrew 6940 Aug 18 04:05 fast_mutex.h
Hello,
Kind of at wits-end with Centrifuge as I've been trying to get it to work with my own database, and NCBI bac & virus, for a long time now. To paraphrase Roseanna Roseannadanna, "Its always something..."
I recently gave it another go with your pre-made indices just to see if I could get it to run at all. Before running a bunch of my samples through, I used centrifuge-inspect to determine if all of my target organisms were indeed in the database. I used centrifuge-inspect and grep for this...
$ centrifuge-inspect --name-table p+h+v > nametable.txt
$ grep "Zika" nametable.txt
$
From what I can tell, Zika Virus is not in the p+h+v (the pre-made bacteria, viruses, archaea, human index listed on the right margin of your website)? All of my other target organisms (Human papillomavirus Type 132 and Variola virus, for example) are included in this index.
$ grep "Human papillomavirus type 132" nametable.txt
909331 Human papillomavirus type 132
$ grep "Variola" nametable.txt
10255 Variola virus
ALSO...
Since Zika did not seem to be included, I tried using centrifuge-download again, but I get an error. The connection to NCBI's ftp site seems to be blocked or otherwise not good. Below is the error I get...
$ centrifuge-download -o taxonomy taxonomy
Downloading NCBI taxonomy ...
rsync: failed to connect to ftp.ncbi.nih.gov (130.14.250.7): Connection refused (111)
rsync: failed to connect to ftp.ncbi.nih.gov (2607:f220:41e:250::13): Network is unreachable (101)
rsync error: error in socket IO (code 10) at clientserver.c(128) [Receiver=3.1.0]
Hi,
Thanks for writing to us.
The issue is mostly in the http protocol used by the tool. With the switching to HTTPS late last year, NCBI also requires that http access to our ftp site be switched to HTTPS. You will need to contact the Centrifuge code provider for them to update their code to use HTTPS protocol instead.
A minor issue is the ftp.ncbi.nih.gov domain. Even though it may still work for historical reasons, it may not. The domain should be fully specified with .nlm included, aka ftp.ncbi.nlm.nih.gov
Regards,
I dove into the centrifuge-download script to see if I could manually update the web address that the script is pointed to. There was only one place where the web address was listed that didn't have the '.nlm' in it, and that was line 194. I added the '.nlm' to the address on that line, saved and re-compiled, and re-ran....but I got the same error. I didn't see any references to http and/or https in the centrifuge-download source code.
Also, where does one manually retrieve the names.dmp and nodes.dmp files from NCBI? Weren't those files phased out when they updated to the new format without GI numbers?
Any help ironing out these problems would be much appreciated.
Thank you.
When I execute centrifuge I see this as usage:
Centrifuge version v1.0.1-beta-40-g689d12fbd0 by Daehwan Kim ([email protected], www.ccb.jhu.edu/people/infphilo)
Usage:
hisat [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <filename>] [--report-file <report>]
...
As you can see it say's to execute hisat.
Is there a way I could index custom sequences ?
head -n 1 centrifuge-RemoveN.pl
#/bin/perl
should be #!/usr/bin/env perl
I am getting the following results for a read:
readID seqID taxID score 2ndBestScore hitLength queryLength numMatches
cc722ee0-14cf-43a0-b53c-75ba10f250e4_Basecall_Alignment_template no rank 0 256 256 31 5226 4
cc722ee0-14cf-43a0-b53c-75ba10f250e4_Basecall_Alignment_template no rank 0 256 256 31 5226 4
cc722ee0-14cf-43a0-b53c-75ba10f250e4_Basecall_Alignment_template family 10699 256 256 31 5226 4
cc722ee0-14cf-43a0-b53c-75ba10f250e4_Basecall_Alignment_template no rank 0 256 256 31 5226 4
Is it normal getting taxID 0? It is not present when using 'centrifuge-inspect --taxonomy-tree'
best,
c
Read: ACATACTTTACGTTCAGTTACGTATTGCTCAGCACCATCTATAGGTGGCAATGGCTCATTCAATTATTCTAAAACAATTAGTTATACCCAAAGAGTTATGTCAGTGAAGTAGACAAGCAAAACTCAAAATCTACTGTTAAATGATGTTCAAAGCAAACGAATTTGTACATACGATGGAAAAATCTGCGCATGATAGTATTTATTCGTACAAAGTCAAATGGTCCAGCAGTTTCAGCAAGAATATTTTGCTCCTGATAATCAGTACCACCTTTAGTTCAAGTGGCTTTAATCCATCGTTTATCACTACACTATCACATGAAAAGGTTCAAGTGATGAGTGAATTGAAATTTCATATGGTAGAAACTTAGATATTACATATGCGACTTTATTCCTAAATTTAGTATTTGCAGAAAGAAAGCATAATGCATTTGTAAATAGAAACTTTGTAGTTAGATATGAGTTAATTGGAAAACACGGGAATTAAGAGTGAAAGGACGCAATTAATATGAAATGAAAAATTGAGTCAAATCATCAGTTGCTTCATCGTTGCACTGCTTTTGCTATCGAATACAGTTGATGCAGCTCAACATATCACACCTG
Index: refseq-viral
I did run some samples with centrifuge but I notice that I'm missing some reads in the end results. As example I just summarize what numbers I find:
fastq file = 27587 reads
centrifuge metrics - Read = 27587
centrifuge metrics - UnfilteredRead = 27587
cat <centrifuge output> | cut -f1 | uniq | wc -l
= 22428 (including header)
centrifuge kreport - unclassified = 32
centrifuge kreport - root = 22394
So some thing about this number are not completely correct. In the centrifuge output I'm missing 27587 - 22427 = 5160 reads. What did happen with those reads? In the metrics file I could not directly see why does reads are not used. I do know we does read them because the number in the metrics file are correct.
When I add unclassified to unsigned I get 22394 + 32 = 22426. This means that 1 read is getting lost from centrifuge output to kreport. This seems like a array starting from 1 instead of 0?
On all samples I did run I see the pattern, does not matter if it's illumnia or nanopore data.
Hi, I ran into an issue when I tried to add Saccharomyces cerevisiae genomes and others into my centrifuge database. I have a warning of 55 542 sequences ID.
I created my own file "seqid2taxid.map" and I can grep the sequencesID on it, for instance in vim the sequence id NC_001224.1 is associated to tax id 559292 (tab separated and nothing other characters in the line ie: NC_001224.1^I559292$
Then I looked in names.dmp and node.dmp files and 559292 is associated to
grep -w "^559292" taxonomy/names.dmp
559292 | Saccharomyces cerevisiae S288c | | scientific name |
And on nodes.dmp:
grep -w "^559292" taxonomy/nodes.dmp
559292 | 4932 | no rank | | 4 | 1 | 1 | 1 | 3 | 1 | 1 | 0 | |
Could you explained if I am doing something wrong that I can correct because I really want to count S. cerevisae in my data and I can't associate taxonomy to their sequences.
Thanks,
Alban
After running the following command with 10 cores and 150GB of memory (of which 87.140GB were used) on CentOS Linux release 7.1.1503:
$CENTRIFUGE_HOME/centrifuge -q -x datasets/centrifuge/nt/nt -1 f1.R1.fastq -2 f1.R2.fastq
Centrifuge classifies sequences from the fastq files, but errors out when generating centrifuge_report.csv.
The error output is here:
error.txt
Am I doing something wrong?
In the manual, there is an example for building a reference:
$CENTRIFUGE_HOME/centrifuge-build --conversion-table $CENTRIFUGE_HOME/example/reference/gi_to_tid.dmp --taxonomy-tree $CENTRIFUGE_HOME/example/reference/nodes.dmp --name-table $CENTRIFUGE_HOME/example/reference/names.dmp $CENTRIFUGE_HOME/example/reference/test.fa test
However, the example/reference
directory only contains test.fa
, so the command fails because other files are missing.
Please make them all /usr/bin/env
head -n 1 *.pl
==> centrifuge-BuildSharedSequence.pl <==
#!/bin/perl
==> centrifuge-RemoveEmptySequence.pl <==
#!/bin/perl
==> centrifuge-RemoveN.pl <==
#/bin/perl
==> centrifuge-compress.pl <==
#!/usr/bin/perl
==> centrifuge-sort-nt.pl <==
#! /usr/bin/env perl
Hi there,
I installed the centrifuge successfully and ran the test data provided with the tool- It all worked fine. However, when I ran my own dataset with the command:
centrifuge -f -x abv ~/Documents/HC.650.fa
It gives me error:
(ERR): centrifuge-class died with signal 11 (SEGV) (core dumped)
First I was working locally, then I thought that this could be the memory issue- so I went to the server and increased the memory allocation too- but the problem remains the same.
I would appreciate if you please guide me through.
Thanks
Gaurav
Hello,
Is it possible to implement Kraken style mpa-report which gives standard lineage for each OTU.
Here is the script url:
https://github.com/DerrickWood/kraken/blob/master/scripts/kraken-mpa-report
I also see that Pavian: https://github.com/fbreitwieser/pavian/ has the ability to create "taxonstring" as a column but one cannot export all the rows at once.
Thanks
Ashish
Hello,
I was trying to use centrifuge on several bacterial shot gun metagenomic datasets.
It stop with the following error message:
report file study.18512
Number of iterations in EM algorithm: 8415
Probability diff. (P - P_prev) in the last iteration: 9.99892e-11
(ERR): centrifuge-class died with signal 11 (SEGV)
The report file was empty.
Hi,
Is there any option to specify the taxonomic level at which to report the classification, or even the full lineage. Currently I'm getting only species/genus classifications which aren't really informative for assessing environmental metagenome bins.
Thanks,
Ruben
The centrifuge_report.csv
file is not a comma-separated file.
make b_compressed+h+v THREADS=72
Making: b_compressed+h+v: b_compressed+h+v
make -f Makefile IDX_NAME=b_compressed+h+v
make[1]: Entering directory `/mnt/seq/KRAKEN/centrifuge'
mkdir -p reference-sequences
[[ -d tmp_b_compressed+h+v ]] && rm -rf tmp_b_compressed+h+v; mkdir -p tmp_b_compressed+h+v
Downloading and dust-masking viral
centrifuge-download -o tmp_b_compressed+h+v -m -d "viral" -P 72 refseq > \
tmp_b_compressed+h+v/all-viral.map
grep: .listing: No such file or directory
viral is not a valid domain - use one of the following:
grep: .listing: No such file or directory
make[1]: *** [reference-sequences/all-viral.fna] Error 2
make[1]: Leaving directory `/mnt/seq/KRAKEN/centrifuge'
make: *** [b_compressed+h+v] Error 2
Here's the file status:
% find .
.
./Makefile
./reference-sequences
./tmp_b_compressed+h+v
./tmp_b_compressed+h+v/all-viral.map
and that last .map
file is empty.
It would be nice if the kreport script supports zipped input files. The output is now 1 line for each read/pair, this can become a very large file. Just piping from centrifuge to gzip is easy but kreport does not support this.
Hi,
some of my sample are not generating a report file and the program exits with the following message.
report file /home/people/user/centrifuge/S1134.report
Number of iterations in EM algorithm: 1118
Probability diff. (P - P_prev) in the last iteration: 9.70335e-11
*** Error in `/home/people/user/apps/centrifuge-1.0.1-beta/centrifuge-class': free(): invalid next size (normal): 0x00000004819753a0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7d023)[0x7ffff7375023]
/home/people/user/apps/centrifuge-1.0.1-beta/centrifuge-class[0x445f7b]
/home/people/user/apps/centrifuge-1.0.1-beta/centrifuge-class[0x41cecb]
/home/people/user/apps/centrifuge-1.0.1-beta/centrifuge-class[0x41f3fb]
/home/people/user/apps/centrifuge-1.0.1-beta/centrifuge-class[0x49812b]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffff7319b15]
/home/people/user/apps/centrifuge-1.0.1-beta/centrifuge-class[0x40f6e9]
======= Memory map: ========
00400000-00747000 r-xp 00000000 00:2c 6914555853 /home/people/user/apps/centrifuge-1.0.1-beta/centrifuge-class
00946000-0095e000 rw-p 00346000 00:2c 6914555853 /home/people/user/apps/centrifuge-1.0.1-beta/centrifuge-class
0095e000-481985000 rw-p 00000000 00:00 0 [heap]
7fed60000000-7fed62b0f000 rw-p 00000000 00:00 0
7fed62b0f000-7fed64000000 ---p 00000000 00:00 0
7fed88000000-7fed8903e000 rw-p 00000000 00:00 0
7fed8903e000-7fed8c000000 ---p 00000000 00:00 0
7fedd0000000-7fedd26b5000 rw-p 00000000 00:00 0
7fedd26b5000-7fedd4000000 ---p 00000000 00:00 0
7fedd4000000-7fedd55a4000 rw-p 00000000 00:00 0
7fedd55a4000-7fedd8000000 ---p 00000000 00:00 0
7fedd8000000-7fedd8fa1000 rw-p 00000000 00:00 0
7fedd8fa1000-7feddc000000 ---p 00000000 00:00 0
7fee40000000-7fee41254000 rw-p 00000000 00:00 0
7fee41254000-7fee44000000 ---p 00000000 00:00 0
7fee44000000-7fee45d13000 rw-p 00000000 00:00 0
7fee45d13000-7fee48000000 ---p 00000000 00:00 0
7fee48000000-7fee49d19000 rw-p 00000000 00:00 0
7fee49d19000-7fee4c000000 ---p 00000000 00:00 0
7ff5a6e85000-7ff5c7286000 rw-p 00000000 00:00 0
7ff5c8000000-7ff5c9b01000 rw-p 00000000 00:00 0
7ff5c9b01000-7ff5cc000000 ---p 00000000 00:00 0
7ff5cc000000-7ff5cd9b5000 rw-p 00000000 00:00 0
7ff5cd9b5000-7ff5d0000000 ---p 00000000 00:00 0
7ff5d0000000-7ff5d16f4000 rw-p 00000000 00:00 0
7ff5d16f4000-7ff5d4000000 ---p 00000000 00:00 0
7ff5d6c86000-7ff5d6c87000 ---p 00000000 00:00 0
7ff5d6c87000-7ff5d7487000 rw-p 00000000 00:00 0
7fff457a4000-7fffa63a6000 rw-p 00000000 00:00 0
7fffa67fd000-7fffa67fe000 ---p 00000000 00:00 0
7fffa67fe000-7fffa6ffe000 rw-p 00000000 00:00 0
7fffa6ffe000-7fffa6fff000 ---p 00000000 00:00 0
7fffa6fff000-7fffa77ff000 rw-p 00000000 00:00 0
7fffa77ff000-7fffa7800000 ---p 00000000 00:00 0
7fffa7800000-7fffa8000000 rw-p 00000000 00:00 0
7fffa8000000-7fffa9884000 rw-p 00000000 00:00 0
7fffa9884000-7fffac000000 ---p 00000000 00:00 0
7fffac000000-7fffacb5c000 rw-p 00000000 00:00 0
7fffacb5c000-7fffb0000000 ---p 00000000 00:00 0
7fffb04e2000-7fffb04f8000 r-xp 00000000 08:01 925922 /cm/local/apps/gcc/5.1.0/lib64/libgcc_s.so.1
7fffb04f8000-7fffb06f7000 ---p 00016000 08:01 925922 /cm/local/apps/gcc/5.1.0/lib64/libgcc_s.so.1
7fffb06f7000-7fffb06f8000 r--p 00015000 08:01 925922 /cm/local/apps/gcc/5.1.0/lib64/libgcc_s.so.1
7fffb06f8000-7fffb06f9000 rw-p 00016000 08:01 925922 /cm/local/apps/gcc/5.1.0/lib64/libgcc_s.so.1
7fffb06f9000-7fffb07f9000 rw-p 00000000 00:00 0
7fffb07f9000-7fffb07fa000 ---p 00000000 00:00 0
7fffb07fa000-7fffb0ffa000 rw-p 00000000 00:00 0
7fffb0ffa000-7fffb0ffb000 ---p 00000000 00:00 0
7fffb0ffb000-7fffb17fb000 rw-p 00000000 00:00 0
7fffb17fb000-7fffb17fc000 ---p 00000000 00:00 0
7fffb17fc000-7fffb1ffc000 rw-p 00000000 00:00 0
7fffb1ffc000-7fffb1ffd000 ---p 00000000 00:00 0
7fffb1ffd000-7fffb27fd000 rw-p 00000000 00:00 0
7fffb27fd000-7fffb27fe000 ---p 00000000 00:00 0
7fffb27fe000-7fffb2ffe000 rw-p 00000000 00:00 0
7fffb2ffe000-7fffb2fff000 ---p 00000000 00:00 0
7fffb2fff000-7fffb37ff000 rw-p 00000000 00:00 0
7fffb37ff000-7fffb3800000 ---p 00000000 00:00 0
7fffb3800000-7fffb4000000 rw-p 00000000 00:00 0
7fffb4000000-7fffb4e77000 rw-p 00000000 00:00 0
7fffb4e77000-7fffb8000000 ---p 00000000 00:00 0
7fffb8000000-7fffb9e85000 rw-p 00000000 00:00 0
7fffb9e85000-7fffbc000000 ---p 00000000 00:00 0
7fffbc000000-7fffbd4a1000 rw-p 00000000 00:00 0
7fffbd4a1000-7fffc0000000 ---p 00000000 00:00 0
7fffc002f000-7fffc00af000 rw-p 00000000 00:00 0
7fffc00af000-7fffc00b0000 ---p 00000000 00:00 0
7fffc00b0000-7fffe4b31000 rw-p 00000000 00:00 0
7fffe4b6d000-7fffe4b6e000 ---p 00000000 00:00 0
7fffe4b6e000-7fffe536e000 rw-p 00000000 00:00 0
7fffe536e000-7fffe536f000 ---p 00000000 00:00 0
7fffe536f000-7fffe5b6f000 rw-p 00000000 00:00 0
7fffe5b6f000-7fffe5b70000 ---p 00000000 00:00 0
7fffe5b70000-7fffe6370000 rw-p 00000000 00:00 0
7fffe6370000-7fffe6371000 ---p 00000000 00:00 0
7fffe6371000-7ffff72f8000 rw-p 00000000 00:00 0
7ffff72f8000-7ffff74ae000 r-xp 00000000 08:01 542276 /usr/lib64/libc-2.17.so
7ffff74ae000-7ffff76ae000 ---p 001b6000 08:01 542276 /usr/lib64/libc-2.17.so
7ffff76ae000-7ffff76b2000 r--p 001b6000 08:01 542276 /usr/lib64/libc-2.17.so
7ffff76b2000-7ffff76b4000 rw-p 001ba000 08:01 542276 /usr/lib64/libc-2.17.so
7ffff76b4000-7ffff76b9000 rw-p 00000000 00:00 0
7ffff76b9000-7ffff77ba000 r-xp 00000000 08:01 542597 /usr/lib64/libm-2.17.so
7ffff77ba000-7ffff79b9000 ---p 00101000 08:01 542597 /usr/lib64/libm-2.17.so
7ffff79b9000-7ffff79ba000 r--p 00100000 08:01 542597 /usr/lib64/libm-2.17.so
7ffff79ba000-7ffff79bb000 rw-p 00101000 08:01 542597 /usr/lib64/libm-2.17.so
7ffff79bb000-7ffff79be000 r-xp 00000000 08:01 556500 /usr/lib64/libdl-2.17.so
7ffff79be000-7ffff7bbd000 ---p 00003000 08:01 556500 /usr/lib64/libdl-2.17.so
7ffff7bbd000-7ffff7bbe000 r--p 00002000 08:01 556500 /usr/lib64/libdl-2.17.so
7ffff7bbe000-7ffff7bbf000 rw-p 00003000 08:01 556500 /usr/lib64/libdl-2.17.so
7ffff7bbf000-7ffff7bd5000 r-xp 00000000 08:01 556539 /usr/lib64/libpthread-2.17.so
7ffff7bd5000-7ffff7dd5000 ---p 00016000 08:01 556539 /usr/lib64/libpthread-2.17.so
7ffff7dd5000-7ffff7dd6000 r--p 00016000 08:01 556539 /usr/lib64/libpthread-2.17.so
7ffff7dd6000-7ffff7dd7000 rw-p 00017000 08:01 556539 /usr/lib64/libpthread-2.17.so
7ffff7dd7000-7ffff7ddb000 rw-p 00000000 00:00 0
7ffff7ddb000-7ffff7dfc000 r-xp 00000000 08:01 541997 /usr/lib64/ld-2.17.so
7ffff7e57000-7ffff7fe1000 rw-p 00000000 00:00 0
7ffff7ff8000-7ffff7ffa000 rw-p 00000000 00:00 0
7ffff7ffa000-7ffff7ffc000 r-xp 00000000 00:00 0 [vdso]
7ffff7ffc000-7ffff7ffd000 r--p 00021000 08:01 541997 /usr/lib64/ld-2.17.so
7ffff7ffd000-7ffff7ffe000 rw-p 00022000 08:01 541997 /usr/lib64/ld-2.17.so
7ffff7ffe000-7ffff7fff000 rw-p 00000000 00:00 0
7ffffffdd000-7ffffffff000 rw-p 00000000 00:00 0 [stack]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 vsyscall: centrifuge-class died with signal 6 (ABRT)
The command to run the sample:
/home/people/user/apps/centrifuge-1.0.1-beta/centrifuge -k 1 -q -p 16 --reorder -x /home/people/user/apps/centrifuge-1.0.1-beta/indices/b+h+v/b+h+v -1 /home/people/user/data_trimmed/S1134_R1.trim.fq -2 /home/people/user/data_trimmed/S1134_R2.trim.fq --report-file /home/people/user/centrifuge/S1134.report -S /home/people/user/centrifuge/S1134.summary
It looks like the summary report may be reporting wrong genome sizes.
For human (taxID 9606):
From report: 6,339,524,059 (2X bigger than expected)
From NCBI: median total length (Mb): 2996.43
For gorilla (taxID 9593):
From report: 19,140,263 (100X smaller than expected)
From NCBI: median total length (Mb): 3058.03
For Picea glauca (taxID 3330):
From report: 26,852,969 (1,000X smaller than expected)
From NCBI: median total length (Mb): 25784.7
Hi Daehwan,
Just wanted to notify you about a minor thing I noticed: While there were reads missing from the main Centrifuge output (supposedly unaligned?), the file for unaligned reads specified with the --un option remained empty.
Personally not an issue for me, since it was easy to find out the unaligned reads by other means.
Cheers,
Moritz
FYI, the examples below appear to be missing 'refseq' at the end before the '>>'.
Just stepping through.
Thanks,
Bob
# download mouse and human reference genomes
centrifuge-download -o library -d "vertebrate_mammalian" -a "Chromosome" -t 9606,10090 -c 'reference genome' >> seqid2taxid.map
# only human
centrifuge-download -o library -d "vertebrate_mammalian" -a "Chromosome" -t 9606 -c 'reference genome' >> seqid2taxid.map
# only mouse
centrifuge-download -o library -d "vertebrate_mammalian" -a "Chromosome" -t 10090 -c 'reference genome' >> seqid2taxid.map
The repo would be cleaner with all the code in src
subfolder?
Hi, I am trying to download centrifuge on my system. However, I am getting the following error in the index building step:
command I am giving:
centrifuge-build -p 4 --conversion-table seqid2taxid.map --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp input-sequences.fna abc
Output:
Settings:
Output files: "abv..cf"
Line rate: 7 (line is 128 bytes)
Lines per side: 1 (side is 128 bytes)
Offset rate: 4 (one in 16)
FTable chars: 10
Strings: unpacked
Local offset rate: 3 (one in 8)
Local fTable chars: 6
Max bucket size: default
Max bucket size, sqrt multiplier: default
Max bucket size, len divisor: 4
Difference-cover sample period: 1024
Endianness: little
Actual local endianness: little
Sanity checking: disabled
Assertions: disabled
Random seed: 0
Sizeofs: void:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
input-sequences.fna
Warning: Empty fasta file: 'input-sequences.fna'
Warning: All fasta inputs were empty
Total time for call to driver() for forward index: 00:00:00
Error: Encountered internal Centrifuge exception (#1)
Thanks
Gaurav
I've been trying to get databases set up for centrifuge for a few weeks now and I've been casually banging my head against this problem. I get the same error on two different Mac OSX Sierra machines. I suspect it's a permission issue. Seems that after the files are downloaded the data can't be written to the database. I have tried to modify the makefile to use 'sudo' but none of my changes seen to work.
Here's my error message:
Progress : [######################################--] 97% 5855/5983
make[1]: *** [reference-sequences/all-viral.fna] Error 1
make: *** [b_compressed+h+v] Error 2
Any help here would be appreciated! Thanks!
Hello,
Ops, I pressed the submit button to early for my previous request.
Here I will also include the log file and the error message below.
Could yo please let me know what is possible wrong.
I am using centrifuge-1.0.1-beta
Thanks very much,
Josef,
report file study.18512
Number of iterations in EM algorithm: 8415
Probability diff. (P - P_prev) in the last iteration: 9.99892e-11
(ERR): centrifuge-class died with signal 11 (SEGV)
The output of centrifuge-download contaminants
looks like
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
gnl|uv 32630
...
The URL is not working http://ccb.jhu.edu/software/centrifuge/downloads/centrifuge-suppl/data.tar.gz
Hello,
Not sure if this is the right place, but in your documentation, there are no examples in the 'Custom Database' section. Just a TODO list :-)
If you could add these in, that would be really useful.
Thanks!
Phil
Hi,
I am trying to use Centrifuge to complete metagenomics project, I only need one taxonomy ID for each read, however there are too many of them. And I have no idea which parameter could take control of how many taxonomy IDs a read could output. Could you please show which parameter could limit the number of output taxonomy id?
Thanks a lot.
Should be make p+h+v instead of b+h+v, b should be replaced with p for compressed indices as well.
centrifuge-kreport currently outputs hierarchical 2-spaced reports which do not correspond to kraken's usual report format.
Downstream tools depending on this as Krona do not accept hierarchical formats, but the classical P;C;O;(...) output kraken delivers. It is also somewhat difficult to convert the current output to a Krona-friendly format.
Any chance a 'root;cellular organisms;Bacteria;Actinobacteria;Actinobacteria;Corynebacteriales;Mycobacteriaceae;Mycobacterium;'-like format can be included in this command?
Either that or does someone have a workaround for this?
Thanks!
Maybe there is already a way to extract this information, but is it possible to add the number of unknown or unmatched reads to the report table? As far as I can tell, only matches are reported currently.
Hi,
I'm looking forward to trying your tool! Thanks for making the source code and the pre-print available.
I realized the Makefile does not have an install target, and I was curious whether you are planning to change that.
Best wishes,
I tried running centrifuge on the test data and it works, but it keeps failing for my own using the same reference.
$CENTRIFUGE_HOME/centrifuge -f -x ~/ref/Centrifuge/b+h+v/b+h+v ./testfile.fasta
I see the results, but the report fails:
report file centrifuge_report.csv
Number of iterations in EM algorithm: 4
Probability diff. (P - P_prev) in the last iteration: 3.52546e-11
*** glibc detected *** /ifs/home/id/software/centrifuge-1.0.1-beta/centrifuge-class: munmap_chunk(): invalid pointer: 0x000000000185a7f0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x75f4e)[0x2aaaab3e8f4e]
/ifs/home/id/software/centrifuge-1.0.1-beta/centrifuge-class[0x445fd2]
/ifs/home/id/software/centrifuge-1.0.1-beta/centrifuge-class[0x41cecb]
/ifs/home/id/software/centrifuge-1.0.1-beta/centrifuge-class[0x41f3fb]
/ifs/home/id/software/centrifuge-1.0.1-beta/centrifuge-class[0x49812b]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2aaaab391d5d]
/ifs/home/id/software/centrifuge-1.0.1-beta/centrifuge-class[0x40f6e9]
This happened with 1,000 50bp sequences, but not when used 100.
The refseq_microbial database target attempts to obtain archaea-chromosome_level and fails:
$ make refseq_microbial
.....
Making: refseq_microbial: refseq_microbial
make -f Makefile IDX_NAME=refseq_microbial
make[1]: Entering directory `/ceph/mgx-sw/src/centrifuge/indices'
[[ -d tmp_refseq_microbial ]] && rm -rf tmp_refseq_microbial; mkdir -p tmp_refseq_microbial
Downloading and dust-masking archaea-chromosome_level
centrifuge-download -o tmp_refseq_microbial -m -d "archaea-chromosome_level" -P 1 refseq > \
tmp_refseq_microbial/all-archaea-chromosome_level.map
archaea-chromosome_level is not a valid domain - use one of the following:
make[1]: *** [reference-sequences/all-archaea-chromosome_level.fna] Error 1
make[1]: Leaving directory `/ceph/mgx-sw/src/centrifuge/indices'
make: *** [refseq_microbial] Error 2
$ cat tmp_refseq_microbial/all-archaea-chromosome_level.map
archaea
bacteria
fungi
invertebrate
plant
protozoa
vertebrate_mammalian
vertebrate_other
viral
$ ls reference-sequences/
all-archaea.fna all-bacteria.fna all-fungi.fna all-protozoa.fna all-viral.fna
all-archaea.map all-bacteria.map all-fungi.map all-protozoa.map all-viral.map
Hi,
I am trying this beta build and it worked well for few files and then it stopped working with message below.
(ERR): centrifuge-class died with signal 11 (SEGV) (core dumped)
This machine has 448 GB of RAM and 32 cores.
ashish4@ashish4:/mnt/centrifuge/library$ ~/centri*/centrifuge --version
/home/ashish4/centrifuge/centrifuge-class version 1.0.0-beta
64-bit
Built on ashish
Sun Jan 31 01:03:16 UTC 2016
Compiler: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04)
Options: -O3 -m64 -msse2 -funroll-loops -g3 -DPOPCNT_CAPABILITY
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}
centrifuge-build-bin crashed on me attempting to build refseq_microbial with THREADS=150.
In addition, indices/Makefile does not check the return code of centrifuge-build, thus indicating
success instead of failing.
Last lines of output:
bmax according to bmaxDivN setting: 3991473403
Using parameters --bmax 2993605053 --dcv 1024
Doing ahead-of-time memory usage test
Passed! Constructing with these parameters: --bmax 2993605053 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
Building sPrime
Building sPrimeOrder
V-Sorting samples
Core was generated by `centrifuge-build-bin --wrapper basic-0 -p 150 --ftabchars 14 --conversion-table'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000000000040c0ce in try_lock (this=0x0) at fast_mutex.h:161
161 );
(gdb) bt
#0 0x000000000040c0ce in try_lock (this=0x0) at fast_mutex.h:161
#1 lock (this=0x0) at fast_mutex.h:125
#2 ThreadSafe (locked=true, ptr_mutex=0x0, this=<synthetic pointer>) at threading.h:42
#3 VSorting_worker<SString<char> > (vp=0x1d4de30) at diff_sample.h:696
#4 0x000000000046265f in tthread::thread::wrapper_function (aArg=0x1450a20) at tinythread.cpp:169
#5 0x00002b14eadf5184 in start_thread (arg=0x2b1507e54700) at pthread_create.c:312
#6 0x00002b14eb92537d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) up 3
#3 VSorting_worker<SString<char> > (vp=0x1d4de30) at diff_sample.h:696
696 ThreadSafe ts(param->mutex, true);
(gdb) p param
$1 = (VSortingParam<SString<char> > *) 0x1d4de30
(gdb) p param->mutex
$2 = (tthread::fast_mutex *) 0x0
(gdb) l
691 const size_t hlen = host.length();
692 uint32_t v = dcs->v();
693 while(true) {
694 size_t cur = 0;
695 {
696 ThreadSafe ts(param->mutex, true);
697 cur = *(param->cur);
698 (*param->cur)++;
699 }
700 if(cur >= param->boundaries->size()) return;
Centrifuge version 1.0.3-beta by Daehwan Kim ([email protected], www.ccb.jhu.edu/people/infphilo)
Usage:
hisat [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <filename>] [--report-file <report>]
It is not parsable by software with the name "genome research" and can not be sorted or compared.
I notice the VERSION
file says 1.0.3-beta
but centrifuge --version
doesn't print that.
Working with a database I created using bacterial, archaeal and fungal genomes, I am now getting the following trying to run it with some reads:
Error reading _ebwt[] array: 1352, 7627660928
Error: Encountered internal Centrifuge exception (#1)
Command: /usr/local/bin/centrifuge-class --wrapper basic-0 -S /mnt/e/reads_output.txt --report-file /mnt/e/reads_report.tsv -f -p 8 -U /mnt/e/reads.fasta /mnt/e/centrifuge_database/abv
(ERR): centrifuge-class exited with value 1
Not sure what to do here. Any help @infphilo ?
I'm still banging my head on this problem. I've worked on this off and on for the last few months and can't seem to get this database to index. I'm trying to make the nt database index and I am now getting this error.
$ centrifuge-build -p 16 --bmax 1342177280 --conversion-table gi_taxid_nucl.map --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp nt.fna nt
Settings:
Output files: "nt.*.cf"
Line rate: 7 (line is 128 bytes)
Lines per side: 1 (side is 128 bytes)
Offset rate: 4 (one in 16)
FTable chars: 10
Strings: unpacked
Local offset rate: 3 (one in 8)
Local fTable chars: 6
Max bucket size: 1342177280
Max bucket size, sqrt multiplier: default
Max bucket size, len divisor: default
Difference-cover sample period: 1024
Endianness: little
Actual local endianness: little
Sanity checking: disabled
Assertions: disabled
Random seed: 0
Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
nt.fna
Reading reference sizes
Warning: Encountered reference sequence with only gaps
Warning: Encountered reference sequence with only gaps
Warning: Encountered reference sequence with only gaps
Warning: Encountered reference sequence with only gaps
Warning: Encountered reference sequence with only gaps
Warning: Encountered reference sequence with only gaps
Warning: Encountered reference sequence with only gaps
Warning: Encountered reference sequence with only gaps
Time reading reference sizes: 00:29:08
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
Killed: 9
Can you shed some light on what is happening here? Do I need to specify a different value for memory? Thanks!
./centrifuge-inspect -s indices/bacteria/bacteria
Error: Encountered exception: 'Cannot open file indices/bacteria/bacteria.rev'
Command: centrifuge-inspect --wrapper basic-0 -s indices/bacteria/bacteria
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.